Indexing puncuation and symbols

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing puncuation and symbols

John Byrne-3
Hi,

Has anyone written an analyzer that preserves puncuation and synmbols
("£", "$", "%" etc.) as tokens?

That way we could distinguish between searching for "100" and "100%" or
"$100".

Does anyone know of a reason why that wouldn't work? I notice that even
Google doesn't support that. But I can't think why.

Regards,
John B.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing puncuation and symbols

Karl Wettin

1 okt 2007 kl. 15.33 skrev John Byrne:

> Has anyone written an analyzer that preserves puncuation and
> synmbols ("£", "$", "%" etc.) as tokens?

WhitespaceAnalyzer?

You could also extend the lexical rules of StandardAnalyzer.


--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing puncuation and symbols

John Byrne-3
Whitespace analyzer does preserve those symbols, but not as tokens. It
simply leaves them attached to the original term.

As an example of what I'm talking about, consider a document that
contains (without the quotes) "foo, ".

Now, using WhitespaceAnalyzer, I could only get that document by
searching for "foo,". Using StandardAnalyzer or any analyzer that
removes punctuation, I could only find it by searching for "foo".

I want an analyzer that will allow me to find it if I build a phrase
query with the term "foo" followed immediately by ",". After all, the
comma may be relevant to the search, but is definitely not part of the
word.

Extending StandardAnalyer is what I had in mind, but I don't know where
to start. I also wonder why no-one seems to have done it before- it
makes me suspect that there's some reason I haven't seen yet that makes
it impossible ot impractical.



Karl Wettin wrote:

>
> 1 okt 2007 kl. 15.33 skrev John Byrne:
>
>> Has anyone written an analyzer that preserves puncuation and
>> synmbols ("£", "$", "%" etc.) as tokens?
>
> WhitespaceAnalyzer?
>
> You could also extend the lexical rules of StandardAnalyzer.
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing puncuation and symbols

Patrek
Hi,

Don't know the size of your dataset. But, couldn't you index in 2
fields, with PerFieldAnalyzer, tokenizing with Standard for 1 field,
and WhiteSpace for the other.

Then use multiple field query (there is a query parser for that, just
don't remember the name right now).

Patrick

On 10/1/07, John Byrne <[hidden email]> wrote:

> Whitespace analyzer does preserve those symbols, but not as tokens. It
> simply leaves them attached to the original term.
>
> As an example of what I'm talking about, consider a document that
> contains (without the quotes) "foo, ".
>
> Now, using WhitespaceAnalyzer, I could only get that document by
> searching for "foo,". Using StandardAnalyzer or any analyzer that
> removes punctuation, I could only find it by searching for "foo".
>
> I want an analyzer that will allow me to find it if I build a phrase
> query with the term "foo" followed immediately by ",". After all, the
> comma may be relevant to the search, but is definitely not part of the
> word.
>
> Extending StandardAnalyer is what I had in mind, but I don't know where
> to start. I also wonder why no-one seems to have done it before- it
> makes me suspect that there's some reason I haven't seen yet that makes
> it impossible ot impractical.
>
>
>
> Karl Wettin wrote:
> >
> > 1 okt 2007 kl. 15.33 skrev John Byrne:
> >
> >> Has anyone written an analyzer that preserves puncuation and
> >> synmbols ("£", "$", "%" etc.) as tokens?
> >
> > WhitespaceAnalyzer?
> >
> > You could also extend the lexical rules of StandardAnalyzer.
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing puncuation and symbols

John Byrne-3
Well, the size wouldn't be a problem, we could afford the extra field.
But it would seem to complicate the search quite a lot. I'd have to run
the search terms through both analyzers. It would be much simpler if the
characters were indexed as separate tokens.

Patrick Turcotte wrote:

> Hi,
>
> Don't know the size of your dataset. But, couldn't you index in 2
> fields, with PerFieldAnalyzer, tokenizing with Standard for 1 field,
> and WhiteSpace for the other.
>
> Then use multiple field query (there is a query parser for that, just
> don't remember the name right now).
>
> Patrick
>
> On 10/1/07, John Byrne <[hidden email]> wrote:
>  
>> Whitespace analyzer does preserve those symbols, but not as tokens. It
>> simply leaves them attached to the original term.
>>
>> As an example of what I'm talking about, consider a document that
>> contains (without the quotes) "foo, ".
>>
>> Now, using WhitespaceAnalyzer, I could only get that document by
>> searching for "foo,". Using StandardAnalyzer or any analyzer that
>> removes punctuation, I could only find it by searching for "foo".
>>
>> I want an analyzer that will allow me to find it if I build a phrase
>> query with the term "foo" followed immediately by ",". After all, the
>> comma may be relevant to the search, but is definitely not part of the
>> word.
>>
>> Extending StandardAnalyer is what I had in mind, but I don't know where
>> to start. I also wonder why no-one seems to have done it before- it
>> makes me suspect that there's some reason I haven't seen yet that makes
>> it impossible ot impractical.
>>
>>
>>
>> Karl Wettin wrote:
>>    
>>> 1 okt 2007 kl. 15.33 skrev John Byrne:
>>>
>>>      
>>>> Has anyone written an analyzer that preserves puncuation and
>>>> synmbols ("£", "$", "%" etc.) as tokens?
>>>>        
>>> WhitespaceAnalyzer?
>>>
>>> You could also extend the lexical rules of StandardAnalyzer.
>>>
>>>
>>>      
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>    
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>  


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing puncuation and symbols

Patrek
Of course, it depends on the kind of query you are doing, but (I did
find the query parser in the mean time)

MultiFieldQueryParser mfqp = new MultiFieldQueryParser(useFields,
analyzer, boosts);
where analyzer can be a PerFieldAnalyzer
followed by
Query query = mfqp.parse(queryString);
would do the trick quite simply.

Patrick

On 10/1/07, John Byrne <[hidden email]> wrote:
> Well, the size wouldn't be a problem, we could afford the extra field.
> But it would seem to complicate the search quite a lot. I'd have to run
> the search terms through both analyzers. It would be much simpler if the
> characters were indexed as separate tokens.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Indexing puncuation and symbols

Erick Erickson
You might be able to create an analyzer that breaks your
stream up (from the example) into tokens
"foo" and "," and then (using the same analyzer)
search on phrases with a slop of 0. That seems like
it'd do what you want.....

Best
Erick

On 10/1/07, Patrick Turcotte <[hidden email]> wrote:

>
> Of course, it depends on the kind of query you are doing, but (I did
> find the query parser in the mean time)
>
> MultiFieldQueryParser mfqp = new MultiFieldQueryParser(useFields,
> analyzer, boosts);
> where analyzer can be a PerFieldAnalyzer
> followed by
> Query query = mfqp.parse(queryString);
> would do the trick quite simply.
>
> Patrick
>
> On 10/1/07, John Byrne <[hidden email]> wrote:
> > Well, the size wouldn't be a problem, we could afford the extra field.
> > But it would seem to complicate the search quite a lot. I'd have to run
> > the search terms through both analyzers. It would be much simpler if the
> > characters were indexed as separate tokens.
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Changing the Punctuation definition for StandardAnalyzer

tareque
In reply to this post by John Byrne-3
I am using StandardAnalyzer for my indexes. Now I don't want to be able to
be search whole email addresses, and want to consider '@' as a punctuation
too. Because my users would rather be able to search for user id and/or
the host name to return all the email addresses than searching by the
whole address. And, that way, then can create a query that will return
email addresses anyway.

How do I let StandardAnalyzer consider '@' as a punctuation?

Thanks
Tareque

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Changing the Punctuation definition for StandardAnalyzer

Karl Wettin

20 dec 2007 kl. 18.43 skrev [hidden email]:

> I am using StandardAnalyzer for my indexes. Now I don't want to be  
> able to
> be search whole email addresses, and want to consider '@' as a  
> punctuation
> too. Because my users would rather be able to search for user id and/
> or
> the host name to return all the email addresses than searching by the
> whole address. And, that way, then can create a query that will return
> email addresses anyway.
>
> How do I let StandardAnalyzer consider '@' as a punctuation?

A quick and dirty solution is to introduce a TokenFilter that splits  
any token at @ and add it to the end of the chain of streams in  
StandardAnalyzer#tokenStream.

It would probably be much more efficient if you modified the lexer  
grammar StandardTokenzier is generated from.

--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Changing the Punctuation definition for StandardAnalyzer

tareque
Thanks Karl,

I would rather like to modify the lexer grammar. But exactly where it is
defined. After having a quick look, seems like
StandardTokenizerTokenManager.java may be where it is being done.
Ampersand having a decimal value of '38', I was assuming that the
following step is taken when faced with ampersand:

=============
              case 73:
                  if (curChar == 38)
                     jjstateSet[jjnewStateCnt++] = 74;
                  break;
=============

It's kind of complicated, so before I attempt to delve into I thought I
should ask if I am looking at the right place.

Thanks again!
Tareque



>
> 20 dec 2007 kl. 18.43 skrev [hidden email]:
>
>> I am using StandardAnalyzer for my indexes. Now I don't want to be
>> able to
>> be search whole email addresses, and want to consider '@' as a
>> punctuation
>> too. Because my users would rather be able to search for user id and/
>> or
>> the host name to return all the email addresses than searching by the
>> whole address. And, that way, then can create a query that will return
>> email addresses anyway.
>>
>> How do I let StandardAnalyzer consider '@' as a punctuation?
>
> A quick and dirty solution is to introduce a TokenFilter that splits
> any token at @ and add it to the end of the chain of streams in
> StandardAnalyzer#tokenStream.
>
> It would probably be much more efficient if you modified the lexer
> grammar StandardTokenzier is generated from.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Changing the Punctuation definition for StandardAnalyzer

Karl Wettin

20 dec 2007 kl. 20.21 skrev [hidden email]:

> I would rather like to modify the lexer grammar. But exactly where  
> it is
> defined. After having a quick look, seems like
> StandardTokenizerTokenManager.java may be where it is being done.

http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex

It can be generated with the Ant build.

--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Changing the Punctuation definition for StandardAnalyzer

tareque
Karl,

I should have mentioned before, I have Lucene 1.9.1.

In fact I had previously located the grammar in StandardTokenizer.jj (just
wasn't sure if that was the one u were talking about) and had commented
out EMAIL entries from all the following files:

StandardTokenizer.java
StandardTokenizer.jj
StandardTokenizerConstants.java

But evidently the tokenizer was expecting the email addresses to be one of
the other TOKEN types. But since they were matching with none of them it
was throwing a ParseException.

Now what is puzzling to me is that though I don't see the '@' (unicode
value 0040) sign to be included in "LETTER" or any other definition, why
is it not  splitting the words? It certainly isn't, which is why Tokenizer
is expecting the email address to be defined as a TYPE. My understanding,
looking at the code, is that whichever characters were not defined in the
grammar, would be acting as splitter, since they are not contributing to
any TOKEN definition.

Please let me know what I am missing.

Thanks
Tareque

>
> 20 dec 2007 kl. 20.21 skrev [hidden email]:
>
>> I would rather like to modify the lexer grammar. But exactly where
>> it is
>> defined. After having a quick look, seems like
>> StandardTokenizerTokenManager.java may be where it is being done.
>
> http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
>
> It can be generated with the Ant build.
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Changing the Punctuation definition for StandardAnalyzer

Karl Wettin
20 dec 2007 kl. 22.32 skrev [hidden email]:

> In fact I had previously located the grammar in StandardTokenizer.jj  
> (just wasn't sure if that was the one u were talking about) and had  
> commented out EMAIL entries from all the following files:
>
> StandardTokenizer.java
> StandardTokenizer.jj
> StandardTokenizerConstants.java
>
> Now what is puzzling to me is that though I don't see the '@'

I think you'll find the JavaCC-list a much better forum for this  
question. You do however seem a bit confused about the fact that  
StandardTokenizer and StandardTokenierConstants are the generated  
artifacts via Ant build, based on StandardTokenizer.jj.

Why was the TokenFilter solution not good enough? What was the results  
from your benchmarks?


--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Changing the Punctuation definition for StandardAnalyzer

tareque
I actually hadn't implemented the TokenFilter solution before deciding not
to go with that solution, so didn't have any benchmark.

But eventually I have taken care of this problem with a different
variation of your quick and dirty solution. I have captured the character
'@' in FastCharStream.java, and replaced it with a blank space. That took
care of it.

Thanks for your help!
Tareque

> 20 dec 2007 kl. 22.32 skrev [hidden email]:
>
>> In fact I had previously located the grammar in StandardTokenizer.jj
>> (just wasn't sure if that was the one u were talking about) and had
>> commented out EMAIL entries from all the following files:
>>
>> StandardTokenizer.java
>> StandardTokenizer.jj
>> StandardTokenizerConstants.java
>>
>> Now what is puzzling to me is that though I don't see the '@'
>
> I think you'll find the JavaCC-list a much better forum for this
> question. You do however seem a bit confused about the fact that
> StandardTokenizer and StandardTokenierConstants are the generated
> artifacts via Ant build, based on StandardTokenizer.jj.
>
> Why was the TokenFilter solution not good enough? What was the results
> from your benchmarks?
>
>
> --
> karl
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]