Confused about non-tokenized fields

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Confused about non-tokenized fields

Max Pfingsthorn
Hi!

In my application, I index some strings (like filenames) untokenized, meaning via

doc.add(new Field(FIELD,VALUE,false,true,false));

When I later take a look at it with Luke, I still get tokens of the filenames (like "news" instead of "news-item.xml") in the list of most frequent terms. Shouldn't I get only the complete filenames there??

Also, how do I search case-insensitive over this kind of field?

Thanks!

Best regards,

Max Pfingsthorn

Hippo  

Oosteinde 11
1017WT Amsterdam
The Netherlands
Tel  +31 (0)20 5224466
-------------------------------------------------------------
[hidden email] / www.hippo.nl
--------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Confused about non-tokenized fields

Gusenbauer Stefan
Max Pfingsthorn wrote:

>Hi!
>
>In my application, I index some strings (like filenames) untokenized, meaning via
>
>doc.add(new Field(FIELD,VALUE,false,true,false));
>
>When I later take a look at it with Luke, I still get tokens of the filenames (like "news" instead of "news-item.xml") in the list of most frequent terms. Shouldn't I get only the complete filenames there??
>
>Also, how do I search case-insensitive over this kind of field?
>
>Thanks!
>
>Best regards,
>
>Max Pfingsthorn
>
>Hippo  
>
>Oosteinde 11
>1017WT Amsterdam
>The Netherlands
>Tel  +31 (0)20 5224466
>-------------------------------------------------------------
>[hidden email] / www.hippo.nl
>--------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>
>  
>
For indexing untokenized fields try the static method
Field.Keyword(String fieldname,String value) then the string is really
not tokenized. But i think new Field with your params should make the
same. Have you tried to make a search for the filename this should only
return a result when you write out the whole filename.

Case insensitive search is standard when you use the standardanalyzer i
think:
the code should look like this
Searcher.search(QueryParser.parse("the query string","the fieldname",new
StandardAnalyzer());


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Confused about non-tokenized fields

Max Pfingsthorn
In reply to this post by Max Pfingsthorn
Hi!

Thanks for the reply. I figured already that fields are actually not tokenized... I lost track of the filenames/dirnames and there were some duplicates...

About case-insensitivity: Okay, I can make my query lower case, but my strings in the field are not... I guess I have to do that manually during indexing? Or is there some nicer way?

Thanks!
Max Pfingsthorn

-----Original Message-----
From: Gusenbauer Stefan [mailto:[hidden email]]
Sent: Friday, May 27, 2005 18:00
To: [hidden email]
Subject: Re: Confused about non-tokenized fields


Max Pfingsthorn wrote:

>Hi!
>
>In my application, I index some strings (like filenames) untokenized, meaning via
>
>doc.add(new Field(FIELD,VALUE,false,true,false));
>
>When I later take a look at it with Luke, I still get tokens of the filenames (like "news" instead of "news-item.xml") in the list of most frequent terms. Shouldn't I get only the complete filenames there??
>
>Also, how do I search case-insensitive over this kind of field?
>
>Thanks!
>
>Best regards,
>
>Max Pfingsthorn
>
>Hippo  
>
>Oosteinde 11
>1017WT Amsterdam
>The Netherlands
>Tel  +31 (0)20 5224466
>-------------------------------------------------------------
>[hidden email] / www.hippo.nl
>--------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>
>  
>
For indexing untokenized fields try the static method
Field.Keyword(String fieldname,String value) then the string is really
not tokenized. But i think new Field with your params should make the
same. Have you tried to make a search for the filename this should only
return a result when you write out the whole filename.

Case insensitive search is standard when you use the standardanalyzer i
think:
the code should look like this
Searcher.search(QueryParser.parse("the query string","the fieldname",new
StandardAnalyzer());


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Confused about non-tokenized fields

Gusenbauer Stefan
Max Pfingsthorn wrote:

>Hi!
>
>Thanks for the reply. I figured already that fields are actually not tokenized... I lost track of the filenames/dirnames and there were some duplicates...
>
>About case-insensitivity: Okay, I can make my query lower case, but my strings in the field are not... I guess I have to do that manually during indexing? Or is there some nicer way?
>  
>
I think this is not a problem. This should be done automatically when
you make a case insensitiv search so that you don't have to think about
it. If it should become a problem write another email *g*
Stefan

>Thanks!
>Max Pfingsthorn
>
>-----Original Message-----
>From: Gusenbauer Stefan [mailto:[hidden email]]
>Sent: Friday, May 27, 2005 18:00
>To: [hidden email]
>Subject: Re: Confused about non-tokenized fields
>
>
>Max Pfingsthorn wrote:
>
>  
>
>>Hi!
>>
>>In my application, I index some strings (like filenames) untokenized, meaning via
>>
>>doc.add(new Field(FIELD,VALUE,false,true,false));
>>
>>When I later take a look at it with Luke, I still get tokens of the filenames (like "news" instead of "news-item.xml") in the list of most frequent terms. Shouldn't I get only the complete filenames there??
>>
>>Also, how do I search case-insensitive over this kind of field?
>>
>>Thanks!
>>
>>Best regards,
>>
>>Max Pfingsthorn
>>
>>Hippo  
>>
>>Oosteinde 11
>>1017WT Amsterdam
>>The Netherlands
>>Tel  +31 (0)20 5224466
>>-------------------------------------------------------------
>>[hidden email] / www.hippo.nl
>>--------------------------------------------------------------
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: [hidden email]
>>For additional commands, e-mail: [hidden email]
>>
>>
>>
>>
>>
>>    
>>
>For indexing untokenized fields try the static method
>Field.Keyword(String fieldname,String value) then the string is really
>not tokenized. But i think new Field with your params should make the
>same. Have you tried to make a search for the filename this should only
>return a result when you write out the whole filename.
>
>Case insensitive search is standard when you use the standardanalyzer i
>think:
>the code should look like this
>Searcher.search(QueryParser.parse("the query string","the fieldname",new
>StandardAnalyzer());
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Confused about non-tokenized fields

Erik Hatcher
In reply to this post by Max Pfingsthorn

On May 27, 2005, at 11:22 AM, Max Pfingsthorn wrote:

> Hi!
>
> In my application, I index some strings (like filenames)  
> untokenized, meaning via
>
> doc.add(new Field(FIELD,VALUE,false,true,false));
>
> When I later take a look at it with Luke, I still get tokens of the  
> filenames (like "news" instead of "news-item.xml") in the list of  
> most frequent terms. Shouldn't I get only the
> complete filenames there??

Perhaps that "news" term is coming from a different field?  Are you  
sure that you're seeing the filename field tokenized?  Your usage of  
the field constructor looks fine to me and should not tokenize.

> Also, how do I search case-insensitive over this kind of field?

Lucene is case-sensitive.  I suggest lowercasing the field before  
indexing, and search using lowercase.  This is the simplest  
suggestion, but you may need to use some other technique such as  
having different fields (or different indexes) to deal with case-
sensitivity issues.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Confused about non-tokenized fields

Erik Hatcher
In reply to this post by Gusenbauer Stefan

On May 27, 2005, at 12:14 PM, Gusenbauer Stefan wrote:

> Max Pfingsthorn wrote:
>
>
>> Hi!
>>
>> Thanks for the reply. I figured already that fields are actually  
>> not tokenized... I lost track of the filenames/dirnames and there  
>> were some duplicates...
>>
>> About case-insensitivity: Okay, I can make my query lower case,  
>> but my strings in the field are not... I guess I have to do that  
>> manually during indexing? Or is there some nicer way?
>>
>>
>>
> I think this is not a problem. This should be done automatically when
> you make a case insensitiv search so that you don't have to think  
> about
> it. If it should become a problem write another email *g*

If you index but do not tokenize, then case is preserved from the  
original text.  It's the tokenization process, via the specified  
Analyzer, that typically lowercases.

So, yes, you would need to do that manually on the text you hand to a  
Field for untokenized fields.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Confused about non-tokenized fields

Gusenbauer Stefan
Erik Hatcher wrote:

>
> On May 27, 2005, at 12:14 PM, Gusenbauer Stefan wrote:
>
>> Max Pfingsthorn wrote:
>>
>>
>>> Hi!
>>>
>>> Thanks for the reply. I figured already that fields are actually
>>> not tokenized... I lost track of the filenames/dirnames and there
>>> were some duplicates...
>>>
>>> About case-insensitivity: Okay, I can make my query lower case,  but
>>> my strings in the field are not... I guess I have to do that
>>> manually during indexing? Or is there some nicer way?
>>>
>>>
>>>
>> I think this is not a problem. This should be done automatically when
>> you make a case insensitiv search so that you don't have to think  about
>> it. If it should become a problem write another email *g*
>
>
> If you index but do not tokenize, then case is preserved from the
> original text.  It's the tokenization process, via the specified
> Analyzer, that typically lowercases.
>
> So, yes, you would need to do that manually on the text you hand to a
> Field for untokenized fields.
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
thanks that was new to me i will be more carefull before i give out some
suggestions
stefan


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]