Is storing 20 fields in a lucene document desirable?

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Is storing 20 fields in a lucene document desirable?

Kumar Limbu
Our document contains a total of 23 fields in one document and we STORE all of them in lucene index.

We have recently had some performance issues and our analysis has shown the bottleneck to be lucene search and retrieval.

We have been thinking about reducing the number of fields per document by removing unnecessary fields and by merging fields with similar weightings. Will reducing the number of fields help to optimize performance?

Another issue is we are currently retrieving around 9 fields after we do a search. Some are long text of up to 1000 words. Is it a large overhead to retrieve long fields?

We are considering the option of separating the search and retrieve parts so that Lucene performs the search, MYSQL stores the data. We just store the INDEXED field and primary key in the lucene index. After searching we only return 1 field (primary key) instead of 9 fields. This field will be used for retrieving the actual information from the MySQL database. Will reducing the number of fields retrieved from lucene reduce the response time or will using MySQL database make it worse?

So our main concern is to find if retrieving fields usually takes longer than searching or not? What does lucene spend most time doing – search or retrieval? We are also concerned that using MySQL will have performance issues because we will be doing I/O for MySQL as well as Lucene. We also add around 100k documents each day and remove around the same number of documents. Will this frequent read and write have impact on performance?

Our current index size is around 8G and contains around 3M documents.

If there are any suggestions or questions, please reply.

Thanks you all!

Regards,
Kumar
Reply | Threaded
Open this post in threaded view
|

Re: Is storing 20 fields in a lucene document desirable?

Grant Ingersoll-2

On Nov 20, 2007, at 6:29 AM, kumarlimbu wrote:

>
> Our document contains a total of 23 fields in one document and we  
> STORE all
> of them in lucene index.
>
> We have recently had some performance issues and our analysis has  
> shown the
> bottleneck to be lucene search and retrieval.

Perhaps you can share your information on java-dev along with any  
detailed tests, etc. so that we can see if there is anything we can  
improve.

>
>
> We have been thinking about reducing the number of fields per  
> document by
> removing unnecessary fields and by merging fields with similar  
> weightings.
> Will reducing the number of fields help to optimize performance?

Yes

>
>
> Another issue is we are currently retrieving around 9 fields after  
> we do a
> search. Some are long text of up to 1000 words. Is it a large  
> overhead to
> retrieve long fields?

Yes.  Are you using FieldSelector?  Also, I would only STORE those  
fields that you actually need to display, not all 23.  Do you display  
all 9 fields right away or are some only when you choose a document?  
If so, try the Lazy Loading piece of FieldSelector.

>
>
> We are considering the option of separating the search and retrieve  
> parts so
> that Lucene performs the search, MYSQL stores the data. We just  
> store the
> INDEXED field and primary key in the lucene index. After searching  
> we only
> return 1 field (primary key) instead of 9 fields. This field will be  
> used
> for retrieving the actual information from the MySQL database. Will  
> reducing
> the number of fields retrieved from lucene reduce the response time  
> or will
> using MySQL database make it worse?

Hard to say, you will likely have to try it out.

>
>
> So our main concern is to find if retrieving fields usually takes  
> longer
> than searching or not? What does lucene spend most time doing –  
> search or
> retrieval? We are also concerned that using MySQL will have  
> performance
> issues because we will be doing I/O for MySQL as well as Lucene. We  
> also add
> around 100k documents each day and remove around the same number of
> documents. Will this frequent read and write have impact on  
> performance?
>

What version of Lucene are you using?  Naturally, if you are updating  
things that will have an effect, but that may not be a factor.  I  
would check out http://wiki.apache.org/lucene-java/BasicsOfPerformance

Also, you may want to try out (not in production) the trunk version of  
Lucene.

Last, but not least, do you have some sort of cache for your  
Documents?  Obviously, you need to have appropriate semantics for when  
the underlying docs change, but using a cache makes sense, too.

-Grant


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is storing 20 fields in a lucene document desirable?

Erick Erickson
How are you doing your search? When you say "lucene is the
bottleneck", that encompasses a lot. You really need to pinpoint things
a bit more....

1> are you iterating over the hits object for many docs? This is bad.

2> are you using a HitCollector and reading the doc each time you get
to the collect method? This is bad.

3> Is it the *search* or the fetch that's eating the time? Time *just* the
search call (making sure it doesn't go through a HitCollector) to answer
this.

4> How many documents are you returning? 5? 5,000? the answer may
tell us a lot.

5> are you opening a new reader each time you search? this is bad.

6> Have you looked over
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed?

Before you try and implement an solution, you *must* figure out what
operation is
actually taking the time. an 8G index isn't that big, so I suspect that
there's
something else going on. What kind of throughput are you seeing? What
do you need? Can you post some sample timings? Can you post some sample
search code?

I can't tell you how many times I've been *sure* I knew what the problem
was, but
in the process of analyzing the problem discovered that my real issue was
something totally different.....

Best
Erick

On Nov 20, 2007 7:21 AM, Grant Ingersoll <[hidden email]> wrote:

>
> On Nov 20, 2007, at 6:29 AM, kumarlimbu wrote:
>
> >
> > Our document contains a total of 23 fields in one document and we
> > STORE all
> > of them in lucene index.
> >
> > We have recently had some performance issues and our analysis has
> > shown the
> > bottleneck to be lucene search and retrieval.
>
> Perhaps you can share your information on java-dev along with any
> detailed tests, etc. so that we can see if there is anything we can
> improve.
>
> >
> >
> > We have been thinking about reducing the number of fields per
> > document by
> > removing unnecessary fields and by merging fields with similar
> > weightings.
> > Will reducing the number of fields help to optimize performance?
>
> Yes
>
> >
> >
> > Another issue is we are currently retrieving around 9 fields after
> > we do a
> > search. Some are long text of up to 1000 words. Is it a large
> > overhead to
> > retrieve long fields?
>
> Yes.  Are you using FieldSelector?  Also, I would only STORE those
> fields that you actually need to display, not all 23.  Do you display
> all 9 fields right away or are some only when you choose a document?
> If so, try the Lazy Loading piece of FieldSelector.
>
> >
> >
> > We are considering the option of separating the search and retrieve
> > parts so
> > that Lucene performs the search, MYSQL stores the data. We just
> > store the
> > INDEXED field and primary key in the lucene index. After searching
> > we only
> > return 1 field (primary key) instead of 9 fields. This field will be
> > used
> > for retrieving the actual information from the MySQL database. Will
> > reducing
> > the number of fields retrieved from lucene reduce the response time
> > or will
> > using MySQL database make it worse?
>
> Hard to say, you will likely have to try it out.
>
> >
> >
> > So our main concern is to find if retrieving fields usually takes
> > longer
> > than searching or not? What does lucene spend most time doing –
> > search or
> > retrieval? We are also concerned that using MySQL will have
> > performance
> > issues because we will be doing I/O for MySQL as well as Lucene. We
> > also add
> > around 100k documents each day and remove around the same number of
> > documents. Will this frequent read and write have impact on
> > performance?
> >
>
> What version of Lucene are you using?  Naturally, if you are updating
> things that will have an effect, but that may not be a factor.  I
> would check out http://wiki.apache.org/lucene-java/BasicsOfPerformance
>
> Also, you may want to try out (not in production) the trunk version of
> Lucene.
>
> Last, but not least, do you have some sort of cache for your
> Documents?  Obviously, you need to have appropriate semantics for when
> the underlying docs change, but using a cache makes sense, too.
>
> -Grant
>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

help required urgent!!!!!!!!!!!

Shakti_Sareen
In reply to this post by Grant Ingersoll-2
Hi
I am using StandardAnalyser() to index the data.
But I want to do a like search on a word containing Hyphen
For example it want to search a word "soft-wa*"

I am getting no hits for that. It is said that if the hyphen is there in
the word, then we should include that word in the double quotes ("). But
enclosing the word in a double quotes (") means the exact word search.

How can I perform the like search on a word containing hyphen???????

Please help.

Regards,
Shakti Sareen





DISCLAIMER:
This email (including any attachments) is intended for the sole use of the intended recipient/s and may contain material that is CONFIDENTIAL AND PRIVATE COMPANY INFORMATION. Any review or reliance by others or copying or distribution or forwarding of any or all of the contents in this message is STRICTLY PROHIBITED. If you are not the intended recipient, please contact the sender by email and delete all copies; your cooperation in this regard is appreciated.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: help required urgent!!!!!!!!!!!

Shai Erera
Hi

You can simply create a PrefixQuery. However, if you're using
StandardAnalyzer, and the word is added as Index.TOKENIZED,
sotf-wa<something> will be broken to 'soft' and 'wa<something>'. Therefore
you'll need to add the word as Index.UN_TOKENIZED, or use a different
Analyzer when you index the data (for this field at least).

Here's a sample code:

        // Indexing.
        Document doc = new Document();
        doc.add(new Field("field", "soft-wash", Store.NO, Index.UN_TOKENIZED
));

        // Search
        Query q = new PrefixQuery(new Term("field", "soft-wa"));

Does that help?

On Nov 22, 2007 5:46 PM, Shakti_Sareen <[hidden email]> wrote:

> Hi
> I am using StandardAnalyser() to index the data.
> But I want to do a like search on a word containing Hyphen
> For example it want to search a word "soft-wa*"
>
> I am getting no hits for that. It is said that if the hyphen is there in
> the word, then we should include that word in the double quotes ("). But
> enclosing the word in a double quotes (") means the exact word search.
>
> How can I perform the like search on a word containing hyphen???????
>
> Please help.
>
> Regards,
> Shakti Sareen
>
>
>
>
>
> DISCLAIMER:
> This email (including any attachments) is intended for the sole use of the
> intended recipient/s and may contain material that is CONFIDENTIAL AND
> PRIVATE COMPANY INFORMATION. Any review or reliance by others or copying or
> distribution or forwarding of any or all of the contents in this message is
> STRICTLY PROHIBITED. If you are not the intended recipient, please contact
> the sender by email and delete all copies; your cooperation in this regard
> is appreciated.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

RE: help required urgent!!!!!!!!!!!

Shakti_Sareen
Hi

But the file I am indexing is very big and I don't know which word will
contain the hyphen. The thing you suggest can be implemented only if
there are some specific words in the file.

Apart from StandardAnalyzer I have got no option.

Thanks a lot for your reply.

Please suggest me how can I go ahead.

 
SHAKTI SAREEN
GE-GDC
STC HYDERABAD
9948777794

-----Original Message-----
From: Shai Erera [mailto:[hidden email]]
Sent: Thursday, November 22, 2007 9:25 PM
To: [hidden email]
Subject: Re: help required urgent!!!!!!!!!!!

Hi

You can simply create a PrefixQuery. However, if you're using
StandardAnalyzer, and the word is added as Index.TOKENIZED,
sotf-wa<something> will be broken to 'soft' and 'wa<something>'.
Therefore
you'll need to add the word as Index.UN_TOKENIZED, or use a different
Analyzer when you index the data (for this field at least).

Here's a sample code:

        // Indexing.
        Document doc = new Document();
        doc.add(new Field("field", "soft-wash", Store.NO,
Index.UN_TOKENIZED
));

        // Search
        Query q = new PrefixQuery(new Term("field", "soft-wa"));

Does that help?

On Nov 22, 2007 5:46 PM, Shakti_Sareen <[hidden email]> wrote:

> Hi
> I am using StandardAnalyser() to index the data.
> But I want to do a like search on a word containing Hyphen
> For example it want to search a word "soft-wa*"
>
> I am getting no hits for that. It is said that if the hyphen is there
in
> the word, then we should include that word in the double quotes (").
But

> enclosing the word in a double quotes (") means the exact word search.
>
> How can I perform the like search on a word containing hyphen???????
>
> Please help.
>
> Regards,
> Shakti Sareen
>
>
>
>
>
> DISCLAIMER:
> This email (including any attachments) is intended for the sole use of
the
> intended recipient/s and may contain material that is CONFIDENTIAL AND
> PRIVATE COMPANY INFORMATION. Any review or reliance by others or
copying or
> distribution or forwarding of any or all of the contents in this
message is
> STRICTLY PROHIBITED. If you are not the intended recipient, please
contact
> the sender by email and delete all copies; your cooperation in this
regard
> is appreciated.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera


DISCLAIMER:
This email (including any attachments) is intended for the sole use of the intended recipient/s and may contain material that is CONFIDENTIAL AND PRIVATE COMPANY INFORMATION. Any review or reliance by others or copying or distribution or forwarding of any or all of the contents in this message is STRICTLY PROHIBITED. If you are not the intended recipient, please contact the sender by email and delete all copies; your cooperation in this regard is appreciated.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: help required urgent!!!!!!!!!!!

mark harwood
In reply to this post by Shakti_Sareen
>>Re: help required urgent!!!!!!!!!!!

Yikes!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

I'm guessing that the question was more about how to support this in the standard query syntax where there are multiple words.

i.e.  http://www.google.com/search?q=lucene+wildcard+in+phrase

This post seems close to a solution to that problem:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200607.mbox/%3CB50FE8BF4F9CF24FB800E28DBB1A2FE026D681@...%3E


Cheers,
Mark


----- Original Message ----
From: Shai Erera <[hidden email]>
To: [hidden email]
Sent: Thursday, 22 November, 2007 3:54:51 PM
Subject: Re: help required urgent!!!!!!!!!!!

Hi

You can simply create a PrefixQuery. However, if you're using
StandardAnalyzer, and the word is added as Index.TOKENIZED,
sotf-wa<something> will be broken to 'soft' and 'wa<something>'.
 Therefore
you'll need to add the word as Index.UN_TOKENIZED, or use a different
Analyzer when you index the data (for this field at least).

Here's a sample code:

        // Indexing.
        Document doc = new Document();
        doc.add(new Field("field", "soft-wash", Store.NO,
 Index.UN_TOKENIZED
));

        // Search
        Query q = new PrefixQuery(new Term("field", "soft-wa"));

Does that help?

On Nov 22, 2007 5:46 PM, Shakti_Sareen <[hidden email]>
 wrote:

> Hi
> I am using StandardAnalyser() to index the data.
> But I want to do a like search on a word containing Hyphen
> For example it want to search a word "soft-wa*"
>
> I am getting no hits for that. It is said that if the hyphen is there
 in
> the word, then we should include that word in the double quotes (").
 But
> enclosing the word in a double quotes (") means the exact word
 search.

>
> How can I perform the like search on a word containing hyphen???????
>
> Please help.
>
> Regards,
> Shakti Sareen
>
>
>
>
>
> DISCLAIMER:
> This email (including any attachments) is intended for the sole use
 of the
> intended recipient/s and may contain material that is CONFIDENTIAL
 AND
> PRIVATE COMPANY INFORMATION. Any review or reliance by others or
 copying or
> distribution or forwarding of any or all of the contents in this
 message is
> STRICTLY PROHIBITED. If you are not the intended recipient, please
 contact
> the sender by email and delete all copies; your cooperation in this
 regard
> is appreciated.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera





      ___________________________________________________________
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  http://uk.promotions.yahoo.com/forgood/environment.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: help required urgent!!!!!!!!!!!

Shai Erera
In reply to this post by Shakti_Sareen
The thing is - StandardAnalyzer breaks on hyphen. You'll need to work around
this by either extend StandardAnalyzer

From StandardTokenizer's documentation (which is used by StandardAnalyzer):
*   <li> *Splits words at hyphens, unless there's a number in the token, in
which case
 *     the whole token is interpreted as a product number and is not split.*

I've investigated StandardAnalyzer's tokenization and it doesn't look simple
to disable that behavior. What you can do is extend StandardAnalyzer and
override its tokenStream method to create a TokenStream of your own. If you
know your text is space separated, you can use StringTokenizer to split the
text on spaces. If a token contains '-', don't break it, otherwise pass it
forward the the TokenStream returned by StandardAnalyzer.

Maybe someone else has a better answer, but if you insist on using
StandardAnalyzer, I have a feeling it will be problematic.

On Nov 22, 2007 6:02 PM, Shakti_Sareen < [hidden email]> wrote:

> Hi
>
> But the file I am indexing is very big and I don't know which word will
> contain the hyphen. The thing you suggest can be implemented only if
> there are some specific words in the file.
>
> Apart from StandardAnalyzer I have got no option.
>
> Thanks a lot for your reply.
>
> Please suggest me how can I go ahead.
>
>
> SHAKTI SAREEN
> GE-GDC
> STC HYDERABAD
> 9948777794
>
> -----Original Message-----
> From: Shai Erera [mailto:[hidden email]]
> Sent: Thursday, November 22, 2007 9:25 PM
> To: [hidden email]
> Subject: Re: help required urgent!!!!!!!!!!!
>
> Hi
>
> You can simply create a PrefixQuery. However, if you're using
> StandardAnalyzer, and the word is added as Index.TOKENIZED,
> sotf-wa<something> will be broken to 'soft' and 'wa<something>'.
> Therefore
> you'll need to add the word as Index.UN_TOKENIZED, or use a different
> Analyzer when you index the data (for this field at least).
>
> Here's a sample code:
>
>        // Indexing.
>        Document doc = new Document();
>        doc.add(new Field("field", "soft-wash", Store.NO,
> Index.UN_TOKENIZED
> ));
>
>        // Search
>        Query q = new PrefixQuery(new Term("field", "soft-wa"));
>
> Does that help?
>
> On Nov 22, 2007 5:46 PM, Shakti_Sareen < [hidden email]> wrote:
>
> > Hi
> > I am using StandardAnalyser() to index the data.
> > But I want to do a like search on a word containing Hyphen
> > For example it want to search a word "soft-wa*"
> >
> > I am getting no hits for that. It is said that if the hyphen is there
> in
> > the word, then we should include that word in the double quotes (").
> But
> > enclosing the word in a double quotes (") means the exact word search.
> >
> > How can I perform the like search on a word containing hyphen???????
> >
> > Please help.
> >
> > Regards,
> > Shakti Sareen
> >
> >
> >
> >
> >
> > DISCLAIMER:
> > This email (including any attachments) is intended for the sole use of
> the
> > intended recipient/s and may contain material that is CONFIDENTIAL AND
> > PRIVATE COMPANY INFORMATION. Any review or reliance by others or
> copying or
> > distribution or forwarding of any or all of the contents in this
> message is
> > STRICTLY PROHIBITED. If you are not the intended recipient, please
> contact
> > the sender by email and delete all copies; your cooperation in this
> regard
> > is appreciated.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> --
> Regards,
>
> Shai Erera
>
>
> DISCLAIMER:
> This email (including any attachments) is intended for the sole use of the
> intended recipient/s and may contain material that is CONFIDENTIAL AND
> PRIVATE COMPANY INFORMATION. Any review or reliance by others or copying or
> distribution or forwarding of any or all of the contents in this message is
> STRICTLY PROHIBITED. If you are not the intended recipient, please contact
> the sender by email and delete all copies; your cooperation in this regard
> is appreciated.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>



--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: help required urgent!!!!!!!!!!!

Shai Erera
In reply to this post by mark harwood
Yep - that's a good one. Only it may be a very heavy query, and throw
TooManyClausesException, if the number of terms that start with the prefix
is too much. But that certainly would work.
BTW - MultiPhraseQuery's documentation specifically explains how to use it
for exactly the same purpose.

On Nov 22, 2007 6:19 PM, mark harwood <[hidden email]> wrote:

> >>Re: help required urgent!!!!!!!!!!!
>
> Yikes!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>
> I'm guessing that the question was more about how to support this in the
> standard query syntax where there are multiple words.
>
> i.e.  http://www.google.com/search?q=lucene+wildcard+in+phrase
>
> This post seems close to a solution to that problem:
>
> http://mail-archives.apache.org/mod_mbox/lucene-java-user/200607.mbox/%3CB50FE8BF4F9CF24FB800E28DBB1A2FE026D681@...%3E
>
>
> Cheers,
> Mark
>
>
> ----- Original Message ----
> From: Shai Erera <[hidden email]>
> To: [hidden email]
> Sent: Thursday, 22 November, 2007 3:54:51 PM
> Subject: Re: help required urgent!!!!!!!!!!!
>
> Hi
>
> You can simply create a PrefixQuery. However, if you're using
> StandardAnalyzer, and the word is added as Index.TOKENIZED,
> sotf-wa<something> will be broken to 'soft' and 'wa<something>'.
>  Therefore
> you'll need to add the word as Index.UN_TOKENIZED, or use a different
> Analyzer when you index the data (for this field at least).
>
> Here's a sample code:
>
>        // Indexing.
>        Document doc = new Document();
>        doc.add(new Field("field", "soft-wash", Store.NO,
>  Index.UN_TOKENIZED
> ));
>
>        // Search
>        Query q = new PrefixQuery(new Term("field", "soft-wa"));
>
> Does that help?
>
> On Nov 22, 2007 5:46 PM, Shakti_Sareen <[hidden email]>
>  wrote:
>
> > Hi
> > I am using StandardAnalyser() to index the data.
> > But I want to do a like search on a word containing Hyphen
> > For example it want to search a word "soft-wa*"
> >
> > I am getting no hits for that. It is said that if the hyphen is there
>  in
> > the word, then we should include that word in the double quotes (").
>  But
> > enclosing the word in a double quotes (") means the exact word
>  search.
> >
> > How can I perform the like search on a word containing hyphen???????
> >
> > Please help.
> >
> > Regards,
> > Shakti Sareen
> >
> >
> >
> >
> >
> > DISCLAIMER:
> > This email (including any attachments) is intended for the sole use
>  of the
> > intended recipient/s and may contain material that is CONFIDENTIAL
>  AND
> > PRIVATE COMPANY INFORMATION. Any review or reliance by others or
>  copying or
> > distribution or forwarding of any or all of the contents in this
>  message is
> > STRICTLY PROHIBITED. If you are not the intended recipient, please
>  contact
> > the sender by email and delete all copies; your cooperation in this
>  regard
> > is appreciated.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> --
> Regards,
>
> Shai Erera
>
>
>
>
>
>      ___________________________________________________________
> Want ideas for reducing your carbon footprint? Visit Yahoo! For Good
> http://uk.promotions.yahoo.com/forgood/environment.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
Regards,

Shai Erera
Reply | Threaded
Open this post in threaded view
|

Re: help required urgent!!!!!!!!!!!

Matthijs Bierman
In reply to this post by Shakti_Sareen
Hi

Simply create your own analyzer with JavaCC. See the repository for the
latest StandardAnalyzer.jj file, make sure the Analyzer accepts anything
with a hypen as a single token.
And try not to yell, please. Most of the questions are urgent, there is
no need for emphasis - especially in this manner.

Good luck,
Matthijs



Shakti_Sareen wrote:

> Hi
>
> But the file I am indexing is very big and I don't know which word will
> contain the hyphen. The thing you suggest can be implemented only if
> there are some specific words in the file.
>
> Apart from StandardAnalyzer I have got no option.
>
> Thanks a lot for your reply.
>
> Please suggest me how can I go ahead.
>
>  
> SHAKTI SAREEN
> GE-GDC
> STC HYDERABAD
> 9948777794
>
> -----Original Message-----
> From: Shai Erera [mailto:[hidden email]]
> Sent: Thursday, November 22, 2007 9:25 PM
> To: [hidden email]
> Subject: Re: help required urgent!!!!!!!!!!!
>
> Hi
>
> You can simply create a PrefixQuery. However, if you're using
> StandardAnalyzer, and the word is added as Index.TOKENIZED,
> sotf-wa<something> will be broken to 'soft' and 'wa<something>'.
> Therefore
> you'll need to add the word as Index.UN_TOKENIZED, or use a different
> Analyzer when you index the data (for this field at least).
>
> Here's a sample code:
>
>         // Indexing.
>         Document doc = new Document();
>         doc.add(new Field("field", "soft-wash", Store.NO,
> Index.UN_TOKENIZED
> ));
>
>         // Search
>         Query q = new PrefixQuery(new Term("field", "soft-wa"));
>
> Does that help?
>
> On Nov 22, 2007 5:46 PM, Shakti_Sareen <[hidden email]> wrote:
>
>  
>> Hi
>> I am using StandardAnalyser() to index the data.
>> But I want to do a like search on a word containing Hyphen
>> For example it want to search a word "soft-wa*"
>>
>> I am getting no hits for that. It is said that if the hyphen is there
>>    
> in
>  
>> the word, then we should include that word in the double quotes (").
>>    
> But
>  
>> enclosing the word in a double quotes (") means the exact word search.
>>
>> How can I perform the like search on a word containing hyphen???????
>>
>> Please help.
>>
>> Regards,
>> Shakti Sareen
>>
>>
>>
>>
>>
>> DISCLAIMER:
>> This email (including any attachments) is intended for the sole use of
>>    
> the
>  
>> intended recipient/s and may contain material that is CONFIDENTIAL AND
>> PRIVATE COMPANY INFORMATION. Any review or reliance by others or
>>    
> copying or
>  
>> distribution or forwarding of any or all of the contents in this
>>    
> message is
>  
>> STRICTLY PROHIBITED. If you are not the intended recipient, please
>>    
> contact
>  
>> the sender by email and delete all copies; your cooperation in this
>>    
> regard
>  
>> is appreciated.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>    
>
>
>  
> ------------------------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

help required

Shakti_Sareen
In reply to this post by Shai Erera
Hi

Can anyone help me with a code having a class which extends the
StandardAnalyzer and that analyzer should not tokenize the word across
hyphen

 
SHAKTI SAREEN

-----Original Message-----
From: Shai Erera [mailto:[hidden email]]
Sent: Thursday, November 22, 2007 9:53 PM
To: [hidden email]
Subject: Re: help required urgent!!!!!!!!!!!

The thing is - StandardAnalyzer breaks on hyphen. You'll need to work
around
this by either extend StandardAnalyzer

From StandardTokenizer's documentation (which is used by
StandardAnalyzer):
*   <li> *Splits words at hyphens, unless there's a number in the token,
in
which case
 *     the whole token is interpreted as a product number and is not
split.*

I've investigated StandardAnalyzer's tokenization and it doesn't look
simple
to disable that behavior. What you can do is extend StandardAnalyzer and
override its tokenStream method to create a TokenStream of your own. If
you
know your text is space separated, you can use StringTokenizer to split
the
text on spaces. If a token contains '-', don't break it, otherwise pass
it
forward the the TokenStream returned by StandardAnalyzer.

Maybe someone else has a better answer, but if you insist on using
StandardAnalyzer, I have a feeling it will be problematic.

On Nov 22, 2007 6:02 PM, Shakti_Sareen < [hidden email]>
wrote:

> Hi
>
> But the file I am indexing is very big and I don't know which word
will

> contain the hyphen. The thing you suggest can be implemented only if
> there are some specific words in the file.
>
> Apart from StandardAnalyzer I have got no option.
>
> Thanks a lot for your reply.
>
> Please suggest me how can I go ahead.
>
>
> SHAKTI SAREEN
> GE-GDC
> STC HYDERABAD
> 9948777794
>
> -----Original Message-----
> From: Shai Erera [mailto:[hidden email]]
> Sent: Thursday, November 22, 2007 9:25 PM
> To: [hidden email]
> Subject: Re: help required urgent!!!!!!!!!!!
>
> Hi
>
> You can simply create a PrefixQuery. However, if you're using
> StandardAnalyzer, and the word is added as Index.TOKENIZED,
> sotf-wa<something> will be broken to 'soft' and 'wa<something>'.
> Therefore
> you'll need to add the word as Index.UN_TOKENIZED, or use a different
> Analyzer when you index the data (for this field at least).
>
> Here's a sample code:
>
>        // Indexing.
>        Document doc = new Document();
>        doc.add(new Field("field", "soft-wash", Store.NO,
> Index.UN_TOKENIZED
> ));
>
>        // Search
>        Query q = new PrefixQuery(new Term("field", "soft-wa"));
>
> Does that help?
>
> On Nov 22, 2007 5:46 PM, Shakti_Sareen < [hidden email]>
wrote:
>
> > Hi
> > I am using StandardAnalyser() to index the data.
> > But I want to do a like search on a word containing Hyphen
> > For example it want to search a word "soft-wa*"
> >
> > I am getting no hits for that. It is said that if the hyphen is
there
> in
> > the word, then we should include that word in the double quotes (").
> But
> > enclosing the word in a double quotes (") means the exact word
search.

> >
> > How can I perform the like search on a word containing hyphen???????
> >
> > Please help.
> >
> > Regards,
> > Shakti Sareen
> >
> >
> >
> >
> >
> > DISCLAIMER:
> > This email (including any attachments) is intended for the sole use
of
> the
> > intended recipient/s and may contain material that is CONFIDENTIAL
AND

> > PRIVATE COMPANY INFORMATION. Any review or reliance by others or
> copying or
> > distribution or forwarding of any or all of the contents in this
> message is
> > STRICTLY PROHIBITED. If you are not the intended recipient, please
> contact
> > the sender by email and delete all copies; your cooperation in this
> regard
> > is appreciated.
> >
> >
---------------------------------------------------------------------

> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
> --
> Regards,
>
> Shai Erera
>
>
> DISCLAIMER:
> This email (including any attachments) is intended for the sole use of
the
> intended recipient/s and may contain material that is CONFIDENTIAL AND
> PRIVATE COMPANY INFORMATION. Any review or reliance by others or
copying or
> distribution or forwarding of any or all of the contents in this
message is
> STRICTLY PROHIBITED. If you are not the intended recipient, please
contact
> the sender by email and delete all copies; your cooperation in this
regard
> is appreciated.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>



--
Regards,

Shai Erera


DISCLAIMER:
This email (including any attachments) is intended for the sole use of the intended recipient/s and may contain material that is CONFIDENTIAL AND PRIVATE COMPANY INFORMATION. Any review or reliance by others or copying or distribution or forwarding of any or all of the contents in this message is STRICTLY PROHIBITED. If you are not the intended recipient, please contact the sender by email and delete all copies; your cooperation in this regard is appreciated.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: help required

Mark Miller-3
Copy the current Standard Analyzer and add '-' to the definition of a
LETTER. You might work with the StandardAnazlyer on the trunk which uses
JFlex rather than the current JavaCC flavor...the new one is something
like 6-10 times faster.

- Mark

Shakti_Sareen wrote:

> Hi
>
> Can anyone help me with a code having a class which extends the
> StandardAnalyzer and that analyzer should not tokenize the word across
> hyphen
>
>  
> SHAKTI SAREEN
>
> -----Original Message-----
> From: Shai Erera [mailto:[hidden email]]
> Sent: Thursday, November 22, 2007 9:53 PM
> To: [hidden email]
> Subject: Re: help required urgent!!!!!!!!!!!
>
> The thing is - StandardAnalyzer breaks on hyphen. You'll need to work
> around
> this by either extend StandardAnalyzer
>
> >From StandardTokenizer's documentation (which is used by
> StandardAnalyzer):
> *   <li> *Splits words at hyphens, unless there's a number in the token,
> in
> which case
>  *     the whole token is interpreted as a product number and is not
> split.*
>
> I've investigated StandardAnalyzer's tokenization and it doesn't look
> simple
> to disable that behavior. What you can do is extend StandardAnalyzer and
> override its tokenStream method to create a TokenStream of your own. If
> you
> know your text is space separated, you can use StringTokenizer to split
> the
> text on spaces. If a token contains '-', don't break it, otherwise pass
> it
> forward the the TokenStream returned by StandardAnalyzer.
>
> Maybe someone else has a better answer, but if you insist on using
> StandardAnalyzer, I have a feeling it will be problematic.
>
> On Nov 22, 2007 6:02 PM, Shakti_Sareen < [hidden email]>
> wrote:
>
>  
>> Hi
>>
>> But the file I am indexing is very big and I don't know which word
>>    
> will
>  
>> contain the hyphen. The thing you suggest can be implemented only if
>> there are some specific words in the file.
>>
>> Apart from StandardAnalyzer I have got no option.
>>
>> Thanks a lot for your reply.
>>
>> Please suggest me how can I go ahead.
>>
>>
>> SHAKTI SAREEN
>> GE-GDC
>> STC HYDERABAD
>> 9948777794
>>
>> -----Original Message-----
>> From: Shai Erera [mailto:[hidden email]]
>> Sent: Thursday, November 22, 2007 9:25 PM
>> To: [hidden email]
>> Subject: Re: help required urgent!!!!!!!!!!!
>>
>> Hi
>>
>> You can simply create a PrefixQuery. However, if you're using
>> StandardAnalyzer, and the word is added as Index.TOKENIZED,
>> sotf-wa<something> will be broken to 'soft' and 'wa<something>'.
>> Therefore
>> you'll need to add the word as Index.UN_TOKENIZED, or use a different
>> Analyzer when you index the data (for this field at least).
>>
>> Here's a sample code:
>>
>>        // Indexing.
>>        Document doc = new Document();
>>        doc.add(new Field("field", "soft-wash", Store.NO,
>> Index.UN_TOKENIZED
>> ));
>>
>>        // Search
>>        Query q = new PrefixQuery(new Term("field", "soft-wa"));
>>
>> Does that help?
>>
>> On Nov 22, 2007 5:46 PM, Shakti_Sareen < [hidden email]>
>>    
> wrote:
>  
>>> Hi
>>> I am using StandardAnalyser() to index the data.
>>> But I want to do a like search on a word containing Hyphen
>>> For example it want to search a word "soft-wa*"
>>>
>>> I am getting no hits for that. It is said that if the hyphen is
>>>      
> there
>  
>> in
>>    
>>> the word, then we should include that word in the double quotes (").
>>>      
>> But
>>    
>>> enclosing the word in a double quotes (") means the exact word
>>>      
> search.
>  
>>> How can I perform the like search on a word containing hyphen???????
>>>
>>> Please help.
>>>
>>> Regards,
>>> Shakti Sareen
>>>
>>>
>>>
>>>
>>>
>>> DISCLAIMER:
>>> This email (including any attachments) is intended for the sole use
>>>      
> of
>  
>> the
>>    
>>> intended recipient/s and may contain material that is CONFIDENTIAL
>>>      
> AND
>  
>>> PRIVATE COMPANY INFORMATION. Any review or reliance by others or
>>>      
>> copying or
>>    
>>> distribution or forwarding of any or all of the contents in this
>>>      
>> message is
>>    
>>> STRICTLY PROHIBITED. If you are not the intended recipient, please
>>>      
>> contact
>>    
>>> the sender by email and delete all copies; your cooperation in this
>>>      
>> regard
>>    
>>> is appreciated.
>>>
>>>
>>>      
> ---------------------------------------------------------------------
>  
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>>      
>> --
>> Regards,
>>
>> Shai Erera
>>
>>
>> DISCLAIMER:
>> This email (including any attachments) is intended for the sole use of
>>    
> the
>  
>> intended recipient/s and may contain material that is CONFIDENTIAL AND
>> PRIVATE COMPANY INFORMATION. Any review or reliance by others or
>>    
> copying or
>  
>> distribution or forwarding of any or all of the contents in this
>>    
> message is
>  
>> STRICTLY PROHIBITED. If you are not the intended recipient, please
>>    
> contact
>  
>> the sender by email and delete all copies; your cooperation in this
>>    
> regard
>  
>> is appreciated.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>    
>
>
>
>  
> ------------------------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]