Fastest Method for Searching (need all results)

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Fastest Method for Searching (need all results)

rjohara
My index contains approximately 5 millions documents.  During a  
search, I need to grab the value of a field for every document in the  
result set.  I am currently using a HitCollector to search.  Below is  
my code:

searcher.search(query, new HitCollector(){
                         public void collect(int doc, float score){
                                 if(searcher.doc(doc).get("SYM") !=  
null){
                                     addSymbolsToHash(searcher.doc
(doc).get("SYM").split("ENDOFSYM"));
                                 }
                         }
                     });

This is fairly fast for small and medium-sized result sets.  However,  
it gets slow as the result set grows.  I read this on HitCollector's  
API page:

"For good search performance, implementations of this method should  
not call Searcher.doc(int) or Reader.document(int) on every document  
number encountered. Doing so can slow searches by an order of  
magnitude or more."

Along with this implementation, I've also tried using FieldCache.  
This faired better with large-sized result sets, but worse with small  
and medium-sized result sets.  Anyone have any ideas of what the best  
approach might be?

Thanks a lot,
Ryan
Reply | Threaded
Open this post in threaded view
|

Re: Fastest Method for Searching (need all results)

Otis Gospodnetic-2
I haven't had the chance to use this new feature yet, but have you tried with selective field loading, so that you can load only that 1 field from your index and not all of them?

Otis

----- Original Message ----
From: Ryan O'Hara <[hidden email]>
To: [hidden email]
Sent: Friday, July 21, 2006 2:43:41 PM
Subject: Fastest Method for Searching (need all results)

My index contains approximately 5 millions documents.  During a  
search, I need to grab the value of a field for every document in the  
result set.  I am currently using a HitCollector to search.  Below is  
my code:

searcher.search(query, new HitCollector(){
                         public void collect(int doc, float score){
                                 if(searcher.doc(doc).get("SYM") !=  
null){
                                     addSymbolsToHash(searcher.doc
(doc).get("SYM").split("ENDOFSYM"));
                                 }
                         }
                     });

This is fairly fast for small and medium-sized result sets.  However,  
it gets slow as the result set grows.  I read this on HitCollector's  
API page:

"For good search performance, implementations of this method should  
not call Searcher.doc(int) or Reader.document(int) on every document  
number encountered. Doing so can slow searches by an order of  
magnitude or more."

Along with this implementation, I've also tried using FieldCache.  
This faired better with large-sized result sets, but worse with small  
and medium-sized result sets.  Anyone have any ideas of what the best  
approach might be?

Thanks a lot,
Ryan



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest Method for Searching (need all results)

Mark Miller-3
In reply to this post by rjohara
Ryan O'Hara wrote:

> My index contains approximately 5 millions documents.  During a
> search, I need to grab the value of a field for every document in the
> result set.  I am currently using a HitCollector to search.  Below is
> my code:
>
> searcher.search(query, new HitCollector(){
>                         public void collect(int doc, float score){
>                                 if(searcher.doc(doc).get("SYM") != null){
>                                    
> addSymbolsToHash(searcher.doc(doc).get("SYM").split("ENDOFSYM"));
>                                 }
>                         }
>                     });
>
> This is fairly fast for small and medium-sized result sets.  However,
> it gets slow as the result set grows.  I read this on HitCollector's
> API page:
>
> "For good search performance, implementations of this method should
> not call Searcher.doc(int) or Reader.document(int) on every document
> number encountered. Doing so can slow searches by an order of
> magnitude or more."
>
> Along with this implementation, I've also tried using FieldCache.  
> This faired better with large-sized result sets, but worse with small
> and medium-sized result sets.  Anyone have any ideas of what the best
> approach might be?
>
> Thanks a lot,
> Ryan
Perhaps I am speaking too quickly, but I would try by not grabbing the
value of the field for every document in the results set. Someone will
see that value or use it for a couple million hits? Could be I
suppose...but if not than axe it. Grab the first few thousand (or MUCH
less) and if they need more head back in and grab more.


- mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest Method for Searching (need all results)

rjohara
> Perhaps I am speaking too quickly, but I would try by not grabbing  
> the value of the field for every document in the results set.  
> Someone will see that value or use it for a couple million hits?  
> Could be I suppose...but if not than axe it. Grab the first few  
> thousand (or MUCH less) and if they need more head back in and grab  
> more.
>
>
> - mark

I need all values of a certain field from each document.  More  
specifically, I need a compilation of all symbols in the result set.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest Method for Searching (need all results)

Mark Miller-3
In reply to this post by Otis Gospodnetic-2
> Provides a new api, IndexReader.document(int doc, String[] fields).  A document containing
only the specified fields is created.  The other fields of the document are not loaded, although
unfortunately uncompressed strings still have to be scanned because the length information
in the index is for UTF-8 encoded chars and not bytes.  This is useful for applications that
need quick access to a small subset of the fields.  It can be used in conjunction with or
for some uses instead of ParallelReader.

Does this mean that you must be compressing the fields to really take advantage of this? Or does 'scanned' not infer a load.

- mark


Otis Gospodnetic wrote:
> I haven't had the chance to use this new feature yet, but have you tried with selective field loading, so that you can load only that 1 field from your index and not all of them?
>
> Otis
>
>
>
>  



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest Method for Searching (need all results)

rjohara
In reply to this post by Otis Gospodnetic-2
> I haven't had the chance to use this new feature yet, but have you  
> tried with selective field loading, so that you can load only that  
> 1 field from your index and not all of them?

I have not tried selective field loading, but it sounds like a good  
idea.  What class is it in?  Any more information would be  
appreciated.  Thanks again.

Ryan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest Method for Searching (need all results)

eks dev
In reply to this post by rjohara
have you tried to only collect doc-ids and see if the speed problem is there, or maybe to fetch only field values? If you have dense results it can easily be split() or addSymbolsToHash() what takes the time.

I see 3 possibilities what could be slow,  getting doc-ids, fetching field value or doing something with this value

Would be interesting to know what you get here

yeah, I know, it sounds to naive, but sometimes  repeting the obvious helps

----- Original Message ----
From: Ryan O'Hara <[hidden email]>
To: [hidden email]
Sent: Friday, 21 July, 2006 8:43:41 PM
Subject: Fastest Method for Searching (need all results)

My index contains approximately 5 millions documents.  During a  
search, I need to grab the value of a field for every document in the  
result set.  I am currently using a HitCollector to search.  Below is  
my code:

searcher.search(query, new HitCollector(){
                         public void collect(int doc, float score){
                                 if(searcher.doc(doc).get("SYM") !=  
null){
                                     addSymbolsToHash(searcher.doc
(doc).get("SYM").split("ENDOFSYM"));
                                 }
                         }
                     });

This is fairly fast for small and medium-sized result sets.  However,  
it gets slow as the result set grows.  I read this on HitCollector's  
API page:

"For good search performance, implementations of this method should  
not call Searcher.doc(int) or Reader.document(int) on every document  
number encountered. Doing so can slow searches by an order of  
magnitude or more."

Along with this implementation, I've also tried using FieldCache.  
This faired better with large-sized result sets, but worse with small  
and medium-sized result sets.  Anyone have any ideas of what the best  
approach might be?

Thanks a lot,
Ryan



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest Method for Searching (need all results)

rjohara
eks dev,

The most best way of looping through all results that I have come  
across is using a HitCollector and grabbing the field values via  
FieldCache.  This is under two conditions:  1) The FieldCache arrays  
are initialized only once, since creating these arrays creates  
serious overhead, especially if you have millions of documents in  
your index.  I use Tomcat as my application server, so the way I  
accomplished this was I created a Listener class that extends  
ServletContextListener.  This way, when Tomcat restarts, the  
contextInitialize method in the Listener class is executed,  
initializing the arrays only once.  These arrays are then accessible  
to all users across all sessions.  2)You have enough RAM to store the  
arrays.  If you are dealing with millions of documents, you can  
easily use up hundreds of megabytes of RAM, so keep this in mind.  
Just thought I would let you know how I made out.  Thanks again.

Ryan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]