confused about an entry in the FAQ

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

confused about an entry in the FAQ

Stephane Nicoll
From the FAQ:

"Don't iterate over more hits than needed.
Iterating over all hits is slow for two reasons. Firstly, the search()
method that returns a Hits object re-executes the search internally
when you need more than 100 hits. Solution: use the search method that
takes a HitCollector instead."

I had a look to HitCollector but it returns the documentId and the
javadoc recommends not fetching the original query there.

I have to return *one* indexed field from the query result and
currently I am iterating on all results and it's slow. Can you explain
a bit more how I could improve this?

Thanks,
Stéphane


--
Large Systems Suck: This rule is 100% transitive. If you build one,
you suck" -- S.Yegge

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: confused about an entry in the FAQ

John Wang-9
If your indexed field is not used to further filtering out the doc nor
further scoring, you should use some sort of priority queueing mechanism to
gather the top N documents. You can then call reader.document() on those
docs if necc.

-John

On Sat, May 10, 2008 at 6:35 AM, Stephane Nicoll <[hidden email]>
wrote:

> From the FAQ:
>
> "Don't iterate over more hits than needed.
> Iterating over all hits is slow for two reasons. Firstly, the search()
> method that returns a Hits object re-executes the search internally
> when you need more than 100 hits. Solution: use the search method that
> takes a HitCollector instead."
>
> I had a look to HitCollector but it returns the documentId and the
> javadoc recommends not fetching the original query there.
>
> I have to return *one* indexed field from the query result and
> currently I am iterating on all results and it's slow. Can you explain
> a bit more how I could improve this?
>
> Thanks,
> Stéphane
>
>
> --
> Large Systems Suck: This rule is 100% transitive. If you build one,
> you suck" -- S.Yegge
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: confused about an entry in the FAQ

Stephane Nicoll
Sorry I don't get it. Do you have a sample code?

Sent from my iPhone

On 10 May 2008, at 17:43, "John Wang" <[hidden email]> wrote:

> If your indexed field is not used to further filtering out the doc nor
> further scoring, you should use some sort of priority queueing  
> mechanism to
> gather the top N documents. You can then call reader.document() on  
> those
> docs if necc.
>
> -John
>
> On Sat, May 10, 2008 at 6:35 AM, Stephane Nicoll <[hidden email]
> >
> wrote:
>
>> From the FAQ:
>>
>> "Don't iterate over more hits than needed.
>> Iterating over all hits is slow for two reasons. Firstly, the search
>> ()
>> method that returns a Hits object re-executes the search internally
>> when you need more than 100 hits. Solution: use the search method  
>> that
>> takes a HitCollector instead."
>>
>> I had a look to HitCollector but it returns the documentId and the
>> javadoc recommends not fetching the original query there.
>>
>> I have to return *one* indexed field from the query result and
>> currently I am iterating on all results and it's slow. Can you  
>> explain
>> a bit more how I could improve this?
>>
>> Thanks,
>> Stéphane
>>
>>
>> --
>> Large Systems Suck: This rule is 100% transitive. If you build one,
>> you suck" -- S.Yegge
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: j

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: confused about an entry in the FAQ

Patrek
In reply to this post by Stephane Nicoll
Did you try the IndexSearcher.doc(int i, FieldSelector fieldSelector)  method?

Could be faster because Lucene don't have do "prepare" the whole document.

Patrick

On Sat, May 10, 2008 at 9:35 AM, Stephane Nicoll
<[hidden email]> wrote:

> From the FAQ:
>
> "Don't iterate over more hits than needed.
> Iterating over all hits is slow for two reasons. Firstly, the search()
> method that returns a Hits object re-executes the search internally
> when you need more than 100 hits. Solution: use the search method that
> takes a HitCollector instead."
>
> I had a look to HitCollector but it returns the documentId and the
> javadoc recommends not fetching the original query there.
>
> I have to return *one* indexed field from the query result and
> currently I am iterating on all results and it's slow. Can you explain
> a bit more how I could improve this?
>
> Thanks,
> Stéphane
>
>
> --
> Large Systems Suck: This rule is 100% transitive. If you build one,
> you suck" -- S.Yegge
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: confused about an entry in the FAQ

Stephane Nicoll
I tried all this and I am confused about the result. I am trying to
implement an hybrid query handler where I fetch the IDs from a
database criteria and the IDs from a full text lucene query and I
intersect them to return the result to the user. The database query
and the intersection works fine even with high load. However the
lucene query is much slower when the number of concurrent users
raises.

Here is what I am doing on the lucene side

        final QueryParser queryParser = new
QueryParser(criteria.getDefaultField(), analyzer);
        final Query q = queryParser.parse(criteria.getFullTextQuery());
        // Index Searcher is shared for all threads and is not
reopened during the load test
        final IndexSearcher indexSearcher = getIndexSearcher();
        final Set<Long> result = new TreeSet<Long>();
        indexSearcher.search(q, new HitCollector() {
            public void collect(int i, float v) {
                try {
                    final Document d =
indexSearcher.getIndexReader().document(i, new FieldSelector() {
                        public FieldSelectorResult accept(String s) {
                            if (s.equals(CatalogItem.ATTR_ID)) {
                                return FieldSelectorResult.LOAD;
                            } else {
                                return FieldSelectorResult.NO_LOAD;
                            }
                        }
                    });
                    result.add(Long.parseLong(d.get(CatalogItem.ATTR_ID)));
                } catch (IOException e) {
                    throw new RuntimeException("Could not collect
lucene IDs", e);
                }
            }
        });
        return result;


When running with one thread, I have the following figures per test:

Database query is done in[125 msecs] (size=598]
Lucene query is done in[80 msecs (size=15204]
Intersect is done in[4 msecs] (size=103]
Hybrid query is done in[97 msecs]

-> 327 msec / user

When running with ten threads, I have the following figures per user per test:

Database query is done in[222 msecs] (size=94]
Lucene query is done in[2364 msecs (size=15367]
Intersect is done in[0 msecs] (size=12]
Hybrid query is done in[18 msecs]

-> 2.5 sec / user !!

I am just wondering how I can improve this. Clearly there is something
wrong in my code since it's much slower with multiple threads running
concurrently on the same index. The size of the index is 5Mb, I only
store:

* an "id" field (which is the primary key of the related object in the db
* a "class" field which is the class nazme of the related object
(Hibernate search does that for me)

The "keywords" field is indexed but not stored as it is a
representation of other data stored in the db. The searches are
performed on the keywords field only ("foo AND bar" is a typical
query)

Any help is appreciated. If you also know a Spring bean that could
take care of opening/closing the index readers properly, let me know.
Hibernate Search introduces deadlock with multiple threads and the
lucene integration in spring modules does not seeem to do what I want.

Thanks,
Stéphane


On Sat, May 10, 2008 at 8:05 PM, Patrick Turcotte <[hidden email]> wrote:

> Did you try the IndexSearcher.doc(int i, FieldSelector fieldSelector)  method?
>
>  Could be faster because Lucene don't have do "prepare" the whole document.
>
>  Patrick
>
>
>  On Sat, May 10, 2008 at 9:35 AM, Stephane Nicoll
>  <[hidden email]> wrote:
>
>
> > From the FAQ:
>  >
>  > "Don't iterate over more hits than needed.
>  > Iterating over all hits is slow for two reasons. Firstly, the search()
>  > method that returns a Hits object re-executes the search internally
>  > when you need more than 100 hits. Solution: use the search method that
>  > takes a HitCollector instead."
>  >
>  > I had a look to HitCollector but it returns the documentId and the
>  > javadoc recommends not fetching the original query there.
>  >
>  > I have to return *one* indexed field from the query result and
>  > currently I am iterating on all results and it's slow. Can you explain
>  > a bit more how I could improve this?
>  >
>  > Thanks,
>  > Stéphane
>  >
>  >
>  > --
>  > Large Systems Suck: This rule is 100% transitive. If you build one,
>  > you suck" -- S.Yegge
>  >
>
> > ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: [hidden email]
>
> > For additional commands, e-mail: [hidden email]
>  >
>  >
>
>
> ---------------------------------------------------------------------
>  To unsubscribe, e-mail: [hidden email]
>
>
> For additional commands, e-mail: [hidden email]
>
>



--
Large Systems Suck: This rule is 100% transitive. If you build one,
you suck" -- S.Yegge

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: confused about an entry in the FAQ

Stephane Nicoll
ping. Sorry for the long email but I prefer to provide all information first.

On Mon, May 12, 2008 at 12:13 PM, Stephane Nicoll
<[hidden email]> wrote:

> I tried all this and I am confused about the result. I am trying to
>  implement an hybrid query handler where I fetch the IDs from a
>  database criteria and the IDs from a full text lucene query and I
>  intersect them to return the result to the user. The database query
>  and the intersection works fine even with high load. However the
>  lucene query is much slower when the number of concurrent users
>  raises.
>
>  Here is what I am doing on the lucene side
>
>         final QueryParser queryParser = new
>  QueryParser(criteria.getDefaultField(), analyzer);
>         final Query q = queryParser.parse(criteria.getFullTextQuery());
>         // Index Searcher is shared for all threads and is not
>  reopened during the load test
>         final IndexSearcher indexSearcher = getIndexSearcher();
>         final Set<Long> result = new TreeSet<Long>();
>         indexSearcher.search(q, new HitCollector() {
>             public void collect(int i, float v) {
>                 try {
>                     final Document d =
>  indexSearcher.getIndexReader().document(i, new FieldSelector() {
>                         public FieldSelectorResult accept(String s) {
>                             if (s.equals(CatalogItem.ATTR_ID)) {
>                                 return FieldSelectorResult.LOAD;
>                             } else {
>                                 return FieldSelectorResult.NO_LOAD;
>                             }
>                         }
>                     });
>                     result.add(Long.parseLong(d.get(CatalogItem.ATTR_ID)));
>                 } catch (IOException e) {
>                     throw new RuntimeException("Could not collect
>  lucene IDs", e);
>                 }
>             }
>         });
>         return result;
>
>
>  When running with one thread, I have the following figures per test:
>
>  Database query is done in[125 msecs] (size=598]
>  Lucene query is done in[80 msecs (size=15204]
>  Intersect is done in[4 msecs] (size=103]
>  Hybrid query is done in[97 msecs]
>
>  -> 327 msec / user
>
>  When running with ten threads, I have the following figures per user per test:
>
>  Database query is done in[222 msecs] (size=94]
>  Lucene query is done in[2364 msecs (size=15367]
>  Intersect is done in[0 msecs] (size=12]
>  Hybrid query is done in[18 msecs]
>
>  -> 2.5 sec / user !!
>
>  I am just wondering how I can improve this. Clearly there is something
>  wrong in my code since it's much slower with multiple threads running
>  concurrently on the same index. The size of the index is 5Mb, I only
>  store:
>
>  * an "id" field (which is the primary key of the related object in the db
>  * a "class" field which is the class nazme of the related object
>  (Hibernate search does that for me)
>
>  The "keywords" field is indexed but not stored as it is a
>  representation of other data stored in the db. The searches are
>  performed on the keywords field only ("foo AND bar" is a typical
>  query)
>
>  Any help is appreciated. If you also know a Spring bean that could
>  take care of opening/closing the index readers properly, let me know.
>  Hibernate Search introduces deadlock with multiple threads and the
>  lucene integration in spring modules does not seeem to do what I want.
>
>  Thanks,
>  Stéphane
>
>
>
>
>  On Sat, May 10, 2008 at 8:05 PM, Patrick Turcotte <[hidden email]> wrote:
>  > Did you try the IndexSearcher.doc(int i, FieldSelector fieldSelector)  method?
>  >
>  >  Could be faster because Lucene don't have do "prepare" the whole document.
>  >
>  >  Patrick
>  >
>  >
>  >  On Sat, May 10, 2008 at 9:35 AM, Stephane Nicoll
>  >  <[hidden email]> wrote:
>  >
>  >
>  > > From the FAQ:
>  >  >
>  >  > "Don't iterate over more hits than needed.
>  >  > Iterating over all hits is slow for two reasons. Firstly, the search()
>  >  > method that returns a Hits object re-executes the search internally
>  >  > when you need more than 100 hits. Solution: use the search method that
>  >  > takes a HitCollector instead."
>  >  >
>  >  > I had a look to HitCollector but it returns the documentId and the
>  >  > javadoc recommends not fetching the original query there.
>  >  >
>  >  > I have to return *one* indexed field from the query result and
>  >  > currently I am iterating on all results and it's slow. Can you explain
>  >  > a bit more how I could improve this?
>  >  >
>  >  > Thanks,
>  >  > Stéphane
>  >  >
>  >  >
>  >  > --
>  >  > Large Systems Suck: This rule is 100% transitive. If you build one,
>  >  > you suck" -- S.Yegge
>  >  >
>  >
>  > > ---------------------------------------------------------------------
>  >  > To unsubscribe, e-mail: [hidden email]
>  >
>  > > For additional commands, e-mail: [hidden email]
>  >  >
>  >  >
>  >
>  >
>  > ---------------------------------------------------------------------
>  >  To unsubscribe, e-mail: [hidden email]
>  >
>  >
>  > For additional commands, e-mail: [hidden email]
>  >
>  >
>
>
>
>  --
>
>
> Large Systems Suck: This rule is 100% transitive. If you build one,
>  you suck" -- S.Yegge
>



--
Large Systems Suck: This rule is 100% transitive. If you build one,
you suck" -- S.Yegge

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: confused about an entry in the FAQ

Otis Gospodnetic-2
In reply to this post by Stephane Nicoll
pong.
Is that the most optimal use of FieldSelector?  What happens if you remove it from that HitCollector.collect method?
It looks like you are creating a new FieldSelector object for each hit found in each search thread.

If it's not that, is the index optimized?
If not, does optimizing it make a difference?

You are also examining every each and every Document in the result set.  Do you really need to do that?  That's expensive and you may be witnessing the cost.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----

> From: Stephane Nicoll <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, May 14, 2008 2:38:25 AM
> Subject: Re: confused about an entry in the FAQ
>
> ping. Sorry for the long email but I prefer to provide all information first.
>
> On Mon, May 12, 2008 at 12:13 PM, Stephane Nicoll
> wrote:
> > I tried all this and I am confused about the result. I am trying to
> >  implement an hybrid query handler where I fetch the IDs from a
> >  database criteria and the IDs from a full text lucene query and I
> >  intersect them to return the result to the user. The database query
> >  and the intersection works fine even with high load. However the
> >  lucene query is much slower when the number of concurrent users
> >  raises.
> >
> >  Here is what I am doing on the lucene side
> >
> >         final QueryParser queryParser = new
> >  QueryParser(criteria.getDefaultField(), analyzer);
> >         final Query q = queryParser.parse(criteria.getFullTextQuery());
> >         // Index Searcher is shared for all threads and is not
> >  reopened during the load test
> >         final IndexSearcher indexSearcher = getIndexSearcher();
> >         final Setresult = new TreeSet();
> >         indexSearcher.search(q, new HitCollector() {
> >             public void collect(int i, float v) {
> >                 try {
> >                     final Document d =
> >  indexSearcher.getIndexReader().document(i, new FieldSelector() {
> >                         public FieldSelectorResult accept(String s) {
> >                             if (s.equals(CatalogItem.ATTR_ID)) {
> >                                 return FieldSelectorResult.LOAD;
> >                             } else {
> >                                 return FieldSelectorResult.NO_LOAD;
> >                             }
> >                         }
> >                     });
> >                     result.add(Long.parseLong(d.get(CatalogItem.ATTR_ID)));
> >                 } catch (IOException e) {
> >                     throw new RuntimeException("Could not collect
> >  lucene IDs", e);
> >                 }
> >             }
> >         });
> >         return result;
> >
> >
> >  When running with one thread, I have the following figures per test:
> >
> >  Database query is done in[125 msecs] (size=598]
> >  Lucene query is done in[80 msecs (size=15204]
> >  Intersect is done in[4 msecs] (size=103]
> >  Hybrid query is done in[97 msecs]
> >
> >  -> 327 msec / user
> >
> >  When running with ten threads, I have the following figures per user per
> test:
> >
> >  Database query is done in[222 msecs] (size=94]
> >  Lucene query is done in[2364 msecs (size=15367]
> >  Intersect is done in[0 msecs] (size=12]
> >  Hybrid query is done in[18 msecs]
> >
> >  -> 2.5 sec / user !!
> >
> >  I am just wondering how I can improve this. Clearly there is something
> >  wrong in my code since it's much slower with multiple threads running
> >  concurrently on the same index. The size of the index is 5Mb, I only
> >  store:
> >
> >  * an "id" field (which is the primary key of the related object in the db
> >  * a "class" field which is the class nazme of the related object
> >  (Hibernate search does that for me)
> >
> >  The "keywords" field is indexed but not stored as it is a
> >  representation of other data stored in the db. The searches are
> >  performed on the keywords field only ("foo AND bar" is a typical
> >  query)
> >
> >  Any help is appreciated. If you also know a Spring bean that could
> >  take care of opening/closing the index readers properly, let me know.
> >  Hibernate Search introduces deadlock with multiple threads and the
> >  lucene integration in spring modules does not seeem to do what I want.
> >
> >  Thanks,
> >  Stéphane
> >
> >
> >
> >
> >  On Sat, May 10, 2008 at 8:05 PM, Patrick Turcotte wrote:
> >  > Did you try the IndexSearcher.doc(int i, FieldSelector fieldSelector)  
> method?
> >  >
> >  >  Could be faster because Lucene don't have do "prepare" the whole document.
> >  >
> >  >  Patrick
> >  >
> >  >
> >  >  On Sat, May 10, 2008 at 9:35 AM, Stephane Nicoll
> >  >  wrote:
> >  >
> >  >
> >  > > From the FAQ:
> >  >  >
> >  >  > "Don't iterate over more hits than needed.
> >  >  > Iterating over all hits is slow for two reasons. Firstly, the search()
> >  >  > method that returns a Hits object re-executes the search internally
> >  >  > when you need more than 100 hits. Solution: use the search method that
> >  >  > takes a HitCollector instead."
> >  >  >
> >  >  > I had a look to HitCollector but it returns the documentId and the
> >  >  > javadoc recommends not fetching the original query there.
> >  >  >
> >  >  > I have to return *one* indexed field from the query result and
> >  >  > currently I am iterating on all results and it's slow. Can you explain
> >  >  > a bit more how I could improve this?
> >  >  >
> >  >  > Thanks,
> >  >  > Stéphane
> >  >  >
> >  >  >
> >  >  > --
> >  >  > Large Systems Suck: This rule is 100% transitive. If you build one,
> >  >  > you suck" -- S.Yegge
> >  >  >
> >  >
> >  > > ---------------------------------------------------------------------
> >  >  > To unsubscribe, e-mail: [hidden email]
> >  >
> >  > > For additional commands, e-mail: [hidden email]
> >  >  >
> >  >  >
> >  >
> >  >
> >  > ---------------------------------------------------------------------
> >  >  To unsubscribe, e-mail: [hidden email]
> >  >
> >  >
> >  > For additional commands, e-mail: [hidden email]
> >  >
> >  >
> >
> >
> >
> >  --
> >
> >
> > Large Systems Suck: This rule is 100% transitive. If you build one,
> >  you suck" -- S.Yegge
> >
>
>
>
> --
> Large Systems Suck: This rule is 100% transitive. If you build one,
> you suck" -- S.Yegge
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: confused about an entry in the FAQ

Emmanuel Bernard
In reply to this post by Stephane Nicoll
Hi Stephane
Can you tell me a bit more about the deadlocks you experience with  
Hibernate Search. I have not seen such a situation so far and am  
interested to see how to fix the problem.

Emmanuel

On  May 12, 2008, at 06:13, Stephane Nicoll wrote:

> Hibernate Search introduces deadlock with multiple threads and the
> lucene integration in spring modules does not seeem to do what I want.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: confused about an entry in the FAQ

Stephane Nicoll
On Sat, May 24, 2008 at 12:39 AM, Emmanuel Bernard
<[hidden email]> wrote:
> Hi Stephane
> Can you tell me a bit more about the deadlocks you experience with Hibernate
> Search. I have not seen such a situation so far and am interested to see how
> to fix the problem.

It is hard to externalize a unit test since it relies on many factor.
You need to have a significant amount of data (100.000 documents) and
you need to browse all results in the lucene index (15.000 results for
a typicial query in my case). I still don't find any optimized
solution to do this even if I only need one field from the search
result and the index is 5MB. I could put that into memory but that's
not a viable solution mid-term.

I've stopped using lucene. I am using sql like for now and we are
investigating Oracle Text and postigs test extension.

If anyone has an idea, i'm interested. For instance, knowing that the
IDs I got from the database are < 500, would it be reasonable to build
a lucene query like

"my search query  AND (id IN (the list of 500 ids)" <- will this hit
the toomanyclausesexception? How can I build such a query efficently?

Thanks,
Stéphane


>
> Emmanuel
>
> On  May 12, 2008, at 06:13, Stephane Nicoll wrote:
>
>> Hibernate Search introduces deadlock with multiple threads and the
>> lucene integration in spring modules does not seeem to do what I want.
>
>



--
Large Systems Suck: This rule is 100% transitive. If you build one,
you suck" -- S.Yegge

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]