efficient way to filter out unwanted results

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

efficient way to filter out unwanted results

Jay-98
Hi everyone,

I am trying to remove several docs from search results each time I do
query. The docs can be identified by an exteranl ids whcih are
saved/indexed. I could use a Query or QueryFilter  to achieve this but
not sure if it's the most efficient way to do that.
Anyone has any experience or idea?
Thanks!

Jay

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: efficient way to filter out unwanted results

ssharma
Hello Jay,

I am not sure up to what level I understood your problem . But as far as my
assumption, you can try HitCollector class and its collect method. Here you
can get DocID for each hit and can remove while searching.

Hope it will be useful.

Sawan
(Chambal.com inc. NJ USA)





On 6/15/07, yu <[hidden email]> wrote:

>
> Hi everyone,
>
> I am trying to remove several docs from search results each time I do
> query. The docs can be identified by an exteranl ids whcih are
> saved/indexed. I could use a Query or QueryFilter  to achieve this but
> not sure if it's the most efficient way to do that.
> Anyone has any experience or idea?
> Thanks!
>
> Jay
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: efficient way to filter out unwanted results

Jay-98
Thanks Sawan for the suggestion.
I guess this will work for statically known doc ids. In my case, I know
only external ids that I want to exclude from the result set.for each  
search.  Of course, I can always exclude these  docs in a post search
process. I am curious if there are other more efficient approach.

Thanks again for your help.

Jay

Sawan Sharma wrote:

> Hello Jay,
>
> I am not sure up to what level I understood your problem . But as far
> as my
> assumption, you can try HitCollector class and its collect method.
> Here you
> can get DocID for each hit and can remove while searching.
>
> Hope it will be useful.
>
> Sawan
> (Chambal.com inc. NJ USA)
>
>
>
>
>
> On 6/15/07, yu <[hidden email]> wrote:
>>
>> Hi everyone,
>>
>> I am trying to remove several docs from search results each time I do
>> query. The docs can be identified by an exteranl ids whcih are
>> saved/indexed. I could use a Query or QueryFilter  to achieve this but
>> not sure if it's the most efficient way to do that.
>> Anyone has any experience or idea?
>> Thanks!
>>
>> Jay
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: efficient way to filter out unwanted results

adb
yu wrote:
> Thanks Sawan for the suggestion.
> I guess this will work for statically known doc ids. In my case, I know
> only external ids that I want to exclude from the result set.for each  
> search.  Of course, I can always exclude these  docs in a post search
> process. I am curious if there are other more efficient approach.

When you open a searcher, you could create a cached array of all your external
Ids with their Lucene DocId.  Using a custom HitCollector, which can be created
with the Ids you wish to exclude, you can get a document's external Id during
the collect() method using the docid.  Then just check the external Id of the
matched document against the exclusion list.

As long as you have your searcher open, the cache will remain valid.
Antony




>
> Thanks again for your help.
>
> Jay
>
> Sawan Sharma wrote:
>> Hello Jay,
>>
>> I am not sure up to what level I understood your problem . But as far
>> as my
>> assumption, you can try HitCollector class and its collect method.
>> Here you
>> can get DocID for each hit and can remove while searching.
>>
>> Hope it will be useful.
>>
>> Sawan
>> (Chambal.com inc. NJ USA)
>>
>>
>>
>>
>>
>> On 6/15/07, yu <[hidden email]> wrote:
>>>
>>> Hi everyone,
>>>
>>> I am trying to remove several docs from search results each time I do
>>> query. The docs can be identified by an exteranl ids whcih are
>>> saved/indexed. I could use a Query or QueryFilter  to achieve this but
>>> not sure if it's the most efficient way to do that.
>>> Anyone has any experience or idea?
>>> Thanks!
>>>
>>> Jay
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: efficient way to filter out unwanted results

Jay-98
Thanks Antony for the idea.
The only thing that may prevent it from working well is that the index
is updated frequently so the docid to ext id or cache needs to be
updated freq, which may affect the performance.

Thanks again for your help.

Antony Bowesman wrote:

> yu wrote:
>> Thanks Sawan for the suggestion.
>> I guess this will work for statically known doc ids. In my case, I
>> know only external ids that I want to exclude from the result set.for
>> each  search.  Of course, I can always exclude these  docs in a post
>> search process. I am curious if there are other more efficient approach.
>
> When you open a searcher, you could create a cached array of all your
> external Ids with their Lucene DocId.  Using a custom HitCollector,
> which can be created with the Ids you wish to exclude, you can get a
> document's external Id during the collect() method using the docid.  
> Then just check the external Id of the matched document against the
> exclusion list.
>
> As long as you have your searcher open, the cache will remain valid.
> Antony
>
>
>
>
>>
>> Thanks again for your help.
>>
>> Jay
>>
>> Sawan Sharma wrote:
>>> Hello Jay,
>>>
>>> I am not sure up to what level I understood your problem . But as far
>>> as my
>>> assumption, you can try HitCollector class and its collect method.
>>> Here you
>>> can get DocID for each hit and can remove while searching.
>>>
>>> Hope it will be useful.
>>>
>>> Sawan
>>> (Chambal.com inc. NJ USA)
>>>
>>>
>>>
>>>
>>>
>>> On 6/15/07, yu <[hidden email]> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> I am trying to remove several docs from search results each time I do
>>>> query. The docs can be identified by an exteranl ids whcih are
>>>> saved/indexed. I could use a Query or QueryFilter  to achieve this but
>>>> not sure if it's the most efficient way to do that.
>>>> Anyone has any experience or idea?
>>>> Thanks!
>>>>
>>>> Jay
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Lucene index performance

Lee Li Bin
In reply to this post by adb
Hi,

I would like to know how's the performance during indexing and searching of
results on a large index files would be like.

And is it possible to create multiple index files and search across multiple
index files? If possible, may I know how could it be done?

Thanks a lot.





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene index performance

Mark Miller-3


Lee Li Bin wrote:
> Hi,
>
> I would like to know how's the performance during indexing and searching of
> results on a large index files would be like.
>  
Fast.
> And is it possible to create multiple index files and search across multiple
> index files?
Yes.
>  If possible, may I know how could it be done?
>  
Check out MultiSearcher.
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/MultiSearcher.html

> Thanks a lot.
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>  

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene index performance

Andreas Guther-2-2
Searching on multiple index files is incredible fast.  We have 10 different
index folders with different sizes.  All folders together have a size of 7
GB.  Results come back usual within less than 50 ms.  Getting results out of
the index i.e. reading documents is expensive and you will have to spent
time here to get a good performance.  You will need to look into
- Topdocs
- Extracting results in an ordered way, i.e. sort by index and within an
index by document id.  This will help to minimize disk head jumps and gave
me a tremendous boost.
- Extracting only what you need (using a special read filter I do not recall
the name right now and I do not have access to my sources at the moment of
writing this)

Andreas


On 6/17/07, Mark Miller <[hidden email]> wrote:

>
>
>
> Lee Li Bin wrote:
> > Hi,
> >
> > I would like to know how's the performance during indexing and searching
> of
> > results on a large index files would be like.
> >
> Fast.
> > And is it possible to create multiple index files and search across
> multiple
> > index files?
> Yes.
> >  If possible, may I know how could it be done?
> >
> Check out MultiSearcher.
>
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/MultiSearcher.html
> > Thanks a lot.
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Lucene index performance

Fang_Li
Hi Andreas,
        I am very interested in the multiple index file index/search.
Can you kindly help me on following questions?
1) Why you use multi index files? How much is the performance gain for
both indexing and searching? Someone reported that there no big
performance difference except the number if indices is huge, like 1000.
2) Are these index files located in a single machine or distributed into
multiple machines?
3) How do you distribute the document into several index files?

Thanks a lot,
Li

-----Original Message-----
From: Andreas Guther [mailto:[hidden email]]
Sent: Monday, June 18, 2007 4:00 AM
To: [hidden email]
Subject: Re: Lucene index performance

Searching on multiple index files is incredible fast.  We have 10
different
index folders with different sizes.  All folders together have a size of
7
GB.  Results come back usual within less than 50 ms.  Getting results
out of
the index i.e. reading documents is expensive and you will have to spent
time here to get a good performance.  You will need to look into
- Topdocs
- Extracting results in an ordered way, i.e. sort by index and within an
index by document id.  This will help to minimize disk head jumps and
gave
me a tremendous boost.
- Extracting only what you need (using a special read filter I do not
recall
the name right now and I do not have access to my sources at the moment
of
writing this)

Andreas


On 6/17/07, Mark Miller <[hidden email]> wrote:
>
>
>
> Lee Li Bin wrote:
> > Hi,
> >
> > I would like to know how's the performance during indexing and
searching

> of
> > results on a large index files would be like.
> >
> Fast.
> > And is it possible to create multiple index files and search across
> multiple
> > index files?
> Yes.
> >  If possible, may I know how could it be done?
> >
> Check out MultiSearcher.
>
>
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/or
g/apache/lucene/search/MultiSearcher.html
> > Thanks a lot.
> >
> >
> >
> >
> >
> >
---------------------------------------------------------------------

> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lucene index performance

Otis Gospodnetic-2
In reply to this post by Lee Li Bin
----- Original Message ----
From: Lee Li Bin <[hidden email]>

I would like to know how's the performance during indexing and searching of
results on a large index files would be like.

OG: It depends ;)
- on your hardware (fast disk?  lots of RAM?  multi-CPU?  multi-core?)
- on the size of data you're indexing (one field with 1KB of data or 10 fields with 10KB each?)
- on field indexing options (indexing? tokenizing? storing? compressing? term vectors enabled? payloads?)

Who was is that mentioned a DB server that can insert 1000 rows/second the other day?  Sounds rather high. :)


Otis
--

Lucene Consulting -- http://lucene-consulting.com/






---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Lucene index performance

Andreas Guther
In reply to this post by Fang_Li
Hi Li,

Sorry for taking so long to answer your questions.

We came up with splitting our index into smaller units after we realized
that we have to deal with an index of the size of many GB.  Updating and
optimizing such large files becomes a bottle neck.  We portioned our
index based on when the indexed units where created.  Updates usually
happen only on current units and rarely on units for previous years.

In terms of performance I think there is very little difference and if
stated in another response it really depends on your hardware.

All index directories are located on the same box and drive.

The documents are not distributed into several files.  I suppose you do
not talk about a Lucene document but rather about an indexed unit.  It
really depends how you organize your index but my experience is not to
split one indexed unit into parts.  When I started to index our units we
separated meta data from aggregated units, like for example a books meta
information like ISBN etc. and its pages.  Each page (or aggregated
unit) was a single Lucene document.  This made it somehow difficult to
assemble the information as the UI dictated it and we went back to treat
one unit and its aggregates as a single Lucene document which made the
reading faster.

Andreas


-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Tuesday, June 19, 2007 8:05 PM
To: [hidden email]
Subject: RE: Lucene index performance

Hi Andreas,
        I am very interested in the multiple index file index/search.
Can you kindly help me on following questions?
1) Why you use multi index files? How much is the performance gain for
both indexing and searching? Someone reported that there no big
performance difference except the number if indices is huge, like 1000.
2) Are these index files located in a single machine or distributed into
multiple machines?
3) How do you distribute the document into several index files?

Thanks a lot,
Li

-----Original Message-----
From: Andreas Guther [mailto:[hidden email]]
Sent: Monday, June 18, 2007 4:00 AM
To: [hidden email]
Subject: Re: Lucene index performance

Searching on multiple index files is incredible fast.  We have 10
different
index folders with different sizes.  All folders together have a size of
7
GB.  Results come back usual within less than 50 ms.  Getting results
out of
the index i.e. reading documents is expensive and you will have to spent
time here to get a good performance.  You will need to look into
- Topdocs
- Extracting results in an ordered way, i.e. sort by index and within an
index by document id.  This will help to minimize disk head jumps and
gave
me a tremendous boost.
- Extracting only what you need (using a special read filter I do not
recall
the name right now and I do not have access to my sources at the moment
of
writing this)

Andreas


On 6/17/07, Mark Miller <[hidden email]> wrote:
>
>
>
> Lee Li Bin wrote:
> > Hi,
> >
> > I would like to know how's the performance during indexing and
searching

> of
> > results on a large index files would be like.
> >
> Fast.
> > And is it possible to create multiple index files and search across
> multiple
> > index files?
> Yes.
> >  If possible, may I know how could it be done?
> >
> Check out MultiSearcher.
>
>
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/or
g/apache/lucene/search/MultiSearcher.html
> > Thanks a lot.
> >
> >
> >
> >
> >
> >
---------------------------------------------------------------------

> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]