Large queries

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Large queries

Trond Aksel Myklebust-2
How is Lucene handling very large queries? I have 6million documents, which
each has a "docID" field. There is a total of 20000 distinct docID's, so
many documents got the same docID which consists of a filename (only name,
not path).

Sometimes, I must get all documents that has one of 10 docID's, and
sometimes I need to get all documents that has one of 10000 docIDs. Is there
any other way than doing a query: docID:(file1 file2 file3 file4..) ?

 

Trond A Myklebust

 

Reply | Threaded
Open this post in threaded view
|

Re: Large queries

jian chen
Hi, Trond,

It should be no problem for Lucene to handle 6 million documents.

For your query, it seems you want to do a disjunctive (or'ed) query for
multiple terms, 10 terms or 10000 terms for example. The worst case I can
think of is, you can very easily write your own query class to handle this,
utilizing the TermDocs iterator class.

Say, if you want to have one of 10 docID's, you have 10 TermDocs. Each
TermDoc corresponds to a term. Then, you can do a multi-way (in this case,
10 way) merge of these 10 TermDocs, and generate a final list of the doc
ids.

I suggest that you can look at PhraseQuery and PhraseScrorer to see how it
does the conjunctive merge to find the docs that contains all the terms. In
your case, instead of doing intersection, you are doing a union of all the
term docs, right?

Maybe there is already some query class that comes with the Lucene package
that does this. However, the method I described should also help just in
case.

Cheers,

Jian


On 10/16/05, Trond Aksel Myklebust <[hidden email]> wrote:

>
> How is Lucene handling very large queries? I have 6million documents,
> which
> each has a "docID" field. There is a total of 20000 distinct docID's, so
> many documents got the same docID which consists of a filename (only name,
> not path).
>
> Sometimes, I must get all documents that has one of 10 docID's, and
> sometimes I need to get all documents that has one of 10000 docIDs. Is
> there
> any other way than doing a query: docID:(file1 file2 file3 file4..) ?
>
>
>
> Trond A Myklebust
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Large queries

jian chen
In reply to this post by Trond Aksel Myklebust-2
Hi, Trond,

By the way, it appears to me that Lucene uses the iterator pattern a lot,
like SegmentTermEnum, TermDocs, TermPositions, etc. Each iterator uses the
underlying fix sized buffer to load a chunck of data at a time. So, even you
have millions of documents, you shouldn't run into memory problem given
Lucene's index data structure.

That said, there are some parameters that you can tweak given really large
amount of documents. But I suggest that you don't need to worry about it
unless you really run into it.

Cheers,

Jian

On 10/16/05, Trond Aksel Myklebust <[hidden email]> wrote:

>
> How is Lucene handling very large queries? I have 6million documents,
> which
> each has a "docID" field. There is a total of 20000 distinct docID's, so
> many documents got the same docID which consists of a filename (only name,
> not path).
>
> Sometimes, I must get all documents that has one of 10 docID's, and
> sometimes I need to get all documents that has one of 10000 docIDs. Is
> there
> any other way than doing a query: docID:(file1 file2 file3 file4..) ?
>
>
>
> Trond A Myklebust
>
>
>
>
>
>