Get all terms of a specific field

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Get all terms of a specific field

Philippe-52
Hi,

what would be the fastest way to get all terms for all documents
matching a specific query?

Sofar I:

1.) Query the index
2.) Retrieve all scoreDocs
3.) Iterate the scoreDocs and retrieve all terms using the getValues
method and a customised "FieldSelector"

However, retrieving and iterating the scoredocs is quite costly.  So is
there a better/faster way to perform this?

Cheers,
     Philippe

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Get all terms of a specific field

Grant Ingersoll-2


On Jul 27, 2010, at 8:50 AM, Philippe wrote:

> Hi,
>
> what would be the fastest way to get all terms for all documents matching a specific query?
>
> Sofar I:
>
> 1.) Query the index
> 2.) Retrieve all scoreDocs
> 3.) Iterate the scoreDocs and retrieve all terms using the getValues method and a customised "FieldSelector"
>
> However, retrieving and iterating the scoredocs is quite costly.  So is there a better/faster way to perform this?


If you can afford to store TermVectors (disk is cheap, right?) then it will give you back the terms post analysis and you won't have to split again, which you would have to do if you use the getValues() approach.  You might also hook into the Collector (HitCollector) and build it as you go, assuming you don't need the score docs structure.

-Grant



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Get all terms of a specific field

Philippe-52
Hi Grant,

thanks for the ideas. I implemented a personal Collector, which returns
all docID's. In the next step I collect all terms using a customised
FieldSelector. This implementation is about 2 to 3 times faster than my
previous implementation using only a customised FieldSelector.

However, I did not fully understood your first idea. During indexing I
can store the TermVectors on disk. What do I have to do during
retrieval? I mean, does lucene automatically profit from the
TermVectors? Or do I have to use something different instead of getValues().

Regards,
     Philippe

Am 27.07.2010 17:17, schrieb Grant Ingersoll:

>
> On Jul 27, 2010, at 8:50 AM, Philippe wrote:
>
>    
>> Hi,
>>
>> what would be the fastest way to get all terms for all documents matching a specific query?
>>
>> Sofar I:
>>
>> 1.) Query the index
>> 2.) Retrieve all scoreDocs
>> 3.) Iterate the scoreDocs and retrieve all terms using the getValues method and a customised "FieldSelector"
>>
>> However, retrieving and iterating the scoredocs is quite costly.  So is there a better/faster way to perform this?
>>      
>
> If you can afford to store TermVectors (disk is cheap, right?) then it will give you back the terms post analysis and you won't have to split again, which you would have to do if you use the getValues() approach.  You might also hook into the Collector (HitCollector) and build it as you go, assuming you don't need the score docs structure.
>
> -Grant
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]