distinct field values

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

distinct field values

Akanksha Baid
I have indexed multiple documents - each of them have 3 fields ( id, tag
, text). Is there an easy way to determine the set of tags for a given
query without iterating through all the hits?
For example if I have 100 documents in my index and my set of tag = {A,
B, C}. Query Q on the text field returns 15 docs with tag A , 10 with
tag B and none with tag C (total of 25 hits). Is there a way to
determine that the set of tags for query Q = {A, B} without iterating
through all 25 hits.

Any ideas?

Thanks!
Akanksha


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinct field values

Anshum-2
Hi,

You could try changing (or extending) TopFieldDocCollector and do your
processing there (that is what I tried... and it worked fine). But that
would mean changing lucene code a little bit.

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Tue, Oct 14, 2008 at 12:53 PM, Akanksha Baid <[hidden email]> wrote:

> I have indexed multiple documents - each of them have 3 fields ( id, tag ,
> text). Is there an easy way to determine the set of tags for a given query
> without iterating through all the hits?
> For example if I have 100 documents in my index and my set of tag = {A, B,
> C}. Query Q on the text field returns 15 docs with tag A , 10 with tag B and
> none with tag C (total of 25 hits). Is there a way to determine that the set
> of tags for query Q = {A, B} without iterating through all 25 hits.
>
> Any ideas?
>
> Thanks!
> Akanksha
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: distinct field values

Akanksha Baid
Is there something I could do to Index the documents differently to
accomplish this? Currently I am looking at all the hits to generate the
set of tags for the query.
If I need to implement the same thing within Lucene, I am not sure if I
will gain anything performance wise. Or am I wrong about this?



Anshum wrote:

> Hi,
>
> You could try changing (or extending) TopFieldDocCollector and do your
> processing there (that is what I tried... and it worked fine). But that
> would mean changing lucene code a little bit.
>
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
>
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
>
>
> On Tue, Oct 14, 2008 at 12:53 PM, Akanksha Baid <[hidden email]> wrote:
>
>> I have indexed multiple documents - each of them have 3 fields ( id, tag ,
>> text). Is there an easy way to determine the set of tags for a given query
>> without iterating through all the hits?
>> For example if I have 100 documents in my index and my set of tag = {A, B,
>> C}. Query Q on the text field returns 15 docs with tag A , 10 with tag B and
>> none with tag C (total of 25 hits). Is there a way to determine that the set
>> of tags for query Q = {A, B} without iterating through all 25 hits.
>>
>> Any ideas?
>>
>> Thanks!
>> Akanksha
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinct field values

hossman
In reply to this post by Akanksha Baid

: For example if I have 100 documents in my index and my set of tag = {A, B, C}.
: Query Q on the text field returns 15 docs with tag A , 10 with tag B and none
: with tag C (total of 25 hits). Is there a way to determine that the set of
: tags for query Q = {A, B} without iterating through all 25 hits.

what you are describing is is a subset of a rbaoder topic known as
"faceted searching" ... if you search the archives for that or "category
counts" you'll find quite a bit of discussion on the approaches that can
be used to ahieve this.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: distinct field values

Khawaja Shams
Hi,  You may also want to take a look at Carrot2:
http://demo.carrot2.org/demo-stable/main

Lucene documentation references them, but I was disappointed to see that
they had an open source version (really old) and one that you can buy. It
may work for you.

Also, take a look at SOLR's implementation of faceted searching:
http://wiki.apache.org/solr/SolrFacetingOverview

I have been dealing with a very similar problem.  Iterating over all the
hits may not be too slow if you do it right. One such brute-force way to
deal with this is to use the FieldCache for the tag field to quickly iterate
over all the document ids that came back with the search.  The constant
lookup time in an array makes it nice and easy, but if may not scale well
with the number of documents. If you know that the number of distinct tags
in your dataset is fairly small (5-10), you can always just run the query
with each tag added as a constraint to find out if there are any matches or
you can manage a filter for each value to deduce if it exists in the
resultset even faster.


 Please share your experience as you find more clues on this problem.



Regards,
Khawaja Shams

On Tue, Oct 14, 2008 at 8:15 PM, Chris Hostetter
<[hidden email]>wrote:

>
> : For example if I have 100 documents in my index and my set of tag = {A,
> B, C}.
> : Query Q on the text field returns 15 docs with tag A , 10 with tag B and
> none
> : with tag C (total of 25 hits). Is there a way to determine that the set
> of
> : tags for query Q = {A, B} without iterating through all 25 hits.
>
> what you are describing is is a subset of a rbaoder topic known as
> "faceted searching" ... if you search the archives for that or "category
> counts" you'll find quite a bit of discussion on the approaches that can
> be used to ahieve this.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: distinct field values

Anshum-2
In reply to this post by Akanksha Baid
You could go through this implementation. Have been using this (improvised)
for a while now. There might be better ways to do so too. so you could
check!

http://www.gossamer-threads.com/lists/lucene/java-user/35704?search_string=categorycounts;#35704
--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Wed, Oct 15, 2008 at 12:49 AM, Akanksha Baid <[hidden email]> wrote:

> Is there something I could do to Index the documents differently to
> accomplish this? Currently I am looking at all the hits to generate the set
> of tags for the query.
> If I need to implement the same thing within Lucene, I am not sure if I
> will gain anything performance wise. Or am I wrong about this?
>
>
>
>
> Anshum wrote:
>
>> Hi,
>>
>> You could try changing (or extending) TopFieldDocCollector and do your
>> processing there (that is what I tried... and it worked fine). But that
>> would mean changing lucene code a little bit.
>>
>> --
>> Anshum Gupta
>> Naukri Labs!
>> http://ai-cafe.blogspot.com
>>
>> The facts expressed here belong to everybody, the opinions to me. The
>> distinction is yours to draw............
>>
>>
>> On Tue, Oct 14, 2008 at 12:53 PM, Akanksha Baid <[hidden email]> wrote:
>>
>>  I have indexed multiple documents - each of them have 3 fields ( id, tag
>>> ,
>>> text). Is there an easy way to determine the set of tags for a given
>>> query
>>> without iterating through all the hits?
>>> For example if I have 100 documents in my index and my set of tag = {A,
>>> B,
>>> C}. Query Q on the text field returns 15 docs with tag A , 10 with tag B
>>> and
>>> none with tag C (total of 25 hits). Is there a way to determine that the
>>> set
>>> of tags for query Q = {A, B} without iterating through all 25 hits.
>>>
>>> Any ideas?
>>>
>>> Thanks!
>>> Akanksha
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
adb
Reply | Threaded
Open this post in threaded view
|

Re: distinct field values

adb
In reply to this post by Akanksha Baid
Akanksha Baid wrote:
> I have indexed multiple documents - each of them have 3 fields ( id, tag
> , text). Is there an easy way to determine the set of tags for a given
> query without iterating through all the hits?
> For example if I have 100 documents in my index and my set of tag = {A,
> B, C}. Query Q on the text field returns 15 docs with tag A , 10 with
> tag B and none with tag C (total of 25 hits). Is there a way to
> determine that the set of tags for query Q = {A, B} without iterating
> through all 25 hits.

Another way is to use a HitCollector to collect all the hits into a Map and then
use TermEnum + TermDocs to walk the tags / docs and see what tag the hit comes
from.  This would be different to walking the Hits/Documents to fetch the tag
from the Document.  Not sure if this is very efficient though, depends on the
Document count.

Antony





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]