Deduplication in 1.4

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Deduplication in 1.4

Kaktu Chakarabati
Hey,
I've been trying to find some documentation on using this feature in 1.4 but Wiki page is alittle sparse..
In specific, here's what i'm trying to do:

I have a field, say 'duplicate_group_id' that i'll populate based on some offline documents deduplication process I have.

All I want is for solr to compute a 'duplicate_signature' field based on this one at update time, so that when i search for documents later, all documents with same original 'duplicate_group_id' value will be rolled up (e.g i'll just get the first one that came back  according to relevancy).

I enabled the deduplication processor and put it into updater, but i'm not seeing any difference in returned results (i.e results with same duplicate_id are returned separately..)

is there anything i need to supply in query-time for this to take effect? what should be the behaviour? is there any working example of this?

Anything will be helpful..

Thanks,
Chak
Reply | Threaded
Open this post in threaded view
|

Re: Deduplication in 1.4

Otis Gospodnetic-2
Hi,

As far as I know, the point of deduplication in Solr ( http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document before indexing it in order to avoid duplicates in the index in the first place.

What you are describing is closer to field collapsing patch in SOLR-236.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: KaktuChakarabati <[hidden email]>
> To: [hidden email]
> Sent: Tue, November 24, 2009 5:29:00 PM
> Subject: Deduplication in 1.4
>
>
> Hey,
> I've been trying to find some documentation on using this feature in 1.4 but
> Wiki page is alittle sparse..
> In specific, here's what i'm trying to do:
>
> I have a field, say 'duplicate_group_id' that i'll populate based on some
> offline documents deduplication process I have.
>
> All I want is for solr to compute a 'duplicate_signature' field based on
> this one at update time, so that when i search for documents later, all
> documents with same original 'duplicate_group_id' value will be rolled up
> (e.g i'll just get the first one that came back  according to relevancy).
>
> I enabled the deduplication processor and put it into updater, but i'm not
> seeing any difference in returned results (i.e results with same
> duplicate_id are returned separately..)
>
> is there anything i need to supply in query-time for this to take effect?
> what should be the behaviour? is there any working example of this?
>
> Anything will be helpful..
>
> Thanks,
> Chak
> --
> View this message in context:
> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Deduplication in 1.4

Kaktu Chakarabati
Hey Otis,
Yep, I realized this myself after playing some with the dedupe feature yesterday.
So it does look like Field collapsing is what I need pretty much.
Any idea on how close it is to being production-ready?

Thanks,
-Chak
Otis Gospodnetic wrote
Hi,

As far as I know, the point of deduplication in Solr ( http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document before indexing it in order to avoid duplicates in the index in the first place.

What you are describing is closer to field collapsing patch in SOLR-236.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: KaktuChakarabati <jimmoefoe@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, November 24, 2009 5:29:00 PM
> Subject: Deduplication in 1.4
>
>
> Hey,
> I've been trying to find some documentation on using this feature in 1.4 but
> Wiki page is alittle sparse..
> In specific, here's what i'm trying to do:
>
> I have a field, say 'duplicate_group_id' that i'll populate based on some
> offline documents deduplication process I have.
>
> All I want is for solr to compute a 'duplicate_signature' field based on
> this one at update time, so that when i search for documents later, all
> documents with same original 'duplicate_group_id' value will be rolled up
> (e.g i'll just get the first one that came back  according to relevancy).
>
> I enabled the deduplication processor and put it into updater, but i'm not
> seeing any difference in returned results (i.e results with same
> duplicate_id are returned separately..)
>
> is there anything i need to supply in query-time for this to take effect?
> what should be the behaviour? is there any working example of this?
>
> Anything will be helpful..
>
> Thanks,
> Chak
> --
> View this message in context:
> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
> Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Deduplication in 1.4

Martijn v Groningen
Field collapsing has been used by many in their production
environment. The last few months the stability of the patch grew as
quiet some bugs were fixed. The only big feature missing currently is
caching of the collapsing algorithm. I'm currently working on that and
I will put it in a new patch in the coming next days.  So yes the
patch is very near being production ready.

Martijn

2009/11/26 KaktuChakarabati <[hidden email]>:

>
> Hey Otis,
> Yep, I realized this myself after playing some with the dedupe feature
> yesterday.
> So it does look like Field collapsing is what I need pretty much.
> Any idea on how close it is to being production-ready?
>
> Thanks,
> -Chak
>
> Otis Gospodnetic wrote:
>>
>> Hi,
>>
>> As far as I know, the point of deduplication in Solr (
>> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
>> document before indexing it in order to avoid duplicates in the index in
>> the first place.
>>
>> What you are describing is closer to field collapsing patch in SOLR-236.
>>
>>  Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> ----- Original Message ----
>>> From: KaktuChakarabati <[hidden email]>
>>> To: [hidden email]
>>> Sent: Tue, November 24, 2009 5:29:00 PM
>>> Subject: Deduplication in 1.4
>>>
>>>
>>> Hey,
>>> I've been trying to find some documentation on using this feature in 1.4
>>> but
>>> Wiki page is alittle sparse..
>>> In specific, here's what i'm trying to do:
>>>
>>> I have a field, say 'duplicate_group_id' that i'll populate based on some
>>> offline documents deduplication process I have.
>>>
>>> All I want is for solr to compute a 'duplicate_signature' field based on
>>> this one at update time, so that when i search for documents later, all
>>> documents with same original 'duplicate_group_id' value will be rolled up
>>> (e.g i'll just get the first one that came back  according to relevancy).
>>>
>>> I enabled the deduplication processor and put it into updater, but i'm
>>> not
>>> seeing any difference in returned results (i.e results with same
>>> duplicate_id are returned separately..)
>>>
>>> is there anything i need to supply in query-time for this to take effect?
>>> what should be the behaviour? is there any working example of this?
>>>
>>> Anything will be helpful..
>>>
>>> Thanks,
>>> Chak
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Deduplication in 1.4

Otis Gospodnetic-2
Hi Martijn,

 
----- Original Message ----

> From: Martijn v Groningen <[hidden email]>
> To: [hidden email]
> Sent: Thu, November 26, 2009 3:19:40 AM
> Subject: Re: Deduplication in 1.4
>
> Field collapsing has been used by many in their production
> environment.

Got any pointers to public sites you know use it?  I know of a high traffic site that used an early version, and it caused performance problems.  Is double-tripping still required?

> The last few months the stability of the patch grew as
> quiet some bugs were fixed. The only big feature missing currently is
> caching of the collapsing algorithm. I'm currently working on that and

Is it also full distributed-search-ready?

> I will put it in a new patch in the coming next days.  So yes the
> patch is very near being production ready.

Thanks,
Otis

> Martijn
>
> 2009/11/26 KaktuChakarabati :
> >
> > Hey Otis,
> > Yep, I realized this myself after playing some with the dedupe feature
> > yesterday.
> > So it does look like Field collapsing is what I need pretty much.
> > Any idea on how close it is to being production-ready?
> >
> > Thanks,
> > -Chak
> >
> > Otis Gospodnetic wrote:
> >>
> >> Hi,
> >>
> >> As far as I know, the point of deduplication in Solr (
> >> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
> >> document before indexing it in order to avoid duplicates in the index in
> >> the first place.
> >>
> >> What you are describing is closer to field collapsing patch in SOLR-236.
> >>
> >>  Otis
> >> --
> >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >>
> >>
> >>
> >> ----- Original Message ----
> >>> From: KaktuChakarabati
> >>> To: [hidden email]
> >>> Sent: Tue, November 24, 2009 5:29:00 PM
> >>> Subject: Deduplication in 1.4
> >>>
> >>>
> >>> Hey,
> >>> I've been trying to find some documentation on using this feature in 1.4
> >>> but
> >>> Wiki page is alittle sparse..
> >>> In specific, here's what i'm trying to do:
> >>>
> >>> I have a field, say 'duplicate_group_id' that i'll populate based on some
> >>> offline documents deduplication process I have.
> >>>
> >>> All I want is for solr to compute a 'duplicate_signature' field based on
> >>> this one at update time, so that when i search for documents later, all
> >>> documents with same original 'duplicate_group_id' value will be rolled up
> >>> (e.g i'll just get the first one that came back  according to relevancy).
> >>>
> >>> I enabled the deduplication processor and put it into updater, but i'm
> >>> not
> >>> seeing any difference in returned results (i.e results with same
> >>> duplicate_id are returned separately..)
> >>>
> >>> is there anything i need to supply in query-time for this to take effect?
> >>> what should be the behaviour? is there any working example of this?
> >>>
> >>> Anything will be helpful..
> >>>
> >>> Thanks,
> >>> Chak
> >>> --
> >>> View this message in context:
> >>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
> >>> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >>
> >
> > --
> > View this message in context:
> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >

Reply | Threaded
Open this post in threaded view
|

Re: Deduplication in 1.4

Martijn v Groningen
Two sites that use field-collapsing:
1) www.ilocal.nl
2) www.welke.nl
I'm not sure what you mean with double-tripping? The sites mentioned
do not have performance problems that are caused by field collapsing.

Field-collapsing currently only supports quasi distributed
field-collapsing (as I have described on the Solr wiki). Currently I
don't know a distributed field-collapsing algorithm that works
properly and does not influence the search time in such a way that the
search becomes slow.

Martijn

2009/11/26 Otis Gospodnetic <[hidden email]>:

> Hi Martijn,
>
>
> ----- Original Message ----
>
>> From: Martijn v Groningen <[hidden email]>
>> To: [hidden email]
>> Sent: Thu, November 26, 2009 3:19:40 AM
>> Subject: Re: Deduplication in 1.4
>>
>> Field collapsing has been used by many in their production
>> environment.
>
> Got any pointers to public sites you know use it?  I know of a high traffic site that used an early version, and it caused performance problems.  Is double-tripping still required?
>
>> The last few months the stability of the patch grew as
>> quiet some bugs were fixed. The only big feature missing currently is
>> caching of the collapsing algorithm. I'm currently working on that and
>
> Is it also full distributed-search-ready?
>
>> I will put it in a new patch in the coming next days.  So yes the
>> patch is very near being production ready.
>
> Thanks,
> Otis
>
>> Martijn
>>
>> 2009/11/26 KaktuChakarabati :
>> >
>> > Hey Otis,
>> > Yep, I realized this myself after playing some with the dedupe feature
>> > yesterday.
>> > So it does look like Field collapsing is what I need pretty much.
>> > Any idea on how close it is to being production-ready?
>> >
>> > Thanks,
>> > -Chak
>> >
>> > Otis Gospodnetic wrote:
>> >>
>> >> Hi,
>> >>
>> >> As far as I know, the point of deduplication in Solr (
>> >> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
>> >> document before indexing it in order to avoid duplicates in the index in
>> >> the first place.
>> >>
>> >> What you are describing is closer to field collapsing patch in SOLR-236.
>> >>
>> >>  Otis
>> >> --
>> >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >>> From: KaktuChakarabati
>> >>> To: [hidden email]
>> >>> Sent: Tue, November 24, 2009 5:29:00 PM
>> >>> Subject: Deduplication in 1.4
>> >>>
>> >>>
>> >>> Hey,
>> >>> I've been trying to find some documentation on using this feature in 1.4
>> >>> but
>> >>> Wiki page is alittle sparse..
>> >>> In specific, here's what i'm trying to do:
>> >>>
>> >>> I have a field, say 'duplicate_group_id' that i'll populate based on some
>> >>> offline documents deduplication process I have.
>> >>>
>> >>> All I want is for solr to compute a 'duplicate_signature' field based on
>> >>> this one at update time, so that when i search for documents later, all
>> >>> documents with same original 'duplicate_group_id' value will be rolled up
>> >>> (e.g i'll just get the first one that came back  according to relevancy).
>> >>>
>> >>> I enabled the deduplication processor and put it into updater, but i'm
>> >>> not
>> >>> seeing any difference in returned results (i.e results with same
>> >>> duplicate_id are returned separately..)
>> >>>
>> >>> is there anything i need to supply in query-time for this to take effect?
>> >>> what should be the behaviour? is there any working example of this?
>> >>>
>> >>> Anything will be helpful..
>> >>>
>> >>> Thanks,
>> >>> Chak
>> >>> --
>> >>> View this message in context:
>> >>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
>> >>> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >>
>> >
>> > --
>> > View this message in context:
>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>> >
>
>