duplicate deleting

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

duplicate deleting

Zaheed Haque
Hi:

I have a following use case. I have 5 sets of large different XML dump
from 5 database. All the XML file has the same structure. However
there are duplicate entries which are in different files with
different ID's. (i.e. solr doc, 1 db entry = 1 solr doc). How can I
remove duplicate where 3 or more condition must be met as well as 1
document must NOT be deleted. Example

file1.xml
=======
<add>
<doc>
<field name="id">121</field>
<field name="name">Acme inc</field>
<field name="contact">[hidden email]</field>
</doc>

<doc>
<field name="id">122</field>
<field name="name">ABC inc</field>
<field name="contact">[hidden email]</field>
</doc>

<doc>
<field name="id">123</field>
<field name="name">XYZ inc</field>
<field name="contact">[hidden email]</field>
</doc>
</add>

file2.xml
======

<add>
<doc>
<field name="id">221</field>
<field name="name">Acme inc</field>
<field name="contact">[hidden email]</field>
</doc>

<doc>
<field name="id">222</field>
<field name="name">BBC inc</field>
<field name="contact">[hidden email]</field>
</doc>

<doc>
<field name="id">223</field>
<field name="name">CNN inc</field>
<field name="contact">[hidden email]</field>
</doc>
</add>

file3.xml
======
<add>
<doc>
<field name="id">321</field>
<field name="name">NBC inc</field>
<field name="contact">[hidden email]</field>
</doc>

<doc>
<field name="id">322</field>
<field name="name">ABC inc</field>
<field name="contact">[hidden email]</field>
</doc>

<doc>
<field name="id">323</field>
<field name="name">BBC inc</field>
<field name="contact">[hidden email]</field>
</doc>


I have a field called last modified and that field will determine
which record will get to be the one in Solr index.

These files are huge and I need an automatic way of cleanup on a
weekly basis. Yes, I can cleanup the files before handing over to Solr
but I thought there must be some way to do it without writing custom
modifications.

Any tips/tricks very much appreciated.

Regards
Reply | Threaded
Open this post in threaded view
|

Re: duplicate deleting

Chris Hostetter-3

: I have a field called last modified and that field will determine
: which record will get to be the one in Solr index.
:
: These files are huge and I need an automatic way of cleanup on a
: weekly basis. Yes, I can cleanup the files before handing over to Solr
: but I thought there must be some way to do it without writing custom
: modifications.

First off: keep in mind that you don't *need* to create files on disk ...
you said all of this data was being dumped fro ma databse ... instead of
dumping to files, stream the data out of your DB and send it to Solr
directly -- now your "deduping" problem becomes something you can manage
in your database (or in the code steaming hte data out of the DB.

that said, if you'd still like Solr to handle the "deduping' for you then
the 3 fields that define a duplicate record need to be combined into a
single field which you defie as your uniqueKey ... you could concat them
if they are simple enough, or you could use something like an md5sum ...
if the docs are added in order of lastModifiedDate, Solr will ensure the
most recent one "sticks"


...this doesn't help you clean up the files on disk though, but like i
said: yoy don't need the files to be on disk.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: duplicate deleting

Zaheed Haque
Chris:

Thanks for the reply.

On 11/27/06, Chris Hostetter <[hidden email]> wrote:

>
> : I have a field called last modified and that field will determine
> : which record will get to be the one in Solr index.
> :
> : These files are huge and I need an automatic way of cleanup on a
> : weekly basis. Yes, I can cleanup the files before handing over to Solr
> : but I thought there must be some way to do it without writing custom
> : modifications.
>
> First off: keep in mind that you don't *need* to create files on disk ...
> you said all of this data was being dumped fro ma databse ... instead of
> dumping to files, stream the data out of your DB and send it to Solr
> directly.

Eventually we will be able to do that. Currently this data is coming from
external parties (4 different external vendor) so I have no control over it.
We have discussed the issue as you pointed out. We have thought about
i.e. running web service out from vendor db to Solr. But security from
various corporate standpoint is a issue.

-- now your "deduping" problem becomes something you can manage
> in your database (or in the code steaming hte data out of the DB.

Same as above. I don't know whats in the database i.e. content in advance.

> that said, if you'd still like Solr to handle the "deduping' for you then
> the 3 fields that define a duplicate record need to be combined into a
> single field which you defie as your uniqueKey ... you could concat them
> if they are simple enough, or you could use something like an md5sum ...
> if the docs are added in order of lastModifiedDate, Solr will ensure the
> most recent one "sticks"
>
Thanks. This is what I wanted to know. Excellent. I will give this a try.

Cheers.