Deleteing an index document in nutch

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Deleteing an index document in nutch

Dennis Kubes-2
Anybody know how to delete an index document in a distributed search
server?  Is that even possible?

Dennis
Reply | Threaded
Open this post in threaded view
|

Re: Deleteing an index document in nutch

John Mendenhall
> Anybody know how to delete an index document in a distributed search
> server?  Is that even possible?

I will assume by index document, you are
referring to a document that has been indexed.
If not, delete and forget.

When we need to remove a document, we go through
the process of filtering out the document by
using the following procedure:

1. build temporary nutch configuration directory
     build special filter files based on document(s) to be filtered out
     point NUTCH_CONF_DIR env var to temporary nutch configuration directory
2. run bin/nutch mergedb $NEWCRAWLDBDIR $CRAWLDBDIR -filter
3. run bin/nutch mergesegs $NEWSEGMENTSDIR -dir $SEGMENTSDIR -filter
4. run bin/nutch mergelinkdb $NEWLINKDBDIR $LINKDBDIR -filter
5. run standard set to rebuild index:
     bin/nutch index $NEWINDEXESDIR $CRAWLDBDIR $LINKDBDIR $NEWSEGLIST
     bin/nutch dedup $NEWINDEXESDIR
     bin/nutch merge -workingdir $NUTCHTMPDIR $NEWINDEXDIR $NEWINDEXESDIR

The variable names should be self-explanatory.  If not,
just let me know.

JohnM

--
john mendenhall
[hidden email]
surf utopia
internet services
Reply | Threaded
Open this post in threaded view
|

Re: Deleteing an index document in nutch

Dennis Kubes-2
An easier way to do this (after some digging) is to use:

bin/nutch org.apache.nutch.tools.PruneIndexTool

You would first need to stop the DistributedSearch$Server, run the tool,
which has a dryrun mode as well, then restart the server.  Another more
brute force way to do this if your indexes are in the form part-00000 is
to delete an entire part-xxxxx.  The prune tool will need to be run on
each part-xxxxx within a single shard.

Be aware that this will not stop urls from coming back when content is
reindexed, it will only remove them from the current index.

Dennis

John Mendenhall wrote:

>> Anybody know how to delete an index document in a distributed search
>> server?  Is that even possible?
>
> I will assume by index document, you are
> referring to a document that has been indexed.
> If not, delete and forget.
>
> When we need to remove a document, we go through
> the process of filtering out the document by
> using the following procedure:
>
> 1. build temporary nutch configuration directory
>      build special filter files based on document(s) to be filtered out
>      point NUTCH_CONF_DIR env var to temporary nutch configuration directory
> 2. run bin/nutch mergedb $NEWCRAWLDBDIR $CRAWLDBDIR -filter
> 3. run bin/nutch mergesegs $NEWSEGMENTSDIR -dir $SEGMENTSDIR -filter
> 4. run bin/nutch mergelinkdb $NEWLINKDBDIR $LINKDBDIR -filter
> 5. run standard set to rebuild index:
>      bin/nutch index $NEWINDEXESDIR $CRAWLDBDIR $LINKDBDIR $NEWSEGLIST
>      bin/nutch dedup $NEWINDEXESDIR
>      bin/nutch merge -workingdir $NUTCHTMPDIR $NEWINDEXDIR $NEWINDEXESDIR
>
> The variable names should be self-explanatory.  If not,
> just let me know.
>
> JohnM
>