[jira] [Commented] (NUTCH-2616) Review routing of deletions by Exchange component

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (NUTCH-2616) Review routing of deletions by Exchange component

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540204#comment-16540204 ]

Roannel Fernández Hernández commented on NUTCH-2616:

Send deletions to all index writers, seems to be the best option. Before the Exchange component exists, this is the behavior, right?

Passing documents with a single field might work, but you can only use the ID/URL field in JEXL expressions to ensure that the deletion actions match the exchange (at least for exchange-jexl), because in this case it will be the only field available. e.g. If you use {{<param name="expr" value="doc.getFieldValue('host')=='example.org'" />}}, all documents with host='example.org' will match, but in delete actions won't match even when id='http://example.org/' for instance, because the 'host' field doesn't exist in the document.

Another option could be to pass the documents with a single field and modify the exchange component to execute different routines depending the action to execute. The expression to be applied in each case would be in the exchanges.xml file as part of the configuration.

> Review routing of deletions by Exchange component
> -------------------------------------------------
>                 Key: NUTCH-2616
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2616
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
> If the exchange component (NUTCH-2412) is enabled it must also route deletions (404, etc.) to the configured index writers. Deletions are done alone using the document ID (URL), there is no NutchDocument (or it's null) which needs to handled to avoid an NPE in the Exchanges class or the exchange plugins.
> NUTCH-2412 has added a new delete method in the IndexWriters class:
> - {{delete(String, NutchDocument)}} is now called from the indexing job ({{bin/nutch index ... -deleteGone}}). However, the NutchDocument is always null in case of deletions, see IndexerMapReduce.DELETE_ACTION.
> - {{delete(String)}} is now a no-op but is still called from CleaningJob ({{bin/nutch clean ...}})
> We could ([~roannel], are there better options?)
> - send deletions to all index writers. This causes a certain overhead (could be critical if deletion lists are long).
> - pass a document containing only a single field (the document ID / URL) to the exchange component.

This message was sent by Atlassian JIRA