[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

Parth (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144264#comment-16144264 ]

Jorge Luis Betancourt Gonzalez commented on NUTCH-2414:
-------------------------------------------------------

+1 This would allow also help to deprecate the {{mimetype-filter}} plugin and avoid having the responsibility of indexing/allowing/blocking documents (from being indexed) scattered across several plugins

> Allow LanguageIndexingFilter to actually filter documents by language.
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-2414
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2414
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Minor
>
> It is often useful to only index pages in select languages (e.g. only those languages that we intend to search in). At first glance it seems that this is done by LanguageIndexingFilter, but currently all the filter does is add the language as a field to the index.
> We can add a configuration property to LanguageIndexingFilter that will allow it to only index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

BlackIce
+1 This way one could have a very focused crawl/search

On Mon, Aug 28, 2017 at 10:08 PM, Jorge Luis Betancourt Gonzalez (JIRA) <[hidden email]> wrote:

    [ https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144264#comment-16144264 ]

Jorge Luis Betancourt Gonzalez commented on NUTCH-2414:
-------------------------------------------------------

+1 This would allow also help to deprecate the {{mimetype-filter}} plugin and avoid having the responsibility of indexing/allowing/blocking documents (from being indexed) scattered across several plugins

> Allow LanguageIndexingFilter to actually filter documents by language.
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-2414
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2414
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Minor
>
> It is often useful to only index pages in select languages (e.g. only those languages that we intend to search in). At first glance it seems that this is done by LanguageIndexingFilter, but currently all the filter does is add the language as a field to the index.
> We can add a configuration property to LanguageIndexingFilter that will allow it to only index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)