Metadata not indexed after migrating to Nutch 2.4

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Metadata not indexed after migrating to Nutch 2.4

Anton Skarp
Hi,

After migrating from nutch 2.3.1 to 2.4 I have not been able to conf nutch to index metadata to elasticsearch. Indexchecker gets the metadata correctly though.
I have tried both hbase version 0.9.8-hadoop2 and also with mongodb. Both contained the wanted metadata.

I have done some debugging and the problem seems to be that MetadataIndexer filter methods parameter page does not even contain the metadata.

There are no exceptions/errors outputted by nutch or elasticsearch.

Any ideas on what is the problem and how I should approach fixing it.


Regards. Anton
Reply | Threaded
Open this post in threaded view
|

Re: Metadata not indexed after migrating to Nutch 2.4

Sebastian Nagel-2
Hi Anton,

after a short look into MetadataIndexer:
- it does not request any fields from the webpage,
  see getFields() method
- this is a bug (but already was in 2.3.1)
- it could be worked around by activating another
  plugin which requests the METADATA field/column,
  eg. language-identifier/LanguageIndexingFilter

That's one possible explanation.

Please note that it is unlikely that there will be further
releases on the 2.x series of Nutch, see the release announcement
for more details.

Best,
Sebastian


On 11/11/19 12:44 PM, Anton Skarp wrote:

> Hi,
>
> After migrating from nutch 2.3.1 to 2.4 I have not been able to conf nutch to index metadata to elasticsearch. Indexchecker gets the metadata correctly though.
> I have tried both hbase version 0.9.8-hadoop2 and also with mongodb. Both contained the wanted metadata.
>
> I have done some debugging and the problem seems to be that MetadataIndexer filter methods parameter page does not even contain the metadata.
>
> There are no exceptions/errors outputted by nutch or elasticsearch.
>
> Any ideas on what is the problem and how I should approach fixing it.
>
>
> Regards. Anton
>

Reply | Threaded
Open this post in threaded view
|

Re: Metadata not indexed after migrating to Nutch 2.4

Anton Skarp
Hi Sebastian,

your suggestion of adding the plugin solved the problem. Thank you for your help.

Regards, Anton


________________________________
From: Sebastian Nagel <[hidden email]>
Sent: Monday, November 11, 2019 3:08 PM
To: [hidden email] <[hidden email]>
Subject: Re: Metadata not indexed after migrating to Nutch 2.4

Hi Anton,

after a short look into MetadataIndexer:
- it does not request any fields from the webpage,
  see getFields() method
- this is a bug (but already was in 2.3.1)
- it could be worked around by activating another
  plugin which requests the METADATA field/column,
  eg. language-identifier/LanguageIndexingFilter

That's one possible explanation.

Please note that it is unlikely that there will be further
releases on the 2.x series of Nutch, see the release announcement
for more details.

Best,
Sebastian


On 11/11/19 12:44 PM, Anton Skarp wrote:

> Hi,
>
> After migrating from nutch 2.3.1 to 2.4 I have not been able to conf nutch to index metadata to elasticsearch. Indexchecker gets the metadata correctly though.
> I have tried both hbase version 0.9.8-hadoop2 and also with mongodb. Both contained the wanted metadata.
>
> I have done some debugging and the problem seems to be that MetadataIndexer filter methods parameter page does not even contain the metadata.
>
> There are no exceptions/errors outputted by nutch or elasticsearch.
>
> Any ideas on what is the problem and how I should approach fixing it.
>
>
> Regards. Anton
>