[jira] [Created] (NUTCH-2787) CrawlDb JSON dump does not export metadata primitive data types correctly

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (NUTCH-2787) CrawlDb JSON dump does not export metadata primitive data types correctly

Hudson (Jira)
Patrick Mézard created NUTCH-2787:
-------------------------------------

             Summary: CrawlDb JSON dump does not export metadata primitive data types correctly
                 Key: NUTCH-2787
                 URL: https://issues.apache.org/jira/browse/NUTCH-2787
             Project: Nutch
          Issue Type: Bug
          Components: crawldb
    Affects Versions: 1.17
         Environment: Reproduced with:
{code:java}
commit 9139d6ec7a98aea1af943755e9802066803b02b7 (HEAD -> master, origin/master, origin/HEAD)
Merge: e61a8a3b f971ca1b
Author: Sebastian Nagel <[hidden email]>
Date:   Thu May 14 17:43:14 2020 +0200    Merge pull request #526 from sebastian-nagel/NUTCH-2419-urlfilter-rule-file-precedence
   
    NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file {code}
            Reporter: Patrick Mézard


To reproduce:
 * Activate scoring-depth plugin
 * Create a new crawldb from a seed URL:
 * Dump the crawldb as json
 * Look at the json

{code:java}
$ nutch inject crawl/crawldb seeds.txt
$ rm -rf out; nutch readdb crawl/crawldb -dump out -format json
$ cat out/part-r-00000 | head -1 | python -m json.tool
{
    "url": "http://clustree.com/",
    "statusCode": 1,
    "statusName": "db_unfetched",
    "fetchTime": "Thu Jun 04 15:19:02 CEST 2020",
    "modifiedTime": "Thu Jan 01 01:00:00 CET 1970",
    "retriesSinceFetch": 0,
    "retryIntervalSeconds": 2592000,
    "retryIntervalDays": 30,
    "score": 1.0,
    "signature": "null",
    "metadata": {
        "_depth_": {},
        "_maxdepth_": {}
    }
}{code}
KO => _`_depth_` and `_maxdepth_` are not integer._

The fields are correct in the crawldb, as shown by a CSV dump:
{code:java}
$ rm -rf out; nutch readdb crawl/crawldb -dump out -format csv
$ cat out/part-r-00000
Url,Status code,Status name,Fetch Time,Modified Time,Retries since fetch,Retry interval seconds,Retry interval days,Score,Signature,Metadata
"http://clustree.com/",1,"db_unfetched",Thu Jun 04 15:19:02 CEST 2020,Thu Jan 01 01:00:00 CET 1970,0,2592000.0,30.0,1.0,"null","_depth_:1|||_maxdepth_:5|||" {code}
Code is here:

[https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDbReader.java#L269]

I do not know Java very well but I think it comes from IntWritable & co not being POJO types (or at least not the way we want them).

One fix might be to:
 * Map all primitive type Writable classes to some function casting the base interface and calling "get" (may boxing the value as well).
 * Call that in the metadata conversion loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)