[jira] [Commented] (NUTCH-2787) CrawlDb JSON dump does not export metadata primitive data types correctly

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2787) CrawlDb JSON dump does not export metadata primitive data types correctly

Sergey Smolyakov (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129465#comment-17129465 ]

ASF GitHub Bot commented on NUTCH-2787:
---------------------------------------

sebastian-nagel opened a new pull request #531:
URL: https://github.com/apache/nutch/pull/531


   - add JsonSerializer to write common Writable types (null, boolean, numbers)
   - remaining "unknown" Writables are written after calling toString()
   
   ```json
   {
     "url": "https://nutch.apache.org/",
     ...,
     "metadata": {
       "_depth_": 1,
       "_maxdepth_": 1000
     }
   }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]


> CrawlDb JSON dump does not export metadata primitive data types correctly
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-2787
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2787
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.17
>         Environment: Reproduced with:
> {code:java}
> commit 9139d6ec7a98aea1af943755e9802066803b02b7 (HEAD -> master, origin/master, origin/HEAD)
> Merge: e61a8a3b f971ca1b
> Author: Sebastian Nagel <[hidden email]>
> Date:   Thu May 14 17:43:14 2020 +0200    Merge pull request #526 from sebastian-nagel/NUTCH-2419-urlfilter-rule-file-precedence
>    
>     NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file {code}
>            Reporter: Patrick M├ęzard
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.17
>
>
> To reproduce:
>  * Activate scoring-depth plugin
>  * Create a new crawldb from a seed URL:
>  * Dump the crawldb as json
>  * Look at the json
> {code:java}
> $ nutch inject crawl/crawldb seeds.txt
> $ rm -rf out; nutch readdb crawl/crawldb -dump out -format json
> $ cat out/part-r-00000 | head -1 | python -m json.tool
> {
>     "url": "http://example.com/",
>     "statusCode": 1,
>     "statusName": "db_unfetched",
>     "fetchTime": "Thu Jun 04 15:19:02 CEST 2020",
>     "modifiedTime": "Thu Jan 01 01:00:00 CET 1970",
>     "retriesSinceFetch": 0,
>     "retryIntervalSeconds": 2592000,
>     "retryIntervalDays": 30,
>     "score": 1.0,
>     "signature": "null",
>     "metadata": {
>         "_depth_": {},
>         "_maxdepth_": {}
>     }
> }{code}
> KO => `__depth__` and `__maxdepth__` are not integer.
> The fields are correct in the crawldb, as shown by a CSV dump:
> {code:java}
> $ rm -rf out; nutch readdb crawl/crawldb -dump out -format csv
> $ cat out/part-r-00000
> Url,Status code,Status name,Fetch Time,Modified Time,Retries since fetch,Retry interval seconds,Retry interval days,Score,Signature,Metadata
> "http://example.com/",1,"db_unfetched",Thu Jun 04 15:19:02 CEST 2020,Thu Jan 01 01:00:00 CET 1970,0,2592000.0,30.0,1.0,"null","_depth_:1|||_maxdepth_:5|||" {code}
> Code is here:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDbReader.java#L269]
> I do not know Java very well but I think it comes from IntWritable & co not being POJO types (or at least not the way we want them).
> One fix might be to:
>  * Map all primitive type Writable classes to some function casting the base interface and calling "get" (may boxing the value as well).
>  * Call that in the metadata conversion loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)