[jira] [Commented] (NUTCH-2391) Spurious Duplications for MD5

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Commented] (NUTCH-2391) Spurious Duplications for MD5

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075010#comment-16075010 ]

Hudson commented on NUTCH-2391:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch-trunk #3433 (See [https://builds.apache.org/job/Nutch-trunk/3433/])
NUTCH-2391 use URL for MD5 digest as fall-back if content is empty (snagel: [https://github.com/apache/nutch/commit/d35b433c397c03e78245c3e262ecaa31c78a564e])
* (edit) src/java/org/apache/nutch/crawl/MD5Signature.java


> Spurious Duplications for MD5
> -----------------------------
>
>                 Key: NUTCH-2391
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2391
>             Project: Nutch
>          Issue Type: Bug
>          Components: commoncrawl
>    Affects Versions: 1.11
>            Reporter: David Johnson
>            Priority: Minor
>             Fix For: 1.14
>
>
> We're seeing some incidence of a large number of documents being marked as duplicate in our crawl.
> We traced it back to one of the crawl plugins returning an empty array for the content field.
> We'd like to propose changing the MD5 signature generation from:
> {code}
> public byte[] calculate(Content content, Parse parse) {
>     byte[] data = content.getContent();
>     if (data == null)
>       data = content.getUrl().getBytes();
>     return MD5Hash.digest(data).getDigest();
>   }
> {code}
> to:
> {code}
> public byte[] calculate(Content content, Parse parse) {
>     byte[] data = content.getContent();
>     if ((data == null) || (data.length == 0))
>       data = content.getUrl().getBytes();
>     return MD5Hash.digest(data).getDigest();
>   }
> {code}
> to address the issue



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
Loading...