[jira] Created: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

JIRA jira@apache.org
DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
------------------------------------------------------------------

                 Key: NUTCH-420
                 URL: http://issues.apache.org/jira/browse/NUTCH-420
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 0.9.0
            Reporter: Dogacan Güney
            Priority: Minor


DeleteDuplicates.HashPartitioner.reduce():

// byScore case
if (value.score > highest.score) {
  highest.keep = false;
  LOG.debug("-discard " + highest + ", keep " + value);
  output.collect(highest.url, highest);     // delete highest
  highest = value;
}
// !byScore is also similar

So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-420?page=all ]

Dogacan Güney updated NUTCH-420:
--------------------------------

    Attachment: dedup.patch

Patch for the problem. This patch also slightly refactors the code.

> DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
> ------------------------------------------------------------------
>
>                 Key: NUTCH-420
>                 URL: http://issues.apache.org/jira/browse/NUTCH-420
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Dogacan Güney
>            Priority: Minor
>         Attachments: dedup.patch
>
>
> DeleteDuplicates.HashPartitioner.reduce():
> // byScore case
> if (value.score > highest.score) {
>   highest.keep = false;
>   LOG.debug("-discard " + highest + ", keep " + value);
>   output.collect(highest.url, highest);     // delete highest
>   highest = value;
> }
> // !byScore is also similar
> So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dogacan Güney updated NUTCH-420:
--------------------------------

    Attachment: dedup-v2.patch

> DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
> ------------------------------------------------------------------
>
>                 Key: NUTCH-420
>                 URL: https://issues.apache.org/jira/browse/NUTCH-420
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Dogacan Güney
>            Priority: Minor
>         Attachments: dedup-v2.patch, dedup.patch
>
>
> DeleteDuplicates.HashPartitioner.reduce():
> // byScore case
> if (value.score > highest.score) {
>   highest.keep = false;
>   LOG.debug("-discard " + highest + ", keep " + value);
>   output.collect(highest.url, highest);     // delete highest
>   highest = value;
> }
> // !byScore is also similar
> So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462173 ]

Dogacan Güney commented on NUTCH-420:
-------------------------------------

I realized that my last patch if's some irrevelant LOG.debug code. Attaching new version that doesn't do that.

Also the bug is in DeleteDuplicates.HashReducer not HashPartitioner (I can't believe I wrote that wrong, twice). So if there are n documents with same content, dedup may not delete them in _HashReducer_ .

> DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
> ------------------------------------------------------------------
>
>                 Key: NUTCH-420
>                 URL: https://issues.apache.org/jira/browse/NUTCH-420
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Dogacan Güney
>            Priority: Minor
>         Attachments: dedup-v2.patch, dedup.patch
>
>
> DeleteDuplicates.HashPartitioner.reduce():
> // byScore case
> if (value.score > highest.score) {
>   highest.keep = false;
>   LOG.debug("-discard " + highest + ", keep " + value);
>   output.collect(highest.url, highest);     // delete highest
>   highest = value;
> }
> // !byScore is also similar
> So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463056 ]

Dogacan Güney commented on NUTCH-420:
-------------------------------------

I thought I would attach an index which exhibits this bug. If you run dedup on the attached file, you can see that neither dup.html nor original.html is removed from the index even though they have the same digest.

> DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
> ------------------------------------------------------------------
>
>                 Key: NUTCH-420
>                 URL: https://issues.apache.org/jira/browse/NUTCH-420
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Dogacan Güney
>            Priority: Minor
>         Attachments: dedup-v2.patch, dedup.patch, index.tar.gz
>
>
> DeleteDuplicates.HashPartitioner.reduce():
> // byScore case
> if (value.score > highest.score) {
>   highest.keep = false;
>   LOG.debug("-discard " + highest + ", keep " + value);
>   output.collect(highest.url, highest);     // delete highest
>   highest = value;
> }
> // !byScore is also similar
> So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dogacan Güney updated NUTCH-420:
--------------------------------

    Attachment: index.tar.gz

> DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
> ------------------------------------------------------------------
>
>                 Key: NUTCH-420
>                 URL: https://issues.apache.org/jira/browse/NUTCH-420
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Dogacan Güney
>            Priority: Minor
>         Attachments: dedup-v2.patch, dedup.patch, index.tar.gz
>
>
> DeleteDuplicates.HashPartitioner.reduce():
> // byScore case
> if (value.score > highest.score) {
>   highest.keep = false;
>   LOG.debug("-discard " + highest + ", keep " + value);
>   output.collect(highest.url, highest);     // delete highest
>   highest = value;
> }
> // !byScore is also similar
> So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463059 ]

Sami Siren commented on NUTCH-420:
----------------------------------

The feather 'Licensed for inclusion in ASF works' is missing from 2nd patch. Can you add a testcase for this also?

> DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
> ------------------------------------------------------------------
>
>                 Key: NUTCH-420
>                 URL: https://issues.apache.org/jira/browse/NUTCH-420
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Dogacan Güney
>            Priority: Minor
>         Attachments: dedup-v2.patch, dedup.patch, index.tar.gz
>
>
> DeleteDuplicates.HashPartitioner.reduce():
> // byScore case
> if (value.score > highest.score) {
>   highest.keep = false;
>   LOG.debug("-discard " + highest + ", keep " + value);
>   output.collect(highest.url, highest);     // delete highest
>   highest = value;
> }
> // !byScore is also similar
> So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dogacan Güney updated NUTCH-420:
--------------------------------

    Attachment: dedup-v3.patch

> DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
> ------------------------------------------------------------------
>
>                 Key: NUTCH-420
>                 URL: https://issues.apache.org/jira/browse/NUTCH-420
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Dogacan Güney
>            Priority: Minor
>         Attachments: dedup-v2.patch, dedup-v3.patch, dedup.patch, index.tar.gz
>
>
> DeleteDuplicates.HashPartitioner.reduce():
> // byScore case
> if (value.score > highest.score) {
>   highest.keep = false;
>   LOG.debug("-discard " + highest + ", keep " + value);
>   output.collect(highest.url, highest);     // delete highest
>   highest = value;
> }
> // !byScore is also similar
> So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463214 ]

Dogacan Güney commented on NUTCH-420:
-------------------------------------

Attaching the patch with a testcase (I hope that I got it right, but I am new to junit).

> DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
> ------------------------------------------------------------------
>
>                 Key: NUTCH-420
>                 URL: https://issues.apache.org/jira/browse/NUTCH-420
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Dogacan Güney
>            Priority: Minor
>         Attachments: dedup-v2.patch, dedup-v3.patch, dedup.patch, index.tar.gz
>
>
> DeleteDuplicates.HashPartitioner.reduce():
> // byScore case
> if (value.score > highest.score) {
>   highest.keep = false;
>   LOG.debug("-discard " + highest + ", keep " + value);
>   output.collect(highest.url, highest);     // delete highest
>   highest = value;
> }
> // !byScore is also similar
> So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-420.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.9.0
         Assignee: Andrzej Bialecki

Fixed in rev. 495397. Thank you!

> DeleteDuplicates.HashPartitioner depends on the order of IndexDocs
> ------------------------------------------------------------------
>
>                 Key: NUTCH-420
>                 URL: https://issues.apache.org/jira/browse/NUTCH-420
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>            Reporter: Dogacan Güney
>         Assigned To: Andrzej Bialecki
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: dedup-v2.patch, dedup-v3.patch, dedup.patch, index.tar.gz
>
>
> DeleteDuplicates.HashPartitioner.reduce():
> // byScore case
> if (value.score > highest.score) {
>   highest.keep = false;
>   LOG.debug("-discard " + highest + ", keep " + value);
>   output.collect(highest.url, highest);     // delete highest
>   highest = value;
> }
> // !byScore is also similar
> So assume two docs with same hash are in values.If the first has higher score than the second than second doc will be deleted. But if the first has lower score than the second then none will be deleted. AFAICS, there should be an else condition to delete value and keep highest as it is.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira