[jira] Created: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
----------------------------------------------------------------------------------

                 Key: NUTCH-497
                 URL: https://issues.apache.org/jira/browse/NUTCH-497
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.9.0, 0.8.1, 1.0.0
         Environment: all
            Reporter: Dennis Kubes
            Assignee: Dennis Kubes
             Fix For: 1.0.0


Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment: ExtremeNestedTags.patch

This is a rudimentary fix for those that want a workaround for this issue immediately.  This patch simply alters DomContentUtils to ignore parsing links if they are more than 50 levels deep in nesting.  I will provide a more robust patch with configuration options and unit test when time allows.  I have successfully run this patch on a production system.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment: nested-tags-trap.patch

This patch reworks DomContentUtils.getOutlinks to use a stack instead of recursion.  This fixes the problem of spider traps where pages have extreme nested tags causing StackOverflow exceptions.  A nested spider trap test page has also been added to the fetcher tests.  

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506596 ]

Dennis Kubes commented on NUTCH-497:
------------------------------------

The newest patch is the nested-tags-trap.patch file.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506616 ]

Doğacan Güney commented on NUTCH-497:
-------------------------------------

Dennis, your patch is not using the variable curNodeDepth at all. I guess you can remove that.

(btw, after the change to use a stack, we no longer get an OOM or StackOverFlow no matter the depth of tag-nesting, right?)

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment: nested-tags-trap2.patch

Patch with the curNodeDepth removed.  The patch file is nested-tags-trap2.patch.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506725 ]

Dennis Kubes commented on NUTCH-497:
------------------------------------

Doğacan, that is correct.  By using the stack we shouldn't get a StackOverflow error any more no matter what the depth.  The process can still run out of memory if the stack itself gets too big but realistically I don't know of any webpage that would cause this.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506775 ]

Andrzej Bialecki  commented on NUTCH-497:
-----------------------------------------

The patch looks good to me as it is now - however, I've seen similar issues with getTextHelper, too, or for that matter with any other DOM tree traversal present in Nutch (all other places in DOMContentUtils, HTMLMetaTags, CCParseFilter and HtmlLanguageParser).

We can apply the patch as is, but it would be good to come up with a general method of stack-based DOM traveral, so that we can use it in other places, too.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506894 ]

Dennis Kubes commented on NUTCH-497:
------------------------------------

I agree, I think it would be better to have something generic if we are having this same issue (or at least the possibility) in multiple places. Let's hold off on committing this right now and let me see if I can make this more general.  Besides, if anyone needs the workaround immediately they can still get the current patch from here.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment: nested-tags-trap3.patch

Adds a utility class called NodeWalker which allows a generic framework for stack based walking of Node trees.  This framework is then applied to DomContentUtils and HtmlLanguageParser reworking functionality that used to be handled by recursion.  The patch file is nested-tags-trap3.patch

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment: nested-tags-trap3.patch

added nested-tags-trap3.patch with apache grant

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment: nested-tags-trap2.patch

added nested-tags-trap2.patch with apache grant

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment:     (was: nested-tags-trap3.patch)

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment:     (was: nested-tags-trap2.patch)

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Dennis Kubes-2
In reply to this post by Hudson (Jira)
If no one has any objections, I will go ahead and commit this.

Dennis Kubes

Dennis Kubes (JIRA) wrote:

>      [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Dennis Kubes updated NUTCH-497:
> -------------------------------
>
>     Attachment: nested-tags-trap3.patch
>
> added nested-tags-trap3.patch with apache grant
>
>> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
>> ----------------------------------------------------------------------------------
>>
>>                 Key: NUTCH-497
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>>             Project: Nutch
>>          Issue Type: Bug
>>          Components: fetcher
>>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>>         Environment: all
>>            Reporter: Dennis Kubes
>>            Assignee: Dennis Kubes
>>             Fix For: 1.0.0
>>
>>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>>
>>
>> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.
>
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

chrismattmann
Dennis, +1


On 6/25/07 4:42 PM, "Dennis Kubes" <[hidden email]> wrote:

> If no one has any objections, I will go ahead and commit this.
>
> Dennis Kubes
>
> Dennis Kubes (JIRA) wrote:
>>      [
>> https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugi
>> n.system.issuetabpanels:all-tabpanel ]
>>
>> Dennis Kubes updated NUTCH-497:
>> -------------------------------
>>
>>     Attachment: nested-tags-trap3.patch
>>
>> added nested-tags-trap3.patch with apache grant
>>
>>> Extreme Nested Tags causes StackOverflowException in
>>> DomContentUtils...Spider Trap
>>> ----------------------------------------------------------------------------
>>> ------
>>>
>>>                 Key: NUTCH-497
>>>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>>>             Project: Nutch
>>>          Issue Type: Bug
>>>          Components: fetcher
>>>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>>>         Environment: all
>>>            Reporter: Dennis Kubes
>>>            Assignee: Dennis Kubes
>>>             Fix For: 1.0.0
>>>
>>>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch,
>>> nested-tags-trap2.patch, nested-tags-trap3.patch
>>>
>>>
>>> Some webpages have a form of a spider trap that causes a
>>> StackOverflowException in DomContentUtils by having nested tags with
>>> thousands of layers deep.  DomContentUtils when trying to get outlinks uses
>>> a recursive method to parse the html.  With this type of nesting it errors
>>> out.
>>


Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes closed NUTCH-497.
------------------------------


Issue resolved and committed.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes resolved NUTCH-497.
--------------------------------

    Resolution: Fixed

commited with revision 550669

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508083 ]

Hudson commented on NUTCH-497:
------------------------------

Integrated in Nutch-Nightly #129 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/129/])

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.