[jira] Created: (NUTCH-555) StackOverflowError in DomContentUtils

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-555) StackOverflowError in DomContentUtils

JIRA jira@apache.org
StackOverflowError in DomContentUtils
-------------------------------------

                 Key: NUTCH-555
                 URL: https://issues.apache.org/jira/browse/NUTCH-555
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.9.0
            Reporter: Karsten Dello


Parsing the attached webpage (which exposes very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
But parsing should be stable, it is definetely better to just skip pages like this.

parseOutlinks in DomContentUtils is implemented recursive.
An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth.


 

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Attachment: stacktrace.txt

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>         Attachments: stacktrace.txt
>
>
> Parsing the attached webpage (which exposes very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is definetely better to just skip pages like this.
> parseOutlinks in DomContentUtils is implemented recursive.
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth.
>  

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Description:
Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
But parsing should be stable, it is probably better to just skip pages like this.

Attached it
a) the stacktrace
b) the segmentreader-get output for the url where the exception is thrown

Possible fixes:
parseOutlinks in DomContentUtils is implemented recursive.
An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth.


  was:
Parsing the attached webpage (which exposes very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
But parsing should be stable, it is definetely better to just skip pages like this.

parseOutlinks in DomContentUtils is implemented recursive.
An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth.


 


> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>         Attachments: stacktrace.txt
>
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this.
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive.
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Attachment:     (was: stacktrace.txt)

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this.
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive.
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Attachment:     (was: readseg.txt)

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this.
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive.
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Attachment: readseg.txt

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this.
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive.
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Attachment: readseg.txt

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>         Attachments: readseg.txt, stacktrace.txt
>
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this.
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive.
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Attachment: stacktrace.txt

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>         Attachments: readseg.txt, stacktrace.txt
>
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this.
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive.
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-555) StackOverflowError in DomContentUtils

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes resolved NUTCH-555.
--------------------------------

    Resolution: Duplicate

Solved by NUTCH-497

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>         Attachments: readseg.txt, stacktrace.txt
>
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this.
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive.
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-555) StackOverflowError in DomContentUtils

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes closed NUTCH-555.
------------------------------

    Assignee: Dennis Kubes

Issue closed, fixed by NUTCH-497

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>            Assignee: Dennis Kubes
>         Attachments: readseg.txt, stacktrace.txt
>
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this.
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive.
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.