[jira] [Commented] (NUTCH-2769) Nutch 1.15 unable to parse certain outlinks

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (NUTCH-2769) Nutch 1.15 unable to parse certain outlinks

Chris Mattmann (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047290#comment-17047290 ]

Prajeeth Emanuel commented on NUTCH-2769:

Hey guys, thanks for the quick reply!

Does this mean that the bug can be fixed on parse-html? If not, how reliable/different is parse-tika for parsing web pages? We are currently using it to parse all PDFs.

> Nutch 1.15 unable to parse certain outlinks
> --------------------------------------------
>                 Key: NUTCH-2769
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2769
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.15, 1.16
>            Reporter: Prajeeth Emanuel
>            Priority: Major
> Nutch is unable to parse certain outlinks in pages. 
> For example:
> Crawling [http://d4fdot.com/pbfdot/PBC-North_index.asp] does not parse the outlinks: 
> [congress_avenue_lighting_improvements.asp|http://www.d4fdot.com/pbfdot/congress_avenue_lighting_improvements.asp]
> [blue_heron_boulevard_bridge_fender_replacement.asp|http://www.d4fdot.com/pbfdot/blue_heron_boulevard_bridge_fender_replacement.asp]
> [indiantown_road_intersection_improvements.asp|http://www.d4fdot.com/pbfdot/indiantown_road_intersection_improvements.asp]
> Crawling [http://www.d4fdot.com/pbfdot/index.asp] however, parses [congress_avenue_lighting_improvements.asp|http://www.d4fdot.com/pbfdot/congress_avenue_lighting_improvements.asp] correctly even though the Anchor element is structured similarly. 
> URL filters and normalizers have been modified to barely operate and no URLs or outlinks are being ignored in the current config and the error still occurs. 

This message was sent by Atlassian Jira