Can't parse html on some urls

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Can't parse html on some urls

Enrico Triolo-2
Hi, I used nutch to fetch and index only this page:

When I perform a query to extract this document, I get it correctly,
but I can't get 'clean' content, just the html (*and* the content).
If I perform the same operation on other urls, everything works as expected.

Here's the code I use to extract the content:

NutchBean bean = ...; //Instantiate bean

//Perform query

Hit hit = hits.getHit(0);
HitDetails details = bean.getDetails(hit);

String content = new String(bean.getParseText(details).getText());

I guess it Is a problem on the parsing routine?