Can't parse html on some urls

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Can't parse html on some urls

Enrico Triolo-2
Hi, I used nutch to fetch and index only this page:

http://www.althack.com/index.php?option=com_content&task=view&id=24&Itemid=27

When I perform a query to extract this document, I get it correctly,
but I can't get 'clean' content, just the html (*and* the content).
If I perform the same operation on other urls, everything works as expected.

Here's the code I use to extract the content:

NutchBean bean = ...; //Instantiate bean

//Perform query
...

Hit hit = hits.getHit(0);
HitDetails details = bean.getDetails(hit);

String content = new String(bean.getParseText(details).getText());

I guess it Is a problem on the parsing routine?

Cheers,
Enrico