When I perform a query to extract this document, I get it correctly,
but I can't get 'clean' content, just the html (*and* the content).
If I perform the same operation on other urls, everything works as expected.
Here's the code I use to extract the content:
NutchBean bean = ...; //Instantiate bean
Hit hit = hits.getHit(0);
HitDetails details = bean.getDetails(hit);
String content = new String(bean.getParseText(details).getText());