HTML Support for jsoup-extractor in Nutch 2.x?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

HTML Support for jsoup-extractor in Nutch 2.x?

Michael Chen
Hi,

I'm trying to use the new jsoup-extractor in Nutch 2.x but it gives "The
markup in the document following the root element must be well-formed"
error when I hand it HTML. I re-read the descriptions in NUTCH-2389 and
it seems that it's designed to parse XML only.

I'm still quite new to Nutch so I wanted some opinions on this, should I
try to implement HTML DOM building for jsoup-extractor or is it too much
work/not feasible in Nutch 2.x? Any suggestions would be greatly
appreciated!

Go Nutch!

Michael

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: HTML Support for jsoup-extractor in Nutch 2.x?

Michael Chen
Nevermind problem nonexistent... After reading the code realized that
the problem is with the out-of-box jsoup-extractor.xml missing an
<extractor> root element... The example xml is correct though.

So HTML is supported based on the jsoup HTML parser. I'm not getting any
extracted value yet but I'll keep trying.

Thanks!

Michael


On 08/02/2017 02:42 PM, Michael Chen wrote:

> Hi,
>
> I'm trying to use the new jsoup-extractor in Nutch 2.x but it gives
> "The markup in the document following the root element must be
> well-formed" error when I hand it HTML. I re-read the descriptions in
> NUTCH-2389 and it seems that it's designed to parse XML only.
>
> I'm still quite new to Nutch so I wanted some opinions on this, should
> I try to implement HTML DOM building for jsoup-extractor or is it too
> much work/not feasible in Nutch 2.x? Any suggestions would be greatly
> appreciated!
>
> Go Nutch!
>
> Michael
>

Loading...