Iterating spidered pages

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Iterating spidered pages

Fredrik Andersson-2-2
Hi!

I'm new to this list, so hello to you all.

Here's the gig - I have crawled and indexed a bunch of pages. The HTML
Parser used in nutch only parses out the title, text, metadata and
outlinks. Is there any way to extend this set of attributes
post-crawling (i.e, without rewriting HtmlParser.java)? I'd like to
iterate all the crawled pages, access their raw data, parse out some
chunk of text and save it as a detail field or similar.

I haven't really got the full hang of the all the connections in the
API yet, so forgive a poor guy for being a newbie.

Big thanks in advance,
Fredrik
Reply | Threaded
Open this post in threaded view
|

Re: Iterating spidered pages

Andy Liu-3
You can use a SegmentReader object to give you references to the
FetcherOutput, ParseData, and Content objects for each page in the
segment.  The raw page data is encapsulated within the Content object
so you can parse out whatever you want from it.

However, somebody correct me if I'm wrong, but I don't think you can
update individual ArrayFile entries once they've been written.  So
while you're looping over each ParseData entry, you can write your
updated ParseData objects to a temporary ArrayFile and replace it with
the old one when you're done.

Andy

On 7/5/05, Fredrik Andersson <[hidden email]> wrote:

> Hi!
>
> I'm new to this list, so hello to you all.
>
> Here's the gig - I have crawled and indexed a bunch of pages. The HTML
> Parser used in nutch only parses out the title, text, metadata and
> outlinks. Is there any way to extend this set of attributes
> post-crawling (i.e, without rewriting HtmlParser.java)? I'd like to
> iterate all the crawled pages, access their raw data, parse out some
> chunk of text and save it as a detail field or similar.
>
> I haven't really got the full hang of the all the connections in the
> API yet, so forgive a poor guy for being a newbie.
>
> Big thanks in advance,
> Fredrik
>
Reply | Threaded
Open this post in threaded view
|

Re: Iterating spidered pages

Andrzej Białecki-2
Andy Liu wrote:

> However, somebody correct me if I'm wrong, but I don't think you can
> update individual ArrayFile entries once they've been written.  So
> while you're looping over each ParseData entry, you can write your
> updated ParseData objects to a temporary ArrayFile and replace it with
> the old one when you're done.

Yes, that's correct. Currently the only place where one can add some
custom data without changing the core classes (Content, ParseData,
ParseText) would be the metadata attributes. There are actually two
metadata collections - one at the protocol level (Content.metadata) and
the other at parse level (ParseData.metadata).


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com