getting page content for nutch search result

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

getting page content for nutch search result

Leslie Rohde

The list archives contain _some_ questions and answers on this,
but nothing that is definitive.  What I want is like the "cache"
button in google results pages -- not as a user interface feature,
but to be able to access and process the entire web page that
is referenced by a nutch search results.

Page does not do the trick (despite the name).
Content I have not figured out.
SegmentReader was suggested in one of the archived messages,
but the relation it has to search results is far from clear.

I gotta' believe that this is simple and I just don't know where
to look.  All pointers appreciated.

Thanks,
Leslie.
Reply | Threaded
Open this post in threaded view
|

Re: getting page content for nutch search result

Leslie Rohde
duh!  Simple as I thought, just much higher up in the "stack".

public byte[] *getContent*(HitDetails <mailbox:///home/leslie/MyMail/Sent?number=230511583&part=1.2&filename=HitDetails.html> hit) in NutchBean

This works for me.

Although, BTW, digging into this method, it would be
convolved to pass around a reference to the document
content w/o the NutchBean instance.  NutchBean ties
together the several pieces of data required to get
to the documents found via a query.  As far as I can
tell at this point, the bean is the only place these
individual parts are brought together.

leslie



Leslie Rohde wrote:

>
> The list archives contain _some_ questions and answers on this,
> but nothing that is definitive.  What I want is like the "cache"
> button in google results pages -- not as a user interface feature,
> but to be able to access and process the entire web page that
> is referenced by a nutch search results.
>
> Page does not do the trick (despite the name).
> Content I have not figured out.
> SegmentReader was suggested in one of the archived messages,
> but the relation it has to search results is far from clear.
>
> I gotta' believe that this is simple and I just don't know where
> to look.  All pointers appreciated.
>
> Thanks,
> Leslie.
>