How to get Text and Parse data for URL

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How to get Text and Parse data for URL

Dennis Kubes
Can somebody direct me on how to get the stored text and parse metadata
for a given url?

Dennis
Reply | Threaded
Open this post in threaded view
|

Re: How to get Text and Parse data for URL

Doug Cutting
NutchBean.getContent() and NutchBean.getParseData() do this, but require
a HitDetails instance.  In the non-distributed case, the only required
field of the HitDetails for these calls is "url".  In the distributed
case, the "segment" field must also be provided, so that the request can
be routed to a node serving that segment.  These are implemented by
FetchedSegments.java and DistributedSearch.java.

Doug

Dennis Kubes wrote:
> Can somebody direct me on how to get the stored text and parse metadata
> for a given url?
>
> Dennis
Reply | Threaded
Open this post in threaded view
|

Re: How to get Text and Parse data for URL

Dennis Kubes
That got me started.  I think that I am not fully understanding the role
the segments directory and its contents play.  It looks like it holds
parse text and parse data in map files, but what is the content folder
(also a map file)?  And is the segments contents used once the index is
created?

Dennis Kubes


Doug Cutting wrote:

> NutchBean.getContent() and NutchBean.getParseData() do this, but
> require a HitDetails instance.  In the non-distributed case, the only
> required field of the HitDetails for these calls is "url".  In the
> distributed case, the "segment" field must also be provided, so that
> the request can be routed to a node serving that segment.  These are
> implemented by FetchedSegments.java and DistributedSearch.java.
>
> Doug
>
> Dennis Kubes wrote:
>> Can somebody direct me on how to get the stored text and parse
>> metadata for a given url?
>>
>> Dennis
Reply | Threaded
Open this post in threaded view
|

Re: How to get Text and Parse data for URL

Dennis Kubes
Truly I am just not understanding the concept of a segment.

Dennis Kubes wrote:

> That got me started.  I think that I am not fully understanding the
> role the segments directory and its contents play.  It looks like it
> holds parse text and parse data in map files, but what is the content
> folder (also a map file)?  And is the segments contents used once the
> index is created?
>
> Dennis Kubes
>
>
> Doug Cutting wrote:
>> NutchBean.getContent() and NutchBean.getParseData() do this, but
>> require a HitDetails instance.  In the non-distributed case, the only
>> required field of the HitDetails for these calls is "url".  In the
>> distributed case, the "segment" field must also be provided, so that
>> the request can be routed to a node serving that segment.  These are
>> implemented by FetchedSegments.java and DistributedSearch.java.
>>
>> Doug
>>
>> Dennis Kubes wrote:
>>> Can somebody direct me on how to get the stored text and parse
>>> metadata for a given url?
>>>
>>> Dennis
Reply | Threaded
Open this post in threaded view
|

Re: How to get Text and Parse data for URL

Andrzej Białecki-2
In reply to this post by Dennis Kubes
Dennis Kubes wrote:
> Can somebody direct me on how to get the stored text and parse
> metadata for a given url?

 From a single segment, or from a set of segments?

 From a single segment: please see how SegmentReader.get() does this
(although it's a bit obscured by the fact that it uses multiple threads
to retrieve different parts of the data).

For multiple segments, it would help if you knew in advance which
segment holds the data associated with the URL, that's what normally the
Lucene index is for ;) - please see FetchedSegments for details.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: How to get Text and Parse data for URL

Doug Cutting
In reply to this post by Dennis Kubes
Dennis Kubes wrote:
> I think that I am not fully understanding the role
> the segments directory and its contents play.

A segment is simply a set of urls fetched in the same round, and data
associated with these urls.  The content subdirectory contains the raw
http content.  The parse-text subdirectory contains the extracted text,
used when indexing and when building snippets for hits.  The index
subdirectory holds a Lucene index of the pages in the segment.  Etc.  It
is an independent chunk of Nutch data.

In 0.8, each segment subdirectory is further split into parts, the
result of distributed processing.  The parts are split by the hash of
the url.

Does that help?

Doug
Reply | Threaded
Open this post in threaded view
|

RE: How to get Text and Parse data for URL

Prashant Purkar
In reply to this post by Dennis Kubes
One trick would be to search on a URL, explain link shows what segments
it belongs to, say 1200604211450.

Then using segread command (this works for 0.7.2)

bin/nutch segread -dumpsort -nocontent  segments/1200604211450  

That shows text, parse data for a URL.

Thanks
P




-----Original Message-----
From: Dennis Kubes [mailto:[hidden email]]
Sent: Wednesday, April 26, 2006 1:42 AM
To: [hidden email]
Subject: How to get Text and Parse data for URL

Can somebody direct me on how to get the stored text and parse metadata
for a given url?

Dennis