Does nutch 0.8.x have an command like bin/nutch fetchlist -dumpurls

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Does nutch 0.8.x have an command like bin/nutch fetchlist -dumpurls

Bryan Woliner
Hi,

When I was using nutch 0.7, I found the bin/nutch fetchlist -dumpurls
command to be very useful. However, I have not been able to find an
equivalent command in nutch 0.8.x.

Essentially all I want to do is dump all urls stored in a certain segment
(or group of segments) into a text file.

In nutch 0.7.x I would call a command like this:

*$ bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls $s1 >foo.txt

*Any suggestions for how this can be accomplished in nutch 0.8.x are very
much appreciated.

Thanks,
Bryan
Reply | Threaded
Open this post in threaded view
|

Re: Does nutch 0.8.x have an command like bin/nutch fetchlist -dumpurls

kettle
You could do something like this:

bin/nutch readseg -dump $NUTCH_HOME/crawl/segment/SEGNAME OUTPUT_DIR/
-nocontent -nogenerate -noparse -noparsedata -noparsetext

this will print a file called 'dump' to OUTPUT_DIR/ containing the fetcher
data only.  Each entry will look something like:

Recno:: 4
URL:: http://www.examplepage.com

CrawlDatum::
Version: 4
Status: 5 (fetch_success)
Fetch time: Tue Nov 07 22:54:09 JST 2006
Modified time: Thu Jan 01 09:00:00 JST 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: 71fc0f7885a5766980c785a72934dcb0
Metadata: null

You could then grab the urls based on the 'Status' value.  Dumping only the
content will lead to something similar.  If there is a faster way, please
let me know!

Check out bin/nutch readseg

cheers!


On 11/13/06, Bryan Woliner <[hidden email]> wrote:

>
> Hi,
>
> When I was using nutch 0.7, I found the bin/nutch fetchlist -dumpurls
> command to be very useful. However, I have not been able to find an
> equivalent command in nutch 0.8.x.
>
> Essentially all I want to do is dump all urls stored in a certain segment
> (or group of segments) into a text file.
>
> In nutch 0.7.x I would call a command like this:
>
> *$ bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls $s1 >foo.txt
>
> *Any suggestions for how this can be accomplished in nutch 0.8.x are very
> much appreciated.
>
> Thanks,
> Bryan
>
>