nutch 0.8: invertlinks IOException segments/parse_data

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

nutch 0.8: invertlinks IOException segments/parse_data

Alexander E Genaud
Hello, I am receiving an IOException when running a Whole web crawl
via cygwin. Interestingly (to me at least), the error reads:

..../crawl/segments/parse_data

rather than

..../crawl/segments/20060729123456/parse_data


$ nutch-0.8/bin/nutch invertlinks crawl/linkdb crawl/segments
Exception in thread "main" java.io.IOException: Input directory
c:/alex/vicaya-root/trunk/dist/vicaya-0.2.0/vicaya/crawl
/segments/parse_data in local is invalid.
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:212)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:316)


My crawl deviates from the tutorial in that I am hitting localhost, I
have created the url seeds manually, my crawl/crawldb, etc directories
are in a different location, and my regex-urlfilter.txt looks like
this:


-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
-[?*!@=]
-.*(/.+?)/.*?\1/.*?\1/
+^http://([a-z0-9]*\.)*localhost:8108/


Does anything seem immediately/obviously wrong to anyone?
Reply | Threaded
Open this post in threaded view
|

Re: nutch 0.8: invertlinks IOException segments/parse_data

Sami Siren-2
please try

bin/nutch invertlinks crawl/linkdb -dir crawl/segments/

--
  Sami Siren


Alexander E Genaud wrote:

> Hello, I am receiving an IOException when running a Whole web crawl
> via cygwin. Interestingly (to me at least), the error reads:
>
> ..../crawl/segments/parse_data
>
> rather than
>
> ..../crawl/segments/20060729123456/parse_data
>
>
> $ nutch-0.8/bin/nutch invertlinks crawl/linkdb crawl/segments
> Exception in thread "main" java.io.IOException: Input directory
> c:/alex/vicaya-root/trunk/dist/vicaya-0.2.0/vicaya/crawl
> /segments/parse_data in local is invalid.
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
>        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:212)
>        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:316)
>
>
> My crawl deviates from the tutorial in that I am hitting localhost, I
> have created the url seeds manually, my crawl/crawldb, etc directories
> are in a different location, and my regex-urlfilter.txt looks like
> this:
>
>
> -^(file|ftp|mailto):
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
>
> -[?*!@=]
> -.*(/.+?)/.*?\1/.*?\1/
> +^http://([a-z0-9]*\.)*localhost:8108/
>
>
> Does anything seem immediately/obviously wrong to anyone?
>

Reply | Threaded
Open this post in threaded view
|

Re: nutch 0.8: invertlinks IOException segments/parse_data

Alexander E Genaud
In reply to this post by Alexander E Genaud
Thanks Sami Siren,

The "-dir" flag appears to work (at least I get no exceptions). Mind I
have not verified that the segments are searchable. I've run a script
much closer to the tutorial (but with localhost seeds) on Linux.

On neither Windows/Cygwin nor Linux did the invertlinks work for me
when creating a crawldb from scratch without the "-dir" flag.

I will test this all more thoroughly on my windows and linux box and
give the success/fail cases.

Note also the updatedb function works well on Linux but failed
intermittently on windows/cygwin. I just cleaned up and tried again.
Success was about 50/50 per update (1/(2^numRounds)).

Hope some of this is useful for debugging.

Alex

--
-- 55.67N 12.588E
CCC7 D19D D107 F079 2F3D BF97 8443 DB5A 6DB8 9CE1
--