Local filesystem crawl problem

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Local filesystem crawl problem

Paolo Mazzoni
Hi all,
i'm very new to Nutch, and i'd like to setup a single search server.

I'm actually working on Windows Vista for testing purposes.
I was able to crawl some websites, but when i try to configure Nutch
to crawl the local file system i got the following error:

Paolo@PC-Paolo /cygdrive/c/sviluppo/CVS/apps/nutch-0.9
$ bin/nutch crawl urls -dir crawl-localfs -depth 3 -topN 50
crawl started in: crawl-localfs
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl-localfs/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

-----

My crawl-urlfilter.txt is:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
#-^(file|ftp|mailto):
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

# skip everything else
#-.
+.*

----

And my nutch-site.xml is (and i have a lot of dobts on this):

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
  <name>plugin.includes</name>
  <value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)</value>
 </property>
 <property>
  <name>file.content.limit</name> <value>-1</value>
 </property>
</configuration>

----

Under urls/ directory i hace a txt file ulrs.txt containing the folders to index:

file:///cygdrive/c/Temp/


Reply | Threaded
Open this post in threaded view
|

Re: Local filesystem crawl problem (SOLVED)

Paolo Mazzoni
In the hadoop.log i have found this message:

2008-08-11 10:17:38,097 WARN  mapred.LocalJobRunner - job_2kz6mm
java.lang.RuntimeException: No scoring plugins - at least one scoring plugin
is required!
 at org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:85)
 at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:61)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
 at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
 at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
 at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:170)
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)

So i added in the nutch-site.xml file:

...|scoring-opic...

And i have solved... i have now a new error that i will post soon...

Thank you
Paolo

----- Original Message -----
From: "Paolo Mazzoni" <[hidden email]>
To: <[hidden email]>
Sent: Monday, August 11, 2008 11:02 AM
Subject: Local filesystem crawl problem


Hi all,
i'm very new to Nutch, and i'd like to setup a single search server.

I'm actually working on Windows Vista for testing purposes.
I was able to crawl some websites, but when i try to configure Nutch
to crawl the local file system i got the following error:

Paolo@PC-Paolo /cygdrive/c/sviluppo/CVS/apps/nutch-0.9
$ bin/nutch crawl urls -dir crawl-localfs -depth 3 -topN 50
crawl started in: crawl-localfs
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl-localfs/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

-----

My crawl-urlfilter.txt is:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
#-^(file|ftp|mailto):
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

# skip everything else
#-.
+.*

----

And my nutch-site.xml is (and i have a lot of dobts on this):

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
  <name>plugin.includes</name>
  <value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)</value>
 </property>
 <property>
  <name>file.content.limit</name> <value>-1</value>
 </property>
</configuration>

----

Under urls/ directory i hace a txt file ulrs.txt containing the folders to
index:

file:///cygdrive/c/Temp/




No virus found in this incoming message.
Checked by AVG - http://www.avg.com
Version: 8.0.138 / Virus Database: 270.6.0/1603 - Release Date: 10/08/2008
18.13


Reply | Threaded
Open this post in threaded view
|

Re: Local filesystem crawl problem

Paolo Mazzoni
In reply to this post by Paolo Mazzoni
After the error before (see post for config files: Local filesystem crawl
problem).
I got this error:

...
fetching file:///cygdrive/c/Temp
org.apache.nutch.protocol.file.FileError: File Error: 404
        at
org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:100)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
fetch of file:///cygdrive/c/Temp failed with:
org.apache.nutch.protocol.file.Fil
eError: File Error: 404
Fetcher: done
....

I configured in a text file urls/urls.tct to crawl in the directory
file:///cygdrive/c/Temp

The folder exists, but it seems to don't find it, i expected the clawler to
find files inside that,
and fetch them...but it doesn't.

Thank you
Paolo

----- Original Message -----
From: "Paolo Mazzoni" <[hidden email]>
To: <[hidden email]>
Sent: Monday, August 11, 2008 11:02 AM
Subject: Local filesystem crawl problem


Hi all,
i'm very new to Nutch, and i'd like to setup a single search server.

I'm actually working on Windows Vista for testing purposes.
I was able to crawl some websites, but when i try to configure Nutch
to crawl the local file system i got the following error:

Paolo@PC-Paolo /cygdrive/c/sviluppo/CVS/apps/nutch-0.9
$ bin/nutch crawl urls -dir crawl-localfs -depth 3 -topN 50
crawl started in: crawl-localfs
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl-localfs/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

-----

My crawl-urlfilter.txt is:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
#-^(file|ftp|mailto):
-^(http|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

# skip everything else
#-.
+.*

----

And my nutch-site.xml is (and i have a lot of dobts on this):

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
  <name>plugin.includes</name>
  <value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)</value>
 </property>
 <property>
  <name>file.content.limit</name> <value>-1</value>
 </property>
</configuration>

----

Under urls/ directory i hace a txt file ulrs.txt containing the folders to
index:

file:///cygdrive/c/Temp/




No virus found in this incoming message.
Checked by AVG - http://www.avg.com
Version: 8.0.138 / Virus Database: 270.6.0/1603 - Release Date: 10/08/2008
18.13


Reply | Threaded
Open this post in threaded view
|

Re: Local filesystem crawl problem

Andrzej Białecki-2
Paolo Mazzoni wrote:

> After the error before (see post for config files: Local filesystem
> crawl problem).
> I got this error:
>
> ...
> fetching file:///cygdrive/c/Temp
> org.apache.nutch.protocol.file.FileError: File Error: 404
>        at
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:100)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> fetch of file:///cygdrive/c/Temp failed with:
> org.apache.nutch.protocol.file.Fil
> eError: File Error: 404
> Fetcher: done
> ....
>
> I configured in a text file urls/urls.tct to crawl in the directory
> file:///cygdrive/c/Temp
>
> The folder exists, but it seems to don't find it, i expected the clawler
> to find files inside that,
> and fetch them...but it doesn't.

This is not a real path, but a virtual mount point under Cygwin. Java is
completely unaware of the Cygwin layer, and uses your Windows file
system API. You should change your seed url to this:

        file:///c:/Temp


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Problem with conf files

Anton Potekhin
Hello!
I have a problem ;-(
I've downloaded last version of nutch (nightly).
And i started namenode, datanode, jobtracker, tasktracker. And i created
nutch-site.xml file in nutch conf folder then set parameters
plugin.folders there. But when i start injector task (bin/nutch
inject..) i see that nutch try to load plugins from "plugins" folder. I
don't understand why it happens. Why don't nutch get propertie
plugin.folders from nutch-site.xml. And i tried to set plugin.folders in
nutch-default.xml and it still use folder plugins....
How can i set folder for plugins ?