problem in crawling

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

problem in crawling

riyal
Hi,

I m using nutch 0.9 on ubuntu on a single machine with pseudo-distributed  mode.
When i executing  the  following command  

bin/nutch crawl urls -dir crawled -depth 10

this is what i got from the hadoop log:

2008-08-03 03:10:17,392 INFO  crawl.Crawl - crawl started in: crawled
2008-08-03 03:10:17,392 INFO  crawl.Crawl - rootUrlDir = urls
2008-08-03 03:10:17,392 INFO  crawl.Crawl - threads = 10
2008-08-03 03:10:17,392 INFO  crawl.Crawl - depth = 10
2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: starting
2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: crawlDb: crawled/crawldb
2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: urlDir: urls
2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2008-08-03 03:10:35,227 INFO  crawl.Injector - Injector: Merging injected urls into crawl db.
2008-08-03 03:10:59,724 INFO  crawl.Injector - Injector: done
2008-08-03 03:11:00,791 INFO  crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: starting
2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: segment: crawled/segments/20080803031100
2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: filtering: false
2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: topN: 2147483647
2008-08-03 03:11:24,239 INFO  crawl.Generator - Generator: Partitioning selected urls by host, for politeness.
2008-08-03 03:11:47,583 INFO  crawl.Generator - Generator: done.
2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: starting
2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: segment: crawled/segments/20080803031100
2008-08-03 03:12:36,915 INFO  fetcher.Fetcher - Fetcher: done
2008-08-03 03:12:36,951 INFO  crawl.CrawlDb - CrawlDb update: starting
2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: db: crawled/crawldb
2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: segments: [crawled/segments/20080803031100]
2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: additions allowed: true
2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL normalizing: true
2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL filtering: true
2008-08-03 03:12:36,967 INFO  crawl.CrawlDb - CrawlDb update: Merging segment data into db.
2008-08-03 03:13:20,341 INFO  crawl.CrawlDb - CrawlDb update: done
2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: starting
2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: segment: crawled/segments/20080803031321
2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: filtering: false
2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: topN: 2147483647
2008-08-03 03:13:39,667 INFO  crawl.Generator - Generator: Partitioning selected urls by host, for politeness.
2008-08-03 03:14:04,963 INFO  crawl.Generator - Generator: done.
2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: starting
2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: segment: crawled/segments/20080803031321
2008-08-03 03:21:26,809 INFO  fetcher.Fetcher - Fetcher: done
2008-08-03 03:21:26,851 INFO  crawl.CrawlDb - CrawlDb update: starting
2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: db: crawled/crawldb
2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: segments: [crawled/segments/20080803031321]
2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: additions allowed: true
2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL normalizing: true
2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL filtering: true
2008-08-03 03:21:26,866 INFO  crawl.CrawlDb - CrawlDb update: Merging segment data into db.
2008-08-03 03:22:13,223 INFO  crawl.CrawlDb - CrawlDb update: done
2008-08-03 03:22:14,251 INFO  crawl.Generator - Generator: Selecting best-scoring urls due for fetch.
2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: starting
2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: segment: crawled/segments/20080803032214
2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: filtering: false
2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: topN: 2147483647
2008-08-03 03:22:34,459 INFO  crawl.Generator - Generator: Partitioning selected urls by host, for politeness.
2008-08-03 03:22:59,733 INFO  crawl.Generator - Generator: done.
2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: starting
2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: segment: crawled/segments/20080803032214
2008-08-03 04:24:53,193 INFO  fetcher.Fetcher - Fetcher: done

What i found executing the command:
bin/hadoop dfs -ls
Found 2 items
/user/nutch/crawled     <dir>
/user/nutch/urls        <dir>
$ bin/hadoop dfs -ls crawled
Found 2 items
/user/nutch/crawled/crawldb     <dir>
/user/nutch/crawled/segments    <dir>

Where is linkdb,indexes and index? So pls tell me which may be the error.

Here is my hadoop-site.xml:

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>fs.default.name</name>
  <value>sysmonitor:9000</value>
  <description>
    The name of the default file system. Either the literal string
    "local" or a host:port for NDFS.
  </description>
</property>
<property>
  <name>mapred.job.tracker</name>
  <value>sysmonitor:9001</value>
  <description>
    The host and port that the MapReduce job tracker runs at. If
    "local", then jobs are run in-process as a single map and
    reduce task.
  </description>
</property>
<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>2</value>
  <description>
    The maximum number of tasks that will be run simultaneously by
    a task tracker. This should be adjusted according to the heap size
    per task, the amount of RAM available, and CPU consumption of each task.
  </description>
</property>
<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx200m</value>
  <description>
    You can specify other Java options for each map or reduce task here,
    but most likely you will want to adjust the heap size.
  </description>
</property>
<property>
  <name>dfs.name.dir</name>
  <value>/nutch/filesystem/name</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/nutch/filesystem/data</value>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/nutch/filesystem/mapreduce/system</value>
</property>
<property>
  <name>mapred.local.dir</name>
  <value>/nutch/filesystem/mapreduce/local</value>
</property>

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
</configuration>


My urls/urllist.txt contains almost 100 seed urls and depth is 10 but it seems there is  little crawling done.


regards
--monirul


     
Reply | Threaded
Open this post in threaded view
|

Re: problem in crawling

Alexander Aristov
Hi

what is in your crawl -urlfilter.txt file?

Did you include your URLs in the filter? By default all urls are excluded.

Alexander

2008/8/3 Mohammad Monirul Hoque <[hidden email]>

> Hi,
>
> I m using nutch 0.9 on ubuntu on a single machine with pseudo-distributed
>  mode.
> When i executing  the  following command
>
> bin/nutch crawl urls -dir crawled -depth 10
>
> this is what i got from the hadoop log:
>
> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - crawl started in: crawled
> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - rootUrlDir = urls
> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - threads = 10
> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - depth = 10
> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: starting
> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: crawlDb:
> crawled/crawldb
> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: urlDir: urls
> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
> 2008-08-03 03:10:35,227 INFO  crawl.Injector - Injector: Merging injected
> urls into crawl db.
> 2008-08-03 03:10:59,724 INFO  crawl.Injector - Injector: done
> 2008-08-03 03:11:00,791 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: starting
> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: segment:
> crawled/segments/20080803031100
> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: filtering: false
> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: topN: 2147483647
> 2008-08-03 03:11:24,239 INFO  crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2008-08-03 03:11:47,583 INFO  crawl.Generator - Generator: done.
> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: starting
> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: segment:
> crawled/segments/20080803031100
> 2008-08-03 03:12:36,915 INFO  fetcher.Fetcher - Fetcher: done
> 2008-08-03 03:12:36,951 INFO  crawl.CrawlDb - CrawlDb update: starting
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: db:
> crawled/crawldb
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: segments:
> [crawled/segments/20080803031100]
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
> normalizing: true
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
> filtering: true
> 2008-08-03 03:12:36,967 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2008-08-03 03:13:20,341 INFO  crawl.CrawlDb - CrawlDb update: done
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: starting
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: segment:
> crawled/segments/20080803031321
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: filtering: false
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: topN: 2147483647
> 2008-08-03 03:13:39,667 INFO  crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2008-08-03 03:14:04,963 INFO  crawl.Generator - Generator: done.
> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: starting
> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: segment:
> crawled/segments/20080803031321
> 2008-08-03 03:21:26,809 INFO  fetcher.Fetcher - Fetcher: done
> 2008-08-03 03:21:26,851 INFO  crawl.CrawlDb - CrawlDb update: starting
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: db:
> crawled/crawldb
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: segments:
> [crawled/segments/20080803031321]
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
> normalizing: true
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
> filtering: true
> 2008-08-03 03:21:26,866 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2008-08-03 03:22:13,223 INFO  crawl.CrawlDb - CrawlDb update: done
> 2008-08-03 03:22:14,251 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: starting
> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: segment:
> crawled/segments/20080803032214
> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: filtering: false
> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: topN: 2147483647
> 2008-08-03 03:22:34,459 INFO  crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2008-08-03 03:22:59,733 INFO  crawl.Generator - Generator: done.
> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: starting
> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: segment:
> crawled/segments/20080803032214
> 2008-08-03 04:24:53,193 INFO  fetcher.Fetcher - Fetcher: done
>
> What i found executing the command:
> bin/hadoop dfs -ls
> Found 2 items
> /user/nutch/crawled     <dir>
> /user/nutch/urls        <dir>
> $ bin/hadoop dfs -ls crawled
> Found 2 items
> /user/nutch/crawled/crawldb     <dir>
> /user/nutch/crawled/segments    <dir>
>
> Where is linkdb,indexes and index? So pls tell me which may be the error.
>
> Here is my hadoop-site.xml:
>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>  <name>fs.default.name</name>
>  <value>sysmonitor:9000</value>
>  <description>
>    The name of the default file system. Either the literal string
>    "local" or a host:port for NDFS.
>  </description>
> </property>
> <property>
>  <name>mapred.job.tracker</name>
>  <value>sysmonitor:9001</value>
>  <description>
>    The host and port that the MapReduce job tracker runs at. If
>    "local", then jobs are run in-process as a single map and
>    reduce task.
>  </description>
> </property>
> <property>
>  <name>mapred.tasktracker.tasks.maximum</name>
>  <value>2</value>
>  <description>
>    The maximum number of tasks that will be run simultaneously by
>    a task tracker. This should be adjusted according to the heap size
>    per task, the amount of RAM available, and CPU consumption of each task.
>  </description>
> </property>
> <property>
>  <name>mapred.child.java.opts</name>
>  <value>-Xmx200m</value>
>  <description>
>    You can specify other Java options for each map or reduce task here,
>    but most likely you will want to adjust the heap size.
>  </description>
> </property>
> <property>
>  <name>dfs.name.dir</name>
>  <value>/nutch/filesystem/name</value>
> </property>
> <property>
>  <name>dfs.data.dir</name>
>  <value>/nutch/filesystem/data</value>
> </property>
>
> <property>
>  <name>mapred.system.dir</name>
>  <value>/nutch/filesystem/mapreduce/system</value>
> </property>
> <property>
>  <name>mapred.local.dir</name>
>  <value>/nutch/filesystem/mapreduce/local</value>
> </property>
>
> <property>
>  <name>dfs.replication</name>
>  <value>1</value>
> </property>
> </configuration>
>
>
> My urls/urllist.txt contains almost 100 seed urls and depth is 10 but it
> seems there is  little crawling done.
>
>
> regards
> --monirul
>
>
>




--
Best Regards
Alexander Aristov
Reply | Threaded
Open this post in threaded view
|

Re: problem in crawling

riyal
In reply to this post by riyal
Hi,

Thanks for ur reply. In my crawl-urlfilter.txt i included the following line

+^http://([a-z0-9]*\.)*wikipedia.org/  as i want to crawl wiki.

My urls/urllist.txt contains urls of wikipedia like below:

http://en.wikipedia.org/

I used nutch 0.9 previously in fedora 8.It worked fine.

So pls tell me if u have any idea.

best regards,

--monirul




----- Original Message ----
From: Alexander Aristov <[hidden email]>
To: [hidden email]
Sent: Monday, August 4, 2008 1:28:58 PM
Subject: Re: problem in crawling

Hi

what is in your crawl -urlfilter.txt file?

Did you include your URLs in the filter? By default all urls are excluded.

Alexander

2008/8/3 Mohammad Monirul Hoque <[hidden email]>

> Hi,
>
> I m using nutch 0.9 on ubuntu on a single machine with pseudo-distributed
>  mode.
> When i executing  the  following command
>
> bin/nutch crawl urls -dir crawled -depth 10
>
> this is what i got from the hadoop log:
>
> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - crawl started in: crawled
> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - rootUrlDir = urls
> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - threads = 10
> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - depth = 10
> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: starting
> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: crawlDb:
> crawled/crawldb
> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: urlDir: urls
> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
> 2008-08-03 03:10:35,227 INFO  crawl.Injector - Injector: Merging injected
> urls into crawl db.
> 2008-08-03 03:10:59,724 INFO  crawl.Injector - Injector: done
> 2008-08-03 03:11:00,791 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: starting
> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: segment:
> crawled/segments/20080803031100
> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: filtering: false
> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: topN: 2147483647
> 2008-08-03 03:11:24,239 INFO  crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2008-08-03 03:11:47,583 INFO  crawl.Generator - Generator: done.
> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: starting
> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: segment:
> crawled/segments/20080803031100
> 2008-08-03 03:12:36,915 INFO  fetcher.Fetcher - Fetcher: done
> 2008-08-03 03:12:36,951 INFO  crawl.CrawlDb - CrawlDb update: starting
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: db:
> crawled/crawldb
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: segments:
> [crawled/segments/20080803031100]
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
> normalizing: true
> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
> filtering: true
> 2008-08-03 03:12:36,967 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2008-08-03 03:13:20,341 INFO  crawl.CrawlDb - CrawlDb update: done
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: starting
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: segment:
> crawled/segments/20080803031321
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: filtering: false
> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: topN: 2147483647
> 2008-08-03 03:13:39,667 INFO  crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2008-08-03 03:14:04,963 INFO  crawl.Generator - Generator: done.
> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: starting
> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: segment:
> crawled/segments/20080803031321
> 2008-08-03 03:21:26,809 INFO  fetcher.Fetcher - Fetcher: done
> 2008-08-03 03:21:26,851 INFO  crawl.CrawlDb - CrawlDb update: starting
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: db:
> crawled/crawldb
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: segments:
> [crawled/segments/20080803031321]
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
> normalizing: true
> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
> filtering: true
> 2008-08-03 03:21:26,866 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2008-08-03 03:22:13,223 INFO  crawl.CrawlDb - CrawlDb update: done
> 2008-08-03 03:22:14,251 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: starting
> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: segment:
> crawled/segments/20080803032214
> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: filtering: false
> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: topN: 2147483647
> 2008-08-03 03:22:34,459 INFO  crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2008-08-03 03:22:59,733 INFO  crawl.Generator - Generator: done.
> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: starting
> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: segment:
> crawled/segments/20080803032214
> 2008-08-03 04:24:53,193 INFO  fetcher.Fetcher - Fetcher: done
>
> What i found executing the command:
> bin/hadoop dfs -ls
> Found 2 items
> /user/nutch/crawled     <dir>
> /user/nutch/urls        <dir>
> $ bin/hadoop dfs -ls crawled
> Found 2 items
> /user/nutch/crawled/crawldb     <dir>
> /user/nutch/crawled/segments    <dir>
>
> Where is linkdb,indexes and index? So pls tell me which may be the error.
>
> Here is my hadoop-site.xml:
>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>  <name>fs.default.name</name>
>  <value>sysmonitor:9000</value>
>  <description>
>    The name of the default file system. Either the literal string
>    "local" or a host:port for NDFS.
>  </description>
> </property>
> <property>
>  <name>mapred.job.tracker</name>
>  <value>sysmonitor:9001</value>
>  <description>
>    The host and port that the MapReduce job tracker runs at. If
>    "local", then jobs are run in-process as a single map and
>    reduce task.
>  </description>
> </property>
> <property>
>  <name>mapred.tasktracker.tasks.maximum</name>
>  <value>2</value>
>  <description>
>    The maximum number of tasks that will be run simultaneously by
>    a task tracker. This should be adjusted according to the heap size
>    per task, the amount of RAM available, and CPU consumption of each task.
>  </description>
> </property>
> <property>
>  <name>mapred.child.java.opts</name>
>  <value>-Xmx200m</value>
>  <description>
>    You can specify other Java options for each map or reduce task here,
>    but most likely you will want to adjust the heap size.
>  </description>
> </property>
> <property>
>  <name>dfs.name.dir</name>
>  <value>/nutch/filesystem/name</value>
> </property>
> <property>
>  <name>dfs.data.dir</name>
>  <value>/nutch/filesystem/data</value>
> </property>
>
> <property>
>  <name>mapred.system.dir</name>
>  <value>/nutch/filesystem/mapreduce/system</value>
> </property>
> <property>
>  <name>mapred.local.dir</name>
>  <value>/nutch/filesystem/mapreduce/local</value>
> </property>
>
> <property>
>  <name>dfs.replication</name>
>  <value>1</value>
> </property>
> </configuration>
>
>
> My urls/urllist.txt contains almost 100 seed urls and depth is 10 but it
> seems there is  little crawling done.
>
>
> regards
> --monirul
>
>
>




--
Best Regards
Alexander Aristov



     
Reply | Threaded
Open this post in threaded view
|

Re: problem in crawling

Tristan Buckner
Are your urls of the form http://en.wikipedia.org/w/wiki.phtml?title=_&curid=foo 
  ?  If it does the robots file excludes these.

Also is there a line above that line for which the urls fail?

On Aug 4, 2008, at 11:37 AM, Mohammad Monirul Hoque wrote:

> Hi,
>
> Thanks for ur reply. In my crawl-urlfilter.txt i included the  
> following line
>
> +^http://([a-z0-9]*\.)*wikipedia.org/  as i want to crawl wiki.
>
> My urls/urllist.txt contains urls of wikipedia like below:
>
> http://en.wikipedia.org/
>
> I used nutch 0.9 previously in fedora 8.It worked fine.
>
> So pls tell me if u have any idea.
>
> best regards,
>
> --monirul
>
>
>
>
> ----- Original Message ----
> From: Alexander Aristov <[hidden email]>
> To: [hidden email]
> Sent: Monday, August 4, 2008 1:28:58 PM
> Subject: Re: problem in crawling
>
> Hi
>
> what is in your crawl -urlfilter.txt file?
>
> Did you include your URLs in the filter? By default all urls are  
> excluded.
>
> Alexander
>
> 2008/8/3 Mohammad Monirul Hoque <[hidden email]>
>
>> Hi,
>>
>> I m using nutch 0.9 on ubuntu on a single machine with pseudo-
>> distributed
>> mode.
>> When i executing  the  following command
>>
>> bin/nutch crawl urls -dir crawled -depth 10
>>
>> this is what i got from the hadoop log:
>>
>> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - crawl started in: crawled
>> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - rootUrlDir = urls
>> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - threads = 10
>> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - depth = 10
>> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: starting
>> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: crawlDb:
>> crawled/crawldb
>> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: urlDir: urls
>> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: Converting
>> injected urls to crawl db entries.
>> 2008-08-03 03:10:35,227 INFO  crawl.Injector - Injector: Merging  
>> injected
>> urls into crawl db.
>> 2008-08-03 03:10:59,724 INFO  crawl.Injector - Injector: done
>> 2008-08-03 03:11:00,791 INFO  crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: starting
>> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: segment:
>> crawled/segments/20080803031100
>> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator:  
>> filtering: false
>> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: topN:  
>> 2147483647
>> 2008-08-03 03:11:24,239 INFO  crawl.Generator - Generator:  
>> Partitioning
>> selected urls by host, for politeness.
>> 2008-08-03 03:11:47,583 INFO  crawl.Generator - Generator: done.
>> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: starting
>> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: segment:
>> crawled/segments/20080803031100
>> 2008-08-03 03:12:36,915 INFO  fetcher.Fetcher - Fetcher: done
>> 2008-08-03 03:12:36,951 INFO  crawl.CrawlDb - CrawlDb update:  
>> starting
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: db:
>> crawled/crawldb
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:  
>> segments:
>> [crawled/segments/20080803031100]
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:  
>> additions
>> allowed: true
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
>> normalizing: true
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
>> filtering: true
>> 2008-08-03 03:12:36,967 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2008-08-03 03:13:20,341 INFO  crawl.CrawlDb - CrawlDb update: done
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: starting
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: segment:
>> crawled/segments/20080803031321
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator:  
>> filtering: false
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: topN:  
>> 2147483647
>> 2008-08-03 03:13:39,667 INFO  crawl.Generator - Generator:  
>> Partitioning
>> selected urls by host, for politeness.
>> 2008-08-03 03:14:04,963 INFO  crawl.Generator - Generator: done.
>> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: starting
>> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: segment:
>> crawled/segments/20080803031321
>> 2008-08-03 03:21:26,809 INFO  fetcher.Fetcher - Fetcher: done
>> 2008-08-03 03:21:26,851 INFO  crawl.CrawlDb - CrawlDb update:  
>> starting
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: db:
>> crawled/crawldb
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:  
>> segments:
>> [crawled/segments/20080803031321]
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:  
>> additions
>> allowed: true
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
>> normalizing: true
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
>> filtering: true
>> 2008-08-03 03:21:26,866 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2008-08-03 03:22:13,223 INFO  crawl.CrawlDb - CrawlDb update: done
>> 2008-08-03 03:22:14,251 INFO  crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: starting
>> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: segment:
>> crawled/segments/20080803032214
>> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator:  
>> filtering: false
>> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: topN:  
>> 2147483647
>> 2008-08-03 03:22:34,459 INFO  crawl.Generator - Generator:  
>> Partitioning
>> selected urls by host, for politeness.
>> 2008-08-03 03:22:59,733 INFO  crawl.Generator - Generator: done.
>> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: starting
>> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: segment:
>> crawled/segments/20080803032214
>> 2008-08-03 04:24:53,193 INFO  fetcher.Fetcher - Fetcher: done
>>
>> What i found executing the command:
>> bin/hadoop dfs -ls
>> Found 2 items
>> /user/nutch/crawled     <dir>
>> /user/nutch/urls        <dir>
>> $ bin/hadoop dfs -ls crawled
>> Found 2 items
>> /user/nutch/crawled/crawldb     <dir>
>> /user/nutch/crawled/segments    <dir>
>>
>> Where is linkdb,indexes and index? So pls tell me which may be the  
>> error.
>>
>> Here is my hadoop-site.xml:
>>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>> <name>fs.default.name</name>
>> <value>sysmonitor:9000</value>
>> <description>
>>   The name of the default file system. Either the literal string
>>   "local" or a host:port for NDFS.
>> </description>
>> </property>
>> <property>
>> <name>mapred.job.tracker</name>
>> <value>sysmonitor:9001</value>
>> <description>
>>   The host and port that the MapReduce job tracker runs at. If
>>   "local", then jobs are run in-process as a single map and
>>   reduce task.
>> </description>
>> </property>
>> <property>
>> <name>mapred.tasktracker.tasks.maximum</name>
>> <value>2</value>
>> <description>
>>   The maximum number of tasks that will be run simultaneously by
>>   a task tracker. This should be adjusted according to the heap size
>>   per task, the amount of RAM available, and CPU consumption of  
>> each task.
>> </description>
>> </property>
>> <property>
>> <name>mapred.child.java.opts</name>
>> <value>-Xmx200m</value>
>> <description>
>>   You can specify other Java options for each map or reduce task  
>> here,
>>   but most likely you will want to adjust the heap size.
>> </description>
>> </property>
>> <property>
>> <name>dfs.name.dir</name>
>> <value>/nutch/filesystem/name</value>
>> </property>
>> <property>
>> <name>dfs.data.dir</name>
>> <value>/nutch/filesystem/data</value>
>> </property>
>>
>> <property>
>> <name>mapred.system.dir</name>
>> <value>/nutch/filesystem/mapreduce/system</value>
>> </property>
>> <property>
>> <name>mapred.local.dir</name>
>> <value>/nutch/filesystem/mapreduce/local</value>
>> </property>
>>
>> <property>
>> <name>dfs.replication</name>
>> <value>1</value>
>> </property>
>> </configuration>
>>
>>
>> My urls/urllist.txt contains almost 100 seed urls and depth is 10  
>> but it
>> seems there is  little crawling done.
>>
>>
>> regards
>> --monirul
>>
>>
>>
>
>
>
>
> --
> Best Regards
> Alexander Aristov
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: problem in crawling

riyal
In reply to this post by riyal

Hi,

What i only modify in crawl-urlfilter.txt is to add the line

+^http://([a-z0-9]*\.)*wikipedia.org/

I also commented out the previous line like the following:

#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

I also tried many other  urls  but  each time  it  returned  same  type  of  result.

Another imp things : I am trying nutch on ubuntu now which is showing problem but when i used it in fedora core 8 it just worked fine.

I was trying previously on pseudo-distributed  mode  but  after having problem i tried yesterday  in stand-alone mode it returned same type of result.

When i see the hadoop.log it indicates that lots of pages were being fetched  with  lots of  error,  fatal error  regarding  http.robots.agents,
parser not found, java.net.SocketTimeOut exection etc.

Pls tell me where i m wrong.

regards,
--monirul
 



----- Original Message ----
From: Tristan Buckner <[hidden email]>
To: [hidden email]
Sent: Tuesday, August 5, 2008 12:46:21 AM
Subject: Re: problem in crawling

Are your urls of the form http://en.wikipedia.org/w/wiki.phtml?title=_&curid=foo 
  ?  If it does the robots file excludes these.

Also is there a line above that line for which the urls fail?

On Aug 4, 2008, at 11:37 AM, Mohammad Monirul Hoque wrote:

> Hi,
>
> Thanks for ur reply. In my crawl-urlfilter.txt i included the  
> following line
>
> +^http://([a-z0-9]*\.)*wikipedia.org/  as i want to crawl wiki.
>
> My urls/urllist.txt contains urls of wikipedia like below:
>
> http://en.wikipedia.org/
>
> I used nutch 0.9 previously in fedora 8.It worked fine.
>
> So pls tell me if u have any idea.
>
> best regards,
>
> --monirul
>
>
>
>
> ----- Original Message ----
> From: Alexander Aristov <[hidden email]>
> To: [hidden email]
> Sent: Monday, August 4, 2008 1:28:58 PM
> Subject: Re: problem in crawling
>
> Hi
>
> what is in your crawl -urlfilter.txt file?
>
> Did you include your URLs in the filter? By default all urls are  
> excluded.
>
> Alexander
>
> 2008/8/3 Mohammad Monirul Hoque <[hidden email]>
>
>> Hi,
>>
>> I m using nutch 0.9 on ubuntu on a single machine with pseudo-
>> distributed
>> mode.
>> When i executing  the  following command
>>
>> bin/nutch crawl urls -dir crawled -depth 10
>>
>> this is what i got from the hadoop log:
>>
>> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - crawl started in: crawled
>> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - rootUrlDir = urls
>> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - threads = 10
>> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - depth = 10
>> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: starting
>> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: crawlDb:
>> crawled/crawldb
>> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: urlDir: urls
>> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: Converting
>> injected urls to crawl db entries.
>> 2008-08-03 03:10:35,227 INFO  crawl.Injector - Injector: Merging  
>> injected
>> urls into crawl db.
>> 2008-08-03 03:10:59,724 INFO  crawl.Injector - Injector: done
>> 2008-08-03 03:11:00,791 INFO  crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: starting
>> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: segment:
>> crawled/segments/20080803031100
>> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator:  
>> filtering: false
>> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: topN:  
>> 2147483647
>> 2008-08-03 03:11:24,239 INFO  crawl.Generator - Generator:  
>> Partitioning
>> selected urls by host, for politeness.
>> 2008-08-03 03:11:47,583 INFO  crawl.Generator - Generator: done.
>> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: starting
>> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: segment:
>> crawled/segments/20080803031100
>> 2008-08-03 03:12:36,915 INFO  fetcher.Fetcher - Fetcher: done
>> 2008-08-03 03:12:36,951 INFO  crawl.CrawlDb - CrawlDb update:  
>> starting
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: db:
>> crawled/crawldb
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:  
>> segments:
>> [crawled/segments/20080803031100]
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:  
>> additions
>> allowed: true
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
>> normalizing: true
>> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
>> filtering: true
>> 2008-08-03 03:12:36,967 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2008-08-03 03:13:20,341 INFO  crawl.CrawlDb - CrawlDb update: done
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: starting
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: segment:
>> crawled/segments/20080803031321
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator:  
>> filtering: false
>> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: topN:  
>> 2147483647
>> 2008-08-03 03:13:39,667 INFO  crawl.Generator - Generator:  
>> Partitioning
>> selected urls by host, for politeness.
>> 2008-08-03 03:14:04,963 INFO  crawl.Generator - Generator: done.
>> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: starting
>> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: segment:
>> crawled/segments/20080803031321
>> 2008-08-03 03:21:26,809 INFO  fetcher.Fetcher - Fetcher: done
>> 2008-08-03 03:21:26,851 INFO  crawl.CrawlDb - CrawlDb update:  
>> starting
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: db:
>> crawled/crawldb
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:  
>> segments:
>> [crawled/segments/20080803031321]
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:  
>> additions
>> allowed: true
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
>> normalizing: true
>> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
>> filtering: true
>> 2008-08-03 03:21:26,866 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2008-08-03 03:22:13,223 INFO  crawl.CrawlDb - CrawlDb update: done
>> 2008-08-03 03:22:14,251 INFO  crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: starting
>> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: segment:
>> crawled/segments/20080803032214
>> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator:  
>> filtering: false
>> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: topN:  
>> 2147483647
>> 2008-08-03 03:22:34,459 INFO  crawl.Generator - Generator:  
>> Partitioning
>> selected urls by host, for politeness.
>> 2008-08-03 03:22:59,733 INFO  crawl.Generator - Generator: done.
>> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: starting
>> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: segment:
>> crawled/segments/20080803032214
>> 2008-08-03 04:24:53,193 INFO  fetcher.Fetcher - Fetcher: done
>>
>> What i found executing the command:
>> bin/hadoop dfs -ls
>> Found 2 items
>> /user/nutch/crawled     <dir>
>> /user/nutch/urls        <dir>
>> $ bin/hadoop dfs -ls crawled
>> Found 2 items
>> /user/nutch/crawled/crawldb     <dir>
>> /user/nutch/crawled/segments    <dir>
>>
>> Where is linkdb,indexes and index? So pls tell me which may be the  
>> error.
>>
>> Here is my hadoop-site.xml:
>>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>> <name>fs.default.name</name>
>> <value>sysmonitor:9000</value>
>> <description>
>>   The name of the default file system. Either the literal string
>>   "local" or a host:port for NDFS.
>> </description>
>> </property>
>> <property>
>> <name>mapred.job.tracker</name>
>> <value>sysmonitor:9001</value>
>> <description>
>>   The host and port that the MapReduce job tracker runs at. If
>>   "local", then jobs are run in-process as a single map and
>>   reduce task.
>> </description>
>> </property>
>> <property>
>> <name>mapred.tasktracker.tasks.maximum</name>
>> <value>2</value>
>> <description>
>>   The maximum number of tasks that will be run simultaneously by
>>   a task tracker. This should be adjusted according to the heap size
>>   per task, the amount of RAM available, and CPU consumption of  
>> each task.
>> </description>
>> </property>
>> <property>
>> <name>mapred.child.java.opts</name>
>> <value>-Xmx200m</value>
>> <description>
>>   You can specify other Java options for each map or reduce task  
>> here,
>>   but most likely you will want to adjust the heap size.
>> </description>
>> </property>
>> <property>
>> <name>dfs.name.dir</name>
>> <value>/nutch/filesystem/name</value>
>> </property>
>> <property>
>> <name>dfs.data.dir</name>
>> <value>/nutch/filesystem/data</value>
>> </property>
>>
>> <property>
>> <name>mapred.system.dir</name>
>> <value>/nutch/filesystem/mapreduce/system</value>
>> </property>
>> <property>
>> <name>mapred.local.dir</name>
>> <value>/nutch/filesystem/mapreduce/local</value>
>> </property>
>>
>> <property>
>> <name>dfs.replication</name>
>> <value>1</value>
>> </property>
>> </configuration>
>>
>>
>> My urls/urllist.txt contains almost 100 seed urls and depth is 10  
>> but it
>> seems there is  little crawling done.
>>
>>
>> regards
>> --monirul
>>
>>
>>
>
>
>
>
> --
> Best Regards
> Alexander Aristov
>
>
>


     
Reply | Threaded
Open this post in threaded view
|

Re: problem in crawling

Alexander Aristov
Do you have proxy in your network?

2008/8/5 Mohammad Monirul Hoque <[hidden email]>

>
> Hi,
>
> What i only modify in crawl-urlfilter.txt is to add the line
>
> +^http://([a-z0-9]*\.)*wikipedia.org/
>
> I also commented out the previous line like the following:
>
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
> I also tried many other  urls  but  each time  it  returned  same  type  of
>  result.
>
> Another imp things : I am trying nutch on ubuntu now which is showing
> problem but when i used it in fedora core 8 it just worked fine.
>
> I was trying previously on pseudo-distributed  mode  but  after having
> problem i tried yesterday  in stand-alone mode it returned same type of
> result.
>
> When i see the hadoop.log it indicates that lots of pages were being
> fetched  with  lots of  error,  fatal error  regarding  http.robots.agents,
> parser not found, java.net.SocketTimeOut exection etc.
>
> Pls tell me where i m wrong.
>
> regards,
> --monirul
>
>
>
>
> ----- Original Message ----
> From: Tristan Buckner <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, August 5, 2008 12:46:21 AM
> Subject: Re: problem in crawling
>
> Are your urls of the form
> http://en.wikipedia.org/w/wiki.phtml?title=_&curid=foo
>  ?  If it does the robots file excludes these.
>
> Also is there a line above that line for which the urls fail?
>
> On Aug 4, 2008, at 11:37 AM, Mohammad Monirul Hoque wrote:
>
> > Hi,
> >
> > Thanks for ur reply. In my crawl-urlfilter.txt i included the
> > following line
> >
> > +^http://([a-z0-9]*\.)*wikipedia.org/  as i want to crawl wiki.
> >
> > My urls/urllist.txt contains urls of wikipedia like below:
> >
> > http://en.wikipedia.org/
> >
> > I used nutch 0.9 previously in fedora 8.It worked fine.
> >
> > So pls tell me if u have any idea.
> >
> > best regards,
> >
> > --monirul
> >
> >
> >
> >
> > ----- Original Message ----
> > From: Alexander Aristov <[hidden email]>
> > To: [hidden email]
> > Sent: Monday, August 4, 2008 1:28:58 PM
> > Subject: Re: problem in crawling
> >
> > Hi
> >
> > what is in your crawl -urlfilter.txt file?
> >
> > Did you include your URLs in the filter? By default all urls are
> > excluded.
> >
> > Alexander
> >
> > 2008/8/3 Mohammad Monirul Hoque <[hidden email]>
> >
> >> Hi,
> >>
> >> I m using nutch 0.9 on ubuntu on a single machine with pseudo-
> >> distributed
> >> mode.
> >> When i executing  the  following command
> >>
> >> bin/nutch crawl urls -dir crawled -depth 10
> >>
> >> this is what i got from the hadoop log:
> >>
> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - crawl started in: crawled
> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - rootUrlDir = urls
> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - threads = 10
> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - depth = 10
> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: starting
> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: crawlDb:
> >> crawled/crawldb
> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: urlDir: urls
> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: Converting
> >> injected urls to crawl db entries.
> >> 2008-08-03 03:10:35,227 INFO  crawl.Injector - Injector: Merging
> >> injected
> >> urls into crawl db.
> >> 2008-08-03 03:10:59,724 INFO  crawl.Injector - Injector: done
> >> 2008-08-03 03:11:00,791 INFO  crawl.Generator - Generator: Selecting
> >> best-scoring urls due for fetch.
> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: starting
> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: segment:
> >> crawled/segments/20080803031100
> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator:
> >> filtering: false
> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: topN:
> >> 2147483647
> >> 2008-08-03 03:11:24,239 INFO  crawl.Generator - Generator:
> >> Partitioning
> >> selected urls by host, for politeness.
> >> 2008-08-03 03:11:47,583 INFO  crawl.Generator - Generator: done.
> >> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: starting
> >> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: segment:
> >> crawled/segments/20080803031100
> >> 2008-08-03 03:12:36,915 INFO  fetcher.Fetcher - Fetcher: done
> >> 2008-08-03 03:12:36,951 INFO  crawl.CrawlDb - CrawlDb update:
> >> starting
> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: db:
> >> crawled/crawldb
> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:
> >> segments:
> >> [crawled/segments/20080803031100]
> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:
> >> additions
> >> allowed: true
> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> normalizing: true
> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> filtering: true
> >> 2008-08-03 03:12:36,967 INFO  crawl.CrawlDb - CrawlDb update: Merging
> >> segment data into db.
> >> 2008-08-03 03:13:20,341 INFO  crawl.CrawlDb - CrawlDb update: done
> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: Selecting
> >> best-scoring urls due for fetch.
> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: starting
> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: segment:
> >> crawled/segments/20080803031321
> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator:
> >> filtering: false
> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: topN:
> >> 2147483647
> >> 2008-08-03 03:13:39,667 INFO  crawl.Generator - Generator:
> >> Partitioning
> >> selected urls by host, for politeness.
> >> 2008-08-03 03:14:04,963 INFO  crawl.Generator - Generator: done.
> >> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: starting
> >> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: segment:
> >> crawled/segments/20080803031321
> >> 2008-08-03 03:21:26,809 INFO  fetcher.Fetcher - Fetcher: done
> >> 2008-08-03 03:21:26,851 INFO  crawl.CrawlDb - CrawlDb update:
> >> starting
> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: db:
> >> crawled/crawldb
> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:
> >> segments:
> >> [crawled/segments/20080803031321]
> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:
> >> additions
> >> allowed: true
> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> normalizing: true
> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> filtering: true
> >> 2008-08-03 03:21:26,866 INFO  crawl.CrawlDb - CrawlDb update: Merging
> >> segment data into db.
> >> 2008-08-03 03:22:13,223 INFO  crawl.CrawlDb - CrawlDb update: done
> >> 2008-08-03 03:22:14,251 INFO  crawl.Generator - Generator: Selecting
> >> best-scoring urls due for fetch.
> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: starting
> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: segment:
> >> crawled/segments/20080803032214
> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator:
> >> filtering: false
> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: topN:
> >> 2147483647
> >> 2008-08-03 03:22:34,459 INFO  crawl.Generator - Generator:
> >> Partitioning
> >> selected urls by host, for politeness.
> >> 2008-08-03 03:22:59,733 INFO  crawl.Generator - Generator: done.
> >> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: starting
> >> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: segment:
> >> crawled/segments/20080803032214
> >> 2008-08-03 04:24:53,193 INFO  fetcher.Fetcher - Fetcher: done
> >>
> >> What i found executing the command:
> >> bin/hadoop dfs -ls
> >> Found 2 items
> >> /user/nutch/crawled     <dir>
> >> /user/nutch/urls        <dir>
> >> $ bin/hadoop dfs -ls crawled
> >> Found 2 items
> >> /user/nutch/crawled/crawldb     <dir>
> >> /user/nutch/crawled/segments    <dir>
> >>
> >> Where is linkdb,indexes and index? So pls tell me which may be the
> >> error.
> >>
> >> Here is my hadoop-site.xml:
> >>
> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >>
> >> <!-- Put site-specific property overrides in this file. -->
> >>
> >> <configuration>
> >> <property>
> >> <name>fs.default.name</name>
> >> <value>sysmonitor:9000</value>
> >> <description>
> >>   The name of the default file system. Either the literal string
> >>   "local" or a host:port for NDFS.
> >> </description>
> >> </property>
> >> <property>
> >> <name>mapred.job.tracker</name>
> >> <value>sysmonitor:9001</value>
> >> <description>
> >>   The host and port that the MapReduce job tracker runs at. If
> >>   "local", then jobs are run in-process as a single map and
> >>   reduce task.
> >> </description>
> >> </property>
> >> <property>
> >> <name>mapred.tasktracker.tasks.maximum</name>
> >> <value>2</value>
> >> <description>
> >>   The maximum number of tasks that will be run simultaneously by
> >>   a task tracker. This should be adjusted according to the heap size
> >>   per task, the amount of RAM available, and CPU consumption of
> >> each task.
> >> </description>
> >> </property>
> >> <property>
> >> <name>mapred.child.java.opts</name>
> >> <value>-Xmx200m</value>
> >> <description>
> >>   You can specify other Java options for each map or reduce task
> >> here,
> >>   but most likely you will want to adjust the heap size.
> >> </description>
> >> </property>
> >> <property>
> >> <name>dfs.name.dir</name>
> >> <value>/nutch/filesystem/name</value>
> >> </property>
> >> <property>
> >> <name>dfs.data.dir</name>
> >> <value>/nutch/filesystem/data</value>
> >> </property>
> >>
> >> <property>
> >> <name>mapred.system.dir</name>
> >> <value>/nutch/filesystem/mapreduce/system</value>
> >> </property>
> >> <property>
> >> <name>mapred.local.dir</name>
> >> <value>/nutch/filesystem/mapreduce/local</value>
> >> </property>
> >>
> >> <property>
> >> <name>dfs.replication</name>
> >> <value>1</value>
> >> </property>
> >> </configuration>
> >>
> >>
> >> My urls/urllist.txt contains almost 100 seed urls and depth is 10
> >> but it
> >> seems there is  little crawling done.
> >>
> >>
> >> regards
> >> --monirul
> >>
> >>
> >>
> >
> >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
> >
> >
>
>
>
>



--
Best Regards
Alexander Aristov
Reply | Threaded
Open this post in threaded view
|

Re: problem in crawling

brainstorm-2-2
fatal error  regarding  http.robots.agents

You should check or configure the following properties on
nutch-site.xml properly:

  <name>http.max.delays</name>
  <name>http.robots.agents</name>
  <name>http.agent.name</name>
  <name>http.agent.description</name>
  <name>http.agent.url</name>
  <name>http.agent.email</name>


On Tue, Aug 5, 2008 at 8:56 AM, Alexander Aristov
<[hidden email]> wrote:

> Do you have proxy in your network?
>
> 2008/8/5 Mohammad Monirul Hoque <[hidden email]>
>
>>
>> Hi,
>>
>> What i only modify in crawl-urlfilter.txt is to add the line
>>
>> +^http://([a-z0-9]*\.)*wikipedia.org/
>>
>> I also commented out the previous line like the following:
>>
>> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>
>> I also tried many other  urls  but  each time  it  returned  same  type  of
>>  result.
>>
>> Another imp things : I am trying nutch on ubuntu now which is showing
>> problem but when i used it in fedora core 8 it just worked fine.
>>
>> I was trying previously on pseudo-distributed  mode  but  after having
>> problem i tried yesterday  in stand-alone mode it returned same type of
>> result.
>>
>> When i see the hadoop.log it indicates that lots of pages were being
>> fetched  with  lots of  error,  fatal error  regarding  http.robots.agents,
>> parser not found, java.net.SocketTimeOut exection etc.
>>
>> Pls tell me where i m wrong.
>>
>> regards,
>> --monirul
>>
>>
>>
>>
>> ----- Original Message ----
>> From: Tristan Buckner <[hidden email]>
>> To: [hidden email]
>> Sent: Tuesday, August 5, 2008 12:46:21 AM
>> Subject: Re: problem in crawling
>>
>> Are your urls of the form
>> http://en.wikipedia.org/w/wiki.phtml?title=_&curid=foo
>>  ?  If it does the robots file excludes these.
>>
>> Also is there a line above that line for which the urls fail?
>>
>> On Aug 4, 2008, at 11:37 AM, Mohammad Monirul Hoque wrote:
>>
>> > Hi,
>> >
>> > Thanks for ur reply. In my crawl-urlfilter.txt i included the
>> > following line
>> >
>> > +^http://([a-z0-9]*\.)*wikipedia.org/  as i want to crawl wiki.
>> >
>> > My urls/urllist.txt contains urls of wikipedia like below:
>> >
>> > http://en.wikipedia.org/
>> >
>> > I used nutch 0.9 previously in fedora 8.It worked fine.
>> >
>> > So pls tell me if u have any idea.
>> >
>> > best regards,
>> >
>> > --monirul
>> >
>> >
>> >
>> >
>> > ----- Original Message ----
>> > From: Alexander Aristov <[hidden email]>
>> > To: [hidden email]
>> > Sent: Monday, August 4, 2008 1:28:58 PM
>> > Subject: Re: problem in crawling
>> >
>> > Hi
>> >
>> > what is in your crawl -urlfilter.txt file?
>> >
>> > Did you include your URLs in the filter? By default all urls are
>> > excluded.
>> >
>> > Alexander
>> >
>> > 2008/8/3 Mohammad Monirul Hoque <[hidden email]>
>> >
>> >> Hi,
>> >>
>> >> I m using nutch 0.9 on ubuntu on a single machine with pseudo-
>> >> distributed
>> >> mode.
>> >> When i executing  the  following command
>> >>
>> >> bin/nutch crawl urls -dir crawled -depth 10
>> >>
>> >> this is what i got from the hadoop log:
>> >>
>> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - crawl started in: crawled
>> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - rootUrlDir = urls
>> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - threads = 10
>> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - depth = 10
>> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: starting
>> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: crawlDb:
>> >> crawled/crawldb
>> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: urlDir: urls
>> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: Converting
>> >> injected urls to crawl db entries.
>> >> 2008-08-03 03:10:35,227 INFO  crawl.Injector - Injector: Merging
>> >> injected
>> >> urls into crawl db.
>> >> 2008-08-03 03:10:59,724 INFO  crawl.Injector - Injector: done
>> >> 2008-08-03 03:11:00,791 INFO  crawl.Generator - Generator: Selecting
>> >> best-scoring urls due for fetch.
>> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: starting
>> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: segment:
>> >> crawled/segments/20080803031100
>> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator:
>> >> filtering: false
>> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: topN:
>> >> 2147483647
>> >> 2008-08-03 03:11:24,239 INFO  crawl.Generator - Generator:
>> >> Partitioning
>> >> selected urls by host, for politeness.
>> >> 2008-08-03 03:11:47,583 INFO  crawl.Generator - Generator: done.
>> >> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: starting
>> >> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: segment:
>> >> crawled/segments/20080803031100
>> >> 2008-08-03 03:12:36,915 INFO  fetcher.Fetcher - Fetcher: done
>> >> 2008-08-03 03:12:36,951 INFO  crawl.CrawlDb - CrawlDb update:
>> >> starting
>> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: db:
>> >> crawled/crawldb
>> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:
>> >> segments:
>> >> [crawled/segments/20080803031100]
>> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:
>> >> additions
>> >> allowed: true
>> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
>> >> normalizing: true
>> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
>> >> filtering: true
>> >> 2008-08-03 03:12:36,967 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> >> segment data into db.
>> >> 2008-08-03 03:13:20,341 INFO  crawl.CrawlDb - CrawlDb update: done
>> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: Selecting
>> >> best-scoring urls due for fetch.
>> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: starting
>> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: segment:
>> >> crawled/segments/20080803031321
>> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator:
>> >> filtering: false
>> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: topN:
>> >> 2147483647
>> >> 2008-08-03 03:13:39,667 INFO  crawl.Generator - Generator:
>> >> Partitioning
>> >> selected urls by host, for politeness.
>> >> 2008-08-03 03:14:04,963 INFO  crawl.Generator - Generator: done.
>> >> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: starting
>> >> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: segment:
>> >> crawled/segments/20080803031321
>> >> 2008-08-03 03:21:26,809 INFO  fetcher.Fetcher - Fetcher: done
>> >> 2008-08-03 03:21:26,851 INFO  crawl.CrawlDb - CrawlDb update:
>> >> starting
>> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: db:
>> >> crawled/crawldb
>> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:
>> >> segments:
>> >> [crawled/segments/20080803031321]
>> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:
>> >> additions
>> >> allowed: true
>> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
>> >> normalizing: true
>> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
>> >> filtering: true
>> >> 2008-08-03 03:21:26,866 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> >> segment data into db.
>> >> 2008-08-03 03:22:13,223 INFO  crawl.CrawlDb - CrawlDb update: done
>> >> 2008-08-03 03:22:14,251 INFO  crawl.Generator - Generator: Selecting
>> >> best-scoring urls due for fetch.
>> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: starting
>> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: segment:
>> >> crawled/segments/20080803032214
>> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator:
>> >> filtering: false
>> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: topN:
>> >> 2147483647
>> >> 2008-08-03 03:22:34,459 INFO  crawl.Generator - Generator:
>> >> Partitioning
>> >> selected urls by host, for politeness.
>> >> 2008-08-03 03:22:59,733 INFO  crawl.Generator - Generator: done.
>> >> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: starting
>> >> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: segment:
>> >> crawled/segments/20080803032214
>> >> 2008-08-03 04:24:53,193 INFO  fetcher.Fetcher - Fetcher: done
>> >>
>> >> What i found executing the command:
>> >> bin/hadoop dfs -ls
>> >> Found 2 items
>> >> /user/nutch/crawled     <dir>
>> >> /user/nutch/urls        <dir>
>> >> $ bin/hadoop dfs -ls crawled
>> >> Found 2 items
>> >> /user/nutch/crawled/crawldb     <dir>
>> >> /user/nutch/crawled/segments    <dir>
>> >>
>> >> Where is linkdb,indexes and index? So pls tell me which may be the
>> >> error.
>> >>
>> >> Here is my hadoop-site.xml:
>> >>
>> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> >>
>> >> <!-- Put site-specific property overrides in this file. -->
>> >>
>> >> <configuration>
>> >> <property>
>> >> <name>fs.default.name</name>
>> >> <value>sysmonitor:9000</value>
>> >> <description>
>> >>   The name of the default file system. Either the literal string
>> >>   "local" or a host:port for NDFS.
>> >> </description>
>> >> </property>
>> >> <property>
>> >> <name>mapred.job.tracker</name>
>> >> <value>sysmonitor:9001</value>
>> >> <description>
>> >>   The host and port that the MapReduce job tracker runs at. If
>> >>   "local", then jobs are run in-process as a single map and
>> >>   reduce task.
>> >> </description>
>> >> </property>
>> >> <property>
>> >> <name>mapred.tasktracker.tasks.maximum</name>
>> >> <value>2</value>
>> >> <description>
>> >>   The maximum number of tasks that will be run simultaneously by
>> >>   a task tracker. This should be adjusted according to the heap size
>> >>   per task, the amount of RAM available, and CPU consumption of
>> >> each task.
>> >> </description>
>> >> </property>
>> >> <property>
>> >> <name>mapred.child.java.opts</name>
>> >> <value>-Xmx200m</value>
>> >> <description>
>> >>   You can specify other Java options for each map or reduce task
>> >> here,
>> >>   but most likely you will want to adjust the heap size.
>> >> </description>
>> >> </property>
>> >> <property>
>> >> <name>dfs.name.dir</name>
>> >> <value>/nutch/filesystem/name</value>
>> >> </property>
>> >> <property>
>> >> <name>dfs.data.dir</name>
>> >> <value>/nutch/filesystem/data</value>
>> >> </property>
>> >>
>> >> <property>
>> >> <name>mapred.system.dir</name>
>> >> <value>/nutch/filesystem/mapreduce/system</value>
>> >> </property>
>> >> <property>
>> >> <name>mapred.local.dir</name>
>> >> <value>/nutch/filesystem/mapreduce/local</value>
>> >> </property>
>> >>
>> >> <property>
>> >> <name>dfs.replication</name>
>> >> <value>1</value>
>> >> </property>
>> >> </configuration>
>> >>
>> >>
>> >> My urls/urllist.txt contains almost 100 seed urls and depth is 10
>> >> but it
>> >> seems there is  little crawling done.
>> >>
>> >>
>> >> regards
>> >> --monirul
>> >>
>> >>
>> >>
>> >
>> >
>> >
>> >
>> > --
>> > Best Regards
>> > Alexander Aristov
>> >
>> >
>> >
>>
>>
>>
>>
>
>
>
> --
> Best Regards
> Alexander Aristov
>
Reply | Threaded
Open this post in threaded view
|

Re: problem in crawling

Alexander Aristov
Set any name. Read the Nutch manual for more information.

Alex

2008/8/5 brainstorm <[hidden email]>

> fatal error  regarding  http.robots.agents
>
> You should check or configure the following properties on
> nutch-site.xml properly:
>
>  <name>http.max.delays</name>
>  <name>http.robots.agents</name>
>  <name>http.agent.name</name>
>  <name>http.agent.description</name>
>  <name>http.agent.url</name>
>  <name>http.agent.email</name>
>
>
> On Tue, Aug 5, 2008 at 8:56 AM, Alexander Aristov
> <[hidden email]> wrote:
> > Do you have proxy in your network?
> >
> > 2008/8/5 Mohammad Monirul Hoque <[hidden email]>
> >
> >>
> >> Hi,
> >>
> >> What i only modify in crawl-urlfilter.txt is to add the line
> >>
> >> +^http://([a-z0-9]*\.)*wikipedia.org/
> >>
> >> I also commented out the previous line like the following:
> >>
> >> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >>
> >> I also tried many other  urls  but  each time  it  returned  same  type
>  of
> >>  result.
> >>
> >> Another imp things : I am trying nutch on ubuntu now which is showing
> >> problem but when i used it in fedora core 8 it just worked fine.
> >>
> >> I was trying previously on pseudo-distributed  mode  but  after having
> >> problem i tried yesterday  in stand-alone mode it returned same type of
> >> result.
> >>
> >> When i see the hadoop.log it indicates that lots of pages were being
> >> fetched  with  lots of  error,  fatal error  regarding
>  http.robots.agents,
> >> parser not found, java.net.SocketTimeOut exection etc.
> >>
> >> Pls tell me where i m wrong.
> >>
> >> regards,
> >> --monirul
> >>
> >>
> >>
> >>
> >> ----- Original Message ----
> >> From: Tristan Buckner <[hidden email]>
> >> To: [hidden email]
> >> Sent: Tuesday, August 5, 2008 12:46:21 AM
> >> Subject: Re: problem in crawling
> >>
> >> Are your urls of the form
> >> http://en.wikipedia.org/w/wiki.phtml?title=_&curid=foo
> >>  ?  If it does the robots file excludes these.
> >>
> >> Also is there a line above that line for which the urls fail?
> >>
> >> On Aug 4, 2008, at 11:37 AM, Mohammad Monirul Hoque wrote:
> >>
> >> > Hi,
> >> >
> >> > Thanks for ur reply. In my crawl-urlfilter.txt i included the
> >> > following line
> >> >
> >> > +^http://([a-z0-9]*\.)*wikipedia.org/  as i want to crawl wiki.
> >> >
> >> > My urls/urllist.txt contains urls of wikipedia like below:
> >> >
> >> > http://en.wikipedia.org/
> >> >
> >> > I used nutch 0.9 previously in fedora 8.It worked fine.
> >> >
> >> > So pls tell me if u have any idea.
> >> >
> >> > best regards,
> >> >
> >> > --monirul
> >> >
> >> >
> >> >
> >> >
> >> > ----- Original Message ----
> >> > From: Alexander Aristov <[hidden email]>
> >> > To: [hidden email]
> >> > Sent: Monday, August 4, 2008 1:28:58 PM
> >> > Subject: Re: problem in crawling
> >> >
> >> > Hi
> >> >
> >> > what is in your crawl -urlfilter.txt file?
> >> >
> >> > Did you include your URLs in the filter? By default all urls are
> >> > excluded.
> >> >
> >> > Alexander
> >> >
> >> > 2008/8/3 Mohammad Monirul Hoque <[hidden email]>
> >> >
> >> >> Hi,
> >> >>
> >> >> I m using nutch 0.9 on ubuntu on a single machine with pseudo-
> >> >> distributed
> >> >> mode.
> >> >> When i executing  the  following command
> >> >>
> >> >> bin/nutch crawl urls -dir crawled -depth 10
> >> >>
> >> >> this is what i got from the hadoop log:
> >> >>
> >> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - crawl started in: crawled
> >> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - rootUrlDir = urls
> >> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - threads = 10
> >> >> 2008-08-03 03:10:17,392 INFO  crawl.Crawl - depth = 10
> >> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: starting
> >> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: crawlDb:
> >> >> crawled/crawldb
> >> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: urlDir: urls
> >> >> 2008-08-03 03:10:17,461 INFO  crawl.Injector - Injector: Converting
> >> >> injected urls to crawl db entries.
> >> >> 2008-08-03 03:10:35,227 INFO  crawl.Injector - Injector: Merging
> >> >> injected
> >> >> urls into crawl db.
> >> >> 2008-08-03 03:10:59,724 INFO  crawl.Injector - Injector: done
> >> >> 2008-08-03 03:11:00,791 INFO  crawl.Generator - Generator: Selecting
> >> >> best-scoring urls due for fetch.
> >> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: starting
> >> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: segment:
> >> >> crawled/segments/20080803031100
> >> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator:
> >> >> filtering: false
> >> >> 2008-08-03 03:11:00,792 INFO  crawl.Generator - Generator: topN:
> >> >> 2147483647
> >> >> 2008-08-03 03:11:24,239 INFO  crawl.Generator - Generator:
> >> >> Partitioning
> >> >> selected urls by host, for politeness.
> >> >> 2008-08-03 03:11:47,583 INFO  crawl.Generator - Generator: done.
> >> >> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: starting
> >> >> 2008-08-03 03:11:47,583 INFO  fetcher.Fetcher - Fetcher: segment:
> >> >> crawled/segments/20080803031100
> >> >> 2008-08-03 03:12:36,915 INFO  fetcher.Fetcher - Fetcher: done
> >> >> 2008-08-03 03:12:36,951 INFO  crawl.CrawlDb - CrawlDb update:
> >> >> starting
> >> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: db:
> >> >> crawled/crawldb
> >> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:
> >> >> segments:
> >> >> [crawled/segments/20080803031100]
> >> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update:
> >> >> additions
> >> >> allowed: true
> >> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> >> normalizing: true
> >> >> 2008-08-03 03:12:36,952 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> >> filtering: true
> >> >> 2008-08-03 03:12:36,967 INFO  crawl.CrawlDb - CrawlDb update: Merging
> >> >> segment data into db.
> >> >> 2008-08-03 03:13:20,341 INFO  crawl.CrawlDb - CrawlDb update: done
> >> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: Selecting
> >> >> best-scoring urls due for fetch.
> >> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: starting
> >> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: segment:
> >> >> crawled/segments/20080803031321
> >> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator:
> >> >> filtering: false
> >> >> 2008-08-03 03:13:21,374 INFO  crawl.Generator - Generator: topN:
> >> >> 2147483647
> >> >> 2008-08-03 03:13:39,667 INFO  crawl.Generator - Generator:
> >> >> Partitioning
> >> >> selected urls by host, for politeness.
> >> >> 2008-08-03 03:14:04,963 INFO  crawl.Generator - Generator: done.
> >> >> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: starting
> >> >> 2008-08-03 03:14:04,963 INFO  fetcher.Fetcher - Fetcher: segment:
> >> >> crawled/segments/20080803031321
> >> >> 2008-08-03 03:21:26,809 INFO  fetcher.Fetcher - Fetcher: done
> >> >> 2008-08-03 03:21:26,851 INFO  crawl.CrawlDb - CrawlDb update:
> >> >> starting
> >> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: db:
> >> >> crawled/crawldb
> >> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:
> >> >> segments:
> >> >> [crawled/segments/20080803031321]
> >> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update:
> >> >> additions
> >> >> allowed: true
> >> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> >> normalizing: true
> >> >> 2008-08-03 03:21:26,852 INFO  crawl.CrawlDb - CrawlDb update: URL
> >> >> filtering: true
> >> >> 2008-08-03 03:21:26,866 INFO  crawl.CrawlDb - CrawlDb update: Merging
> >> >> segment data into db.
> >> >> 2008-08-03 03:22:13,223 INFO  crawl.CrawlDb - CrawlDb update: done
> >> >> 2008-08-03 03:22:14,251 INFO  crawl.Generator - Generator: Selecting
> >> >> best-scoring urls due for fetch.
> >> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: starting
> >> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: segment:
> >> >> crawled/segments/20080803032214
> >> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator:
> >> >> filtering: false
> >> >> 2008-08-03 03:22:14,252 INFO  crawl.Generator - Generator: topN:
> >> >> 2147483647
> >> >> 2008-08-03 03:22:34,459 INFO  crawl.Generator - Generator:
> >> >> Partitioning
> >> >> selected urls by host, for politeness.
> >> >> 2008-08-03 03:22:59,733 INFO  crawl.Generator - Generator: done.
> >> >> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: starting
> >> >> 2008-08-03 03:22:59,734 INFO  fetcher.Fetcher - Fetcher: segment:
> >> >> crawled/segments/20080803032214
> >> >> 2008-08-03 04:24:53,193 INFO  fetcher.Fetcher - Fetcher: done
> >> >>
> >> >> What i found executing the command:
> >> >> bin/hadoop dfs -ls
> >> >> Found 2 items
> >> >> /user/nutch/crawled     <dir>
> >> >> /user/nutch/urls        <dir>
> >> >> $ bin/hadoop dfs -ls crawled
> >> >> Found 2 items
> >> >> /user/nutch/crawled/crawldb     <dir>
> >> >> /user/nutch/crawled/segments    <dir>
> >> >>
> >> >> Where is linkdb,indexes and index? So pls tell me which may be the
> >> >> error.
> >> >>
> >> >> Here is my hadoop-site.xml:
> >> >>
> >> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >> >>
> >> >> <!-- Put site-specific property overrides in this file. -->
> >> >>
> >> >> <configuration>
> >> >> <property>
> >> >> <name>fs.default.name</name>
> >> >> <value>sysmonitor:9000</value>
> >> >> <description>
> >> >>   The name of the default file system. Either the literal string
> >> >>   "local" or a host:port for NDFS.
> >> >> </description>
> >> >> </property>
> >> >> <property>
> >> >> <name>mapred.job.tracker</name>
> >> >> <value>sysmonitor:9001</value>
> >> >> <description>
> >> >>   The host and port that the MapReduce job tracker runs at. If
> >> >>   "local", then jobs are run in-process as a single map and
> >> >>   reduce task.
> >> >> </description>
> >> >> </property>
> >> >> <property>
> >> >> <name>mapred.tasktracker.tasks.maximum</name>
> >> >> <value>2</value>
> >> >> <description>
> >> >>   The maximum number of tasks that will be run simultaneously by
> >> >>   a task tracker. This should be adjusted according to the heap size
> >> >>   per task, the amount of RAM available, and CPU consumption of
> >> >> each task.
> >> >> </description>
> >> >> </property>
> >> >> <property>
> >> >> <name>mapred.child.java.opts</name>
> >> >> <value>-Xmx200m</value>
> >> >> <description>
> >> >>   You can specify other Java options for each map or reduce task
> >> >> here,
> >> >>   but most likely you will want to adjust the heap size.
> >> >> </description>
> >> >> </property>
> >> >> <property>
> >> >> <name>dfs.name.dir</name>
> >> >> <value>/nutch/filesystem/name</value>
> >> >> </property>
> >> >> <property>
> >> >> <name>dfs.data.dir</name>
> >> >> <value>/nutch/filesystem/data</value>
> >> >> </property>
> >> >>
> >> >> <property>
> >> >> <name>mapred.system.dir</name>
> >> >> <value>/nutch/filesystem/mapreduce/system</value>
> >> >> </property>
> >> >> <property>
> >> >> <name>mapred.local.dir</name>
> >> >> <value>/nutch/filesystem/mapreduce/local</value>
> >> >> </property>
> >> >>
> >> >> <property>
> >> >> <name>dfs.replication</name>
> >> >> <value>1</value>
> >> >> </property>
> >> >> </configuration>
> >> >>
> >> >>
> >> >> My urls/urllist.txt contains almost 100 seed urls and depth is 10
> >> >> but it
> >> >> seems there is  little crawling done.
> >> >>
> >> >>
> >> >> regards
> >> >> --monirul
> >> >>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Best Regards
> >> > Alexander Aristov
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >>
> >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
>



--
Best Regards
Alexander Aristov