Quantcast

Please help - Nutch fetch command not fetching data

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Please help - Nutch fetch command not fetching data

apachenutch
This post was updated on .
Hi all,

I recently configured nutch-GORA on my cassandra DB. My colleague referred me to the below link, which is awesome.
http://sujitpal.blogspot.in/2012/01/exploring-nutch-gora-with-cassandra.html

I followed the steps in the blog as is. The problem I am having is, the first time, everything goes well - inject, generate, fetch, and parse. But when I iterate, nutch fetch does not fetch the data. As a result, my solr index only has 10 records (from the first successful run of course), and is not picking the data from the subsequent runs.

Results from my nutch fetch (After iterating)-

andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch fetch 1329855266-1107256220
FetcherJob: starting
FetcherJob : timelimit set for : -1
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob: batchId: 1329855266-1107256220
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread7, activeThreads=0
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues= 0, fetchQueues.totalSize=0
-activeThreads=0
FetcherJob: done

*************************************
vs the author of the above blog -

sujit@cyclone:local$ bin/nutch fetch 1325709400-776802111
FetcherJob: starting
FetcherJob : timelimit set for : -1
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob: batchId: 1325709400-776802111
Using queue mode : byHost
Fetcher: threads: 10
fetching http://www.parathyroid.com/parathyroid.htm
QueueFeeder finished: total 47 records. Hit by time limit :0
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=46
fetching http://www.parathyroid.com/Parathyroid-Surgeon.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=45
fetching http://www.parathyroid.com/paratiroide/index.html
fetching http://www.parathyroid.com/diagnosis.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=43
fetching http://www.parathyroid.com/parathyroid-adenoma.htm
fetching http://www.parathyroid.com/age.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=41
fetching http://www.parathyroid.com/FHH.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=40
fetching http://www.parathyroid.com/treatment-surgery.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=39
fetching http://www.parathyroid.com/who's_eligible.htm
fetching http://www.parathyroid.com/parathyroid-disease.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=37
fetching http://www.parathyroid.com/FAQ.htm
fetching http://www.parathyroid.com/finding-parathyroid.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=35
fetching http://www.parathyroid.com/hyperparathyroidism-diagnosis.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=34
fetching http://www.parathyroid.com/index.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=33
fetching http://www.parathyroid.com/parathyroid-pictures.htm
fetching http://www.parathyroid.com/Parathyroid-Surgeon-Map.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=31
fetching http://www.parathyroid.com/mini-surgery.htm
fetching http://www.parathyroid.com/about-Parathyroid.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=29
fetching http://www.parathyroid.com/disclaimer.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=28
fetching http://www.parathyroid.com/parathyroid-function.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=27
fetching http://www.parathyroid.com/paratiroide
fetching http://www.parathyroid.com/low-vitamin-d.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=25
fetching http://www.parathyroid.com/parathyroid-symptoms-cartoon.htm
fetching http://www.parathyroid.com/sestamibi.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=23
fetching http://www.parathyroid.com/osteoporosis.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=22
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=22
fetching http://www.parathyroid.com/surgery_cure_rates.htm
fetching http://www.parathyroid.com/low-calcium.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=20
fetching http://www.parathyroid.com/Sensipar-high-calcium.htm
fetching http://www.parathyroid.com/Dr.Norman.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=18
fetching http://www.parathyroid.com/parathyroid-anatomy.htm
fetching http://www.parathyroid.com/parathyroid-surgery.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=16
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=16
fetching http://www.parathyroid.com/hypoparathyroidism.htm
fetching http://www.parathyroid.com/endocrinology.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=14
fetching http://www.parathyroid.com/parathyroid-cancer.htm
fetching http://www.parathyroid.com/testimonials.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=12
fetching http://www.parathyroid.com/hyperparathyroidism-videos.htm
fetching http://www.parathyroid.com/high-calcium.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=10
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=10
fetching http://www.parathyroid.com/osteoporosis2.htm
fetching http://www.parathyroid.com/MEN-Syndrome.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=8
fetching http://www.parathyroid.com/causes.htm
fetching http://www.parathyroid.com/MIRP-Surgery.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=6
fetching http://www.parathyroid.com/Re-Operation.htm
fetching http://www.parathyroid.com/pregnancy.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=4
* queue: http://www.parathyroid.com


I am thinking somewhere, depth needs to be specified? - If yes, where?
I followed all the steps in the blog, and don't see a single error in my log file. My seed list directory is in - /home/andrew/nutch/

andrew@andrew-ubuntu:pwd
/home/andrew/nutch/
andrew@andrew-ubuntu:~/nutch$ ls -ltr
total 20
drwxrwxr-x  5 pooja pooja 4096 2012-02-19 19:38 workspace
drwxrwxr-x  3 pooja pooja 4096 2012-02-19 21:23 install
drwxrwxr-x 13 pooja pooja 4096 2012-02-20 08:06 gora
drwxrwxr-x  9 pooja pooja 4096 2012-02-20 09:21 branch
drwxrwxr-x  2 pooja pooja 4096 2012-02-21 12:05 web_seeds

andrew@andrew-ubuntu:~/nutch$ cd web_seeds/

andrew@andrew-ubuntu:~/nutch/web_seeds$ ls -ltr
total 4
-rwxr-xr-x 1 andrew andrew 19 2012-02-21 11:03 nutch.txt

andrew@andrew-ubuntu:~/nutch/web_seeds$ cat *
http://www.cnn.com

For your reference, I have also pasted below the nutch inject, generate, fetch, and parse from my first run.

andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch inject /home/andrew/nutch/web_seeds
InjectorJob: starting
InjectorJob: urlDir: /home/andrew/nutch/web_seeds
InjectorJob: finished
andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1329855121-1496717092

andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch fetch 1329855121-1496717092
FetcherJob: starting
FetcherJob : timelimit set for : -1
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob: batchId: 1329855121-1496717092
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://www.cnn.com/
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread2, activeThreads=1
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues= 0, fetchQueues.totalSize=0
-activeThreads=0
FetcherJob: done

andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch parse 1329855121-1496717092
ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1329855121-1496717092
Parsing http://www.cnn.com/
ParserJob: success

andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch updatedb
DbUpdaterJob: starting
DbUpdaterJob: done


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Please help - Nutch fetch command not fetching data

apachenutch
Any suggestions?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Please help - Nutch fetch command not fetching data

Sujit Pal
In reply to this post by apachenutch
Hi apachenutch,

I am the author of the blog post...thanks for the kind words...

Did you miss the updatedb by any chance? This takes the outlinks from the parsed pages and adds them back to the fetch list so generate can then make these available for fetching...

So initial cycle: inject, generate, fetch, parse, updatedb
next cycle: generate, fetch, parse, updatedb
...
finally: solrindex

-sujit

On Feb 21, 2012, at 12:32 PM, apachenutch wrote:

> Hi all,
>
> I recently configured nutch-GORA on my cassandra DB. My colleague referred
> me to the below link, which is awesome.
> http://sujitpal.blogspot.in/2012/01/exploring-nutch-gora-with-cassandra.html
>
> I followed the steps in the blog as is. The problem I am having is, the
> first time, everything goes well - inject, generate, fetch, and parse. But
> when I iterate, nutch fetch does not fetch the data. As a result, my solr
> index only has 10 records (from the first successful run), and is not
> picking the data from the subsequent runs.
>
> Results from my nutch fetch -
>
> andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch fetch
> 1329855266-1107256220
> FetcherJob: starting
> FetcherJob : timelimit set for : -1
> FetcherJob: threads: 10
> FetcherJob: parsing: false
> FetcherJob: resuming: false
> FetcherJob: batchId: 1329855266-1107256220
> Using queue mode : byHost
> Fetcher: threads: 10
> QueueFeeder finished: total 0 records. Hit by time limit :0
> -finishing thread FetcherThread0, activeThreads=0
> -finishing thread FetcherThread1, activeThreads=0
> -finishing thread FetcherThread2, activeThreads=0
> -finishing thread FetcherThread3, activeThreads=0
> -finishing thread FetcherThread4, activeThreads=0
> -finishing thread FetcherThread5, activeThreads=0
> -finishing thread FetcherThread6, activeThreads=0
> -finishing thread FetcherThread7, activeThreads=0
> -finishing thread FetcherThread8, activeThreads=0
> -finishing thread FetcherThread9, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues= 0, fetchQueues.totalSize=0
> -activeThreads=0
> FetcherJob: done
>
> *************************************
> vs the author of the above blog -
>
> sujit@cyclone:local$ bin/nutch fetch 1325709400-776802111
> FetcherJob: starting
> FetcherJob : timelimit set for : -1
> FetcherJob: threads: 10
> FetcherJob: parsing: false
> FetcherJob: resuming: false
> FetcherJob: batchId: 1325709400-776802111
> Using queue mode : byHost
> Fetcher: threads: 10
> /*fetching http://www.parathyroid.com/parathyroid.htm
> QueueFeeder finished: total 47 records. Hit by time limit :0
> -activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=46
> fetching http://www.parathyroid.com/Parathyroid-Surgeon.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=45
> fetching http://www.parathyroid.com/paratiroide/index.html
> fetching http://www.parathyroid.com/diagnosis.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=43
> fetching http://www.parathyroid.com/parathyroid-adenoma.htm
> fetching http://www.parathyroid.com/age.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=41
> fetching http://www.parathyroid.com/FHH.htm
> -activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=40
> fetching http://www.parathyroid.com/treatment-surgery.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=39
> fetching http://www.parathyroid.com/who's_eligible.htm
> fetching http://www.parathyroid.com/parathyroid-disease.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=37
> fetching http://www.parathyroid.com/FAQ.htm
> fetching http://www.parathyroid.com/finding-parathyroid.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=35
> fetching http://www.parathyroid.com/hyperparathyroidism-diagnosis.htm
> -activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=34
> fetching http://www.parathyroid.com/index.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=33
> fetching http://www.parathyroid.com/parathyroid-pictures.htm
> fetching http://www.parathyroid.com/Parathyroid-Surgeon-Map.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=31
> fetching http://www.parathyroid.com/mini-surgery.htm
> fetching http://www.parathyroid.com/about-Parathyroid.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=29
> fetching http://www.parathyroid.com/disclaimer.htm
> -activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=28
> fetching http://www.parathyroid.com/parathyroid-function.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=27
> fetching http://www.parathyroid.com/paratiroide
> fetching http://www.parathyroid.com/low-vitamin-d.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=25
> fetching http://www.parathyroid.com/parathyroid-symptoms-cartoon.htm
> fetching http://www.parathyroid.com/sestamibi.htm
> -activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=23
> fetching http://www.parathyroid.com/osteoporosis.htm
> -activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=22
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=22
> fetching http://www.parathyroid.com/surgery_cure_rates.htm
> fetching http://www.parathyroid.com/low-calcium.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=20
> fetching http://www.parathyroid.com/Sensipar-high-calcium.htm
> fetching http://www.parathyroid.com/Dr.Norman.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=18
> fetching http://www.parathyroid.com/parathyroid-anatomy.htm
> fetching http://www.parathyroid.com/parathyroid-surgery.htm
> -activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=16
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=16
> fetching http://www.parathyroid.com/hypoparathyroidism.htm
> fetching http://www.parathyroid.com/endocrinology.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=14
> fetching http://www.parathyroid.com/parathyroid-cancer.htm
> fetching http://www.parathyroid.com/testimonials.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=12
> fetching http://www.parathyroid.com/hyperparathyroidism-videos.htm
> fetching http://www.parathyroid.com/high-calcium.htm
> -activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=10
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=10
> fetching http://www.parathyroid.com/osteoporosis2.htm
> fetching http://www.parathyroid.com/MEN-Syndrome.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=8
> fetching http://www.parathyroid.com/causes.htm
> fetching http://www.parathyroid.com/MIRP-Surgery.htm
> -activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=6
> fetching http://www.parathyroid.com/Re-Operation.htm
> fetching http://www.parathyroid.com/pregnancy.htm
> -activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=4
> * queue: http://www.parathyroid.com*/
>
> I am thinking somewhere, depth needs to be specified? - If yes, where?
> I followed all the steps in the blog, and don't see a single error in my log
> file. My seed list directory is in - /home/andrew/nutch/
>
> andrew@andrew-ubuntu:pwd
> /home/andrew/nutch/
> andrew@andrew-ubuntu:~/nutch$ ls -ltr
> total 20
> drwxrwxr-x  5 pooja pooja 4096 2012-02-19 19:38 workspace
> drwxrwxr-x  3 pooja pooja 4096 2012-02-19 21:23 install
> drwxrwxr-x 13 pooja pooja 4096 2012-02-20 08:06 gora
> drwxrwxr-x  9 pooja pooja 4096 2012-02-20 09:21 branch
> drwxrwxr-x  2 pooja pooja 4096 2012-02-21 12:05 web_seeds
>
> andrew@andrew-ubuntu:~/nutch$ cd web_seeds/
>
> andrew@andrew-ubuntu:~/nutch/web_seeds$ ls -ltr
> total 4
> -rwxr-xr-x 1 andrew andrew 19 2012-02-21 11:03 nutch.txt
>
> andrew@andrew-ubuntu:~/nutch/web_seeds$ cat *
> http://www.cnn.com
>
> For your reference, I have also pasted below the nutch inject, generate,
> fetch, and parse from my first run.
>
> andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch inject
> /home/andrew/nutch/web_seeds
> InjectorJob: starting
> InjectorJob: urlDir: /home/andrew/nutch/web_seeds
> InjectorJob: finished
> andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch generate
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: true
> GeneratorJob: done
> GeneratorJob: generated batch id: 1329855121-1496717092
>
> andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch fetch
> 1329855121-1496717092
> FetcherJob: starting
> FetcherJob : timelimit set for : -1
> FetcherJob: threads: 10
> FetcherJob: parsing: false
> FetcherJob: resuming: false
> FetcherJob: batchId: 1329855121-1496717092
> Using queue mode : byHost
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records. Hit by time limit :0
> fetching http://www.cnn.com/
> -finishing thread FetcherThread1, activeThreads=1
> -finishing thread FetcherThread3, activeThreads=1
> -finishing thread FetcherThread2, activeThreads=1
> -finishing thread FetcherThread4, activeThreads=1
> -finishing thread FetcherThread5, activeThreads=1
> -finishing thread FetcherThread6, activeThreads=1
> -finishing thread FetcherThread7, activeThreads=1
> -finishing thread FetcherThread8, activeThreads=1
> -finishing thread FetcherThread9, activeThreads=1
> -finishing thread FetcherThread0, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues= 0, fetchQueues.totalSize=0
> -activeThreads=0
> FetcherJob: done
>
> andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch parse
> 1329855121-1496717092
> ParserJob: starting
> ParserJob: resuming: false
> ParserJob: forced reparse: false
> ParserJob: batchId: 1329855121-1496717092
> Parsing http://www.cnn.com/
> ParserJob: success
>
> andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch updatedb
> DbUpdaterJob: starting
> DbUpdaterJob: done
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Please-help-Nutch-fetch-command-not-fetching-data-tp3764751p3764751.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Please help - Nutch fetch command not fetching data

apachenutch
Update DB was done, after inject, generate, fetch and parse.
Tried iterating after doing the update.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Please help - Nutch fetch command not fetching data

Sujit Pal
Hi apachenutch,

Something of a wild guess here. Given that you are using the same seed file as I am, I would have expected to see a single URL in the index at the end of the first iteration, not 10. So the only time I have observed similar behavior was when the fetcher truncated the file because of the http.content.limit setting, you may want to set it to -1 and see if the problem gets fixed.

You can verify if this is needed by looking at the cnt column for the seed URL and see if the contents of the page is the same as what you get from a view-source of the seed URL page on your browser.

Also to answer your original question, the depth is the iteration number. Each time you go deeper and deeper because you are putting the outlinks generated from the previous call back into the fetch list and fetching/parsing them. You can of course script it and specify a depth parameter that controls the number of iterations...

-sujit

On Feb 21, 2012, at 2:16 PM, apachenutch wrote:

> Update DB was done, after inject, generate, fetch and parse.
> Tried iterating after doing the update.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Please-help-Nutch-fetch-command-not-fetching-data-tp3764751p3764994.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Please help - Nutch fetch command not fetching data

apachenutch
Thank you. I changed the value but no luck. (Changed in runtime/local/conf - nutch-default.xml)

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http

Output --------------------


andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch inject ../../../web_seeds
InjectorJob: starting
InjectorJob: urlDir: ../../../web_seeds
InjectorJob: finished
andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1329930779-110515839
andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch fetch 1329930779-110515839
FetcherJob: starting
FetcherJob : timelimit set for : -1
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob: batchId: 1329930779-110515839
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://www.q1a.com/
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread2, activeThreads=1
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues= 1, fetchQueues.totalSize=0
-finishing thread FetcherThread0, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues= 0, fetchQueues.totalSize=0
-activeThreads=0
FetcherJob: done
andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch parse 1329930779-110515839
ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1329930779-110515839
Parsing http://www.q1a.com/
Skipping http://www.q1a.com/q1a; different batch id - Why does it say skipping here?
ParserJob: success
andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch updatedb
DbUpdaterJob: starting
DbUpdaterJob: done



************************ The first iteration ****************************

andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1329930901-1268252438
andrew@andrew-ubuntu:~/nutch/branch/runtime/local$ bin/nutch fetch 1329930901-1268252438
FetcherJob: starting
FetcherJob : timelimit set for : -1
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob: batchId: 1329930901-1268252438
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://www.q1a.com/q1a
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread2, activeThreads=1
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues= 0, fetchQueues.totalSize=0
-activeThreads=0
FetcherJob: done


I stopped here, since its not doing what it is supposed to. Please suggest.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Please help - Nutch fetch command not fetching data

glumet
Hello, I would like to ask if you solved the problem? I have exactly same situation right now.

Generator generates batchId, fetcher gives this batchId but the result is:

.
.
.
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done

and no urls are fetched.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: Please help - Nutch fetch command not fetching data

kkzxak47
This post was updated on .
In reply to this post by apachenutch
The file you should be modifying is not "{NUTCH_HOME}/runtime/local/conf/nutch-default.xml", that is the configuration template file, you can copy "<property>" from it and change the value. The actual config file is "nutch-site.xml" in the same directory.



--
View this message in context: http://lucene.472066.n3.nabble.com/Please-help-Nutch-fetch-command-not-fetching-data-tp3764751p4118565.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: Please help - Nutch fetch command not fetching data

kkzxak47
This post was updated on .
In reply to this post by apachenutch
There are two options to solve this problem, the descriptions are enough to explain why, I believe it is a chain reaction:
<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>
<property>
  <name>parser.skip.truncated</name>
  <value>true</value>
  <description>Boolean value for whether we should skip parsing for truncated documents. By default this
  property is activated due to extremely high levels of CPU which parsing can sometimes take.  
  </description>
</property>



--
View this message in context: http://lucene.472066.n3.nabble.com/Please-help-Nutch-fetch-command-not-fetching-data-tp3764751p4118565.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: Please help - Nutch fetch command not fetching data

glumet
Unfortunately, it didn't solve the problem. I run /bin/crawl script many times and everything worked fine but suddenly something went wrong and nothing is fetched... I have suspicion that generator writes that he generated batch with ID XXXXXX-XXX but no urls are actually generated... and it could be reason that nothing is fetched.

Does anybody know how to solve it?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: Please help - Nutch fetch command not fetching data

Bayu Widyasanyata
Hi,

Have you check the hadoop.log?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: Please help - Nutch fetch command not fetching data

glumet
Hi,

when I look into hadoop.log, I can see

2014-02-22 16:16:19,174 INFO  mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000
2014-02-22 16:16:21,200 INFO  store.HBaseStore - Keyclass and nameclass match but mismatching table names  mappingfile schema is 'webpage' vs actual schema 'webpage_webpage' , assuming they are the same.
2014-02-22 16:16:21,220 INFO  store.HBaseStore - Keyclass and nameclass match but mismatching table names  mappingfile schema is 'webpage' vs actual schema 'webpage_webpage' , assuming they are the same.
2014-02-22 16:16:21,249 INFO  store.HBaseStore - Keyclass and nameclass match but mismatching table names  mappingfile schema is 'webpage' vs actual schema 'webpage_webpage' , assuming they are the same.
2014-02-22 16:16:21,284 INFO  mapreduce.GoraRecordReader - gora.buffer.read.limit = 10000

etc. etc. ... this repeats many times... my hbase table is walked through and I think it cannot find specified batch ID...

when I look into hbase log, there is

xbouj19@ir:~$ tail -f /opt/ir/hbase/logs/hbase-root-master-ir.log
2014-02-22 16:15:20,697 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 19.38 MB of total=164.39 MB
2014-02-22 16:15:20,699 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=19.43 MB, total=144.96 MB, single=86.25 MB, multi=72.5 MB, memory=4.05 MB
2014-02-22 16:15:21,257 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 19.35 MB of total=164.37 MB
2014-02-22 16:15:21,258 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=19.43 MB, total=145.08 MB, single=86.22 MB, multi=72.5 MB, memory=4.05 MB
2014-02-22 16:15:21,645 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 19.35 MB of total=164.37 MB
2014-02-22 16:15:21,646 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=19.38 MB, total=145.07 MB, single=86.22 MB, multi=72.5 MB, memory=4.05 MB
2014-02-22 16:15:22,251 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 19.36 MB of total=164.37 MB
2014-02-22 16:15:22,252 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=19.36 MB, total=145.01 MB, single=86.23 MB, multi=72.5 MB, memory=4.05 MB
2014-02-22 16:15:22,596 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 19.38 MB of total=164.4 MB
2014-02-22 16:15:22,598 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=19.42 MB, total=144.98 MB, single=86.25 MB, multi=72.5 MB, memory=4.05 MB
2014-02-22 16:15:23,694 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 19.39 MB of total=164.41 MB
2014-02-22 16:15:23,696 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=19.41 MB, total=144.99 MB, single=86.26 MB, multi=72.5 MB, memory=4.05 MB
2014-02-22 16:15:24,164 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 19.39 MB of total=164.4 MB
2014-02-22 16:15:24,165 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=19.49 MB, total=144.91 MB, single=86.25 MB, multi=72.5 MB, memory=4.05 MB
2014-02-22 16:15:24,432 INFO org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection from /127.0.0.1:37587
2014-02-22 16:15:24,432 INFO org.apache.zookeeper.server.NIOServerCnxn: Client attempting to establish new session at /127.0.0.1:37587
2014-02-22 16:15:24,434 INFO org.apache.zookeeper.server.NIOServerCnxn: Established session 0x1444be389493276 with negotiated timeout 40000 for client /127.0.0.1:37587
2014-02-22 16:15:24,440 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x1444be389493276
2014-02-22 16:15:24,441 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:37587 which had sessionid 0x1444be389493276
2014-02-22 16:15:24,454 INFO org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection from /127.0.0.1:37588
2014-02-22 16:15:24,454 INFO org.apache.zookeeper.server.NIOServerCnxn: Client attempting to establish new session at /127.0.0.1:37588
2014-02-22 16:15:24,455 INFO org.apache.zookeeper.server.NIOServerCnxn: Established session 0x1444be389493277 with negotiated timeout 40000 for client /127.0.0.1:37588
2014-02-22 16:15:24,461 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x1444be389493277
2014-02-22 16:15:24,462 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:37588 which had sessionid 0x1444be389493277
2014-02-22 16:15:24,474 INFO org.apache.zookeeper.server.NIOServerCnxn: Accepted socket connection from /127.0.0.1:37589
2014-02-22 16:15:24,474 INFO org.apache.zookeeper.server.NIOServerCnxn: Client attempting to establish new session at /127.0.0.1:37589
2014-02-22 16:15:24,476 INFO org.apache.zookeeper.server.NIOServerCnxn: Established session 0x1444be389493278 with negotiated timeout 40000 for client /127.0.0.1:37589
2014-02-22 16:15:24,484 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x1444be389493278
2014-02-22 16:15:24,485 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:37589 which had sessionid 0x1444be389493278
2014-02-22 16:15:24,810 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 19.37 MB of total=164.39 MB
2014-02-22 16:15:24,812 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=19.39 MB, total=145 MB, single=86.24 MB, multi=72.5 MB, memory=4.05 MB
2014-02-22 16:15:25,236 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 19.4 MB of total=164.42 MB
2014-02-22 16:15:25,237 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=19.43 MB, total=145.11 MB, single=86.27 MB, multi=72.5 MB, memory=4.05 MB
2014-02-22 16:15:25,680 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 19.36 MB of total=164.37 MB
2014-02-22 16:15:25,681 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=19.42 MB, total=145.07 MB, single=86.35 MB, multi=72.5 MB, memory=4.05 MB
2014-02-22 16:15:25,963 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 19.37 MB of total=164.38 MB
2014-02-22 16:15:25,965 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=19.37 MB, total=145.13 MB, single=86.36 MB, multi=72.5 MB, memory=4.05 MB
2014-02-22 16:15:26,310 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction started; Attempting to free 19.35 MB of total=164.36 MB
2014-02-22 16:15:26,312 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction completed; freed=19.35 MB, total=145.01 MB, single=86.22 MB, multi=72.5 MB, memory=4.05 MB

etc. etc. ... no ERRORs or anything
Loading...