Help needed - how to import local files into Nutch 0.8?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Help needed - how to import local files into Nutch 0.8?

Carl Dorestos
I need to index 100s of GBs of documents that I already have on a
local filesystem in my site. I need the content and the index to be
distributed on dfs for distributed search.

What is the best way to import these files (all html docs) into nutch
0.8using dfs and mapred?

I tried putting the files on an http server in my site, then crawling the
files from my dfs/mapred nutch cluster.
-The servers are connected by 1 Gbit/s eathernet, but I could only get crawl
bandwidth of 200 kb/s.
- It is not a cpu utilization issue. I checked the cpu utilization on the
slaves, and it was low as expected (5%-10%).
- The crawl doesn't go through a firewall.
- The crawl-urlfilter.txt file is very simple with a few lines
- Is it a politeness issue? If so how to override the politeness settings?

I'd appreciate your help.

Carl
Reply | Threaded
Open this post in threaded view
|

Re: Help needed - how to import local files into Nutch 0.8?

sudhendra seshachala
Please refer to http://www.mail-archive.com/nutch-user@.../msg04056.html
   
  I hope you find it useful.
   
  Just follow every instruction there.
   
  Let me know, if you need anything else.
  Thanks
  Sudhi

Carl Dorestos <[hidden email]> wrote:
  I need to index 100s of GBs of documents that I already have on a
local filesystem in my site. I need the content and the index to be
distributed on dfs for distributed search.

What is the best way to import these files (all html docs) into nutch
0.8using dfs and mapred?

I tried putting the files on an http server in my site, then crawling the
files from my dfs/mapred nutch cluster.
-The servers are connected by 1 Gbit/s eathernet, but I could only get crawl
bandwidth of 200 kb/s.
- It is not a cpu utilization issue. I checked the cpu utilization on the
slaves, and it was low as expected (5%-10%).
- The crawl doesn't go through a firewall.
- The crawl-urlfilter.txt file is very simple with a few lines
- Is it a politeness issue? If so how to override the politeness settings?

I'd appreciate your help.

Carl



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


               
---------------------------------
How low will we go? Check out Yahoo! Messenger¬ís low  PC-to-Phone call rates.
Reply | Threaded
Open this post in threaded view
|

Re: Help needed - how to import local files into Nutch 0.8?

Doug Cutting
In reply to this post by Carl Dorestos
Carl Dorestos wrote:
> - Is it a politeness issue? If so how to override the politeness settings?

To disable politeness, you would change fetcher.server.delay to zero and
fetcher.threads.per.host to something larger than fetcher.threads.fetch.

Doug