Nutch internals

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Nutch internals

Uroš Gruber-2

I do some changes in CrawlDatum but some things I'm not quite understand.

My idea is to add int hop in CrawlDatum and set this in Injector to 0.
Then after fetching other urls this can be calculated parenturl + 1.

I try to find where adding new urls to webDB is done. If somebody could
explain this to me.

1. Inject (urls are read from url file, filtered through enabled Filters
and stored in WebDB)
2. after that generate is started. Here WebDB is read in create some
list of urls to fetch
3. Fetcher fetch urls and store this in segments dirs

4. updatedb, If I understand correctly data from segment/*/crawl_parse
is merged with current WebDB. If so creating webdb in segment is done
when fetching.

I think it's possible to get fetching url CrawlDatum info while fetching
and then use hop number to calculate with all other urls found on
current page and store this.

Maybe I missed the whole concept of this.

Affter that I can use this hop number to limit generating fetch lists.