CrawlDb and inputDir's

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

CrawlDb and inputDir's

Stefan Groschupf-2
Hi,

there is something more that confuse me and it would be great to get  
some hints.
The call  CrawlDb.createJob(...) creates the crawl db update job. In  
this method the main input folder is defined:
job.addInputDir(new File(crawlDb, CrawlDatum.DB_DIR_NAME));
However in the update method (line 48, 49) two more input dirs are  
added.

This confuses me since theoretically I understand that the parsed  
data are need to add fresh urls into the crawldb, but I'm surprises  
that first of all both folders are added.
Secondly I can't find the code that writes crawldatum objects into  
this folders, instead I found that the fetchoutput format writes  
parseImpl and Content into these folders.
I also find no code where these objects are converted or merged  
together.
So I'm asking myself why these folders are added and where and how  
the fresh crawlDatum objects come from that will be merged into the  
new  crawlDb.

Thirdly wouldn't be cleaner to move the adding of this folders also  
into the createJob method?

Thanks for any hints.
Stefan
Reply | Threaded
Open this post in threaded view
|

Re: CrawlDb and inputDir's

Doug Cutting-2
Stefan Groschupf wrote:
> The call  CrawlDb.createJob(...) creates the crawl db update job. In  
> this method the main input folder is defined:
> job.addInputDir(new File(crawlDb, CrawlDatum.DB_DIR_NAME));
> However in the update method (line 48, 49) two more input dirs are  added.
>
> This confuses me since theoretically I understand that the parsed  data
> are need to add fresh urls into the crawldb, but I'm surprises  that
> first of all both folders are added.

One is from the fetcher, the other from the parser.

The fetcher writes a CrawlDatum for each page fetched, with STATUS_FETCH_*.

The parser writes a CrawlDatum for each link found, with a STATUS_LINKED.

> Secondly I can't find the code that writes crawldatum objects into  this
> folders, instead I found that the fetchoutput format writes  parseImpl
> and Content into these folders.

FetcherOutputFormat line 73, and ParseOutputFormat line 107.

> I also find no code where these objects are converted or merged  together.

CrawlDbReducer.reduce().

> Thirdly wouldn't be cleaner to move the adding of this folders also  
> into the createJob method?

No, the createJob() method is also used by the Injector, where these
directories are not appropriate.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: CrawlDb and inputDir's

Stefan Groschupf-2
Thanks for the clarification, i missed all this cross links!
You definitely 'are in the know'. :-)
Stefan



Am 31.01.2006 um 20:31 schrieb Doug Cutting:

> Stefan Groschupf wrote:
>> The call  CrawlDb.createJob(...) creates the crawl db update job.  
>> In  this method the main input folder is defined:
>> job.addInputDir(new File(crawlDb, CrawlDatum.DB_DIR_NAME));
>> However in the update method (line 48, 49) two more input dirs  
>> are  added.
>> This confuses me since theoretically I understand that the parsed  
>> data are need to add fresh urls into the crawldb, but I'm  
>> surprises  that first of all both folders are added.
>
> One is from the fetcher, the other from the parser.
>
> The fetcher writes a CrawlDatum for each page fetched, with  
> STATUS_FETCH_*.
>
> The parser writes a CrawlDatum for each link found, with a  
> STATUS_LINKED.
>
>> Secondly I can't find the code that writes crawldatum objects  
>> into  this folders, instead I found that the fetchoutput format  
>> writes  parseImpl and Content into these folders.
>
> FetcherOutputFormat line 73, and ParseOutputFormat line 107.
>
>> I also find no code where these objects are converted or merged  
>> together.
>
> CrawlDbReducer.reduce().
>
>> Thirdly wouldn't be cleaner to move the adding of this folders  
>> also  into the createJob method?
>
> No, the createJob() method is also used by the Injector, where  
> these directories are not appropriate.
>
> Doug
>