incremental growing index

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

incremental growing index

Mathijs Homminga
Hi everyone,

Our crawler generates and fetches segments continuously. We'd like to
index and merge each new segment immediately (or with a small delay)
such that our index grows incrementally. This is unlike the normal
situation where one would create a linkdb and an index of all segments
at once, after the crawl has finished.

The problem we have is that Nutch currently needs the complete linkdb
and crawldb each time we want to index a single segment.

The Indexer map task processes all keys (urls) from the input files
(linkdb, crawldb and segment). This includes all data from the linkdb
and crawldb that we actually don't need since we are only interested in
the data that corresponds to the keys (urls) in our segment (this is
filtered out in the Indexer reduce task).
Obviously, as the linkdb and crawldb grow, this becomes more and more of
a problem.

Any ideas on how to tackle this issue?
Is it feasible to lookup the corresponding linkdb and crawldb data for
each key (url) in the segment before or during indexing?

Thanks!
Mathijs Homminga

--
Knowlogy
Helperpark 290 C
9723 ZA Groningen

[hidden email]
+31 (0)6 15312977
http://www.knowlogy.nl


Reply | Threaded
Open this post in threaded view
|

Re: incremental growing index

Andrzej Białecki-2
Mathijs Homminga wrote:

> Hi everyone,
>
> Our crawler generates and fetches segments continuously. We'd like to
> index and merge each new segment immediately (or with a small delay)
> such that our index grows incrementally. This is unlike the normal
> situation where one would create a linkdb and an index of all segments
> at once, after the crawl has finished.
>
> The problem we have is that Nutch currently needs the complete linkdb
> and crawldb each time we want to index a single segment.

The reason for wanting the linkdb is the anchor information. If you
don't need any anchor information, you can provide an empty linkdb.

The reason why crawldb is needed is to get the current page status
information (which may have changed in the meantime due to subsequent
crawldb updates from newer segments). If you don't need this
information, you can modify Indexer.reduce() (~line 212) method to allow
for this, and then remove the line in Indexer.index() that adds crawldb
to the list of input paths.

>
> The Indexer map task processes all keys (urls) from the input files
> (linkdb, crawldb and segment). This includes all data from the linkdb
> and crawldb that we actually don't need since we are only interested in
> the data that corresponds to the keys (urls) in our segment (this is
> filtered out in the Indexer reduce task).
> Obviously, as the linkdb and crawldb grow, this becomes more and more of
> a problem.

Is this really a problem for you now? Unless your segments are tiny, the
indexing process will be dominated by I/O from the processing of
parseText / parseData and Lucene operations.

>
> Any ideas on how to tackle this issue?
> Is it feasible to lookup the corresponding linkdb and crawldb data for
> each key (url) in the segment before or during indexing?

It would be probably too slow, unless you made a copy of linkdb/crawldb
on the local FS-es of each node. But at this point the benefit of this
change would be doubtful, because of all the I/O you would need to do to
prepare each task's environment ...


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: incremental growing index

Mathijs Homminga
Thanks Andrzej,

Perhaps these numbers make our issue more clear:

- after a week of (internet) crawling, the crawldb contains about 22M
documents.
- 6M documents are fetched, in 257 segments (topN = 25,000)
- size of the crawldb = 4,399 MB (22M docs, 0.2 kB/doc)
- size of the linkdb = 75,955 MB (22M docs, 3.5 kB/doc)
- size of a segment = somewhere between 100 and 500 MB (25K docs, 20
kB/doc (max))

As you can see: for a segment of 500 MB, more than 99% of the IO during
indexing is due to the linkdb and crawldb.
We could increase the size of our segments, but in the end this only
delays the problem.

We are now indexing without the linkdb. This reduces the time needed by
a factor 10. But we would really like to have the link texts back in
again in the future.

Thanks,
Mathijs

Andrzej Bialecki wrote:

> Mathijs Homminga wrote:
>> Hi everyone,
>>
>> Our crawler generates and fetches segments continuously. We'd like to
>> index and merge each new segment immediately (or with a small delay)
>> such that our index grows incrementally. This is unlike the normal
>> situation where one would create a linkdb and an index of all
>> segments at once, after the crawl has finished.
>>
>> The problem we have is that Nutch currently needs the complete linkdb
>> and crawldb each time we want to index a single segment.
>
> The reason for wanting the linkdb is the anchor information. If you
> don't need any anchor information, you can provide an empty linkdb.
>
> The reason why crawldb is needed is to get the current page status
> information (which may have changed in the meantime due to subsequent
> crawldb updates from newer segments). If you don't need this
> information, you can modify Indexer.reduce() (~line 212) method to
> allow for this, and then remove the line in Indexer.index() that adds
> crawldb to the list of input paths.
>
>>
>> The Indexer map task processes all keys (urls) from the input files
>> (linkdb, crawldb and segment). This includes all data from the linkdb
>> and crawldb that we actually don't need since we are only interested
>> in the data that corresponds to the keys (urls) in our segment (this
>> is filtered out in the Indexer reduce task).
>> Obviously, as the linkdb and crawldb grow, this becomes more and more
>> of a problem.
>
> Is this really a problem for you now? Unless your segments are tiny,
> the indexing process will be dominated by I/O from the processing of
> parseText / parseData and Lucene operations.
>
>>
>> Any ideas on how to tackle this issue?
>> Is it feasible to lookup the corresponding linkdb and crawldb data
>> for each key (url) in the segment before or during indexing?
>
> It would be probably too slow, unless you made a copy of
> linkdb/crawldb on the local FS-es of each node. But at this point the
> benefit of this change would be doubtful, because of all the I/O you
> would need to do to prepare each task's environment ...
>
>

--
Knowlogy
Helperpark 290 C
9723 ZA Groningen

[hidden email]
+31 (0)6 15312977
http://www.knowlogy.nl


Reply | Threaded
Open this post in threaded view
|

spam detect

Anton Potekhin
Does nutch have any modules for spam detect?
Does anyone know where I can find any information (blogs, articles, FAQ)
about it?

P.s. Sorry I've posted these questions into the nutch-dev mailing list. But
I've not get answers and decided to try here.