Creating different binary databases for indexing

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Creating different binary databases for indexing

Dennis Kubes
I am working on a boosting solutiong where I am having to create more
binary databases than just the linkdb, crawldb, etc.  For example I
create one for uncommon words in a page.  Then I want to use these
database objects inside of the indexing process, in the filters, by key
along with the linkdb, parse text ,parse data and so on.

The link database and parse text and data are passed into the filters
directly through the filter interface.  I can't pass other databases
alongside because I would have to change the interface which means I
would have to refactor all existing indexing filters.  The easiest way I
found right now in modifying the parse interface to also hold the
database objects that I need, but that doesn't feel like a good long
term solution.

Is there a better way to pass other keyed values (database) objects into
the indexing filters?  Should we start a discussion about if we need
this functionality in Nutch and how best to implement it.  I would be
happy to implement it but I want some discussion and opinions first.

Dennis
Reply | Threaded
Open this post in threaded view
|

Re: Creating different binary databases for indexing

Andrzej Białecki-2
Dennis Kubes wrote:

> I am working on a boosting solutiong where I am having to create more
> binary databases than just the linkdb, crawldb, etc.  For example I
> create one for uncommon words in a page.  Then I want to use these
> database objects inside of the indexing process, in the filters, by
> key along with the linkdb, parse text ,parse data and so on.
> The link database and parse text and data are passed into the filters
> directly through the filter interface.  I can't pass other databases
> alongside because I would have to change the interface which means I
> would have to refactor all existing indexing filters.  The easiest way
> I found right now in modifying the parse interface to also hold the
> database objects that I need, but that doesn't feel like a good long
> term solution.
>
> Is there a better way to pass other keyed values (database) objects
> into the indexing filters?  Should we start a discussion about if we
> need this functionality in Nutch and how best to implement it.  I
> would be happy to implement it but I want some discussion and opinions
> first.

I'm not sure if I understood all your requirements.. Anyway. You can
pass arbitrary Writable objects to Indexer map() and reduce(), because
they will be wrapped into ObjectWritable. In particular, you could pass
some data retrieved from an input file (using SequenceFileInputFormat),
if you stored your database values previously in such file. Or you could
stick the primary key to the DB record inside CrawlDatum.metaData, and
then retrieve record data from the DB during reduce ...

All of the above you can accomplish without changing any of the
interfaces, just by adding properly formatted data to the input, and
then using an indexing plugin that can consume this particular type of
input data.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Creating different binary databases for indexing

Dennis Kubes
I am doing that and I have changed Indexer to retrieve the
ObjectWritable just as it does with the Inlinks and CrawlDb.  But my
problem is that those objects are passed into the indexing filters
directly (well parse text and data are wrapped in parse, but it still
goes in directly).  What if I want to pass another object to the
filters?  What would be a good way to do that without changing the
IndexingFilter interface?

Dennis

Andrzej Bialecki wrote:

> Dennis Kubes wrote:
>> I am working on a boosting solutiong where I am having to create more
>> binary databases than just the linkdb, crawldb, etc.  For example I
>> create one for uncommon words in a page.  Then I want to use these
>> database objects inside of the indexing process, in the filters, by
>> key along with the linkdb, parse text ,parse data and so on.
>> The link database and parse text and data are passed into the filters
>> directly through the filter interface.  I can't pass other databases
>> alongside because I would have to change the interface which means I
>> would have to refactor all existing indexing filters.  The easiest
>> way I found right now in modifying the parse interface to also hold
>> the database objects that I need, but that doesn't feel like a good
>> long term solution.
>>
>> Is there a better way to pass other keyed values (database) objects
>> into the indexing filters?  Should we start a discussion about if we
>> need this functionality in Nutch and how best to implement it.  I
>> would be happy to implement it but I want some discussion and
>> opinions first.
>
> I'm not sure if I understood all your requirements.. Anyway. You can
> pass arbitrary Writable objects to Indexer map() and reduce(), because
> they will be wrapped into ObjectWritable. In particular, you could
> pass some data retrieved from an input file (using
> SequenceFileInputFormat), if you stored your database values
> previously in such file. Or you could stick the primary key to the DB
> record inside CrawlDatum.metaData, and then retrieve record data from
> the DB during reduce ...
>
> All of the above you can accomplish without changing any of the
> interfaces, just by adding properly formatted data to the input, and
> then using an indexing plugin that can consume this particular type of
> input data.
>
Reply | Threaded
Open this post in threaded view
|

Re: Creating different binary databases for indexing

Andrzej Białecki-2
Dennis Kubes wrote:
> I am doing that and I have changed Indexer to retrieve the
> ObjectWritable just as it does with the Inlinks and CrawlDb.  But my
> problem is that those objects are passed into the indexing filters
> directly (well parse text and data are wrapped in parse, but it still
> goes in directly).  What if I want to pass another object to the
> filters?  What would be a good way to do that without changing the
> IndexingFilter interface?

Use CrawlDatum.metaData. You can put any Writable there.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Creating different binary databases for indexing

Dennis Kubes
Yeah.  That is what I needed.  Thanks.

Andrzej Bialecki wrote:

> Dennis Kubes wrote:
>> I am doing that and I have changed Indexer to retrieve the
>> ObjectWritable just as it does with the Inlinks and CrawlDb.  But my
>> problem is that those objects are passed into the indexing filters
>> directly (well parse text and data are wrapped in parse, but it still
>> goes in directly).  What if I want to pass another object to the
>> filters?  What would be a good way to do that without changing the
>> IndexingFilter interface?
>
> Use CrawlDatum.metaData. You can put any Writable there.
>