adding dmoz meta data to index.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

adding dmoz meta data to index.

ned@bcit
Hi All,

I need to add dmoz meta-data to my index. I see some people have commented about it but I didn't find a solution. Can someone read the steps below and give me some hints or pointers? This is the code that I added:

1) injector.java: datum.setCategory("dmoz-cat");

2) crawldatum.java: add a new private data 'category' along with set and get methods for it.

3) BasicIndexingFilter.java: doc.add(new Field("category", datum.getCategory(),Field.Store.YES, Field.Index.UN_TOKENIZED));

However, the code breaks at the third step ( when I run index ) saying that category is null.

Another way I was thinking about is whether I am supposed to add the category to the metadata in CrawlDatum. In that case do I have to modify the readFields() method on CrawlDatum?

Thanks in advance.


Reply | Threaded
Open this post in threaded view
|

Re: adding dmoz meta data to index.

Sebastian Steinmetz
Hello,

i'm implementing something similiar at the moment. i'm feeding nutch  
with a url-list with an annotated ID. this ID must go into the lucene  
index, so that i can do a 1:many relation between a database and the  
crawled pages.

i've added the custom data into the meta-data field in the datum. see  
InjectMapper:

// add myID to the crawlDatum as metaData
MapWritable meta = new MapWritable();
meta.put(new Text("myID"), new Text(myID));
datum.setMetaData(meta);

now the ID is saved in the CrawlDatum-Object. On the indexing-side  
i've written a new plugin index-id, but it's simply a modified index-
basic ;) the essence is:

MapWritable meta = datum.getMetaData();

String id = ((Text)meta.get(new Text("myID"))).toString();
               
if (id != "") {
        Field myid = new Field("myid", id, Field.Store.YES,  
Field.Index.UN_TOKENIZED);
        mederiid.setBoost(5.0f);
        doc.add(myid);
        LOG.info("The following ID was added to the index: " + myid);
}

So, that's where i stand at the moment. Now i have to build a custom  
query interface, so that i can search in my MySQL-database and enrich  
the results with my crawled sites.

maybe we can join forces. feel free to contact me :) greetings,
        Sebastian Steinmetz