Linking url metadata to nutch search results

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Linking url metadata to nutch search results

marco-89
I am using nutch to crawl & index an intranet consisting of an initial
fixed set of urls (approx. 3000). For my application I need to reference
some metadata (stored in a database) for each of the original 3000 urls.

Does nutch assign a unique integer id for each starting url in the
crawldb? If so, does the API allow me to get it? When a search is
performed can/is this id returned for each 'hit'?

I want my 'display search results' page to return the nutch results for
each 'hit' as well as the metadata for the hit url if it is one of the
original 3000. I'd rather use an integer ID than have to match on the url
string itself.


Marco Rondelli.


Reply | Threaded
Open this post in threaded view
|

Re: Linking url metadata to nutch search results

Andrzej Białecki-2
[hidden email] wrote:
> I am using nutch to crawl & index an intranet consisting of an initial
> fixed set of urls (approx. 3000). For my application I need to reference
> some metadata (stored in a database) for each of the original 3000 urls.
>
> Does nutch assign a unique integer id for each starting url in the
> crawldb? If so, does the API allow me to get it? When a search is
> performed can/is this id returned for each 'hit'?
>  

Nutch uses the full URL as a unique identifier.

If your collection is relatively small (in the order of a few million
docs or less) you can use MD5Hash.digest(url).halfDigest(), which
returns a long value - and with pretty good confidence it should be unique.

> I want my 'display search results' page to return the nutch results for
> each 'hit' as well as the metadata for the hit url if it is one of the
> original 3000. I'd rather use an integer ID than have to match on the url
> string itself.
>  

Nutch doesn't number the URLs, so you will need to somehow map URLs to
integers. You could do this sequentially, but each time you add/remove
URLs form the crawldb you will get different numbers for the same URLs.
You could also use a perfect hash function which maps String to Integer,
but even in this case you would have a small probability that existing
URLs will be re-numbered. The space of int is too small to use random
hashing and hope there are no collisions.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com