Ask for expertise and advice

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Ask for expertise and advice

Emmanuel JOKE
Two things:

1- Today, every time we parse a page, we generate many Outlinks. Those
Outlinks can be either related links to the same website or links to
external website (different hostname). Those Outlinks are then recorded in
the CrawlDB.

I would like to identify links that are recorded based on the Outlink from a
different hostname.

For instance, parsing www.foo.com generate 2 Outlinks: www.foo.com/map.htmland
www.nutch.com/index.html
I would like to add a record in CrawlDB metadata for
www.nutch.com/index.html to add _referal_=" www.foo.com".

I was thinking to extend the Outlink object to add a metadata content. Then
it will be possible to modify any object Outlink using an HTMLParseFilter.
Then later when the Outlink will be transform in CrawlDatum we could get the
metadata content and add it to the CrawlDatum.during the CrawlDB Updating
process.

Could you tell me if its the correct way to process that ? or does it exist
another way ?

2- Nutch has an URL Filter mechanism implemented to eliminate URL. This
mechanism called the different "plugin filter" configured and pass only the
URL string as parameter.
It filters URL based only on the string representation of the URL.

However i would like to filter some url based on few criterias that can be
found in the Metadata of the CrawlDatum corresponding to this URL.
I was thinking it could be interesting to improve this filter to pass also
the CrawlDatum as a parameter.
It will allow more flexibility for every user to define an advanced filter.

Don't you think it could be interesting for the community ? Should i create
JIRA and add my patch ?

Thanks in advance for your feedback