Focussed Web Crawling with Nutch

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Focussed Web Crawling with Nutch

Alex McLintock
I've been using a perl based focussed web crawler with a MySQL back
end, but am now looking at Nutch instead. It seems like a few other
people have done something similar. I'm wondering whether we could
pool our resources and work together on this?

It seems to me that we would be building a few extra plugins. Here is
how I see a focussed nutch working.

1) Injecting new URLS works as before
2) initial generate works as before but we might want to do something
smarter with DMOZ or wikipedia.
3) fetch works as before based upon the initial urls. We do not follow
links - but we still store them as outlinks as usual.
4) we do a new index based upon some new relevance algorithm (eg page
mentions items that we are interested in) and mark pages as relevant
or not.
5) instead of doing an old style generate or updatedb we go through
all the pages which we marked as relevant and take those outlinks for
our next iteration.
6) We also inject more urls which are added by the users, and
potentially contents of rss files which we know are relevant to our
topic.
7) we loop back to 3 above.

Eventually we end up with a lucene style index as usual which can be
used with the nutch web app, or solr, or some other code

Who is interested in this or has done it in the past.... and can we
chat about it?

Alex
Reply | Threaded
Open this post in threaded view
|

Re: Focussed Web Crawling with Nutch

kkrugler
Hi Alex,

There has been discussion on focused web crawling using Nutch in the  
past, so you probably want to check the archives.

Key aspect is using the scoring plugin API to rate pages (and outlinks  
from pages), which then can be used to do a more efficient job of  
fetching pages that are likely to be of interest, as they have more  
interesting pages pointing to them.

-- Ken


On Jul 31, 2009, at 3:07am, Alex McLintock wrote:

> I've been using a perl based focussed web crawler with a MySQL back
> end, but am now looking at Nutch instead. It seems like a few other
> people have done something similar. I'm wondering whether we could
> pool our resources and work together on this?
>
> It seems to me that we would be building a few extra plugins. Here is
> how I see a focussed nutch working.
>
> 1) Injecting new URLS works as before
> 2) initial generate works as before but we might want to do something
> smarter with DMOZ or wikipedia.
> 3) fetch works as before based upon the initial urls. We do not follow
> links - but we still store them as outlinks as usual.
> 4) we do a new index based upon some new relevance algorithm (eg page
> mentions items that we are interested in) and mark pages as relevant
> or not.
> 5) instead of doing an old style generate or updatedb we go through
> all the pages which we marked as relevant and take those outlinks for
> our next iteration.
> 6) We also inject more urls which are added by the users, and
> potentially contents of rss files which we know are relevant to our
> topic.
> 7) we loop back to 3 above.
>
> Eventually we end up with a lucene style index as usual which can be
> used with the nutch web app, or solr, or some other code
>
> Who is interested in this or has done it in the past.... and can we
> chat about it?
>
> Alex

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Reply | Threaded
Open this post in threaded view
|

Re: Focussed Web Crawling with Nutch

MilleBii
I do something like this... I update the URL scores based on my own
algorithm which works on parse data.
Works great.

2009/7/31 Ken Krugler <[hidden email]>

> Hi Alex,
>
> There has been discussion on focused web crawling using Nutch in the past,
> so you probably want to check the archives.
>
> Key aspect is using the scoring plugin API to rate pages (and outlinks from
> pages), which then can be used to do a more efficient job of fetching pages
> that are likely to be of interest, as they have more interesting pages
> pointing to them.
>
> -- Ken
>
>
>
> On Jul 31, 2009, at 3:07am, Alex McLintock wrote:
>
>  I've been using a perl based focussed web crawler with a MySQL back
>> end, but am now looking at Nutch instead. It seems like a few other
>> people have done something similar. I'm wondering whether we could
>> pool our resources and work together on this?
>>
>> It seems to me that we would be building a few extra plugins. Here is
>> how I see a focussed nutch working.
>>
>> 1) Injecting new URLS works as before
>> 2) initial generate works as before but we might want to do something
>> smarter with DMOZ or wikipedia.
>> 3) fetch works as before based upon the initial urls. We do not follow
>> links - but we still store them as outlinks as usual.
>> 4) we do a new index based upon some new relevance algorithm (eg page
>> mentions items that we are interested in) and mark pages as relevant
>> or not.
>> 5) instead of doing an old style generate or updatedb we go through
>> all the pages which we marked as relevant and take those outlinks for
>> our next iteration.
>> 6) We also inject more urls which are added by the users, and
>> potentially contents of rss files which we know are relevant to our
>> topic.
>> 7) we loop back to 3 above.
>>
>> Eventually we end up with a lucene style index as usual which can be
>> used with the nutch web app, or solr, or some other code
>>
>> Who is interested in this or has done it in the past.... and can we
>> chat about it?
>>
>> Alex
>>
>
> --------------------------
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-210-6378
>
>


--
-MilleBii-