Using Nutch for special content pages

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Nutch for special content pages

Tor Harald Thorland

Hello,

I have a question about Nutch..
I'm a total newbi and are wondering:
Is it possible to setup nutch to crawl any address it finds, and only  
store pages where he finds something about a subject...
I'll like to make a search place for ship/engine related material, and  
were thinking to start with .no domains... ( I have lots of time for  
this, ans the pages I'm looking for is not really getting "outdated",  
but i don't like to waste a lot of disk space etc. for pages which  
don't include what I'm looking for

Best Regards
Tor Harald Thorland

attachment0 (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Using Nutch for special content pages

Damian Florczyk-2
Tor Harald Thorland napisaƂ(a):

>
> Hello,
>
> I have a question about Nutch..
> I'm a total newbi and are wondering:
> Is it possible to setup nutch to crawl any address it finds, and only
> store pages where he finds something about a subject...
> I'll like to make a search place for ship/engine related material, and
> were thinking to start with .no domains... ( I have lots of time for
> this, ans the pages I'm looking for is not really getting "outdated",
> but i don't like to waste a lot of disk space etc. for pages which don't
> include what I'm looking for
>
> Best Regards
> Tor Harald Thorland
My company have dones sth like that, but we need to write our own plugin
for it.

--
Damian Florczyk
Gentoo/NetBSD Development Lead
Reply | Threaded
Open this post in threaded view
|

Re: Using Nutch for special content pages

Zaheed Haque
In reply to this post by Tor Harald Thorland
Hi:

In order to find a specific text or subject or group of text you need
to process the document i.e. you need to download the page to your
disk -- process it -- delete or keep based on rules. But you still need
to download the page. This means you will need a lot of disk space "temporarily"
if you are planning to crawl the world :-)

there is a creative commons plugin in nutch src/plugin/creativecommons .. which
does somewhat similar things could be good starting point. As you have lot
of time then its best you make the new plugin a bit generic :-) So we can all
enjoy it!

Cheers

On 1/9/07, Tor Harald Thorland <[hidden email]> wrote:

>
> Hello,
>
> I have a question about Nutch..
> I'm a total newbi and are wondering:
> Is it possible to setup nutch to crawl any address it finds, and only
> store pages where he finds something about a subject...
> I'll like to make a search place for ship/engine related material, and
> were thinking to start with .no domains... ( I have lots of time for
> this, ans the pages I'm looking for is not really getting "outdated",
> but i don't like to waste a lot of disk space etc. for pages which
> don't include what I'm looking for
>
> Best Regards
> Tor Harald Thorland
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Using Nutch for special content pages

Justin Hartman
On 1/9/07, Zaheed Haque <[hidden email]> wrote:
> there is a creative commons plugin in nutch src/plugin/creativecommons .. which
> does somewhat similar things could be good starting point.

Sorry to change the subject on this one but what exactly does the
creativecommons plugin do and how would you use it? I've been very
interested in this plugin but it's not altogether documented that well
(I don't think).
--
Regards
Justin Hartman
PGP Key ID: 102CC123
Reply | Threaded
Open this post in threaded view
|

Re: Using Nutch for special content pages

Zaheed Haque
Hi:

In general terms the CC plugin looks for the "CC:license" on web pages
it crawls. You can see that in http://creativecommons.org/ at the end
of the page - there is a "CC logo and some copyright text". If you do
view source will give you the HTML's for that bit of the page .. and
when ever nutch crawler finds such page it index the page otherwise
delete the page and move to the next page. In essance this HTML
snippet could be anything i.e. specific text, group of text and what
not.

Whenever CC plugin finds a CC page it also adds some CC specific
fields in Lucene index for query etc. I think all of the above i.e.
CCparser, CCindexer and CCquery filters are under the CC plugin
directory.

Cheers

On 1/9/07, Justin Hartman <[hidden email]> wrote:

> On 1/9/07, Zaheed Haque <[hidden email]> wrote:
> > there is a creative commons plugin in nutch src/plugin/creativecommons .. which
> > does somewhat similar things could be good starting point.
>
> Sorry to change the subject on this one but what exactly does the
> creativecommons plugin do and how would you use it? I've been very
> interested in this plugin but it's not altogether documented that well
> (I don't think).
> --
> Regards
> Justin Hartman
> PGP Key ID: 102CC123
>