protocol-foo: How to tell nutch about more URLs to fetch?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

protocol-foo: How to tell nutch about more URLs to fetch?

Hiran Chaudhuri

Hi there.

 

While I am trying to create the protocol-foo, an implementation for the example protocol with URLs like foo://something I see difficulty in distinguishing when to tell nutch to search for more URLs and when not to. It would be something like a directory listing, or no directory listing but content.

 

It is possible that a protocol-plugin cannot do much without a parser-plugin? And if I were to implement such a parser-plugin, would I then have to implement the directory listing plus all the content parsing like Tika?

 

Hiran

 

 

Hiran Chaudhuri
Principal Support Engineer

Service Reliability Engineering - Custom

Amadeus Data Processing GmbH
Berghamer Strasse 6
85435 Erding
T: +49-8122-43x3662
[hidden email]

http://amadeus.com

 

Reply | Threaded
Open this post in threaded view
|

Re: protocol-foo: How to tell nutch about more URLs to fetch?

Sebastian Nagel
Hi,

> It would be something like a directory listing, or no directory listing but content.

Have a look at the protocol-file plugin: it wraps a directory listing into a HTML page
similar as the Apache web server does if there is no index.html in a directory.

> It is possible that a protocol-plugin cannot do much without a parser-plugin?

No. Nutch is a web crawler and crawling file systems or file servers is only done
by "emulating" web pages.

> And if I were to implement such a parser-plugin

There is already parse-html and parse-tika...

In general, if it's only about indexing a file system, it may be
easier to send the documents directly to Solr (or another indexer).
But often you have a mix of content providers (file system/server,
web site, wiki, etc.) and usually many of them already provide
a web frontend to browse (or crawl) the content.

Best,
Sebastian

On 09/27/2017 06:57 AM, Hiran CHAUDHURI wrote:

> Hi there.
>
>  
>
> While I am trying to create the protocol-foo, an implementation for the example protocol with URLs
> like foo://something I see difficulty in distinguishing when to tell nutch to search for more URLs
> and when not to. It would be something like a directory listing, or no directory listing but content.
>
>  
>
> It is possible that a protocol-plugin cannot do much without a parser-plugin? And if I were to
> implement such a parser-plugin, would I then have to implement the directory listing plus all the
> content parsing like Tika?
>
>  
>
> Hiran
>
>  
>
>  
>
> *Hiran Chaudhuri**
> Principal Support Engineer*
>
> Service Reliability Engineering - Custom
>
> Amadeus Data Processing GmbH
> Berghamer Strasse 6
> 85435 Erding
> T: +49-8122-43x3662
> hiran.chaudhuri@amadeus.com_
> http://amadeus.com <http://amadeus.com/>_**
>
>  
>

Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: protocol-foo: How to tell nutch about more URLs to fetch?

Hiran Chaudhuri
>> It would be something like a directory listing, or no directory listing but content.
>Have a look at the protocol-file plugin: it wraps a directory listing into a HTML page similar
>as the Apache web server does if there is no index.html in a directory.

Yep. I did it. The protocol-foo can now fetch some documents and at least for me it is clean enough such that nutch does neither break nor complain.

Thanks for the efficient hint. :-)

Hiran
Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: protocol-foo: How to tell nutch about more URLs to fetch?

Hiran Chaudhuri
In reply to this post by Sebastian Nagel
Hello Sebastian,

I extended the protocol-foo implementation so it can serve a small numer of URLs.

I believe you should not see issues with multiple parallel instances of PluginRepository any more, and protocol-foo should no longer throw Exceptions as it comes with a better implementation. For me this seems to work pretty well, so from my side I consider it done.

Would you mind taking a look again?

Hiran