Injection from webservice

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Injection from webservice

Roannel Fernandez Hernandez-2
Hi folks,

Is there any way in Nutch 1.15 to inject a remote seed file (accessible via http or https)?

I mean this, for instance:

bin/nutch inject crawl http://example.org/seed

Regards
1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
Por La Habana, lo más grande. #Habana500 #UCIxHabana500

Reply | Threaded
Open this post in threaded view
|

Re: Injection from webservice

Jorge Betancourt
Hi Roannel,

The current implementation of the injector only accepts a path (actually an
org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
directly unless you download the content first.

If you use the REST API you can send the seed file using the API endpoint.
Otherwise, you could write your own injector with the proper logic to deal
with a list of URLs coming from an URL.

The REST API implementation just writes the content in the expected format (
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
)

Best Regards,
Jorge

On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez <[hidden email]>
wrote:

> Hi folks,
>
> Is there any way in Nutch 1.15 to inject a remote seed file (accessible
> via http or https)?
>
> I mean this, for instance:
>
> bin/nutch inject crawl http://example.org/seed
>
> Regards
> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Injection from webservice

Dave Beckstrom-2
Or use a scheduled wget job to pull them from the remote server and store
them on a path that Nutch can access locally.

Regards,

Dave Beckstrom
Technical Delivery Manager / Senior Developer
em: [hidden email] <[hidden email]>
ph: 763.323.3499


On Mon, Sep 16, 2019 at 12:14 PM Jorge Betancourt <
[hidden email]> wrote:

> Hi Roannel,
>
> The current implementation of the injector only accepts a path (actually an
> org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
> directly unless you download the content first.
>
> If you use the REST API you can send the seed file using the API endpoint.
> Otherwise, you could write your own injector with the proper logic to deal
> with a list of URLs coming from an URL.
>
> The REST API implementation just writes the content in the expected format
> (
>
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
> )
>
> Best Regards,
> Jorge
>
> On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez <
> [hidden email]>
> wrote:
>
> > Hi folks,
> >
> > Is there any way in Nutch 1.15 to inject a remote seed file (accessible
> > via http or https)?
> >
> > I mean this, for instance:
> >
> > bin/nutch inject crawl http://example.org/seed
> >
> > Regards
> > 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> > Por La Habana, lo más grande. #Habana500 #UCIxHabana500
> >
> >
>

--
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/



Reply | Threaded
Open this post in threaded view
|

Re: [MASSMAIL]Re: Injection from webservice

Roannel Fernandez Hernandez-2
In reply to this post by Jorge Betancourt
Thanks Jorge for your answer. Do you think an injector that accepts local/hdfs paths and in addition API endpoints could be a good improvement for Nutch.

Regards, Roannel

----- Original Message -----
> From: "Jorge Betancourt" <[hidden email]>
> To: "user" <[hidden email]>
> Sent: Lunes, 16 de Septiembre 2019 13:14:36
> Subject: [MASSMAIL]Re: Injection from webservice

> Hi Roannel,
>
> The current implementation of the injector only accepts a path (actually an
> org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
> directly unless you download the content first.
>
> If you use the REST API you can send the seed file using the API endpoint.
> Otherwise, you could write your own injector with the proper logic to deal
> with a list of URLs coming from an URL.
>
> The REST API implementation just writes the content in the expected format (
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
> )
>
> Best Regards,
> Jorge
>
> On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez <[hidden email]>
> wrote:
>
>> Hi folks,
>>
>> Is there any way in Nutch 1.15 to inject a remote seed file (accessible
>> via http or https)?
>>
>> I mean this, for instance:
>>
>> bin/nutch inject crawl http://example.org/seed
>>
>> Regards
>> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
>> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
>>
1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
Por La Habana, lo más grande. #Habana500 #UCIxHabana500

Reply | Threaded
Open this post in threaded view
|

Re: [MASSMAIL]Re: Injection from webservice

Jorge Betancourt
TBH I'm not entirely sure. Downloading the file can be scripted around
without a lot of troubles. My feeling is that the Injector class has a good
enough scope already. There are valid reasons for having a custom injector
(reading the seed URLs from a DB comes to my mind). When I needed a custom
injector it was for very requirements, and it made more sense to have a
custom injector instead of generating a seed file (this was before having a
REST API, which right now provides a nice API around the injector).

It is a valid point that we don't have an extension point for the Injector
logic which could allow for having different seed URL providers without
developers needing to worry about the specific injection logic.

My main concern is if we want to put this additional complexity in Nutch.
It is really valuable to all of our users to have HTTP/DB/custom injectors
available out of the box in a pluggable way?

I would love to hear what other people have to say.

Best Regards,
Jorge

On Mon, Sep 16, 2019 at 8:53 PM Roannel Fernandez Hernandez <[hidden email]>
wrote:

> Thanks Jorge for your answer. Do you think an injector that accepts
> local/hdfs paths and in addition API endpoints could be a good improvement
> for Nutch.
>
> Regards, Roannel
>
> ----- Original Message -----
> > From: "Jorge Betancourt" <[hidden email]>
> > To: "user" <[hidden email]>
> > Sent: Lunes, 16 de Septiembre 2019 13:14:36
> > Subject: [MASSMAIL]Re: Injection from webservice
>
> > Hi Roannel,
> >
> > The current implementation of the injector only accepts a path (actually
> an
> > org.apache.hadoop.fs.Path) this means that there is no way to feed an URL
> > directly unless you download the content first.
> >
> > If you use the REST API you can send the seed file using the API
> endpoint.
> > Otherwise, you could write your own injector with the proper logic to
> deal
> > with a list of URLs coming from an URL.
> >
> > The REST API implementation just writes the content in the expected
> format (
> >
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/service/resources/SeedResource.java#L92-L113
> > )
> >
> > Best Regards,
> > Jorge
> >
> > On Mon, Sep 16, 2019 at 4:59 PM Roannel Fernandez Hernandez <
> [hidden email]>
> > wrote:
> >
> >> Hi folks,
> >>
> >> Is there any way in Nutch 1.15 to inject a remote seed file (accessible
> >> via http or https)?
> >>
> >> I mean this, for instance:
> >>
> >> bin/nutch inject crawl http://example.org/seed
> >>
> >> Regards
> >> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> >> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
> >>
> 1519-2019: Aniversario 500 de la Villa de San Cristóbal de La Habana
> Por La Habana, lo más grande. #Habana500 #UCIxHabana500
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Injection from webservice

lewis john mcgibbney-2
In reply to this post by Roannel Fernandez Hernandez-2
Hi Folks,
I've implemented what Dave suggested... it is clean and easy but it maybe
not quite as ad-hoc-capable as one would always want. For my use cases it
was acceptable.
More responses inline

On Thu, Sep 19, 2019 at 2:47 PM <[hidden email]> wrote:

> From: Jorge Betancourt <[hidden email]>
> To: [hidden email]
> Cc:
> Bcc:
>
>
[snip]


>
> My main concern is if we want to put this additional complexity in Nutch.
> It is really valuable to all of our users to have HTTP/DB/custom injectors
> available out of the box in a pluggable way?
>
> I would love to hear what other people have to say.
>
> In all honesty, I would like to see as much of the REST logic and WebUI
extracted out of the core codebase as possible. I feel like we should have
done it this way around initially but didn't.
Considering 'separation of concerns' for Nutch is important and Jorge, your
spot on with your reservations.

Lewis