extract elements from each url as json and write it to s3

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

extract elements from each url as json and write it to s3

srinir
Hi nutch-users,

I would like to write a nutch plugin to parse each url and extract
different elements from the page (using something like jsoup parser) and
construct a json and write it to s3 (I am running my nutch cluster in AWS).
I am curious to know whether there is any existing plugin that can do some
of the work for me.

I do see an example of how to write a parser plugin over at
https://wiki.apache.org/nutch/WritingPluginExample-1.2
I am curious to hear from people who have tried a similar use case, to
learn from others experience.

Thanks
Srini
Reply | Threaded
Open this post in threaded view
|

Re: extract elements from each url as json and write it to s3

lsroudi
hi,
Now i'm working in a custom plugin to add some extra field to elasticsearch
index, you can do the same for your own logic, i learn a lot by read the
code of some extisting plugin like tika, elastic indexer....



On Mon, Mar 13, 2017 at 8:25 PM, Srinivasan Ramaswamy <[hidden email]>
wrote:

> Hi nutch-users,
>
> I would like to write a nutch plugin to parse each url and extract
> different elements from the page (using something like jsoup parser) and
> construct a json and write it to s3 (I am running my nutch cluster in AWS).
> I am curious to know whether there is any existing plugin that can do some
> of the work for me.
>
> I do see an example of how to write a parser plugin over at
> https://wiki.apache.org/nutch/WritingPluginExample-1.2
> I am curious to hear from people who have tried a similar use case, to
> learn from others experience.
>
> Thanks
> Srini
>



--
Concepteur et développeur web symfony2
https://github.com/lsroudi
http://lsroudi.com/
Reply | Threaded
Open this post in threaded view
|

RE: extract elements from each url as json and write it to s3

Markus Jelsma-2
In reply to this post by srinir
Hello - you can make an HTML parse filter plugin to extract the things you need from the source HTML. Add everything you extract to the parse metadata. You must then build an an indexing filter that selects the key/value pairs you added earlier to the parse metadata, the key/value pairs must be added as a field to the NutchDocument that is emitted by the indexing filter.

Finally, you must build an indexing backend plugin, this is where you write JSON. The indexing backend plugin receives the NutchDocument you built in your indexing filter. Use these to build your JSON document.

Regards,
Markus
 
 
-----Original message-----

> From:Srinivasan Ramaswamy <[hidden email]>
> Sent: Monday 13th March 2017 20:26
> To: [hidden email]
> Subject: extract elements from each url as json and write it to s3
>
> Hi nutch-users,
>
> I would like to write a nutch plugin to parse each url and extract
> different elements from the page (using something like jsoup parser) and
> construct a json and write it to s3 (I am running my nutch cluster in AWS).
> I am curious to know whether there is any existing plugin that can do some
> of the work for me.
>
> I do see an example of how to write a parser plugin over at
> https://wiki.apache.org/nutch/WritingPluginExample-1.2
> I am curious to hear from people who have tried a similar use case, to
> learn from others experience.
>
> Thanks
> Srini
>
Reply | Threaded
Open this post in threaded view
|

Re: extract elements from each url as json and write it to s3

suyashaoc
In reply to this post by srinir
Hi,
I think you have to take database like mongodb. Write your custom gora
mongodb mapping.xml and pass your Jason object to this.

Thanks,
suyash

On 14 Mar 2017 12:56 am, "Srinivasan Ramaswamy" <[hidden email]> wrote:

> Hi nutch-users,
>
> I would like to write a nutch plugin to parse each url and extract
> different elements from the page (using something like jsoup parser) and
> construct a json and write it to s3 (I am running my nutch cluster in AWS).
> I am curious to know whether there is any existing plugin that can do some
> of the work for me.
>
> I do see an example of how to write a parser plugin over at
> https://wiki.apache.org/nutch/WritingPluginExample-1.2
> I am curious to hear from people who have tried a similar use case, to
> learn from others experience.
>
> Thanks
> Srini
>