Custom Parser / Indexer Starting points

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Custom Parser / Indexer Starting points

David Ferrero
&tldr; If I wanted to learn about the nutch pipeline at a high level, then write a custom parser / indexer of my own where would a starting point be?

I have used the latest 1.x Nutch to crawl a few specific websites and been disappointed with the results, even after experimenting with new html-microdata capabilities with updates to Any23 project incorporated by Nutch, I am still not (yet) excited. Bottom line is website data is not well structured and not super friendly to algorithmic consumption (but you already knew that). To that end, I am interested to developer custom parsers per internet domain in an effort to capture specific domain data. It currently looks like the plugin.includes does not allow a per domain-based approach for parser / indexer. I wonder if someone could guide me toward a high level view of the Nutch data pipeline, then guide me towards where to get started for creating custom parsers that might support a per-domain approach?

Thanks,
David
Reply | Threaded
Open this post in threaded view
|

RE: Custom Parser / Indexer Starting points

Yossi Tamari
Hi David,

The interfaces related to extending Nutch parser/indexer are actually very
simple. However, finding up-to-date documented samples is not. Luckily,
Nutch comes with plenty built-in, so my suggestion would be to pick one, and
dive into its implementation. Then just copy its folder and use it as a
skeleton, replacing the specific logic (and plugin metadata).

The first question you need to ask yourself is if you really want to write a
Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I suspect that the
default behaviour of the Nutch Parser and Indexer is useful for you, and you
just want to add more functionality (that is what Any23 is doing). You can
chain Filters, so your code could also leverage the Any23 logic, for
example.

The documentation starting point is the Wiki
(https://wiki.apache.org/nutch/). For your specific question, this is the
most relevant page: https://wiki.apache.org/nutch/AboutPlugins.

One (old) example of writing a custom parser can be found here:
http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I suggest you
Google for more information as needed, but always keep in mind that things
may have changed over time.

I think the best approach for domain-specific parsers is to have a custom
parser that maps from the URL to the specific code. This can be just one big
if/else, or a Map of domain->code (possibly using functional programming),
or you can even have this map configurable in some file.

Once you have more specific questions/problems, I suggest you email
[hidden email]. [hidden email] is intended for discussing code
contributions to Nutch, as far as I understand, and I think less people see
your messages here. (Also, more people will benefit from your questions
there...)

In summary, from my experience, writing any one of these plugins is really
easy (discounting your own complex logic, of course), just implementing one
or a few methods, changing some plugin XML file, and adding your extension
to the global build (Ant) files. But to really understand how the passed
data looks, and what you can do with it, debugging (in local mode) is the
ultimate tool, and in the end is much more time-efficient than looking for
information on the web. This is partly because a lot of the data is passed
in Map-like form, so even the JavaDoc doesn't really tell you what will be
there (it depends on what plugins you have configured, and how you
configured those plugins...).

        Yossi.


> -----Original Message-----
> From: David Ferrero [mailto:[hidden email]]
> Sent: 11 February 2018 04:00
> To: [hidden email]
> Subject: Custom Parser / Indexer Starting points
>
> &tldr; If I wanted to learn about the nutch pipeline at a high level, then
write a
> custom parser / indexer of my own where would a starting point be?
>
> I have used the latest 1.x Nutch to crawl a few specific websites and been
> disappointed with the results, even after experimenting with new html-
> microdata capabilities with updates to Any23 project incorporated by
Nutch, I
> am still not (yet) excited. Bottom line is website data is not well
structured and
> not super friendly to algorithmic consumption (but you already knew that).
To
> that end, I am interested to developer custom parsers per internet domain
in an
> effort to capture specific domain data. It currently looks like the
plugin.includes
> does not allow a per domain-based approach for parser / indexer. I wonder
if
> someone could guide me toward a high level view of the Nutch data
pipeline,
> then guide me towards where to get started for creating custom parsers
that
> might support a per-domain approach?
>
> Thanks,
> David

Reply | Threaded
Open this post in threaded view
|

Re: Custom Parser / Indexer Starting points

David Ferrero
Thank you for all the tips. I think I need to understand better the pipeline of parsers and if/how their plug-in.includes order  matters.  

> On Feb 11, 2018, at 1:18 AM, Yossi Tamari <[hidden email]> wrote:
>
> Hi David,
>
> The interfaces related to extending Nutch parser/indexer are actually very
> simple. However, finding up-to-date documented samples is not. Luckily,
> Nutch comes with plenty built-in, so my suggestion would be to pick one, and
> dive into its implementation. Then just copy its folder and use it as a
> skeleton, replacing the specific logic (and plugin metadata).
>
> The first question you need to ask yourself is if you really want to write a
> Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I suspect that the
> default behaviour of the Nutch Parser and Indexer is useful for you, and you
> just want to add more functionality (that is what Any23 is doing). You can
> chain Filters, so your code could also leverage the Any23 logic, for
> example.
>
> The documentation starting point is the Wiki
> (https://wiki.apache.org/nutch/). For your specific question, this is the
> most relevant page: https://wiki.apache.org/nutch/AboutPlugins.
>
> One (old) example of writing a custom parser can be found here:
> http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I suggest you
> Google for more information as needed, but always keep in mind that things
> may have changed over time.
>
> I think the best approach for domain-specific parsers is to have a custom
> parser that maps from the URL to the specific code. This can be just one big
> if/else, or a Map of domain->code (possibly using functional programming),
> or you can even have this map configurable in some file.
>
> Once you have more specific questions/problems, I suggest you email
> [hidden email]. [hidden email] is intended for discussing code
> contributions to Nutch, as far as I understand, and I think less people see
> your messages here. (Also, more people will benefit from your questions
> there...)
>
> In summary, from my experience, writing any one of these plugins is really
> easy (discounting your own complex logic, of course), just implementing one
> or a few methods, changing some plugin XML file, and adding your extension
> to the global build (Ant) files. But to really understand how the passed
> data looks, and what you can do with it, debugging (in local mode) is the
> ultimate tool, and in the end is much more time-efficient than looking for
> information on the web. This is partly because a lot of the data is passed
> in Map-like form, so even the JavaDoc doesn't really tell you what will be
> there (it depends on what plugins you have configured, and how you
> configured those plugins...).
>
>    Yossi.
>
>
>> -----Original Message-----
>> From: David Ferrero [mailto:[hidden email]]
>> Sent: 11 February 2018 04:00
>> To: [hidden email]
>> Subject: Custom Parser / Indexer Starting points
>>
>> &tldr; If I wanted to learn about the nutch pipeline at a high level, then
> write a
>> custom parser / indexer of my own where would a starting point be?
>>
>> I have used the latest 1.x Nutch to crawl a few specific websites and been
>> disappointed with the results, even after experimenting with new html-
>> microdata capabilities with updates to Any23 project incorporated by
> Nutch, I
>> am still not (yet) excited. Bottom line is website data is not well
> structured and
>> not super friendly to algorithmic consumption (but you already knew that).
> To
>> that end, I am interested to developer custom parsers per internet domain
> in an
>> effort to capture specific domain data. It currently looks like the
> plugin.includes
>> does not allow a per domain-based approach for parser / indexer. I wonder
> if
>> someone could guide me toward a high level view of the Nutch data
> pipeline,
>> then guide me towards where to get started for creating custom parsers
> that
>> might support a per-domain approach?
>>
>> Thanks,
>> David
>
Reply | Threaded
Open this post in threaded view
|

RE: Custom Parser / Indexer Starting points

Yossi Tamari
The plug-in.includes order does not matter.
To define the order of HtmlParseFilters, use the property
htmlparsefilter.order.
To define the order of Parsers, use the file conf/parse-plugins.xml. Note
that once a single Parser returns a result, the following parsers will not
be run.

> -----Original Message-----
> From: David Ferrero [mailto:[hidden email]]
> Sent: 12 February 2018 06:23
> To: [hidden email]
> Subject: Re: Custom Parser / Indexer Starting points
>
> Thank you for all the tips. I think I need to understand better the
pipeline of

> parsers and if/how their plug-in.includes order  matters.
>
> > On Feb 11, 2018, at 1:18 AM, Yossi Tamari <[hidden email]> wrote:
> >
> > Hi David,
> >
> > The interfaces related to extending Nutch parser/indexer are actually
> > very simple. However, finding up-to-date documented samples is not.
> > Luckily, Nutch comes with plenty built-in, so my suggestion would be
> > to pick one, and dive into its implementation. Then just copy its
> > folder and use it as a skeleton, replacing the specific logic (and
plugin

> metadata).
> >
> > The first question you need to ask yourself is if you really want to
> > write a Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I
> > suspect that the default behaviour of the Nutch Parser and Indexer is
> > useful for you, and you just want to add more functionality (that is
> > what Any23 is doing). You can chain Filters, so your code could also
> > leverage the Any23 logic, for example.
> >
> > The documentation starting point is the Wiki
> > (https://wiki.apache.org/nutch/). For your specific question, this is
> > the most relevant page: https://wiki.apache.org/nutch/AboutPlugins.
> >
> > One (old) example of writing a custom parser can be found here:
> > http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I
> > suggest you Google for more information as needed, but always keep in
> > mind that things may have changed over time.
> >
> > I think the best approach for domain-specific parsers is to have a
> > custom parser that maps from the URL to the specific code. This can be
> > just one big if/else, or a Map of domain->code (possibly using
> > functional programming), or you can even have this map configurable in
some

> file.
> >
> > Once you have more specific questions/problems, I suggest you email
> > [hidden email]. [hidden email] is intended for discussing
> > code contributions to Nutch, as far as I understand, and I think less
> > people see your messages here. (Also, more people will benefit from
> > your questions
> > there...)
> >
> > In summary, from my experience, writing any one of these plugins is
> > really easy (discounting your own complex logic, of course), just
> > implementing one or a few methods, changing some plugin XML file, and
> > adding your extension to the global build (Ant) files. But to really
> > understand how the passed data looks, and what you can do with it,
> > debugging (in local mode) is the ultimate tool, and in the end is much
> > more time-efficient than looking for information on the web. This is
> > partly because a lot of the data is passed in Map-like form, so even
> > the JavaDoc doesn't really tell you what will be there (it depends on
> > what plugins you have configured, and how you configured those
plugins...).

> >
> >    Yossi.
> >
> >
> >> -----Original Message-----
> >> From: David Ferrero [mailto:[hidden email]]
> >> Sent: 11 February 2018 04:00
> >> To: [hidden email]
> >> Subject: Custom Parser / Indexer Starting points
> >>
> >> &tldr; If I wanted to learn about the nutch pipeline at a high level,
> >> then
> > write a
> >> custom parser / indexer of my own where would a starting point be?
> >>
> >> I have used the latest 1.x Nutch to crawl a few specific websites and
> >> been disappointed with the results, even after experimenting with new
> >> html- microdata capabilities with updates to Any23 project
> >> incorporated by
> > Nutch, I
> >> am still not (yet) excited. Bottom line is website data is not well
> > structured and
> >> not super friendly to algorithmic consumption (but you already knew
that).

> > To
> >> that end, I am interested to developer custom parsers per internet
> >> domain
> > in an
> >> effort to capture specific domain data. It currently looks like the
> > plugin.includes
> >> does not allow a per domain-based approach for parser / indexer. I
> >> wonder
> > if
> >> someone could guide me toward a high level view of the Nutch data
> > pipeline,
> >> then guide me towards where to get started for creating custom
> >> parsers
> > that
> >> might support a per-domain approach?
> >>
> >> Thanks,
> >> David
> >

Reply | Threaded
Open this post in threaded view
|

Re: Custom Parser / Indexer Starting points

Evert Wagenaar
You should start with the extension points that Nutch offers. These are very similar to OSGI and Eclipse plug-ins. 

Once you understand this, but  start writing your parse. Test and implement. 


Hope this helps.

Best regards,


Evert Wagenaar. 


On Mon, 12 Feb 2018 at 08:56 Yossi Tamari <[hidden email]> wrote:
The plug-in.includes order does not matter. 
To define the order of HtmlParseFilters, use the property
htmlparsefilter.order.
To define the order of Parsers, use the file conf/parse-plugins.xml. Note
that once a single Parser returns a result, the following parsers will not
be run.

> -----Original Message-----
> From: David Ferrero [mailto:[hidden email]]
> Sent: 12 February 2018 06:23
> To: [hidden email]
> Subject: Re: Custom Parser / Indexer Starting points
>
> Thank you for all the tips. I think I need to understand better the
pipeline of
> parsers and if/how their plug-in.includes order  matters.
>
> > On Feb 11, 2018, at 1:18 AM, Yossi Tamari <[hidden email]> wrote:
> >
> > Hi David,
> >
> > The interfaces related to extending Nutch parser/indexer are actually
> > very simple. However, finding up-to-date documented samples is not.
> > Luckily, Nutch comes with plenty built-in, so my suggestion would be
> > to pick one, and dive into its implementation. Then just copy its
> > folder and use it as a skeleton, replacing the specific logic (and
plugin
> metadata).
> >
> > The first question you need to ask yourself is if you really want to
> > write a Parser/Indexer or just a HtmlParseFilter/IndexingFilter. I
> > suspect that the default behaviour of the Nutch Parser and Indexer is
> > useful for you, and you just want to add more functionality (that is
> > what Any23 is doing). You can chain Filters, so your code could also
> > leverage the Any23 logic, for example.
> >
> > The documentation starting point is the Wiki
> > (https://wiki.apache.org/nutch/). For your specific question, this is
> > the most relevant page: https://wiki.apache.org/nutch/AboutPlugins.
> >
> > One (old) example of writing a custom parser can be found here:
> > http://www.treselle.com/blog/apache-nutch-with-custom-parser/. I
> > suggest you Google for more information as needed, but always keep in
> > mind that things may have changed over time.
> >
> > I think the best approach for domain-specific parsers is to have a
> > custom parser that maps from the URL to the specific code. This can be
> > just one big if/else, or a Map of domain->code (possibly using
> > functional programming), or you can even have this map configurable in
some
> file.
> >
> > Once you have more specific questions/problems, I suggest you email
> > [hidden email]. [hidden email] is intended for discussing
> > code contributions to Nutch, as far as I understand, and I think less
> > people see your messages here. (Also, more people will benefit from
> > your questions
> > there...)
> >
> > In summary, from my experience, writing any one of these plugins is
> > really easy (discounting your own complex logic, of course), just
> > implementing one or a few methods, changing some plugin XML file, and
> > adding your extension to the global build (Ant) files. But to really
> > understand how the passed data looks, and what you can do with it,
> > debugging (in local mode) is the ultimate tool, and in the end is much
> > more time-efficient than looking for information on the web. This is
> > partly because a lot of the data is passed in Map-like form, so even
> > the JavaDoc doesn't really tell you what will be there (it depends on
> > what plugins you have configured, and how you configured those
plugins...).
> >
> >    Yossi.
> >
> >
> >> -----Original Message-----
> >> From: David Ferrero [mailto:[hidden email]]
> >> Sent: 11 February 2018 04:00
> >> To: [hidden email]
> >> Subject: Custom Parser / Indexer Starting points
> >>
> >> &tldr; If I wanted to learn about the nutch pipeline at a high level,
> >> then
> > write a
> >> custom parser / indexer of my own where would a starting point be?
> >>
> >> I have used the latest 1.x Nutch to crawl a few specific websites and
> >> been disappointed with the results, even after experimenting with new
> >> html- microdata capabilities with updates to Any23 project
> >> incorporated by
> > Nutch, I
> >> am still not (yet) excited. Bottom line is website data is not well
> > structured and
> >> not super friendly to algorithmic consumption (but you already knew
that).
> > To
> >> that end, I am interested to developer custom parsers per internet
> >> domain
> > in an
> >> effort to capture specific domain data. It currently looks like the
> > plugin.includes
> >> does not allow a per domain-based approach for parser / indexer. I
> >> wonder
> > if
> >> someone could guide me toward a high level view of the Nutch data
> > pipeline,
> >> then guide me towards where to get started for creating custom
> >> parsers
> > that
> >> might support a per-domain approach?
> >>
> >> Thanks,
> >> David
> >

--
Sent from Gmail IPad