RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Markus Jelsma-2
The DefaultExtractor gives as i remember the same as ArticleExtractor, which is fine for contiguous regions of text. CanolaExtractor must be used if you expect lots of non-contiguous regions of text. The latter is also more prone to get the boilerplate text you want to avoid in the first place.

By the way, if you intend to extract CJK websites you need to manually modify Boilerpipe to take into account the different character-to-information ratio, or try Canola.
 
-----Original message-----

> From:Michael Coffey <[hidden email]>
> Sent: Wednesday 15th November 2017 23:00
> To: [hidden email]
> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
>
> I found a lot of detail about the boilerpipe algortithm in
> http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf
>
>
> Seems like very short paragraphs can be a problem, since one of the primary features used for determining boilerplate is the length of a given text block.
>
> I would also look into the tika.extractor.boilerpipe.algorithm setting. It can be DefaultExtractor, ArticleExtractor or CanolaExtractor. I don't know what the differences are, but I bet ArticleExtractor (the default algorithm ) inserts the Title.
>
>
>
> ________________________________
> From: Markus Jelsma <[hidden email]>
> To: "[hidden email]" <[hidden email]>
> Sent: Wednesday, November 15, 2017 1:38 PM
> Subject: RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
>
>
>
> Boilerpipe is a crude tool but cheap and effective enough for many sorts of websites. It does has a problem with pages with little text, just as all extractors have a degree of problems with little text.
>
>
> I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. I am not sure, but remember you can get rid of it by removing some lines of code. See TikaParser.java, i think it is there.
>
>
> Regards,
>
> Makrus
>
>
> > non-open source contribution, you could try our extractor if you want, there is a (low speed) test available at https://www.openindex.io/saas/data-extraction/ . It is not free or open source but available and actively developed, and does much more than just text extraction.
>
>
>
>
> -----Original message-----
>
> > From:Rushikesh K <[hidden email]>
>
> > Sent: Wednesday 15th November 2017 22:21
>
> > To: [hidden email]; [hidden email]
>
> > Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
>
> >
>
> > Hello,
>
> >
>
> >
>
> > Eyeris - Thanks for your response, i was able to make working with tika boilerpipe but now i have a different problem ,some of the crawled pages doesnt have the expected data
>
> > For some pages it brings back only the Title and skips all the content i am not sure in what special cases does this do.But in my case i have two problems now  
>
> > 1. when my page has a image and 1 or 2 lines of text it doesnt get those lines of data.(the data is in the <p> tag)
>
> > 2.why is it adding Title to the starting of the content is there a way not to include that.
>
> >
>
> > For example see the following image for the first URL it came back with out any date
>
> >
>
> >
>
> >
>
> > On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <[hidden email] <mailto:[hidden email]>> wrote:
>
> > Hello.
>
>
> >
>
>
> > I am using tika boilerpipe with good results in aproximately 2000 websites.
>
>
> > Rushikesh if tika boilerpipe is not working for you maybe it is because you don´t are parsing documents with tika. please check this configuration
>
>
> > and tell us.
>
>
> >
>
>
> > make sure that tika plugin is activated in plugin.included property then check:
>
>
> >
>
>
> > ***********************************************
>
>
> > Use tika parser instead of parse-html.
>
>
> >
>
>
> > parse-plugins.xml
>
>
> >
>
>
> > <mimeType name="text/html">
>
>
> >                 <plugin id="parse-tika" />
>
>
> >         </mimeType>
>
>
> >
>
>
> >         <mimeType name="application/xhtml+xml">
>
>
> >                 <plugin id="parse-tika" />
>
>
> >         </mimeType>
>
>
> > ***********************************************
>
>
> >
>
>
> > ***********************************************
>
>
> > nutch-site.xml
>
>
> > <property>
>
>
> >   <name>tika.extractor</name>
>
>
> >   <value>boilerpipe</value>
>
>
> >   <description>
>
>
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or none.
>
>
> >   </description>
>
>
> > </property>
>
>
> >
>
>
> > <property>
>
>
> >   <name>tika.extractor.boilerpipe.algorithm</name>
>
>
> >   <value>ArticleExtractor</value>
>
>
> >   <description>
>
>
> >   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
>
>
> >   or CanolaExtractor.
>
>
> >   </description>
>
>
> > </property>
>
>
> > ****************************************
>
>
> >
>
>
> >
>
>
> >
>
>
> >
>
>
> >
>
>
> >
>
>
> >
>
>
> >
>
>
> >
>
>
> >
>
>
> >
>
>
> >
>
>
> > ----- Mensaje original -----
>
>
> > De: "Markus Jelsma" <[hidden email] <mailto:[hidden email]>>
>
>
> > Para: [hidden email] <mailto:[hidden email]>
>
>
> > Enviados: Martes, 14 de Noviembre 2017 17:40:08
>
>
> > Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
>
>
> >
>
>
> > Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble getting it configured - it is really just setting a boolean value. Or does it work, but not to your satisfaction?
>
>
> >
>
>
> > The Bayan solution should work, theoretically, but just with a lot of tedious manual per-site configuration.
>
>
> >
>
>
> > Regards,
>
>
> > Markus
>
>
> >
>
>
> > -----Original message-----
>
>
> > > From:Rushikesh K <[hidden email] <mailto:[hidden email]>>
>
>
> > > Sent: Tuesday 14th November 2017 23:30
>
>
> > > To: [hidden email] <mailto:[hidden email]>
>
>
> > > Cc: Sebastian Nagel <[hidden email] <mailto:[hidden email]>>; [hidden email] <mailto:[hidden email]>
>
>
> > > Subject: Re: Removing header,Footer and left menus while crawling
>
>
> > >
>
>
> > > Hello,
>
>
> > >
>
>
> > > *Jorge*
>
>
> > > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
>
>
> > > tried configuring Tika boilerpipe with this version but this doesnt work
>
>
> > > for me.As you suggested to use own parser ,i am not a java developer by
>
>
> > > chance.
>
>
> > > By chance if you or anyone in the community has a working file ,it would be
>
>
> > > great if you can share it because there are many people facing with this
>
>
> > > issue (i came to know this when i googled).
>
>
> > >
>
>
> > > Mark Vega
>
>
> > > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
>
>
> > > not working.we followed the same steps.I can share the changes if you want
>
>
> > > to take a look.
>
>
> > >
>
>
> > > I appreciate for your quick suggestions!
>
>
> > >
>
>
> > > Thanks
>
>
> > > Rushikesh
>
>
> > >
>
>
> > > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
>
>
> > > [hidden email] <mailto:[hidden email]>> wrote:
>
>
> > >
>
>
> > > > Hello Rushikesh,
>
>
> > > >
>
>
> > > > Are you using Nutch 1.3 or Nutch 1.13? If youre using Nutch 1.13, then you
>
>
> > > > could use the Tika boilerpipe implementation, on the nutch-site.xml you
>
>
> > > > need to enable this feature with:
>
>
> > > >
>
>
> > > > <property>
>
>
> > > >   <name>tika.extractor</name>
>
>
> > > >   <value>boilerpipe</value>
>
>
> > > >   <description>
>
>
> > > >   Which text extraction algorithm to use. Valid values are: boilerpipe or
>
>
> > > > none.
>
>
> > > >   </description>
>
>
> > > > </property>
>
>
> > > >
>
>
> > > > And configure the proper extractor with
>
>
> > > > the tika.extractor.boilerpipe.algorithm setting.
>
>
> > > >
>
>
> > > > This is not a perfect solution, but Ive used it successfully in the past,
>
>
> > > > of course, your results will depend on how is the structure (markup of the
>
>
> > > > website).
>
>
> > > >
>
>
> > > > Other option could be to implement your own parser if you need to have more
>
>
> > > > control over what to include/exclude from the HTML. You can take a look at
>
>
> > > > this issue https://issues.apache.org/jira/browse/NUTCH-585 <https://issues.apache.org/jira/browse/NUTCH-585> which contains
>
>
> > > > some info and old patches.
>
>
> > > >
>
>
> > > > Best Regards,
>
>
> > > > Jorge
>
>
> > > >
>
>
> > > > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[hidden email] <mailto:[hidden email]>>
>
>
> > > > wrote:
>
>
> > > >
>
>
> > > > > Hello Sebastian,
>
>
> > > > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
>
>
> > > > > our website and we are happy with the search results  but we had
>
>
> > > > > requirement to skip the header footer and left menus and some other parts
>
>
> > > > > of the page, can you please guide how can we exclude those parts.i was
>
>
> > > > > trying various ways on google but nothing works for me.
>
>
> > > > >
>
>
> > > > > Appreciate for your help in Advance!
>
>
> > > > > --
>
>
> > > > > Regards
>
>
> > > > > Rushikesh M
>
>
> > > > > .Net Developer
>
>
> > > > >
>
>
> > > >
>
>
> > >
>
>
> > >
>
>
> > >
>
>
> > > --
>
>
> > > Regards
>
>
> > > Rushikesh M
>
>
> > > .Net Developer
>
>
> > >
>
>
> > La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución
>
>
> > 2002-2017
>
>
> >
>
> > <br clear="all" />
>
> > --
>
> > Regards
>
> > Rushikesh M
>
> > .Net Developer
>
Reply | Threaded
Open this post in threaded view
|

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Michael Coffey
Also, try the boilerpipe demo online at https://boilerpipe-web.appspot.com/

________________________________
From: Markus Jelsma <[hidden email]>
To: "[hidden email]" <[hidden email]>
Sent: Wednesday, November 15, 2017 2:06 PM
Subject: RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling



The DefaultExtractor gives as i remember the same as ArticleExtractor, which is fine for contiguous regions of text. CanolaExtractor must be used if you expect lots of non-contiguous regions of text. The latter is also more prone to get the boilerplate text you want to avoid in the first place.


By the way, if you intend to extract CJK websites you need to manually modify Boilerpipe to take into account the different character-to-information ratio, or try Canola.


-----Original message-----

> From:Michael Coffey <[hidden email]>

> Sent: Wednesday 15th November 2017 23:00

> To: [hidden email]

> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

>

> I found a lot of detail about the boilerpipe algorithm in

> http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf

>

>

> Seems like very short paragraphs can be a problem, since one of the primary features used for determining boilerplate is the length of a given text block.

>

> I would also look into the tika.extractor.boilerpipe.algorithm setting. It can be DefaultExtractor, ArticleExtractor or CanolaExtractor. I don't know what the differences are, but I bet ArticleExtractor (the default algorithm ) inserts the Title.

>

>

>

> ________________________________

> From: Markus Jelsma <[hidden email]>

> To: "[hidden email]" <[hidden email]>

> Sent: Wednesday, November 15, 2017 1:38 PM

> Subject: RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

>

>

>

> Boilerpipe is a crude tool but cheap and effective enough for many sorts of websites. It does has a problem with pages with little text, just as all extractors have a degree of problems with little text.

>

>

> I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. I am not sure, but remember you can get rid of it by removing some lines of code. See TikaParser.java, i think it is there.

>

>

> Regards,

>

> Makrus

>

>

> > non-open source contribution, you could try our extractor if you want, there is a (low speed) test available at https://www.openindex.io/saas/data-extraction/ . It is not free or open source but available and actively developed, and does much more than just text extraction.

>

>

>

>

> -----Original message-----

>

> > From:Rushikesh K <[hidden email]>

>

> > Sent: Wednesday 15th November 2017 22:21

>

> > To: [hidden email]; [hidden email]

>

> > Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

>

> >

>

> > Hello,

>

> >

>

> >

>

> > Eyeris - Thanks for your response, i was able to make working with tika boilerpipe but now i have a different problem ,some of the crawled pages doesnt have the expected data

>

> > For some pages it brings back only the Title and skips all the content i am not sure in what special cases does this do.But in my case i have two problems now  

>

> > 1. when my page has a image and 1 or 2 lines of text it doesnt get those lines of data.(the data is in the <p> tag)

>

> > 2.why is it adding Title to the starting of the content is there a way not to include that.

>

> >

>

> > For example see the following image for the first URL it came back with out any date

>

> >

>

> >

>

> >

>

> > On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <[hidden email] <mailto:[hidden email]>> wrote:

>

> > Hello.

>

>

> >

>

>

> > I am using tika boilerpipe with good results in aproximately 2000 websites.

>

>

> > Rushikesh if tika boilerpipe is not working for you maybe it is because you don´t are parsing documents with tika. please check this configuration

>

>

> > and tell us.

>

>

> >

>

>

> > make sure that tika plugin is activated in plugin.included property then check:

>

>

> >

>

>

> > ***********************************************

>

>

> > Use tika parser instead of parse-html.

>

>

> >

>

>

> > parse-plugins.xml

>

>

> >

>

>

> > <mimeType name="text/html">

>

>

> >                 <plugin id="parse-tika" />

>

>

> >         </mimeType>

>

>

> >

>

>

> >         <mimeType name="application/xhtml+xml">

>

>

> >                 <plugin id="parse-tika" />

>

>

> >         </mimeType>

>

>

> > ***********************************************

>

>

> >

>

>

> > ***********************************************

>

>

> > nutch-site.xml

>

>

> > <property>

>

>

> >   <name>tika.extractor</name>

>

>

> >   <value>boilerpipe</value>

>

>

> >   <description>

>

>

> >   Which text extraction algorithm to use. Valid values are: boilerpipe or none.

>

>

> >   </description>

>

>

> > </property>

>

>

> >

>

>

> > <property>

>

>

> >   <name>tika.extractor.boilerpipe.algorithm</name>

>

>

> >   <value>ArticleExtractor</value>

>

>

> >   <description>

>

>

> >   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor

>

>

> >   or CanolaExtractor.

>

>

> >   </description>

>

>

> > </property>

>

>

> > ****************************************

>

>

> >

>

>

> >

>

>

> >

>

>

> >

>

>

> >

>

>

> >

>

>

> >

>

>

> >

>

>

> >

>

>

> >

>

>

> >

>

>

> >

>

>

> > ----- Mensaje original -----

>

>

> > De: "Markus Jelsma" <[hidden email] <mailto:[hidden email]>>

>

>

> > Para: [hidden email] <mailto:[hidden email]>

>

>

> > Enviados: Martes, 14 de Noviembre 2017 17:40:08

>

>

> > Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

>

>

> >

>

>

> > Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble getting it configured - it is really just setting a boolean value. Or does it work, but not to your satisfaction?

>

>

> >

>

>

> > The Bayan solution should work, theoretically, but just with a lot of tedious manual per-site configuration.

>

>

> >

>

>

> > Regards,

>

>

> > Markus

>

>

> >

>

>

> > -----Original message-----

>

>

> > > From:Rushikesh K <[hidden email] <mailto:[hidden email]>>

>

>

> > > Sent: Tuesday 14th November 2017 23:30

>

>

> > > To: [hidden email] <mailto:[hidden email]>

>

>

> > > Cc: Sebastian Nagel <[hidden email] <mailto:[hidden email]>>; [hidden email] <mailto:[hidden email]>

>

>

> > > Subject: Re: Removing header,Footer and left menus while crawling

>

>

> > >

>

>

> > > Hello,

>

>

> > >

>

>

> > > *Jorge*

>

>

> > > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i

>

>

> > > tried configuring Tika boilerpipe with this version but this doesnt work

>

>

> > > for me.As you suggested to use own parser ,i am not a java developer by

>

>

> > > chance.

>

>

> > > By chance if you or anyone in the community has a working file ,it would be

>

>

> > > great if you can share it because there are many people facing with this

>

>

> > > issue (i came to know this when i googled).

>

>

> > >

>

>

> > > Mark Vega

>

>

> > > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also

>

>

> > > not working.we followed the same steps.I can share the changes if you want

>

>

> > > to take a look.

>

>

> > >

>

>

> > > I appreciate for your quick suggestions!

>

>

> > >

>

>

> > > Thanks

>

>

> > > Rushikesh

>

>

> > >

>

>

> > > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <

>

>

> > > [hidden email] <mailto:[hidden email]>> wrote:

>

>

> > >

>

>

> > > > Hello Rushikesh,

>

>

> > > >

>

>

> > > > Are you using Nutch 1.3 or Nutch 1.13? If youre using Nutch 1.13, then you

>

>

> > > > could use the Tika boilerpipe implementation, on the nutch-site.xml you

>

>

> > > > need to enable this feature with:

>

>

> > > >

>

>

> > > > <property>

>

>

> > > >   <name>tika.extractor</name>

>

>

> > > >   <value>boilerpipe</value>

>

>

> > > >   <description>

>

>

> > > >   Which text extraction algorithm to use. Valid values are: boilerpipe or

>

>

> > > > none.

>

>

> > > >   </description>

>

>

> > > > </property>

>

>

> > > >

>

>

> > > > And configure the proper extractor with

>

>

> > > > the tika.extractor.boilerpipe.algorithm setting.

>

>

> > > >

>

>

> > > > This is not a perfect solution, but Ive used it successfully in the past,

>

>

> > > > of course, your results will depend on how is the structure (markup of the

>

>

> > > > website).

>

>

> > > >

>

>

> > > > Other option could be to implement your own parser if you need to have more

>

>

> > > > control over what to include/exclude from the HTML. You can take a look at

>

>

> > > > this issue https://issues.apache.org/jira/browse/NUTCH-585 <https://issues.apache.org/jira/browse/NUTCH-585> which contains

>

>

> > > > some info and old patches.

>

>

> > > >

>

>

> > > > Best Regards,

>

>

> > > > Jorge

>

>

> > > >

>

>

> > > > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[hidden email] <mailto:[hidden email]>>

>

>

> > > > wrote:

>

>

> > > >

>

>

> > > > > Hello Sebastian,

>

>

> > > > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling

>

>

> > > > > our website and we are happy with the search results  but we had

>

>

> > > > > requirement to skip the header footer and left menus and some other parts

>

>

> > > > > of the page, can you please guide how can we exclude those parts.i was

>

>

> > > > > trying various ways on google but nothing works for me.

>

>

> > > > >

>

>

> > > > > Appreciate for your help in Advance!

>

>

> > > > > --

>

>

> > > > > Regards

>

>

> > > > > Rushikesh M

>

>

> > > > > .Net Developer

>

>

> > > > >

>

>

> > > >

>

>

> > >

>

>

> > >

>

>

> > >

>

>

> > > --

>

>

> > > Regards

>

>

> > > Rushikesh M

>

>

> > > .Net Developer

>

>

> > >

>

>

> > La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución

>

>

> > 2002-2017

>

>

> >

>

> > <br clear="all" />

>

> > --

>

> > Regards

>

> > Rushikesh M

>

> > .Net Developer

>