RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Markus Jelsma-2
Boilerpipe is a crude tool but cheap and effective enough for many sorts of websites. It does has a problem with pages with little text, just as all extractors have a degree of problems with little text.

I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. I am not sure, but remember you can get rid of it by removing some lines of code. See TikaParser.java, i think it is there.

Regards,
Makrus

> non-open source contribution, you could try our extractor if you want, there is a (low speed) test available at https://www.openindex.io/saas/data-extraction/ . It is not free or open source but available and actively developed, and does much more than just text extraction.


 
-----Original message-----

> From:Rushikesh K <[hidden email]>
> Sent: Wednesday 15th November 2017 22:21
> To: [hidden email]; [hidden email]
> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
>
> Hello,
>
>
> Eyeris - Thanks for your response, i was able to make working with tika boilerpipe but now i have a different problem ,some of the crawled pages doesnt have the expected data
> For some pages it brings back only the Title and skips all the content i am not sure in what special cases does this do.But in my case i have two problems now 
> 1. when my page has a image and 1 or 2 lines of text it doesnt get those lines of data.(the data is in the <p> tag)
> 2.why is it adding Title to the starting of the content is there a way not to include that.
>
> For example see the following image for the first URL it came back with out any date
>
>
>
> On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <[hidden email] <mailto:[hidden email]>> wrote:
> Hello.
 
>
 
> I am using tika boilerpipe with good results in aproximately 2000 websites.
 
> Rushikesh if tika boilerpipe is not working for you maybe it is because you don´t are parsing documents with tika. please check this configuration
 
> and tell us.
 
>
 
> make sure that tika plugin is activated in plugin.included property then check:
 
>
 
> ***********************************************
 
> Use tika parser instead of parse-html.
 
>
 
> parse-plugins.xml
 
>
 
> <mimeType name="text/html">
 
>                 <plugin id="parse-tika" />
 
>         </mimeType>
 
>
 
>         <mimeType name="application/xhtml+xml">
 
>                 <plugin id="parse-tika" />
 
>         </mimeType>
 
> ***********************************************
 
>
 
> ***********************************************
 
> nutch-site.xml
 
> <property>
 
>   <name>tika.extractor</name>
 
>   <value>boilerpipe</value>
 
>   <description>
 
>   Which text extraction algorithm to use. Valid values are: boilerpipe or none.
 
>   </description>
 
> </property>
 
>
 
> <property>
 
>   <name>tika.extractor.boilerpipe.algorithm</name>
 
>   <value>ArticleExtractor</value>
 
>   <description>
 
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
 
>   or CanolaExtractor.
 
>   </description>
 
> </property>
 
> ****************************************
 
>
 
>
 
>
 
>
 
>
 
>
 
>
 
>
 
>
 
>
 
>
 
>
 
> ----- Mensaje original -----
 
> De: "Markus Jelsma" <[hidden email] <mailto:[hidden email]>>
 
> Para: [hidden email] <mailto:[hidden email]>
 
> Enviados: Martes, 14 de Noviembre 2017 17:40:08
 
> Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
 
>
 
> Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble getting it configured - it is really just setting a boolean value. Or does it work, but not to your satisfaction?
 
>
 
> The Bayan solution should work, theoretically, but just with a lot of tedious manual per-site configuration.
 
>
 
> Regards,
 
> Markus
 
>
 
> -----Original message-----
 
> > From:Rushikesh K <[hidden email] <mailto:[hidden email]>>
 
> > Sent: Tuesday 14th November 2017 23:30
 
> > To: [hidden email] <mailto:[hidden email]>
 
> > Cc: Sebastian Nagel <[hidden email] <mailto:[hidden email]>>; [hidden email] <mailto:[hidden email]>
 
> > Subject: Re: Removing header,Footer and left menus while crawling
 
> >
 
> > Hello,
 
> >
 
> > *Jorge*
 
> > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
 
> > tried configuring Tika boilerpipe with this version but this doesnt work
 
> > for me.As you suggested to use own parser ,i am not a java developer by
 
> > chance.
 
> > By chance if you or anyone in the community has a working file ,it would be
 
> > great if you can share it because there are many people facing with this
 
> > issue (i came to know this when i googled).
 
> >
 
> > Mark Vega
 
> > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
 
> > not working.we followed the same steps.I can share the changes if you want
 
> > to take a look.
 
> >
 
> > I appreciate for your quick suggestions!
 
> >
 
> > Thanks
 
> > Rushikesh
 
> >
 
> > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
 
> > [hidden email] <mailto:[hidden email]>> wrote:
 
> >
 
> > > Hello Rushikesh,
 
> > >
 
> > > Are you using Nutch 1.3 or Nutch 1.13? If youre using Nutch 1.13, then you
 
> > > could use the Tika boilerpipe implementation, on the nutch-site.xml you
 
> > > need to enable this feature with:
 
> > >
 
> > > <property>
 
> > >   <name>tika.extractor</name>
 
> > >   <value>boilerpipe</value>
 
> > >   <description>
 
> > >   Which text extraction algorithm to use. Valid values are: boilerpipe or
 
> > > none.
 
> > >   </description>
 
> > > </property>
 
> > >
 
> > > And configure the proper extractor with
 
> > > the tika.extractor.boilerpipe.algorithm setting.
 
> > >
 
> > > This is not a perfect solution, but Ive used it successfully in the past,
 
> > > of course, your results will depend on how is the structure (markup of the
 
> > > website).
 
> > >
 
> > > Other option could be to implement your own parser if you need to have more
 
> > > control over what to include/exclude from the HTML. You can take a look at
 
> > > this issue https://issues.apache.org/jira/browse/NUTCH-585 <https://issues.apache.org/jira/browse/NUTCH-585> which contains
 
> > > some info and old patches.
 
> > >
 
> > > Best Regards,
 
> > > Jorge
 
> > >
 
> > > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[hidden email] <mailto:[hidden email]>>
 
> > > wrote:
 
> > >
 
> > > > Hello Sebastian,
 
> > > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
 
> > > > our website and we are happy with the search results  but we had
 
> > > > requirement to skip the header footer and left menus and some other parts
 
> > > > of the page, can you please guide how can we exclude those parts.i was
 
> > > > trying various ways on google but nothing works for me.
 
> > > >
 
> > > > Appreciate for your help in Advance!
 
> > > > --
 
> > > > Regards
 
> > > > Rushikesh M
 
> > > > .Net Developer
 
> > > >
 
> > >
 
> >
 
> >
 
> >
 
> > --
 
> > Regards
 
> > Rushikesh M
 
> > .Net Developer
 
> >
 
> La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución
 
> 2002-2017
 
>
> <br clear="all" />
> --
> Regards
> Rushikesh M
> .Net Developer
Reply | Threaded
Open this post in threaded view
|

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Michael Coffey
I found a lot of detail about the boilerpipe algortithm in
http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf


Seems like very short paragraphs can be a problem, since one of the primary features used for determining boilerplate is the length of a given text block.

I would also look into the tika.extractor.boilerpipe.algorithm setting. It can be DefaultExtractor, ArticleExtractor or CanolaExtractor. I don't know what the differences are, but I bet ArticleExtractor (the default algorithm ) inserts the Title.



________________________________
From: Markus Jelsma <[hidden email]>
To: "[hidden email]" <[hidden email]>
Sent: Wednesday, November 15, 2017 1:38 PM
Subject: RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling



Boilerpipe is a crude tool but cheap and effective enough for many sorts of websites. It does has a problem with pages with little text, just as all extractors have a degree of problems with little text.


I believe Boilerpipe adds the title hardcoded, or it is TikaParser doing it. I am not sure, but remember you can get rid of it by removing some lines of code. See TikaParser.java, i think it is there.


Regards,

Makrus


> non-open source contribution, you could try our extractor if you want, there is a (low speed) test available at https://www.openindex.io/saas/data-extraction/ . It is not free or open source but available and actively developed, and does much more than just text extraction.




-----Original message-----

> From:Rushikesh K <[hidden email]>

> Sent: Wednesday 15th November 2017 22:21

> To: [hidden email]; [hidden email]

> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

>

> Hello,

>

>

> Eyeris - Thanks for your response, i was able to make working with tika boilerpipe but now i have a different problem ,some of the crawled pages doesnt have the expected data

> For some pages it brings back only the Title and skips all the content i am not sure in what special cases does this do.But in my case i have two problems now  

> 1. when my page has a image and 1 or 2 lines of text it doesnt get those lines of data.(the data is in the <p> tag)

> 2.why is it adding Title to the starting of the content is there a way not to include that.

>

> For example see the following image for the first URL it came back with out any date

>

>

>

> On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <[hidden email] <mailto:[hidden email]>> wrote:

> Hello.


>


> I am using tika boilerpipe with good results in aproximately 2000 websites.


> Rushikesh if tika boilerpipe is not working for you maybe it is because you don´t are parsing documents with tika. please check this configuration


> and tell us.


>


> make sure that tika plugin is activated in plugin.included property then check:


>


> ***********************************************


> Use tika parser instead of parse-html.


>


> parse-plugins.xml


>


> <mimeType name="text/html">


>                 <plugin id="parse-tika" />


>         </mimeType>


>


>         <mimeType name="application/xhtml+xml">


>                 <plugin id="parse-tika" />


>         </mimeType>


> ***********************************************


>


> ***********************************************


> nutch-site.xml


> <property>


>   <name>tika.extractor</name>


>   <value>boilerpipe</value>


>   <description>


>   Which text extraction algorithm to use. Valid values are: boilerpipe or none.


>   </description>


> </property>


>


> <property>


>   <name>tika.extractor.boilerpipe.algorithm</name>


>   <value>ArticleExtractor</value>


>   <description>


>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor


>   or CanolaExtractor.


>   </description>


> </property>


> ****************************************


>


>


>


>


>


>


>


>


>


>


>


>


> ----- Mensaje original -----


> De: "Markus Jelsma" <[hidden email] <mailto:[hidden email]>>


> Para: [hidden email] <mailto:[hidden email]>


> Enviados: Martes, 14 de Noviembre 2017 17:40:08


> Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling


>


> Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble getting it configured - it is really just setting a boolean value. Or does it work, but not to your satisfaction?


>


> The Bayan solution should work, theoretically, but just with a lot of tedious manual per-site configuration.


>


> Regards,


> Markus


>


> -----Original message-----


> > From:Rushikesh K <[hidden email] <mailto:[hidden email]>>


> > Sent: Tuesday 14th November 2017 23:30


> > To: [hidden email] <mailto:[hidden email]>


> > Cc: Sebastian Nagel <[hidden email] <mailto:[hidden email]>>; [hidden email] <mailto:[hidden email]>


> > Subject: Re: Removing header,Footer and left menus while crawling


> >


> > Hello,


> >


> > *Jorge*


> > Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i


> > tried configuring Tika boilerpipe with this version but this doesnt work


> > for me.As you suggested to use own parser ,i am not a java developer by


> > chance.


> > By chance if you or anyone in the community has a working file ,it would be


> > great if you can share it because there are many people facing with this


> > issue (i came to know this when i googled).


> >


> > Mark Vega


> > we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also


> > not working.we followed the same steps.I can share the changes if you want


> > to take a look.


> >


> > I appreciate for your quick suggestions!


> >


> > Thanks


> > Rushikesh


> >


> > On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <


> > [hidden email] <mailto:[hidden email]>> wrote:


> >


> > > Hello Rushikesh,


> > >


> > > Are you using Nutch 1.3 or Nutch 1.13? If youre using Nutch 1.13, then you


> > > could use the Tika boilerpipe implementation, on the nutch-site.xml you


> > > need to enable this feature with:


> > >


> > > <property>


> > >   <name>tika.extractor</name>


> > >   <value>boilerpipe</value>


> > >   <description>


> > >   Which text extraction algorithm to use. Valid values are: boilerpipe or


> > > none.


> > >   </description>


> > > </property>


> > >


> > > And configure the proper extractor with


> > > the tika.extractor.boilerpipe.algorithm setting.


> > >


> > > This is not a perfect solution, but Ive used it successfully in the past,


> > > of course, your results will depend on how is the structure (markup of the


> > > website).


> > >


> > > Other option could be to implement your own parser if you need to have more


> > > control over what to include/exclude from the HTML. You can take a look at


> > > this issue https://issues.apache.org/jira/browse/NUTCH-585 <https://issues.apache.org/jira/browse/NUTCH-585> which contains


> > > some info and old patches.


> > >


> > > Best Regards,


> > > Jorge


> > >


> > > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[hidden email] <mailto:[hidden email]>>


> > > wrote:


> > >


> > > > Hello Sebastian,


> > > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling


> > > > our website and we are happy with the search results  but we had


> > > > requirement to skip the header footer and left menus and some other parts


> > > > of the page, can you please guide how can we exclude those parts.i was


> > > > trying various ways on google but nothing works for me.


> > > >


> > > > Appreciate for your help in Advance!


> > > > --


> > > > Regards


> > > > Rushikesh M


> > > > .Net Developer


> > > >


> > >


> >


> >


> >


> > --


> > Regards


> > Rushikesh M


> > .Net Developer


> >


> La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución


> 2002-2017


>

> <br clear="all" />

> --

> Regards

> Rushikesh M

> .Net Developer