Removing header,Footer and left menus while crawling

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Removing header,Footer and left menus while crawling

Rushikesh K
Hello Sebastian,
we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
our website and we are happy with the search results  but we had
requirement to skip the header footer and left menus and some other parts
of the page, can you please guide how can we exclude those parts.i was
trying various ways on google but nothing works for me.

Appreciate for your help in Advance!
--
Regards
Rushikesh M
.Net Developer
Reply | Threaded
Open this post in threaded view
|

Re: Removing header,Footer and left menus while crawling

Jorge Betancourt
Hello Rushikesh,

Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
could use the Tika boilerpipe implementation, on the nutch-site.xml you
need to enable this feature with:

<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or
none.
  </description>
</property>

And configure the proper extractor with
the tika.extractor.boilerpipe.algorithm setting.

This is not a perfect solution, but I've used it successfully in the past,
of course, your results will depend on how is the structure (markup of the
website).

Other option could be to implement your own parser if you need to have more
control over what to include/exclude from the HTML. You can take a look at
this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains
some info and old patches.

Best Regards,
Jorge

On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[hidden email]>
wrote:

> Hello Sebastian,
> we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
> our website and we are happy with the search results  but we had
> requirement to skip the header footer and left menus and some other parts
> of the page, can you please guide how can we exclude those parts.i was
> trying various ways on google but nothing works for me.
>
> Appreciate for your help in Advance!
> --
> Regards
> Rushikesh M
> .Net Developer
>
Reply | Threaded
Open this post in threaded view
|

Re: Removing header,Footer and left menus while crawling

Michael Coffey
That is a very interesting note. I have been wanting something like that. I use the python-based "newspaper" package but it is not directly compatible with the nutch/hadoop infrastructure.


      From: Jorge Betancourt <[hidden email]>
 To: [hidden email]
Cc: [hidden email]
 Sent: Tuesday, November 14, 2017 5:35 AM
 Subject: Re: Removing header,Footer and left menus while crawling
   
Hello Rushikesh,

Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
could use the Tika boilerpipe implementation, on the nutch-site.xml you
need to enable this feature with:

<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or
none.
  </description>
</property>

And configure the proper extractor with
the tika.extractor.boilerpipe.algorithm setting.

This is not a perfect solution, but I've used it successfully in the past,
of course, your results will depend on how is the structure (markup of the
website).

Other option could be to implement your own parser if you need to have more
control over what to include/exclude from the HTML. You can take a look at
this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains
some info and old patches.

Best Regards,
Jorge

On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[hidden email]>
wrote:

> Hello Sebastian,
> we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
> our website and we are happy with the search results  but we had
> requirement to skip the header footer and left menus and some other parts
> of the page, can you please guide how can we exclude those parts.i was
> trying various ways on google but nothing works for me.
>
> Appreciate for your help in Advance!
> --
> Regards
> Rushikesh M
> .Net Developer
>


   
Reply | Threaded
Open this post in threaded view
|

RE: Removing header,Footer and left menus while crawling

Mark Vega
Michael,
I don't know if it's compatible with v1.13, but I've been using an extractor plug-in from Bayan Group (https://github.com/BayanGroup/nutch-custom-search) with v1.10 to strip content that repeats on every page (header, footer, toc/nav) and index only the main content section into the default search field.  The plug-in is easy to configure and use and allows you to specify multiple elements to remove from the indexable content by element type, id, name or css class.  It also allows you to map multiple elements from different sites with different element naming/classing conventions into the same field, helpful if you've got multiple sites that each call or class their main content section something different. I've been using it without issue for about four years now.

--
Mark F. Vega
Programmer/Analyst
UC Irvine Libraries - Web Services
[hidden email]
949.824.9872
--


-----Original Message-----
From: Michael Coffey [mailto:[hidden email]]
Sent: Tuesday, November 14, 2017 11:25 AM
To: [hidden email]
Cc: [hidden email]
Subject: Re: Removing header,Footer and left menus while crawling

That is a very interesting note. I have been wanting something like that. I use the python-based "newspaper" package but it is not directly compatible with the nutch/hadoop infrastructure.


      From: Jorge Betancourt <[hidden email]>
 To: [hidden email]
Cc: [hidden email]
 Sent: Tuesday, November 14, 2017 5:35 AM
 Subject: Re: Removing header,Footer and left menus while crawling
   
Hello Rushikesh,

Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you could use the Tika boilerpipe implementation, on the nutch-site.xml you need to enable this feature with:

<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>

And configure the proper extractor with
the tika.extractor.boilerpipe.algorithm setting.

This is not a perfect solution, but I've used it successfully in the past, of course, your results will depend on how is the structure (markup of the website).

Other option could be to implement your own parser if you need to have more control over what to include/exclude from the HTML. You can take a look at this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains some info and old patches.

Best Regards,
Jorge

On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[hidden email]>
wrote:

> Hello Sebastian,
> we are most excited in using the  Nutch 1.3 (with solr 6.4)  for
> crawling our website and we are happy with the search results  but we
> had requirement to skip the header footer and left menus and some
> other parts of the page, can you please guide how can we exclude those
> parts.i was trying various ways on google but nothing works for me.
>
> Appreciate for your help in Advance!
> --
> Regards
> Rushikesh M
> .Net Developer
>


   
Reply | Threaded
Open this post in threaded view
|

Re: Removing header,Footer and left menus while crawling

Rushikesh K
In reply to this post by Jorge Betancourt
Hello,

*Jorge*
Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
tried configuring Tika boilerpipe with this version but this doesn't work
for me.As you suggested to use own parser ,i am not a java developer by
chance.
By chance if you or anyone in the community has a working file ,it would be
great if you can share it because there are many people facing with this
issue (i came to know this when i googled).

Mark Vega
we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
not working.we followed the same steps.I can share the changes if you want
to take a look.

I appreciate for your quick suggestions!

Thanks
Rushikesh

On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
[hidden email]> wrote:

> Hello Rushikesh,
>
> Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
> could use the Tika boilerpipe implementation, on the nutch-site.xml you
> need to enable this feature with:
>
> <property>
>   <name>tika.extractor</name>
>   <value>boilerpipe</value>
>   <description>
>   Which text extraction algorithm to use. Valid values are: boilerpipe or
> none.
>   </description>
> </property>
>
> And configure the proper extractor with
> the tika.extractor.boilerpipe.algorithm setting.
>
> This is not a perfect solution, but I've used it successfully in the past,
> of course, your results will depend on how is the structure (markup of the
> website).
>
> Other option could be to implement your own parser if you need to have more
> control over what to include/exclude from the HTML. You can take a look at
> this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains
> some info and old patches.
>
> Best Regards,
> Jorge
>
> On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[hidden email]>
> wrote:
>
> > Hello Sebastian,
> > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
> > our website and we are happy with the search results  but we had
> > requirement to skip the header footer and left menus and some other parts
> > of the page, can you please guide how can we exclude those parts.i was
> > trying various ways on google but nothing works for me.
> >
> > Appreciate for your help in Advance!
> > --
> > Regards
> > Rushikesh M
> > .Net Developer
> >
>



--
Regards
Rushikesh M
.Net Developer
Reply | Threaded
Open this post in threaded view
|

RE: Removing header,Footer and left menus while crawling

Markus Jelsma-2
In reply to this post by Rushikesh K
Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble getting it configured - it is really just setting a boolean value. Or does it work, but not to your satisfaction?

The Bayan solution should work, theoretically, but just with a lot of tedious manual per-site configuration.

Regards,
Markus

-----Original message-----

> From:Rushikesh K <[hidden email]>
> Sent: Tuesday 14th November 2017 23:30
> To: [hidden email]
> Cc: Sebastian Nagel <[hidden email]>; [hidden email]
> Subject: Re: Removing header,Footer and left menus while crawling
>
> Hello,
>
> *Jorge*
> Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
> tried configuring Tika boilerpipe with this version but this doesn't work
> for me.As you suggested to use own parser ,i am not a java developer by
> chance.
> By chance if you or anyone in the community has a working file ,it would be
> great if you can share it because there are many people facing with this
> issue (i came to know this when i googled).
>
> Mark Vega
> we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
> not working.we followed the same steps.I can share the changes if you want
> to take a look.
>
> I appreciate for your quick suggestions!
>
> Thanks
> Rushikesh
>
> On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
> [hidden email]> wrote:
>
> > Hello Rushikesh,
> >
> > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
> > could use the Tika boilerpipe implementation, on the nutch-site.xml you
> > need to enable this feature with:
> >
> > <property>
> >   <name>tika.extractor</name>
> >   <value>boilerpipe</value>
> >   <description>
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or
> > none.
> >   </description>
> > </property>
> >
> > And configure the proper extractor with
> > the tika.extractor.boilerpipe.algorithm setting.
> >
> > This is not a perfect solution, but I've used it successfully in the past,
> > of course, your results will depend on how is the structure (markup of the
> > website).
> >
> > Other option could be to implement your own parser if you need to have more
> > control over what to include/exclude from the HTML. You can take a look at
> > this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains
> > some info and old patches.
> >
> > Best Regards,
> > Jorge
> >
> > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[hidden email]>
> > wrote:
> >
> > > Hello Sebastian,
> > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
> > > our website and we are happy with the search results  but we had
> > > requirement to skip the header footer and left menus and some other parts
> > > of the page, can you please guide how can we exclude those parts.i was
> > > trying various ways on google but nothing works for me.
> > >
> > > Appreciate for your help in Advance!
> > > --
> > > Regards
> > > Rushikesh M
> > > .Net Developer
> > >
> >
>
>
>
> --
> Regards
> Rushikesh M
> .Net Developer
>
Reply | Threaded
Open this post in threaded view
|

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Eyeris
Hello.

I am using tika boilerpipe with good results in aproximately 2000 websites.
Rushikesh if tika boilerpipe is not working for you maybe it is because you don´t are parsing documents with tika. please check this configuration
and tell us.

make sure that tika plugin is activated in plugin.included property then check:

***********************************************
Use tika parser instead of parse-html.

parse-plugins.xml

<mimeType name="text/html">
                <plugin id="parse-tika" />
        </mimeType>

        <mimeType name="application/xhtml+xml">
                <plugin id="parse-tika" />
        </mimeType>
***********************************************

***********************************************
nutch-site.xml
<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>

<property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>ArticleExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
  or CanolaExtractor.
  </description>
</property>
****************************************












----- Mensaje original -----
De: "Markus Jelsma" <[hidden email]>
Para: [hidden email]
Enviados: Martes, 14 de Noviembre 2017 17:40:08
Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble getting it configured - it is really just setting a boolean value. Or does it work, but not to your satisfaction?

The Bayan solution should work, theoretically, but just with a lot of tedious manual per-site configuration.

Regards,
Markus

-----Original message-----

> From:Rushikesh K <[hidden email]>
> Sent: Tuesday 14th November 2017 23:30
> To: [hidden email]
> Cc: Sebastian Nagel <[hidden email]>; [hidden email]
> Subject: Re: Removing header,Footer and left menus while crawling
>
> Hello,
>
> *Jorge*
> Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
> tried configuring Tika boilerpipe with this version but this doesn't work
> for me.As you suggested to use own parser ,i am not a java developer by
> chance.
> By chance if you or anyone in the community has a working file ,it would be
> great if you can share it because there are many people facing with this
> issue (i came to know this when i googled).
>
> Mark Vega
> we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
> not working.we followed the same steps.I can share the changes if you want
> to take a look.
>
> I appreciate for your quick suggestions!
>
> Thanks
> Rushikesh
>
> On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
> [hidden email]> wrote:
>
> > Hello Rushikesh,
> >
> > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
> > could use the Tika boilerpipe implementation, on the nutch-site.xml you
> > need to enable this feature with:
> >
> > <property>
> >   <name>tika.extractor</name>
> >   <value>boilerpipe</value>
> >   <description>
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or
> > none.
> >   </description>
> > </property>
> >
> > And configure the proper extractor with
> > the tika.extractor.boilerpipe.algorithm setting.
> >
> > This is not a perfect solution, but I've used it successfully in the past,
> > of course, your results will depend on how is the structure (markup of the
> > website).
> >
> > Other option could be to implement your own parser if you need to have more
> > control over what to include/exclude from the HTML. You can take a look at
> > this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains
> > some info and old patches.
> >
> > Best Regards,
> > Jorge
> >
> > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[hidden email]>
> > wrote:
> >
> > > Hello Sebastian,
> > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
> > > our website and we are happy with the search results  but we had
> > > requirement to skip the header footer and left menus and some other parts
> > > of the page, can you please guide how can we exclude those parts.i was
> > > trying various ways on google but nothing works for me.
> > >
> > > Appreciate for your help in Advance!
> > > --
> > > Regards
> > > Rushikesh M
> > > .Net Developer
> > >
> >
>
>
>
> --
> Regards
> Rushikesh M
> .Net Developer
>
La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución
2002-2017
Reply | Threaded
Open this post in threaded view
|

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Michael Coffey
I am curious, is it possible to send boilerpipe output to Solr as a separate "plaintext" field, in addition to the usual "content" field (rather than replacing it)? If so, would someone please give an overview of how to do it?
Reply | Threaded
Open this post in threaded view
|

RE: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Markus Jelsma-2
You could do that, but you would need to fiddle around in TikaParser.java. Using TeeContentHandler you can add both the normal ContentHandler, and the Boilerpipe version.

 
 
-----Original message-----
> From:Michael Coffey <[hidden email]>
> Sent: Wednesday 15th November 2017 20:34
> To: [hidden email]
> Subject: Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling
>
> I am curious, is it possible to send boilerpipe output to Solr as a separate "plaintext" field, in addition to the usual "content" field (rather than replacing it)? If so, would someone please give an overview of how to do it?
>
Reply | Threaded
Open this post in threaded view
|

Re: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Rushikesh K
In reply to this post by Eyeris
Hello,

Eyeris - Thanks for your response, i was able to make working with tika boilerpipe but now i have a different problem ,some of the crawled pages doesn't have the expected data

For some pages it brings back only the Title and skips all the content i am not sure in what special cases does this do.But in my case i have two problems now 
1. when my page has a image and 1 or 2 lines of text it doesn't get those lines of data.(the data is in the <p> tag)
2.why is it adding Title to the starting of the content is there a way not to include that.

For example see the following image for the first URL it came back with out any date



On Wed, Nov 15, 2017 at 8:57 AM, Eyeris Rodriguez Rueda <[hidden email]> wrote:
Hello.

I am using tika boilerpipe with good results in aproximately 2000 websites.
Rushikesh if tika boilerpipe is not working for you maybe it is because you don´t are parsing documents with tika. please check this configuration
and tell us.

make sure that tika plugin is activated in plugin.included property then check:

***********************************************
Use tika parser instead of parse-html.

parse-plugins.xml

<mimeType name="text/html">
                <plugin id="parse-tika" />
        </mimeType>

        <mimeType name="application/xhtml+xml">
                <plugin id="parse-tika" />
        </mimeType>
***********************************************

***********************************************
nutch-site.xml
<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>

<property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>ArticleExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
  or CanolaExtractor.
  </description>
</property>
****************************************












----- Mensaje original -----
De: "Markus Jelsma" <[hidden email]>
Para: [hidden email]
Enviados: Martes, 14 de Noviembre 2017 17:40:08
Asunto: [MASSMAIL]RE: Removing header,Footer and left menus while crawling

Hello Rushikesh - why is Boilerpipe not working for you. Are you having trouble getting it configured - it is really just setting a boolean value. Or does it work, but not to your satisfaction?

The Bayan solution should work, theoretically, but just with a lot of tedious manual per-site configuration.

Regards,
Markus

-----Original message-----
> From:Rushikesh K <[hidden email]>
> Sent: Tuesday 14th November 2017 23:30
> To: [hidden email]
> Cc: Sebastian Nagel <[hidden email]>; [hidden email]
> Subject: Re: Removing header,Footer and left menus while crawling
>
> Hello,
>
> *Jorge*
> Thanks for response,Sorry for confusion i am using Nutch 1.13 but also  i
> tried configuring Tika boilerpipe with this version but this doesn't work
> for me.As you suggested to use own parser ,i am not a java developer by
> chance.
> By chance if you or anyone in the community has a working file ,it would be
> great if you can share it because there are many people facing with this
> issue (i came to know this when i googled).
>
> Mark Vega
> we also tried Bayan Group extractor plugin with Nutch 1.13 but this is also
> not working.we followed the same steps.I can share the changes if you want
> to take a look.
>
> I appreciate for your quick suggestions!
>
> Thanks
> Rushikesh
>
> On Tue, Nov 14, 2017 at 8:34 AM, Jorge Betancourt <
> [hidden email]> wrote:
>
> > Hello Rushikesh,
> >
> > Are you using Nutch 1.3 or Nutch 1.13? If you're using Nutch 1.13, then you
> > could use the Tika boilerpipe implementation, on the nutch-site.xml you
> > need to enable this feature with:
> >
> > <property>
> >   <name>tika.extractor</name>
> >   <value>boilerpipe</value>
> >   <description>
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or
> > none.
> >   </description>
> > </property>
> >
> > And configure the proper extractor with
> > the tika.extractor.boilerpipe.algorithm setting.
> >
> > This is not a perfect solution, but I've used it successfully in the past,
> > of course, your results will depend on how is the structure (markup of the
> > website).
> >
> > Other option could be to implement your own parser if you need to have more
> > control over what to include/exclude from the HTML. You can take a look at
> > this issue https://issues.apache.org/jira/browse/NUTCH-585 which contains
> > some info and old patches.
> >
> > Best Regards,
> > Jorge
> >
> > On Mon, Nov 13, 2017 at 8:58 PM Rushikesh K <[hidden email]>
> > wrote:
> >
> > > Hello Sebastian,
> > > we are most excited in using the  Nutch 1.3 (with solr 6.4)  for crawling
> > > our website and we are happy with the search results  but we had
> > > requirement to skip the header footer and left menus and some other parts
> > > of the page, can you please guide how can we exclude those parts.i was
> > > trying various ways on google but nothing works for me.
> > >
> > > Appreciate for your help in Advance!
> > > --
> > > Regards
> > > Rushikesh M
> > > .Net Developer
> > >
> >
>
>
>
> --
> Regards
> Rushikesh M
> .Net Developer
>
La @universidad_uci es Fidel: 15 años conectados al futuro... conectados a la Revolución
2002-2017



--
Regards
Rushikesh M
.Net Developer