Quantcast

Tell Nutch to only crawl parts of document

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Tell Nutch to only crawl parts of document

Christian Kunz-2
Hi everybody,

we've got a problem using Nutch: On the website that has to be crawled, there is a navigation on top of each page. Nutch crawls the navigation of each page which leads to the situation that for certain queries (that are included in the navigation) every page is delivered as a result.

Is there a way to tell Nutch to only crawl parts of a page like only the main content?

Thanks in advance and regards,
Christian
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tell Nutch to only crawl parts of document

Markus Jelsma-2
Hello Christian- you are probably talking about text extraction, which is done in the parse step. Nutch' Tika parser has support for boilerpipe text extraction, it is not very accurate in some cases but it's the open source solution that is available. Check nuch-default for its settings.

Regards,
Markus

 
 
-----Original message-----

> From:Christian Kunz <[hidden email]>
> Sent: Thursday 2nd February 2017 15:23
> To: [hidden email]
> Subject: Tell Nutch to only crawl parts of document
>
> Hi everybody,
>
> we've got a problem using Nutch: On the website that has to be crawled, there is a navigation on top of each page. Nutch crawls the navigation of each page which leads to the situation that for certain queries (that are included in the navigation) every page is delivered as a result.
>
> Is there a way to tell Nutch to only crawl parts of a page like only the main content?
>
> Thanks in advance and regards,
> Christian
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

AW: Tell Nutch to only crawl parts of document

Christian Kunz-2
Hi Markus,

thanks, I will check this.

Regards,
Christian


-----Ursprüngliche Nachricht-----
Von: Markus Jelsma [mailto:[hidden email]]
Gesendet: Donnerstag, 2. Februar 2017 16:36
An: [hidden email]
Betreff: RE: Tell Nutch to only crawl parts of document

Hello Christian- you are probably talking about text extraction, which is done in the parse step. Nutch' Tika parser has support for boilerpipe text extraction, it is not very accurate in some cases but it's the open source solution that is available. Check nuch-default for its settings.

Regards,
Markus

 
 
-----Original message-----

> From:Christian Kunz <[hidden email]>
> Sent: Thursday 2nd February 2017 15:23
> To: [hidden email]
> Subject: Tell Nutch to only crawl parts of document
>
> Hi everybody,
>
> we've got a problem using Nutch: On the website that has to be crawled, there is a navigation on top of each page. Nutch crawls the navigation of each page which leads to the situation that for certain queries (that are included in the navigation) every page is delivered as a result.
>
> Is there a way to tell Nutch to only crawl parts of a page like only the main content?
>
> Thanks in advance and regards,
> Christian
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

AW: Tell Nutch to only crawl parts of document

André Schild
In reply to this post by Christian Kunz-2
Hello Christian,

>we've got a problem using Nutch: On the website that has to be crawled, there is
>a navigation on top of each page. Nutch crawls the navigation of each page
>which leads to the situation that for certain queries (that are included in the navigation) every page is delivered as a result.

We had always used the blacklist-whitelist plugin for this.
There you can specify tags/ids and classes to white or black list in your html.

http://lucene.472066.n3.nabble.com/HTML-tag-filtering-td4116686.html

Here is a version compiled for nutch 1.12 with java 8.

https://aarboard.oncloud7.ch/index.php/s/MfFDlsUBWMWW5ZM


André
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Tell Nutch to only crawl parts of document

Mark Vega
In reply to this post by Christian Kunz-2
Christian,
I am using a Nutch plugin called Extractor from BayanGroup (https://github.com/BayanGroup/nutch-custom-search)  that allows you to select content elements on the page based on xpath expressions or css selectors.  I've mapped all the repeating content elements (navs, headers, footers, search bars, etc) on my sites to specific custom SOLR fields and am able to index the non-repeating content into the defaut 'content' field in SOLR.  Only the 'content' field is used when conducting a search, thereby side-stepping the issue you've encountered of every page showing up in results for certain searches that match on repeated content.  I think the plugin may have changed somewhat from when I included it in my Nutch 1.10 installation, but was easy to set up and has worked well for several years now.  I still index the repeating elements, but now that information is in custom SOLR fields that are not searched (I indexed them anyway just in case I have some reason to search those fields in the future).  One caveat:  When I first set this up, I was indexing 7 sites that basically used the same theme but had no consistent template across sites, i.e, the main 'content' section and the repeating content sections were each given different css selectors in different sites so that the only way to, say, grab all the left navs of every site and separate that content from the main searchable content was to create a very detailed Extractor config file that mapped each individual site's elements into a shared set of custom SOLR fields. Again, only the main 'content' section from each site is indexed into the default SOLR content field and repeating content is indexed into custom global nav, left nav, global search, header, and footer fields in SOLR.  As we undertook redesigns of our public sites last year, I took special pains to make sure that each site used the same css selectors for the repeating content elements and the main content section of all pages.  Now my Extractor config file is much smaller and still works great!

--
Mark F. Vega
Programmer/Analyst
UC Irvine Libraries - Web Services
[hidden email]
949.824.9872
--


-----Original Message-----
From: Christian Kunz [mailto:[hidden email]]
Sent: Thursday, February 02, 2017 6:23 AM
To: [hidden email]
Subject: Tell Nutch to only crawl parts of document

Hi everybody,

we've got a problem using Nutch: On the website that has to be crawled, there is a navigation on top of each page. Nutch crawls the navigation of each page which leads to the situation that for certain queries (that are included in the navigation) every page is delivered as a result.

Is there a way to tell Nutch to only crawl parts of a page like only the main content?

Thanks in advance and regards,
Christian
Loading...