fetching pdfs from our website

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

fetching pdfs from our website

d.kumar@technisat.de
Hey currently,

we are on nutch 2.3.1 and using it to crawl our websites.
One of our focus is to get all the pdfs on our website crawled.  -> Links on different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
I tried different things:
At the configurations I removed ever occurrence of pdf in regex-urlfilter.txt and added the download url, added  parse-tika to nutch-.site.xml in plugins, added application/pdf in default-site.xml in http-accept, added pdf to parse-plugins.xml.
But still no pdf link is been fetched.

regex-urlfilter.txt
+https://assets.*. mysite.com/asset

parse-plugins.xml
<mimeType name="application/pdf">
               <plugin id="parse-tika" />
        </mimeType>

nutch-site.xml
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value>
</property>

default-site.xml
<property>
  <name>http.accept</name>
  <value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
  <description>Value of the "Accept" request header field.
  </description>
</property>

Is there anything else I have to configure?

Thanks

David



Reply | Threaded
Open this post in threaded view
|

Re: fetching pdfs from our website

Sebastian Nagel
Hi David,

for PDFs you usually need to increase the following property:

<property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>

In doubt, also set the equivalent properties ftp.content.limit and file.content.limit

Best,
Sebastian

On 08/08/2017 03:00 PM, [hidden email] wrote:

> Hey currently,
>
> we are on nutch 2.3.1 and using it to crawl our websites.
> One of our focus is to get all the pdfs on our website crawled.  -> Links on different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
> I tried different things:
> At the configurations I removed ever occurrence of pdf in regex-urlfilter.txt and added the download url, added  parse-tika to nutch-.site.xml in plugins, added application/pdf in default-site.xml in http-accept, added pdf to parse-plugins.xml.
> But still no pdf link is been fetched.
>
> regex-urlfilter.txt
> +https://assets.*. mysite.com/asset
>
> parse-plugins.xml
> <mimeType name="application/pdf">
>       <plugin id="parse-tika" />
> </mimeType>
>
> nutch-site.xml
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value>
> </property>
>
> default-site.xml
> <property>
>   <name>http.accept</name>
>   <value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
>   <description>Value of the "Accept" request header field.
>   </description>
> </property>
>
> Is there anything else I have to configure?
>
> Thanks
>
> David
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: fetching pdfs from our website

d.kumar@technisat.de
Hey Sebastian,

thanks a lot. I already increased it to around 65MB. All our pdfs about 3 to 8mb big.
Any other suggestions?
;)



Thanks
David

> Am 09.08.2017 um 18:50 schrieb Sebastian Nagel <[hidden email]>:
>
> Hi David,
>
> for PDFs you usually need to increase the following property:
>
> <property>
>  <name>http.content.limit</name>
>  <value>65536</value>
>  <description>The length limit for downloaded content using the http
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.
>  </description>
> </property>
>
> In doubt, also set the equivalent properties ftp.content.limit and file.content.limit
>
> Best,
> Sebastian
>
>> On 08/08/2017 03:00 PM, [hidden email] wrote:
>> Hey currently,
>>
>> we are on nutch 2.3.1 and using it to crawl our websites.
>> One of our focus is to get all the pdfs on our website crawled.  -> Links on different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
>> I tried different things:
>> At the configurations I removed ever occurrence of pdf in regex-urlfilter.txt and added the download url, added  parse-tika to nutch-.site.xml in plugins, added application/pdf in default-site.xml in http-accept, added pdf to parse-plugins.xml.
>> But still no pdf link is been fetched.
>>
>> regex-urlfilter.txt
>> +https://assets.*. mysite.com/asset
>>
>> parse-plugins.xml
>> <mimeType name="application/pdf">
>>           <plugin id="parse-tika" />
>>    </mimeType>
>>
>> nutch-site.xml
>> <property>
>> <name>plugin.includes</name>
>> <value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value>
>> </property>
>>
>> default-site.xml
>> <property>
>>  <name>http.accept</name>
>>  <value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
>>  <description>Value of the "Accept" request header field.
>>  </description>
>> </property>
>>
>> Is there anything else I have to configure?
>>
>> Thanks
>>
>> David
>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: fetching pdfs from our website

Sebastian Nagel
Hi David,

there are a couple of options to configure how links are followed by the crawler, esp.
  db.max.outlinks.per.page
  db.ignore.external.links

It the white space in the URLs intended?
> https://assets0.mysite.com/asset /DB_product.pdf
>>> +https://assets.*. mysite.com/asset

URLs normally require space to be encoded as '%20' (percent encoding)
or '+' (form encoding, after ?)

In doubt, start debugging:

1. check the logs

2. try
   $ bin/nutch parsechecker -dumpText http://.../xyz.html        (PDFs linked from here)
   $ bin/nutch parsechecker -dumpText http://.../xyz.pdf
   $ bin/nutch indexchecker http://.../xyz.pdf

3. inspect the storage (HBase, etc.) what's in for the fetched PDFs or the HTML with the links.

Best,
Sebastian

On 08/09/2017 07:11 PM, [hidden email] wrote:

> Hey Sebastian,
>
> thanks a lot. I already increased it to around 65MB. All our pdfs about 3 to 8mb big.
> Any other suggestions?
> ;)
>
>
>
> Thanks
> David
>
>> Am 09.08.2017 um 18:50 schrieb Sebastian Nagel <[hidden email]>:
>>
>> Hi David,
>>
>> for PDFs you usually need to increase the following property:
>>
>> <property>
>>  <name>http.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content using the http
>>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>>  than it will be truncated; otherwise, no truncation at all. Do not
>>  confuse this setting with the file.content.limit setting.
>>  </description>
>> </property>
>>
>> In doubt, also set the equivalent properties ftp.content.limit and file.content.limit
>>
>> Best,
>> Sebastian
>>
>>> On 08/08/2017 03:00 PM, [hidden email] wrote:
>>> Hey currently,
>>>
>>> we are on nutch 2.3.1 and using it to crawl our websites.
>>> One of our focus is to get all the pdfs on our website crawled.  -> Links on different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
>>> I tried different things:
>>> At the configurations I removed ever occurrence of pdf in regex-urlfilter.txt and added the download url, added  parse-tika to nutch-.site.xml in plugins, added application/pdf in default-site.xml in http-accept, added pdf to parse-plugins.xml.
>>> But still no pdf link is been fetched.
>>>
>>> regex-urlfilter.txt
>>> +https://assets.*. mysite.com/asset
>>>
>>> parse-plugins.xml
>>> <mimeType name="application/pdf">
>>>           <plugin id="parse-tika" />
>>>    </mimeType>
>>>
>>> nutch-site.xml
>>> <property>
>>> <name>plugin.includes</name>
>>> <value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|index-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-solr</value>
>>> </property>
>>>
>>> default-site.xml
>>> <property>
>>>  <name>http.accept</name>
>>>  <value>application/pdf,text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>
>>>  <description>Value of the "Accept" request header field.
>>>  </description>
>>> </property>
>>>
>>> Is there anything else I have to configure?
>>>
>>> Thanks
>>>
>>> David
>>>
>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

AW: fetching pdfs from our website

d.kumar@technisat.de
Hey Sebastian,

I already change the value of : db.max.outlinks.per.page to "-1" .
And db.ignore.external.links is set to "false" as the assets are sometimes at another domains.

Sry, about the whitespace --> Copy paste mistake. Normally there is no whitespace or it is encoded the right way.

thanks

David


-----Urspr√ľngliche Nachricht-----
Von: Sebastian Nagel [mailto:[hidden email]]
Gesendet: Donnerstag, 10. August 2017 09:22
An: [hidden email]
Betreff: Re: fetching pdfs from our website

Hi David,

there are a couple of options to configure how links are followed by the crawler, esp.
  db.max.outlinks.per.page
  db.ignore.external.links

It the white space in the URLs intended?
> https://assets0.mysite.com/asset /DB_product.pdf
>>> +https://assets.*. mysite.com/asset

URLs normally require space to be encoded as '%20' (percent encoding) or '+' (form encoding, after ?)

In doubt, start debugging:

1. check the logs

2. try
   $ bin/nutch parsechecker -dumpText http://.../xyz.html        (PDFs linked from here)
   $ bin/nutch parsechecker -dumpText http://.../xyz.pdf
   $ bin/nutch indexchecker http://.../xyz.pdf

3. inspect the storage (HBase, etc.) what's in for the fetched PDFs or the HTML with the links.

Best,
Sebastian

On 08/09/2017 07:11 PM, [hidden email] wrote:

> Hey Sebastian,
>
> thanks a lot. I already increased it to around 65MB. All our pdfs about 3 to 8mb big.
> Any other suggestions?
> ;)
>
>
>
> Thanks
> David
>
>> Am 09.08.2017 um 18:50 schrieb Sebastian Nagel <[hidden email]>:
>>
>> Hi David,
>>
>> for PDFs you usually need to increase the following property:
>>
>> <property>
>>  <name>http.content.limit</name>
>>  <value>65536</value>
>>  <description>The length limit for downloaded content using the http  
>> protocol, in bytes. If this value is nonnegative (>=0), content
>> longer  than it will be truncated; otherwise, no truncation at all.
>> Do not  confuse this setting with the file.content.limit setting.
>>  </description>
>> </property>
>>
>> In doubt, also set the equivalent properties ftp.content.limit and
>> file.content.limit
>>
>> Best,
>> Sebastian
>>
>>> On 08/08/2017 03:00 PM, [hidden email] wrote:
>>> Hey currently,
>>>
>>> we are on nutch 2.3.1 and using it to crawl our websites.
>>> One of our focus is to get all the pdfs on our website crawled.  ->
>>> Links on different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf I tried different things:
>>> At the configurations I removed ever occurrence of pdf in regex-urlfilter.txt and added the download url, added  parse-tika to nutch-.site.xml in plugins, added application/pdf in default-site.xml in http-accept, added pdf to parse-plugins.xml.
>>> But still no pdf link is been fetched.
>>>
>>> regex-urlfilter.txt
>>> +https://assets.*. mysite.com/asset
>>>
>>> parse-plugins.xml
>>> <mimeType name="application/pdf">
>>>           <plugin id="parse-tika" />
>>>    </mimeType>
>>>
>>> nutch-site.xml
>>> <property>
>>> <name>plugin.includes</name>
>>> <value>protocol-http|urlfilter-regex|parse-(html|metatags|tika)|inde
>>> x-(basic|anchor|more|metadata)|query-(basic|site|url)|response-(json
>>> |xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|in
>>> dexer-solr</value>
>>> </property>
>>>
>>> default-site.xml
>>> <property>
>>>  <name>http.accept</name>
>>>  
>>> <value>application/pdf,text/html,application/xhtml+xml,application/x
>>> ml;q=0.9,*/*;q=0.8</value>  <description>Value of the "Accept"
>>> request header field.
>>>  </description>
>>> </property>
>>>
>>> Is there anything else I have to configure?
>>>
>>> Thanks
>>>
>>> David
>>>
>>>
>>>
>>