crwal and index ppt,msword,excel(xls,.xlsx) in apache nutch 1.14

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

crwal and index ppt,msword,excel(xls,.xlsx) in apache nutch 1.14

polu.amar
Hi All,

We are trying to crawl and index ppt and msword,excel  mime type documents
as part of seed url which .html page, i mean a seed url which is having
*ppt,msword,ppt* as an attachment.

ex: http://abc.com/solr-tika.html 

I have added below changes to check pdf/ppt crawling, I gone through the
existing parse-plugins.xml for reference and adding ppt,word,execl related
stuff in same file and tried

Tika-parse ref: https://wiki.apache.org/nutch/Features
Mime type ref:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types

*Change 1:*
New fields added in parse-plugins.xml

*<mimeType name="application/vnd.ms-powerpoint">
         <plugin id="parse-tika" />
</mimeType>

<mimeType
name="application/vnd.openxmlformats-officedocument.presentationml.presentation">
         <plugin id="parse-tika" />
</mimeType>*
               
/Change 2:/
Allowed/enabled mime type via mimetype-filter.txt

# allow only documents with a text/html mimetype
application/pdf
application/vnd.ms-powerpoint
application/vnd.openxmlformats-officedocument.presentationml.presentation
application/msword
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.ms-excel
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

/Change3:/

Added below entry in nutch-site.xml
Ref:
https://grokbase.com/t/nutch/user/09b5e59k3s/can-nutch-crawl-xls-and-xlsx-file

<property>
  <name>mime.types.file</name>
  <value>tika-mimetypes.xml</value>
  <description>Name of file in CLASSPATH containing filename extension and
  magic sequence to mime types mapping information. Overrides the default
Tika config
  if specified.
  </description>
</property>

After adding above changes tried with crawl and getting below and failing.
Kindly someone review and guide me next steps


2018-09-10 18:27:54,977 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-09-10 18:27:55,162 INFO  util.MimeUtil - Using custom mime.types.file:
tika-mimetypes.xml
*2018-09-10 18:27:55,164 ERROR util.MimeUtil - Can't load mime.types.file :
tika-mimetypes.xml using Tika's default*
2018-09-10 18:27:56,553 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: content dest:
content
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: title dest:
title
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: host dest:
host
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: segment dest:
segment
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: boost dest:
boost
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: digest dest:
digest
2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: tstamp dest:
tstamp
2018-09-10 18:27:56,739 INFO  solr.SolrIndexWriter - Indexing 1/1 documents
2018-09-10 18:27:56,739 INFO  solr.SolrIndexWriter - Deleting 0 documents
2018-09-10 18:27:57,107 INFO  solr.SolrIndexWriter - Indexing 1/1 documents
2018-09-10 18:27:57,107 INFO  solr.SolrIndexWriter - Deleting 0 documents
*2018-09-10 18:27:57,128 WARN  mapred.LocalJobRunner -
job_local1216759318_0001
java.lang.Exception:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:8983/solr: Expected mime type
application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body>
HTTP ERROR 404

<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p>
</body>
</html>

        at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:8983/solr: Expected mime type
application/octet-stream but got text/html. <html>*

Thanks,
Amarnath Polu



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
Reply | Threaded
Open this post in threaded view
|

Re: crwal and index ppt,msword,excel(xls,.xlsx) in apache nutch 1.14

Sebastian Nagel-2
Hi,

crawling and indexing Office documents should work out-of-the-box without any
configuration changes, the plugin parse-tika is enabled by default in recent
Nutch versions. The only recommended change is to increase the content limit:

 <property>
  <name>http.content.limit</name>
  <value>65536</value>
  <description>The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
 </property>

Office documents tend to be larger than 64 kB and usually fail to parse
if truncated.

The Solr URL seems to be wrong: it's required to add the name of the "core", e.g.,
  http://localhost:8983/solr/nutch
see https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search


Best,
Sebastian


On 09/10/2018 04:32 PM, polu.amar wrote:

> Hi All,
>
> We are trying to crawl and index ppt and msword,excel  mime type documents
> as part of seed url which .html page, i mean a seed url which is having
> *ppt,msword,ppt* as an attachment.
>
> ex: http://abc.com/solr-tika.html 
>
> I have added below changes to check pdf/ppt crawling, I gone through the
> existing parse-plugins.xml for reference and adding ppt,word,execl related
> stuff in same file and tried
>
> Tika-parse ref: https://wiki.apache.org/nutch/Features
> Mime type ref:
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types
>
> *Change 1:*
> New fields added in parse-plugins.xml
>
> *<mimeType name="application/vnd.ms-powerpoint">
>          <plugin id="parse-tika" />
> </mimeType>
>
> <mimeType
> name="application/vnd.openxmlformats-officedocument.presentationml.presentation">
>          <plugin id="parse-tika" />
> </mimeType>*
>                
> /Change 2:/
> Allowed/enabled mime type via mimetype-filter.txt
>
> # allow only documents with a text/html mimetype
> application/pdf
> application/vnd.ms-powerpoint
> application/vnd.openxmlformats-officedocument.presentationml.presentation
> application/msword
> application/vnd.openxmlformats-officedocument.wordprocessingml.document
> application/vnd.ms-excel
> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>
> /Change3:/
>
> Added below entry in nutch-site.xml
> Ref:
> https://grokbase.com/t/nutch/user/09b5e59k3s/can-nutch-crawl-xls-and-xlsx-file
>
> <property>
>   <name>mime.types.file</name>
>   <value>tika-mimetypes.xml</value>
>   <description>Name of file in CLASSPATH containing filename extension and
>   magic sequence to mime types mapping information. Overrides the default
> Tika config
>   if specified.
>   </description>
> </property>
>
> After adding above changes tried with crawl and getting below and failing.
> Kindly someone review and guide me next steps
>
>
> 2018-09-10 18:27:54,977 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-09-10 18:27:55,162 INFO  util.MimeUtil - Using custom mime.types.file:
> tika-mimetypes.xml
> *2018-09-10 18:27:55,164 ERROR util.MimeUtil - Can't load mime.types.file :
> tika-mimetypes.xml using Tika's default*
> 2018-09-10 18:27:56,553 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: content dest:
> content
> 2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: title dest:
> title
> 2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: host dest:
> host
> 2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: segment dest:
> segment
> 2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: boost dest:
> boost
> 2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: digest dest:
> digest
> 2018-09-10 18:27:56,719 INFO  solr.SolrMappingReader - source: tstamp dest:
> tstamp
> 2018-09-10 18:27:56,739 INFO  solr.SolrIndexWriter - Indexing 1/1 documents
> 2018-09-10 18:27:56,739 INFO  solr.SolrIndexWriter - Deleting 0 documents
> 2018-09-10 18:27:57,107 INFO  solr.SolrIndexWriter - Indexing 1/1 documents
> 2018-09-10 18:27:57,107 INFO  solr.SolrIndexWriter - Deleting 0 documents
> *2018-09-10 18:27:57,128 WARN  mapred.LocalJobRunner -
> job_local1216759318_0001
> java.lang.Exception:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://127.0.0.1:8983/solr: Expected mime type
> application/octet-stream but got text/html. <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
> <title>Error 404 Not Found</title>
> </head>
> <body>
> HTTP ERROR 404
>
> <p>Problem accessing /solr/update. Reason:
> <pre>    Not Found</pre></p>
> </body>
> </html>
>
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://127.0.0.1:8983/solr: Expected mime type
> application/octet-stream but got text/html. <html>*
>
> Thanks,
> Amarnath Polu
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>

Reply | Threaded
Open this post in threaded view
|

Re: crwal and index ppt,msword,excel(xls,.xlsx) in apache nutch 1.14

polu.amar
Hi Sebastian ,

Thanks for the update, with the default settings it's not crawling/indexing
for Microsoft office documents(ppt,word,excel etc).

For *http.content.limit* property value we already make it as
unlimited*(-1)*.

Do we need to change any kind of updates in development(AEM 6.3 is
technology,where we are developing a page) side for office kind of
documents? or any solr side changes?

Note: I passed solr url properly(seems it's was missed in ticket) as part of
crawl script

:>*bin/crawl -i -D
solr.server.url=http://localhost:8983/solr/tikaparsecollection  -s urls/
crawl/  -1*

solr collection name: tikaparsecollection
seed.txt: http://abc.com/solr-tika.html 

Kindly, assist us on how to achieve these kind of case in nutch crawling.


Thanks,
Amarnath Polu



--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html