CHM Files and Tika

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

CHM Files and Tika

Jan Riewe
Hey there,

i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:

Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp

i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
should be able to parse those files
https://issues.apache.org/jira/browse/TIKA-245

In the tika-mimetypes.xml i do find a entry related to
application/vnd.ms-htmlhelp

Does anyone ever ran into the same issues and knows how to fix that?

Bye
Jan
Reply | Threaded
Open this post in threaded view
|

Re: CHM Files and Tika

Sebastian Nagel
Hi Jan,

confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
can parse chm. The chm parsers are in tika-parser*.jar which is contained
in the Nutch package.

Any ideas?

Sebastian

On 08/08/2012 12:03 PM, Jan Riewe wrote:

> Hey there,
>
> i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
>
> Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
>
> i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
> should be able to parse those files
> https://issues.apache.org/jira/browse/TIKA-245
>
> In the tika-mimetypes.xml i do find a entry related to
> application/vnd.ms-htmlhelp
>
> Does anyone ever ran into the same issues and knows how to fix that?
>
> Bye
> Jan
>

Reply | Threaded
Open this post in threaded view
|

SolrIndex command

marora
Hi There,
I am a new Nutch user. I am using Nutch to crawl and then send crawl data
to SOLR. I have a question about bin/nutch solrindex command. Which tika
libraries are being used to index; Is it the tika libraries in Nutch or
does Nutch let SOLR index so it uses Solr's tika libraries? I think I read
it somewhere that Nutch is focusing on crawling and parsing and lets SOLR
do the indexing so SOLR's libraries should get used.

Specifically, I am having problems in extracting tags I.e. Say <H1> from
pdf files using Nutch/SOLR combination. The extract-contrib module defined
in schema.xml should get used.

Thanks in advance,
Madhvi

>

Reply | Threaded
Open this post in threaded view
|

RE: CHM Files and Tika

Markus Jelsma-2
In reply to this post by Sebastian Nagel
hmm, i'm not sure but maybe we don't include all Tika parser deps in our build.xml?

 
 
-----Original message-----

> From:Sebastian Nagel <[hidden email]>
> Sent: Thu 09-Aug-2012 23:18
> To: [hidden email]
> Subject: Re: CHM Files and Tika
>
> Hi Jan,
>
> confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
> can parse chm. The chm parsers are in tika-parser*.jar which is contained
> in the Nutch package.
>
> Any ideas?
>
> Sebastian
>
> On 08/08/2012 12:03 PM, Jan Riewe wrote:
> > Hey there,
> >
> > i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
> >
> > Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
> >
> > i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
> > should be able to parse those files
> > https://issues.apache.org/jira/browse/TIKA-245
> >
> > In the tika-mimetypes.xml i do find a entry related to
> > application/vnd.ms-htmlhelp
> >
> > Does anyone ever ran into the same issues and knows how to fix that?
> >
> > Bye
> > Jan
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: CHM Files and Tika

Julien Nioche-4
new JIRA?

On 9 August 2012 23:30, Markus Jelsma <[hidden email]> wrote:

> hmm, i'm not sure but maybe we don't include all Tika parser deps in our
> build.xml?
>
>
>
> -----Original message-----
> > From:Sebastian Nagel <[hidden email]>
> > Sent: Thu 09-Aug-2012 23:18
> > To: [hidden email]
> > Subject: Re: CHM Files and Tika
> >
> > Hi Jan,
> >
> > confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
> > can parse chm. The chm parsers are in tika-parser*.jar which is contained
> > in the Nutch package.
> >
> > Any ideas?
> >
> > Sebastian
> >
> > On 08/08/2012 12:03 PM, Jan Riewe wrote:
> > > Hey there,
> > >
> > > i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
> > >
> > > Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
> > >
> > > i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
> > > should be able to parse those files
> > > https://issues.apache.org/jira/browse/TIKA-245
> > >
> > > In the tika-mimetypes.xml i do find a entry related to
> > > application/vnd.ms-htmlhelp
> > >
> > > Does anyone ever ran into the same issues and knows how to fix that?
> > >
> > > Bye
> > > Jan
> > >
> >
> >
>



--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Reply | Threaded
Open this post in threaded view
|

Re: CHM Files and Tika

Sebastian Nagel
Hi Jan,

opened a Jira issue: https://issues.apache.org/jira/browse/NUTCH-1454
Thanks!

Beyond the "can't retrieve parser" error:
I've tried a couple of chm files (among them the test files from Tika)
but I wasn't able to get Tika to extract content.

 % java -jar tika-app/target/tika-app-1.3-SNAPSHOT.jar -v \
    tika-parsers/src/test/resources/test-documents/testChm2.chm

only extracts:

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Length" content="10807437"/>
<meta name="Content-Type" content="application/vnd.ms-htmlhelp"/>
<meta name="resourceName" content="testChm2.chm"/>
<title/>
</head>
<body/></html>

A CHM-viewer shows much more content. What's wrong?

Sebastian

On 08/10/2012 09:32 AM, Julien Nioche wrote:

> new JIRA?
>
> On 9 August 2012 23:30, Markus Jelsma <[hidden email]> wrote:
>
>> hmm, i'm not sure but maybe we don't include all Tika parser deps in our
>> build.xml?
>>
>>
>>
>> -----Original message-----
>>> From:Sebastian Nagel <[hidden email]>
>>> Sent: Thu 09-Aug-2012 23:18
>>> To: [hidden email]
>>> Subject: Re: CHM Files and Tika
>>>
>>> Hi Jan,
>>>
>>> confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
>>> can parse chm. The chm parsers are in tika-parser*.jar which is contained
>>> in the Nutch package.
>>>
>>> Any ideas?
>>>
>>> Sebastian
>>>
>>> On 08/08/2012 12:03 PM, Jan Riewe wrote:
>>>> Hey there,
>>>>
>>>> i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
>>>>
>>>> Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
>>>>
>>>> i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
>>>> should be able to parse those files
>>>> https://issues.apache.org/jira/browse/TIKA-245
>>>>
>>>> In the tika-mimetypes.xml i do find a entry related to
>>>> application/vnd.ms-htmlhelp
>>>>
>>>> Does anyone ever ran into the same issues and knows how to fix that?
>>>>
>>>> Bye
>>>> Jan
>>>>
>>>
>>>
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: CHM Files and Tika

Jan Riewe
Hey Sebastian,

as far is i found out, the Tika parser is far away from being perfect,
but i would expect that the included Testfiles should get correct
results.

There is an alternative lib (http://sourceforge.net/projects/chm4j/),
but i don't think that there are enough possible users to switch for
this filetype to a differed parser.

Jan

Am Dienstag, den 14.08.2012, 22:28 +0200 schrieb Sebastian Nagel:

> Hi Jan,
>
> opened a Jira issue: https://issues.apache.org/jira/browse/NUTCH-1454
> Thanks!
>
> Beyond the "can't retrieve parser" error:
> I've tried a couple of chm files (among them the test files from Tika)
> but I wasn't able to get Tika to extract content.
>
>  % java -jar tika-app/target/tika-app-1.3-SNAPSHOT.jar -v \
>     tika-parsers/src/test/resources/test-documents/testChm2.chm
>
> only extracts:
>
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="Content-Length" content="10807437"/>
> <meta name="Content-Type" content="application/vnd.ms-htmlhelp"/>
> <meta name="resourceName" content="testChm2.chm"/>
> <title/>
> </head>
> <body/></html>
>
> A CHM-viewer shows much more content. What's wrong?
>
> Sebastian
>
> On 08/10/2012 09:32 AM, Julien Nioche wrote:
> > new JIRA?
> >
> > On 9 August 2012 23:30, Markus Jelsma <[hidden email]> wrote:
> >
> >> hmm, i'm not sure but maybe we don't include all Tika parser deps in our
> >> build.xml?
> >>
> >>
> >>
> >> -----Original message-----
> >>> From:Sebastian Nagel <[hidden email]>
> >>> Sent: Thu 09-Aug-2012 23:18
> >>> To: [hidden email]
> >>> Subject: Re: CHM Files and Tika
> >>>
> >>> Hi Jan,
> >>>
> >>> confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
> >>> can parse chm. The chm parsers are in tika-parser*.jar which is contained
> >>> in the Nutch package.
> >>>
> >>> Any ideas?
> >>>
> >>> Sebastian
> >>>
> >>> On 08/08/2012 12:03 PM, Jan Riewe wrote:
> >>>> Hey there,
> >>>>
> >>>> i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
> >>>>
> >>>> Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
> >>>>
> >>>> i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
> >>>> should be able to parse those files
> >>>> https://issues.apache.org/jira/browse/TIKA-245
> >>>>
> >>>> In the tika-mimetypes.xml i do find a entry related to
> >>>> application/vnd.ms-htmlhelp
> >>>>
> >>>> Does anyone ever ran into the same issues and knows how to fix that?
> >>>>
> >>>> Bye
> >>>> Jan
> >>>>
> >>>
> >>>
> >>
> >
> >
> >
>