parse-plugins.xml

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

parse-plugins.xml

Marko Bauhardt-2
Hi all,
i have a question about the parse-plugins.xml and application/pdf.  
Why is the TextParse used for parsing pdf files? The mimiType  
"appliation/pdf" is mapped to "parse-pdf" and "parse-text". But the  
TextParser does not support pdf files.
The problem is, if the pdf file is truncated the textparser "parse"  
this content and the indexer indexing "waste". So what is the reason  
to map "application/pdf" to the "parse-text" plugin?



        <mimeType name="application/pdf">
                <plugin id="parse-pdf" />
                <plugin id="parse-text" />
        </mimeType>


Thanks for hints,
Marko


Reply | Threaded
Open this post in threaded view
|

Re: parse-plugins.xml

chrismattmann
Hi Marko,

   Thanks for your question. Basically it was set up as a sort of "last
result" of getting at least * some * information from the PDF file, albeit
littered with garbage. If indeed the parse-text does not really make sense
in terms of a backup parser to handle PDF files and get at least some text
to index, then we may think of either (a) removing it from the default
parse-plugins.xml, or (b) writing a simple PdfParser that can handle
truncation as a backup to the existing PdfParser. Basically the philosophy
behind each mimeType entry in parse-plugins.xml is to try and map the set of
existing Nutch parse-plugins to the available content types, giving each
mimeType as many options as possible in terms of getting some content out of
them.

Cheers,
  Chris



On 8/3/06 4:04 AM, "Marko Bauhardt" <[hidden email]> wrote:

> Hi all,
> i have a question about the parse-plugins.xml and application/pdf.
> Why is the TextParse used for parsing pdf files? The mimiType
> "appliation/pdf" is mapped to "parse-pdf" and "parse-text". But the
> TextParser does not support pdf files.
> The problem is, if the pdf file is truncated the textparser "parse"
> this content and the indexer indexing "waste". So what is the reason
> to map "application/pdf" to the "parse-text" plugin?
>
>
>
> <mimeType name="application/pdf">
> <plugin id="parse-pdf" />
> <plugin id="parse-text" />
> </mimeType>
>
>
> Thanks for hints,
> Marko
>
>


Reply | Threaded
Open this post in threaded view
|

Re: parse-plugins.xml

Andrzej Białecki-2
Chris Mattmann wrote:
> Hi Marko,
>
>    Thanks for your question. Basically it was set up as a sort of "last
> result" of getting at least * some * information from the PDF file, albeit
> littered with garbage. If indeed the parse-text does not really make sense
>  

IMO it doesn't make sense. PDF text content, even if it's available in
plain text, is usually compressed. The percentage of non-compressed PDFs
out there in my experience is negligible.

> in terms of a backup parser to handle PDF files and get at least some text
> to index, then we may think of either (a) removing it from the default
>  

+1

> parse-plugins.xml, or (b) writing a simple PdfParser that can handle
> truncation as a backup to the existing PdfParser. Basically the philosophy
>  

I think that "simple PDF parser" is an oxymoron ... ;)


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: parse-plugins.xml

chrismattmann
Hey Andrzej,


On 8/3/06 8:19 AM, "Andrzej Bialecki" <[hidden email]> wrote:

> Chris Mattmann wrote:
>> Hi Marko,
>>
>>    Thanks for your question. Basically it was set up as a sort of "last
>> result" of getting at least * some * information from the PDF file, albeit
>> littered with garbage. If indeed the parse-text does not really make sense
>>  
>
> IMO it doesn't make sense. PDF text content, even if it's available in
> plain text, is usually compressed. The percentage of non-compressed PDFs
> out there in my experience is negligible.
>
>> in terms of a backup parser to handle PDF files and get at least some text
>> to index, then we may think of either (a) removing it from the default
>>  
>
> +1

Okey dok, you'll find a quick patch this at:

http://issues.apache.org/jira/browse/NUTCH-338

I decided to create an issue to just keep track of the fact that we made
this change, and additionally because I tried pasting the quick patch into
my email program here on my Mac and it looked like it was coming out weird
:-)

>
>> parse-plugins.xml, or (b) writing a simple PdfParser that can handle
>> truncation as a backup to the existing PdfParser. Basically the philosophy
>>  
>
> I think that "simple PDF parser" is an oxymoron ... ;)

Heh, I agree with you on that one. If everyone would just move to XML
DocBook, then it would be great! ;)


Thanks!

Cheers,
  Chris

>


Reply | Threaded
Open this post in threaded view
|

Antwort: Re: parse-plugins.xml

marcel.schnippe
In reply to this post by Andrzej Białecki-2
Andrzej Bialecki <[hidden email]> schrieb am 03.08.2006 17:19:02:
> Chris Mattmann wrote:
> > Hi Marko,
> >
> >    Thanks for your question. Basically it was set up as a sort of
"last
> > result" of getting at least * some * information from the PDF file,
albeit
> > littered with garbage. If indeed the parse-text does not really make
sense
> >
>
> IMO it doesn't make sense. PDF text content, even if it's available in
> plain text, is usually compressed. The percentage of non-compressed PDFs

> out there in my experience is negligible.
>
Using parse-text as a default does only make sense for unkown filetypes.
For
known filetypes which do not work, parse-text is a bad choice. Users are
not
interested in bytestreams.

> > in terms of a backup parser to handle PDF files and get at least some
text
> > to index, then we may think of either (a) removing it from the default
> +1
-1
This is not the right way. Better keep parse-text as default parser. But
do not
fall back to parse-text automatically, when the custom parser fails. The
custom parser (PDF in this case) can choose itself to retry with
parse-text.

> > parse-plugins.xml, or (b) writing a simple PdfParser that can handle
> > truncation as a backup to the existing PdfParser. Basically the
philosophy

> I think that "simple PDF parser" is an oxymoron ... ;)
+1, this wont work

But what about: C) A better default parser is needed. It could
determine if the anaylsed bytestream is statistically human language and
decide
to use the bytstream, or to drop it if its binary. Maybe it could filter
language
words contained in binary data (like .exe) .


Best regards,
Marcel Schnippe
Reply | Threaded
Open this post in threaded view
|

Re: Antwort: Re: parse-plugins.xml

Andrzej Białecki-2
[hidden email] wrote:

>>> to index, then we may think of either (a) removing it from the default
>>>      
>> +1
>>    
> -1
> This is not the right way. Better keep parse-text as default parser. But
> do not
> fall back to parse-text automatically, when the custom parser fails. The
> custom parser (PDF in this case) can choose itself to retry with
> parse-text.
>  

... which involves the first step, which we just did, i.e. removing
parse-text from parse-plugins.xml entry for PDF. The enhancement that
you propose makes sense of course, but it's the next step. Would you
like to prepare a patch for this?


> +1, this wont work
>
> But what about: C) A better default parser is needed. It could
> determine if the anaylsed bytestream is statistically human language and
>  

Of which humans? statistical profiles for, say, Devanagari, Kanji,
Arabic and Latin are somewhat different ...

> decide
> to use the bytstream, or to drop it if its binary. Maybe it could filter
> language
> words contained in binary data (like .exe) .
>  

What you probably mean is something equivalent to Unix strings(1). I
have a plugin that implements this, which I could contribute if there's
interest.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Antwort: Re: parse-plugins.xml

Jérôme Charron
> What you probably mean is something equivalent to Unix strings(1). I
> have a plugin that implements this, which I could contribute if there's
> interest.

+1

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: Antwort: Re: parse-plugins.xml

Sami Siren-2
Jérôme Charron wrote:

>> What you probably mean is something equivalent to Unix strings(1). I
>> have a plugin that implements this, which I could contribute if there's
>> interest.
>
>
> +1
>
hmm.. strings on couple of randomply selected pdf gives me content I
wouldn't wanna search against.

--
 Sami Siren