Select tika output for extract-only?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Select tika output for extract-only?

Peter Wolanin-2
I had been assuming that I could choose among possible tika output
formats when using the extracting request handler in extract-only mode
as if from the CLI with the tika jar:

    -x or --xml        Output XHTML content (default)
    -h or --html       Output HTML content
    -t or --text       Output plain text content
    -m or --metadata   Output only metadata

However, looking at the docs and source, it seems that only the xml
option is available (hard-coded) in ExtractingDocumentLoader:

serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));

In addition, it seems that the metadata is always appended to the response.

Are there any open issues relating to this, or opinions on whether
adding additional flexibility to the response format would be of
interest for 1.4?

Thanks,

Peter

--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Select tika output for extract-only?

Grant Ingersoll-2

On Jul 11, 2009, at 5:39 PM, Peter Wolanin wrote:

> I had been assuming that I could choose among possible tika output
> formats when using the extracting request handler in extract-only mode
> as if from the CLI with the tika jar:
>
>    -x or --xml        Output XHTML content (default)
>    -h or --html       Output HTML content
>    -t or --text       Output plain text content
>    -m or --metadata   Output only metadata
>
> However, looking at the docs and source, it seems that only the xml
> option is available (hard-coded) in ExtractingDocumentLoader:
>
> serializer = new XMLSerializer(writer, new OutputFormat("XML",  
> "UTF-8", true));
>
> In addition, it seems that the metadata is always appended to the  
> response.
>
> Are there any open issues relating to this,

Not that I know of.

> or opinions on whether
> adding additional flexibility to the response format would be of
> interest for 1.4?
>

Sure, patches welcome.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Reply | Threaded
Open this post in threaded view
|

Re: Select tika output for extract-only?

Yonik Seeley-2-2
In reply to this post by Peter Wolanin-2
Peter, I'm hacking up solr cell right now, trying to simplify the
parameters and fix some bugs (see SOLR-284)
A quick patch to specify the output format should make it into 1.4 -
but you may want to wait until I finish.

-Yonik
http://www.lucidimagination.com

On Sat, Jul 11, 2009 at 5:39 PM, Peter Wolanin<[hidden email]> wrote:

> I had been assuming that I could choose among possible tika output
> formats when using the extracting request handler in extract-only mode
> as if from the CLI with the tika jar:
>
>    -x or --xml        Output XHTML content (default)
>    -h or --html       Output HTML content
>    -t or --text       Output plain text content
>    -m or --metadata   Output only metadata
>
> However, looking at the docs and source, it seems that only the xml
> option is available (hard-coded) in ExtractingDocumentLoader:
>
> serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));
>
> In addition, it seems that the metadata is always appended to the response.
>
> Are there any open issues relating to this, or opinions on whether
> adding additional flexibility to the response format would be of
> interest for 1.4?
>
> Thanks,
>
> Peter
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> [hidden email]
>
Reply | Threaded
Open this post in threaded view
|

Re: Select tika output for extract-only?

Peter Wolanin-2
Ok, thanks. I played with it enough to to get plain text out at least,
but I'll wait for the resolution of SOLR-284

-Peter

On Sun, Jul 12, 2009 at 9:20 AM, Yonik Seeley<[hidden email]> wrote:

> Peter, I'm hacking up solr cell right now, trying to simplify the
> parameters and fix some bugs (see SOLR-284)
> A quick patch to specify the output format should make it into 1.4 -
> but you may want to wait until I finish.
>
> -Yonik
> http://www.lucidimagination.com
>
> On Sat, Jul 11, 2009 at 5:39 PM, Peter Wolanin<[hidden email]> wrote:
>> I had been assuming that I could choose among possible tika output
>> formats when using the extracting request handler in extract-only mode
>> as if from the CLI with the tika jar:
>>
>>    -x or --xml        Output XHTML content (default)
>>    -h or --html       Output HTML content
>>    -t or --text       Output plain text content
>>    -m or --metadata   Output only metadata
>>
>> However, looking at the docs and source, it seems that only the xml
>> option is available (hard-coded) in ExtractingDocumentLoader:
>>
>> serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));
>>
>> In addition, it seems that the metadata is always appended to the response.
>>
>> Are there any open issues relating to this, or opinions on whether
>> adding additional flexibility to the response format would be of
>> interest for 1.4?
>>
>> Thanks,
>>
>> Peter
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> [hidden email]
>>
>



--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
[hidden email]