ExtractingRequestHandler

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

ExtractingRequestHandler

spring
Hi,

I want to index various filetypes in solr, this can easily done with
ExtractingRequestHandler. But I also need the extracted content back.
I know ext.extract.only but then nothing gets indexed, right?

Can I index the document AND get the content back as with ext.extract.only?
In a single request?

Thank you


Reply | Threaded
Open this post in threaded view
|

Re: ExtractingRequestHandler

Erick Erickson
Yes, you can. but.... Generally, storing the raw input in Solr is
not the best approach. The problem here is that pretty soon
you get a huge index that contains *everything*. Solr was not
intended to be a data store.

Besides, you then need to store the binary form of the file. Solr
only deals with text, not markup.

Most people index the text in Solr, and enough information
so the application knows where to go to fetch the original
document when the user drills down (e.g. file path, database
PK, etc). Would that work for your situation?

Best
Erick

On Sat, Mar 31, 2012 at 3:55 PM,  <[hidden email]> wrote:

> Hi,
>
> I want to index various filetypes in solr, this can easily done with
> ExtractingRequestHandler. But I also need the extracted content back.
> I know ext.extract.only but then nothing gets indexed, right?
>
> Can I index the document AND get the content back as with ext.extract.only?
> In a single request?
>
> Thank you
>
>
Reply | Threaded
Open this post in threaded view
|

RE: ExtractingRequestHandler

spring
Hi Erik,

I think we have some misunderstanding.

I want to index the text of the docs in Solr (only indexed, NOT stored).

But I want the text (Tika output) back for:

* later faster reindexing (some text extraction like OCR takes really long)
* use the text for other processings

The original doc is NOT stored in solr.


So my question was if I can index the original doc via
ExtractingRequestHandler in Solr AND get back the text output, in a single
call.

AFAIK I can do it only in 2 calls:

1) ExtractingRequestHandler?ext.extract.only=true -> Text
2) Index the text from 1) in solr


Thx

> Yes, you can. but.... Generally, storing the raw input in Solr is
> not the best approach. The problem here is that pretty soon
> you get a huge index that contains *everything*. Solr was not
> intended to be a data store.
>
> Besides, you then need to store the binary form of the file. Solr
> only deals with text, not markup.
>
> Most people index the text in Solr, and enough information
> so the application knows where to go to fetch the original
> document when the user drills down (e.g. file path, database
> PK, etc). Would that work for your situation?
>
> Best
> Erick
>
> On Sat, Mar 31, 2012 at 3:55 PM,  <[hidden email]> wrote:
> > Hi,
> >
> > I want to index various filetypes in solr, this can easily done with
> > ExtractingRequestHandler. But I also need the extracted
> content back.
> > I know ext.extract.only but then nothing gets indexed, right?
> >
> > Can I index the document AND get the content back as with
> ext.extract.only?
> > In a single request?
> >
> > Thank you
> >
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: ExtractingRequestHandler

Erick Erickson
Ahhh, OK. Sure, anything you store in Solr you can get back. The key
is not Tika, but your schema.xml file, and setting 'stored="true" '

bq: So my question was if I can index the original doc via
ExtractingRequestHandler in Solr AND get back the text output, in a single
call.

I know of now way to do this using Solr Cell. That said, you can always
use SolrJ and Tika on the client to separate the Tika parsing from
the indexing steps. Then you have all the parts available on the
client to do whatever you want.

 Solr Cell is great for proof-of-concept, but for heavy-duty applications,
you're offloading all the processing on the  Solr server, which can be a
problem.

Here's a writeup describing how to use Tika independently of
Solr while indexing data to Solr that might help:

http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/

Hope that helps
Erick

On Sun, Apr 1, 2012 at 1:27 PM,  <[hidden email]> wrote:

> Hi Erik,
>
> I think we have some misunderstanding.
>
> I want to index the text of the docs in Solr (only indexed, NOT stored).
>
> But I want the text (Tika output) back for:
>
> * later faster reindexing (some text extraction like OCR takes really long)
> * use the text for other processings
>
> The original doc is NOT stored in solr.
>
>
> So my question was if I can index the original doc via
> ExtractingRequestHandler in Solr AND get back the text output, in a single
> call.
>
> AFAIK I can do it only in 2 calls:
>
> 1) ExtractingRequestHandler?ext.extract.only=true -> Text
> 2) Index the text from 1) in solr
>
>
> Thx
>
>> Yes, you can. but.... Generally, storing the raw input in Solr is
>> not the best approach. The problem here is that pretty soon
>> you get a huge index that contains *everything*. Solr was not
>> intended to be a data store.
>>
>> Besides, you then need to store the binary form of the file. Solr
>> only deals with text, not markup.
>>
>> Most people index the text in Solr, and enough information
>> so the application knows where to go to fetch the original
>> document when the user drills down (e.g. file path, database
>> PK, etc). Would that work for your situation?
>>
>> Best
>> Erick
>>
>> On Sat, Mar 31, 2012 at 3:55 PM,  <[hidden email]> wrote:
>> > Hi,
>> >
>> > I want to index various filetypes in solr, this can easily done with
>> > ExtractingRequestHandler. But I also need the extracted
>> content back.
>> > I know ext.extract.only but then nothing gets indexed, right?
>> >
>> > Can I index the document AND get the content back as with
>> ext.extract.only?
>> > In a single request?
>> >
>> > Thank you
>> >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: ExtractingRequestHandler

Billnbell
In reply to this post by Erick Erickson
I have had good luck with creating a separate core index for just data. This is a different core than the indexed core.

Very fast.

Bill Bell
Sent from mobile


On Apr 1, 2012, at 11:15 AM, Erick Erickson <[hidden email]> wrote:

> Yes, you can. but.... Generally, storing the raw input in Solr is
> not the best approach. The problem here is that pretty soon
> you get a huge index that contains *everything*. Solr was not
> intended to be a data store.
>
> Besides, you then need to store the binary form of the file. Solr
> only deals with text, not markup.
>
> Most people index the text in Solr, and enough information
> so the application knows where to go to fetch the original
> document when the user drills down (e.g. file path, database
> PK, etc). Would that work for your situation?
>
> Best
> Erick
>
> On Sat, Mar 31, 2012 at 3:55 PM,  <[hidden email]> wrote:
>> Hi,
>>
>> I want to index various filetypes in solr, this can easily done with
>> ExtractingRequestHandler. But I also need the extracted content back.
>> I know ext.extract.only but then nothing gets indexed, right?
>>
>> Can I index the document AND get the content back as with ext.extract.only?
>> In a single request?
>>
>> Thank you
>>
>>
Reply | Threaded
Open this post in threaded view
|

RE: ExtractingRequestHandler

spring
In reply to this post by Erick Erickson
>  Solr Cell is great for proof-of-concept, but for heavy-duty
> applications,
> you're offloading all the processing on the  Solr server,
> which can be a
> problem.

Good point!

Thank you

Reply | Threaded
Open this post in threaded view
|

Re: ExtractingRequestHandler

Ravish Bhagdev
In reply to this post by Erick Erickson
(Bit off-topic but...) I understand the fact that Solr isn't meant to
'store' everything, but because highlighting matches requires a field to be
stored I would expect most people having to end-up storing full document
content in their indexes?  Can't think there is any good workaround for
this...

Rav

On Sun, Apr 1, 2012 at 6:15 PM, Erick Erickson <[hidden email]>wrote:

> Yes, you can. but.... Generally, storing the raw input in Solr is
> not the best approach. The problem here is that pretty soon
> you get a huge index that contains *everything*. Solr was not
> intended to be a data store.
>
> Besides, you then need to store the binary form of the file. Solr
> only deals with text, not markup.
>
> Most people index the text in Solr, and enough information
> so the application knows where to go to fetch the original
> document when the user drills down (e.g. file path, database
> PK, etc). Would that work for your situation?
>
> Best
> Erick
>
> On Sat, Mar 31, 2012 at 3:55 PM,  <[hidden email]> wrote:
> > Hi,
> >
> > I want to index various filetypes in solr, this can easily done with
> > ExtractingRequestHandler. But I also need the extracted content back.
> > I know ext.extract.only but then nothing gets indexed, right?
> >
> > Can I index the document AND get the content back as with
> ext.extract.only?
> > In a single request?
> >
> > Thank you
> >
> >
>