Indexing and ExtractingRequestHandler

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing and ExtractingRequestHandler

Harry Hochheiser
I'm trying to use Solr to index the contents of an Excel file, using
the ExtractingRequestHandler (CSV handler won't work for me - I need
to consider the whole spreadsheet as one document), and I'm running
into some trouble.

Is there any way to see what's going on during the indexing process?
I'm concerned that I may be losing some terms, and I'd like to see if
i can snoop on the terms that are added to the index as they go along.
How might I do this?

Barring that, how can I inspect the index post-fact?  I have tried to
use luke to see what's in the index, but I get an error: "Unknown
format version -10". Is it possible to get luke to work?

My solr build is straight out of SVN.

thanks,

harry
Reply | Threaded
Open this post in threaded view
|

Re: Indexing and ExtractingRequestHandler

Jan Høydahl / Cominvent
Hi,

You can try Tika command line to parse your Excel file, then you will se the exact textual output from it, which will be indexed into Solr, and thus inspect whether something is missing.

Are you sure you use a version of Luke which supports your version of Lucene?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 23.33, Harry Hochheiser wrote:

> I'm trying to use Solr to index the contents of an Excel file, using
> the ExtractingRequestHandler (CSV handler won't work for me - I need
> to consider the whole spreadsheet as one document), and I'm running
> into some trouble.
>
> Is there any way to see what's going on during the indexing process?
> I'm concerned that I may be losing some terms, and I'd like to see if
> i can snoop on the terms that are added to the index as they go along.
> How might I do this?
>
> Barring that, how can I inspect the index post-fact?  I have tried to
> use luke to see what's in the index, but I get an error: "Unknown
> format version -10". Is it possible to get luke to work?
>
> My solr build is straight out of SVN.
>
> thanks,
>
> harry

Reply | Threaded
Open this post in threaded view
|

Re: Indexing and ExtractingRequestHandler

Harry Hochheiser
Thanks.

I've done Tika command line to parse the Excel file, and I see
contents in it that don't appear to be indexed. I've tried the path of
using Tika to parse the Excel and then using extracting request
handler to index the resulting text, and that doesn't work either.

As far as Luke goes, I've built it from scratch. Still bombs. Is it
possible that it's not compatible with lucene  builds based on trunk?

thanks,


-harry

On Wed, Aug 11, 2010 at 6:48 PM, Jan Høydahl / Cominvent
<[hidden email]> wrote:

> Hi,
>
> You can try Tika command line to parse your Excel file, then you will se the exact textual output from it, which will be indexed into Solr, and thus inspect whether something is missing.
>
> Are you sure you use a version of Luke which supports your version of Lucene?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 11. aug. 2010, at 23.33, Harry Hochheiser wrote:
>
>> I'm trying to use Solr to index the contents of an Excel file, using
>> the ExtractingRequestHandler (CSV handler won't work for me - I need
>> to consider the whole spreadsheet as one document), and I'm running
>> into some trouble.
>>
>> Is there any way to see what's going on during the indexing process?
>> I'm concerned that I may be losing some terms, and I'd like to see if
>> i can snoop on the terms that are added to the index as they go along.
>> How might I do this?
>>
>> Barring that, how can I inspect the index post-fact?  I have tried to
>> use luke to see what's in the index, but I get an error: "Unknown
>> format version -10". Is it possible to get luke to work?
>>
>> My solr build is straight out of SVN.
>>
>> thanks,
>>
>> harry
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing and ExtractingRequestHandler

Lance Norskog-2
This is probably true about Luke. The trunk has a new Lucene format
and does not read any previous format.  The trunk is a busy code base.
The 3.1 branch is slated to be the next Solr release, and is probably
a better base for your testing. Best of all is to use the Solr 1.4.1
binary release.

On Wed, Aug 11, 2010 at 8:08 PM, Harry Hochheiser <[hidden email]> wrote:

> Thanks.
>
> I've done Tika command line to parse the Excel file, and I see
> contents in it that don't appear to be indexed. I've tried the path of
> using Tika to parse the Excel and then using extracting request
> handler to index the resulting text, and that doesn't work either.
>
> As far as Luke goes, I've built it from scratch. Still bombs. Is it
> possible that it's not compatible with lucene  builds based on trunk?
>
> thanks,
>
>
> -harry
>
> On Wed, Aug 11, 2010 at 6:48 PM, Jan Høydahl / Cominvent
> <[hidden email]> wrote:
>> Hi,
>>
>> You can try Tika command line to parse your Excel file, then you will se the exact textual output from it, which will be indexed into Solr, and thus inspect whether something is missing.
>>
>> Are you sure you use a version of Luke which supports your version of Lucene?
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>>
>> On 11. aug. 2010, at 23.33, Harry Hochheiser wrote:
>>
>>> I'm trying to use Solr to index the contents of an Excel file, using
>>> the ExtractingRequestHandler (CSV handler won't work for me - I need
>>> to consider the whole spreadsheet as one document), and I'm running
>>> into some trouble.
>>>
>>> Is there any way to see what's going on during the indexing process?
>>> I'm concerned that I may be losing some terms, and I'd like to see if
>>> i can snoop on the terms that are added to the index as they go along.
>>> How might I do this?
>>>
>>> Barring that, how can I inspect the index post-fact?  I have tried to
>>> use luke to see what's in the index, but I get an error: "Unknown
>>> format version -10". Is it possible to get luke to work?
>>>
>>> My solr build is straight out of SVN.
>>>
>>> thanks,
>>>
>>> harry
>>
>>
>



--
Lance Norskog
[hidden email]