Map tika attribute to be the id in Solr Cell

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Map tika attribute to be the id in Solr Cell

Eric Pugh-4
Hi all,

I want to use the Tika attribute stream_name as my unique key, which I  
can do if I specify <uniqueKey>stream_name</uniqueKey/> and run curl:

curl http://localhost:8983/solr/karaoke/update/extract?ext.def.fl=text 
\&ext.capture=stream_name\&ext.map.stream_name=stream_name  -F "file=@angeleyes.kar
"

However, this means that I can't use the ext.metadata.prefix to  
capture the other metadata fields via:

curl http://localhost:8983/solr/karaoke/update/extract?ext.def.fl=text 
\&ext.metadata.prefix=metadata_\&ext.capture=stream_name
\&ext.map.stream_name=stream_name  -F "file=@angeleyes.kar"

If I do, it seems like stream_name is lost becasue it is now  
metadata_stream_name, but I can't use that name in my ext.capture and  
ext.map:

curl http://localhost:8983/solr/karaoke/update/extract?ext.def.fl=text 
\&ext.metadata.prefix=metadata_\&ext.capture=metadata_stream_name
\&ext.map.metadata_stream_name=stream_name  -F "file=@angeleyes.kar"

Any ideas?  Currently seems like an either/or, but I'd like both!

Eric


-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal




Reply | Threaded
Open this post in threaded view
|

Re: Map tika attribute to be the id in Solr Cell

Grant Ingersoll-2

On May 28, 2009, at 11:29 AM, Eric Pugh wrote:

> Hi all,
>
> I want to use the Tika attribute stream_name as my unique key, which  
> I can do if I specify <uniqueKey>stream_name</uniqueKey/> and run  
> curl:
>
> curl http://localhost:8983/solr/karaoke/update/extract?
> ext.def.fl=text\&ext.capture=stream_name
> \&ext.map.stream_name=stream_name  -F "file=@angeleyes.kar"
>


Why do you need to have the ext.capture and why do you need to map  
stream_name to stream_name?  If the name in tika metadata is a field  
name, you don't need to map.

Also, I assume I'm missing something here because why can't you just  
pass in id=<name of the stream> since presumably, in your examples  
anyway, you have this info, right?  If not, I don't know where else  
you are getting it from, b/c it is a Solr thing, not a Tika thing.  In  
fact, that reminds me, I should document those values that the ERH  
adds to the Metadata.

> However, this means that I can't use the ext.metadata.prefix to  
> capture the other metadata fields via:
>
> curl http://localhost:8983/solr/karaoke/update/extract?
> ext.def.fl=text\&ext.metadata.prefix=metadata_
> \&ext.capture=stream_name\&ext.map.stream_name=stream_name  -F "file=@angeleyes.kar
> "
>
> If I do, it seems like stream_name is lost becasue it is now  
> metadata_stream_name, but I can't use that name in my ext.capture  
> and ext.map:
>
> curl http://localhost:8983/solr/karaoke/update/extract?
> ext.def.fl=text\&ext.metadata.prefix=metadata_
> \&ext.capture=metadata_stream_name
> \&ext.map.metadata_stream_name=stream_name  -F "file=@angeleyes.kar"
>
> Any ideas?  Currently seems like an either/or, but I'd like both!
>
> Eric
>
>
> -----------------------------------------------------
> Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
> Free/Busy: http://tinyurl.com/eric-cal
>
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Reply | Threaded
Open this post in threaded view
|

Re: Map tika attribute to be the id in Solr Cell

Eric Pugh-4
Grant,  you are quite right!  I was too far down in the weeds, and  
didn't need to be doing all that crazyness.

However, one other comment, I saw you edited the wiki (thank you!) and  
the line:

+ It is highly recommend that you try using the extract only option to  
see what values actually get set for these.

I am not sure that is correct, althought it is what I would expect.  
When I run:

budapest:karaoke epugh$ curl http://localhost:8983/solr/karaoke/update/extract?ext.def.fl=text 
\&ext.extract.only=true  -F "file=@mccm.pdf" <?xml version="1.0"  
encoding="UTF-8"?>


My response I get back (via curl) looks like:
<response>
<lst name="responseHeader"><int name="status">0</int><int  
name="QTime">1728</int></lst><str name="mccm.pdf">&lt;?xml  
version="1.0" encoding="UTF-8"?&gt;

SNIP LOTS OF DOCUMENT CONTENT

&lt;/div&gt;
&lt;/body&gt;
&lt;/html&gt;
</str>
</response>

And I don't actually see the metadata fields.  I would expect to  
however!

Eric



On May 28, 2009, at 8:28 PM, Grant Ingersoll wrote:

>
> On May 28, 2009, at 11:29 AM, Eric Pugh wrote:
>
>> Hi all,
>>
>> I want to use the Tika attribute stream_name as my unique key,  
>> which I can do if I specify <uniqueKey>stream_name</uniqueKey/> and  
>> run curl:
>>
>> curl http://localhost:8983/solr/karaoke/update/extract?ext.def.fl=text 
>> \&ext.capture=stream_name\&ext.map.stream_name=stream_name  -F "file=@angeleyes.kar
>> "
>>
>
>
> Why do you need to have the ext.capture and why do you need to map  
> stream_name to stream_name?  If the name in tika metadata is a field  
> name, you don't need to map.
>
> Also, I assume I'm missing something here because why can't you just  
> pass in id=<name of the stream> since presumably, in your examples  
> anyway, you have this info, right?  If not, I don't know where else  
> you are getting it from, b/c it is a Solr thing, not a Tika thing.  
> In fact, that reminds me, I should document those values that the  
> ERH adds to the Metadata.
>
>> However, this means that I can't use the ext.metadata.prefix to  
>> capture the other metadata fields via:
>>
>> curl http://localhost:8983/solr/karaoke/update/extract?ext.def.fl=text 
>> \&ext.metadata.prefix=metadata_\&ext.capture=stream_name
>> \&ext.map.stream_name=stream_name  -F "file=@angeleyes.kar"
>>
>> If I do, it seems like stream_name is lost becasue it is now  
>> metadata_stream_name, but I can't use that name in my ext.capture  
>> and ext.map:
>>
>> curl http://localhost:8983/solr/karaoke/update/extract?ext.def.fl=text 
>> \&ext.metadata.prefix=metadata_\&ext.capture=metadata_stream_name
>> \&ext.map.metadata_stream_name=stream_name  -F "file=@angeleyes.kar"
>>
>> Any ideas?  Currently seems like an either/or, but I'd like both!
>>
>> Eric
>>
>>
>> -----------------------------------------------------
>> Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467  
>> | http://www.opensourceconnections.com
>> Free/Busy: http://tinyurl.com/eric-cal
>>
>>
>>
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal




Reply | Threaded
Open this post in threaded view
|

Re: Map tika attribute to be the id in Solr Cell

Grant Ingersoll-2

On May 28, 2009, at 8:47 PM, Eric Pugh wrote:

> Grant,  you are quite right!  I was too far down in the weeds, and  
> didn't need to be doing all that crazyness.
>
>
> And I don't actually see the metadata fields.  I would expect to  
> however!

What revision are you running?

The following was added to ERH on 4/24/09, r768281, (see SOLR-1128) to  
solve this exact problem:
           String[] names = metadata.names();
           NamedList metadataNL = new NamedList();
           for (int i = 0; i < names.length; i++) {
             String[] vals = metadata.getValues(names[i]);
             metadataNL.add(names[i], vals);
           }
           rsp.add(stream.getName() + "_metadata", metadataNL);


Reply | Threaded
Open this post in threaded view
|

Re: Map tika attribute to be the id in Solr Cell

Eric Pugh-4
In reply to this post by Grant Ingersoll-2
Grant, I went back and tried to recreate my bug using the example  
app.  And indexing example/site/tutorial.pdf I get the error with this  
command:

budapest:site epugh$  curl http://localhost:8983/solr/update/extract?ext.def.fl=text 
\&ext.metadata.prefix=metadata_\&ext.map.stream_name=id  -F "file=@tutorial.pdf
"

If I remove the ext.metadata.prefix, then I am okay, but then I can't  
use dynamic fields for indexing metadata fields.   So this works, but  
I have to manually create all my fields:

budapest:site epugh$  curl http://localhost:8983/solr/update/extract?ext.def.fl=text 
\&ext.map.stream_name=id  -F "file=@tutorial.pdf"


Eric





On May 28, 2009, at 8:28 PM, Grant Ingersoll wrote:

>
> On May 28, 2009, at 11:29 AM, Eric Pugh wrote:
>
>> Hi all,
>>
>> I want to use the Tika attribute stream_name as my unique key,  
>> which I can do if I specify <uniqueKey>stream_name</uniqueKey/> and  
>> run curl:
>>
>> curl http://localhost:8983/solr/karaoke/update/extract?ext.def.fl=text 
>> \&ext.capture=stream_name\&ext.map.stream_name=stream_name  -F "file=@angeleyes.kar
>> "
>>
>
>
> Why do you need to have the ext.capture and why do you need to map  
> stream_name to stream_name?  If the name in tika metadata is a field  
> name, you don't need to map.
>
> Also, I assume I'm missing something here because why can't you just  
> pass in id=<name of the stream> since presumably, in your examples  
> anyway, you have this info, right?  If not, I don't know where else  
> you are getting it from, b/c it is a Solr thing, not a Tika thing.  
> In fact, that reminds me, I should document those values that the  
> ERH adds to the Metadata.
>
>> However, this means that I can't use the ext.metadata.prefix to  
>> capture the other metadata fields via:
>>
>> curl http://localhost:8983/solr/karaoke/update/extract?ext.def.fl=text 
>> \&ext.metadata.prefix=metadata_\&ext.capture=stream_name
>> \&ext.map.stream_name=stream_name  -F "file=@angeleyes.kar"
>>
>> If I do, it seems like stream_name is lost becasue it is now  
>> metadata_stream_name, but I can't use that name in my ext.capture  
>> and ext.map:
>>
>> curl http://localhost:8983/solr/karaoke/update/extract?ext.def.fl=text 
>> \&ext.metadata.prefix=metadata_\&ext.capture=metadata_stream_name
>> \&ext.map.metadata_stream_name=stream_name  -F "file=@angeleyes.kar"
>>
>> Any ideas?  Currently seems like an either/or, but I'd like both!
>>
>> Eric
>>
>>
>> -----------------------------------------------------
>> Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467  
>> | http://www.opensourceconnections.com
>> Free/Busy: http://tinyurl.com/eric-cal
>>
>>
>>
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
>

-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal




Reply | Threaded
Open this post in threaded view
|

Re: Map tika attribute to be the id in Solr Cell

Eric Pugh-4
In reply to this post by Grant Ingersoll-2
Updating to latest and greatest added that data, thank you for the  
pointer.  Too many copies of Solr 1.4 trunk, and I'd neglected to  
update.

However, the issue with the mapping not working with the  
ext.metadata.prefix seems to remain:

budapest:site epugh$  curl http://localhost:8983/solr/update/extract?ext.def.fl=text 
\&ext.map.stream_name=id\&ext.metadata.prefix=metadata_  -F "file=@tutorial.pdf
"

<body><h2>HTTP ERROR: 500</
h2><pre>org.apache.solr.common.SolrException: Document [null] missing  
required field: id


Eric



On May 28, 2009, at 8:56 PM, Grant Ingersoll wrote:

>
> On May 28, 2009, at 8:47 PM, Eric Pugh wrote:
>
>> Grant,  you are quite right!  I was too far down in the weeds, and  
>> didn't need to be doing all that crazyness.
>>
>>
>> And I don't actually see the metadata fields.  I would expect to  
>> however!
>
> What revision are you running?
>
> The following was added to ERH on 4/24/09, r768281, (see SOLR-1128)  
> to solve this exact problem:
>          String[] names = metadata.names();
>          NamedList metadataNL = new NamedList();
>          for (int i = 0; i < names.length; i++) {
>            String[] vals = metadata.getValues(names[i]);
>            metadataNL.add(names[i], vals);
>          }
>          rsp.add(stream.getName() + "_metadata", metadataNL);
>
>

-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal




Reply | Threaded
Open this post in threaded view
|

Re: Map tika attribute to be the id in Solr Cell

Grant Ingersoll-2

On May 28, 2009, at 9:46 PM, Eric Pugh wrote:

> Updating to latest and greatest added that data, thank you for the  
> pointer.  Too many copies of Solr 1.4 trunk, and I'd neglected to  
> update.
>
> However, the issue with the mapping not working with the  
> ext.metadata.prefix seems to remain:
>
> budapest:site epugh$  curl http://localhost:8983/solr/update/extract?ext.def.fl=text 
> \&ext.map.stream_name=id\&ext.metadata.prefix=metadata_  -F "file=@tutorial.pdf
> "
>
> <body><h2>HTTP ERROR: 500</
> h2><pre>org.apache.solr.common.SolrException: Document [null]  
> missing required field: id

AFAICT, here's what's going on:

In the SolrContentHandler, the first thing it does is try to add the  
Metadata to the document.  So, for every metadata item, it goes and  
looks up the metadata name to see if there is a mapping _and_ it  
attaches the prefix (see findMappedMetadataName()) so, in your case,  
you end up with metadata_id.

I still wonder, however, why you need to even do that.  why not just  
&ext.literal.id=tutorial.pdf?

If you really want what you describe above, the current best way would  
be to extend SolrContentHandler and override findMappedMetadataName()  
to check to see if the name is stream_name or to check if it is being  
mapped to the Solr unique field.

Alternatively, you could make your unique id be metadata_id.

Otherwise, I'm not sure if this is a bug or not.

HTH,
Grant