Tika Update, no Data

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Tika Update, no Data

Jörg Agatz
hey...

i work with tika and Solr, at the Moment, i can index Dokument information
but nur content..

to the details:

part of my config:

<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler"
startup="lazy">
<lst name="defaults">
<str name="ext.map.Last-Modified">last_modified</str>
<bool name="ext.ignore.und.fl">true</bool>
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>

Part of my Schema:

<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<field name="sku" type="textTight" indexed="true" stored="true"
omitNorms="true"/>
<field name="name" type="textgen" indexed="true" stored="true"/>
<field name="alphaNameSort" type="alphaOnlySort" indexed="true"
stored="false"/>
<field name="manu" type="textgen" indexed="true" stored="true"
omitNorms="true"/>
<field name="cat" type="text_ws" indexed="true" stored="true"
multiValued="true" omitNorms="true"/>
<field name="features" type="text" indexed="true" stored="true"
multiValued="true"/>
<field name="includes" type="text" indexed="true" stored="true"
termVectors="true" termPositions="true" termOffsets="true"/>
<field name="weight" type="float" indexed="true" stored="true"/>
<field name="price" type="float" indexed="true" stored="true"/>
<field name="popularity" type="int" indexed="true" stored="true"/>
<field name="inStock" type="boolean" indexed="true" stored="true"/>
<field name="title" type="text" indexed="true" stored="true"
multiValued="true"/>
<field name="subject" type="text" indexed="true" stored="true"/>
<field name="description" type="text" indexed="true" stored="true"/>
<field name="comments" type="text" indexed="true" stored="true"/>
<field name="author" type="textgen" indexed="true" stored="true"/>
<field name="keywords" type="textgen" indexed="true" stored="true"/>
<field name="category" type="textgen" indexed="true" stored="true"/>
<field name="content_type" type="string" indexed="true" stored="true"
multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="links" type="string" indexed="true" stored="true"
multiValued="true"/>
<field name="text" type="text" indexed="true" stored="false"
multiValued="true"/>

curl command:

curl "
http://192.168.105.66:8983/solr/update/extract?literal.id=1234&uprefix=attr_commit=true"
-F "myfile=@Word-Text.doc"

result in Solr:

<doc>
-
<arr name="attr_commit=trueapplication_name">
<str>TX_WORD 10.1.210.500</str>
</arr>
-
<arr name="attr_commit=truestream_content_type">
<str>application/octet-stream</str>
</arr>
-
<arr name="attr_commit=truestream_name">
<str>Word-Text.doc</str>
</arr>
-
<arr name="attr_commit=truestream_size">
<str>43592</str>
</arr>
-
<arr name="attr_commit=truestream_source_info">
<str>myfile</str>
</arr>
-
<arr name="content_type">
<str>application/msword</str>
</arr>
<str name="id">1234</str>
</doc>

_________________________________________________________
_________________________________________________________
_________________________________________________________

But i need the Content to.. what i make wrong?

Thanks for Halt

King
Reply | Threaded
Open this post in threaded view
|

Re: Tika Update, no Data

arnaud gaudinat
Le 14.01.2011 16:28, Jörg Agatz a écrit :
> <field name="text" type="text" indexed="true" stored="false"
> multiValued="true"/>
If I well understood your problem try:

<field name="text" type="text" indexed="true" stored="true"
multiValued="true"/>


so with stored="true" to get back the content

Arnaud

Reply | Threaded
Open this post in threaded view
|

Re: Tika Update, no Data

Jörg Agatz
Hey!

Thanks a lot, nice tip.. works fine..

But one Problem i have too...

to indexing ZIP. i tryed :

curl "
http://192.168.105.66:8983/solr/update/extract?literal.id=zip&uprefix=attr_commit=true"
-F "[hidden email]"

and i get:
Warning: Illegally formatted input field!
curl: option -F: is badly used here
curl: try 'curl --help' or 'curl --manual' for more information
service@joa-Desktop:~/Downloads$

Maby you hav an idea?
Reply | Threaded
Open this post in threaded view
|

Re: Tika Update, no Data

Stefan Matheis
missing the = char between myfile and @filename.ext?

On Mon, Jan 17, 2011 at 2:47 PM, Jörg Agatz <[hidden email]>wrote:

> Hey!
>
> Thanks a lot, nice tip.. works fine..
>
> But one Problem i have too...
>
> to indexing ZIP. i tryed :
>
> curl "
>
> http://192.168.105.66:8983/solr/update/extract?literal.id=zip&uprefix=attr_commit=true
> "
> -F "[hidden email]"
>
> and i get:
> Warning: Illegally formatted input field!
> curl: option -F: is badly used here
> curl: try 'curl --help' or 'curl --manual' for more information
> service@joa-Desktop:~/Downloads$
>
> Maby you hav an idea?
>
Reply | Threaded
Open this post in threaded view
|

Re: Tika Update, no Data

Jörg Agatz
ohh, your right.. embarrassing!


i have tryed, and it works, but it seems it works not Perfect, the txt
documents into the ZIP are not indext, lonly the Names of documents into the
zip..

King