how to add more metadata to tika extraction?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

how to add more metadata to tika extraction?

eShard
Hi,
I didn't know where else to post this so apologies in advance...
Here's my quandary:
I'm using manifoldcf v1.1.1 to crawl non standard (IBM) RSS feeds and custom RSS feeds.
There's additional metadata in each item that we need to capture.
I added the additional fields to the Solr schema (4.0 final) but the additional fields are nowhere to be found.
I used fiddler to confirm that manifoldcf is indeed sending all the data to solr.
I can only assume that tika is ignoring it / removing it.
I turned on the <str name="uprefix">attr_</str> in the solrconfig.xml but that didn't work either.

Can anyone tell me how to modify solr and or tika to accept the additional fields from the feed?
I looked into the tika.config file option but I couldn't find any examples and I found one post that says it's obsolete...
I also tried putting the additional metadata in the content field but the xml was stripped out leaving the data. so I used a double pipe as a delimiter but that had mixed results.

here's what my solrconfig.xml extraction handler looks like for the RSS feed:
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
    <lst name="defaults">
          <str name="fmap.content">content</str>
          <str name="fmap.title">solr.title</str>
          <str name="fmap.name">solr.name</str>
          <str name="link">link</str>
          <str name="pubdateiso">pubdateiso</str>
          <str name="summary">summary</str>
          <str name="description">comments</str>
          <str name="authoremail">authoremail</str>
          <str name="modifier">modifier</str>
          <str name="modifieremail">modifieremail</str>
          <str name="authoremail">authoremail</str>
          <str name="published">published</str>
          <str name="updated">updated</str>
          <str name="modified">modified</str>
          <str name="created">created</str>
      <str name="fmap.Last-Modified">last_modified</str>
      <str name="uprefix">attr_</str>
          <str name="lowernames">true</str>
          <str name="fmap.div">ignored_</str>
    </lst>
    <lst name="date.formats">
      <str>yyyy-MM-dd</str>
    </lst>       
   
   
  </requestHandler>

Please advise...

Thanks,
Reply | Threaded
Open this post in threaded view
|

Re: how to add more metadata to tika extraction?

Nick Burch-2
On Wed, 27 Feb 2013, eShard wrote:
> Here's my quandary:
> I'm using manifoldcf v1.1.1 to crawl non standard (IBM) RSS feeds and custom
> RSS feeds.
> There's additional metadata in each item that we need to capture.
> I added the additional fields to the Solr schema (4.0 final) but the
> additional fields are nowhere to be found.

Does Tika extract this metadata? Maybe try using the tika-app with
--metadata to check. That'll let us know if the problem is with getting
the metadata out of the rss feed, or with how the SOLR plugin handles the
data

Nick
Reply | Threaded
Open this post in threaded view
|

Re: how to add more metadata to tika extraction?

eShard
Hi Nick,
Sorry, but can you tell me how to do that exactly?

thanks for the reply, I greatly appreciate it.
Reply | Threaded
Open this post in threaded view
|

Re: how to add more metadata to tika extraction?

eShard
Ok,
I figured it out.
I manually ran the tika-app --gui and I dropped the rss feed into it.
Here's what the metadata output:

Content-Length: 615913
Content-Type: application/rss+xml
dc:description: This is an IBM C3 Public Files feed generated by a Java application.
dc:title: IBM - C3 Public Files RSS feed
description: This is an IBM C3 Public Files feed generated by a Java application.
title: IBM - C3 Public Files RSS feed

that's not what I was expecting. where are the items?
the items are in the xml but tika isn't showing them...

I tried using it on the original IBM feed but it failed with SSL errors.
so I saved the feed as an XML file and gave it to tika and it had even less metadata:
Content-Length: 2068565
Content-Type: application/xml
resourceName: c3files-2-6-2013.xml

Please advise...

Thanks,


Reply | Threaded
Open this post in threaded view
|

Re: how to add more metadata to tika extraction?

Nick Burch-2
On Wed, 27 Feb 2013, eShard wrote:

> I manually ran the tika-app --gui and I dropped the rss feed into it.
> Here's what the metadata output:
>
> Content-Length: 615913
> Content-Type: application/rss+xml
> dc:description: This is an IBM C3 Public Files feed generated by a Java
> application.
> dc:title: IBM - C3 Public Files RSS feed
> description: This is an IBM C3 Public Files feed generated by a Java
> application.
> title: IBM - C3 Public Files RSS feed

Looks like the metadata you want isn't being pulled out as metadata by
Tika

> that's not what I was expecting. where are the items? the items are in
> the xml but tika isn't showing them...

Metadata != content

I'd suspect that if you look at the content output (eg run tika-app with
the --xml flag rather than --gui) you'll see that there. Do you?

Nick