rdf output

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

rdf output

turnguard
hi, i just checked out tika for the first time,

since i could use it very well, i have a few newbie questions.

1. when will tika switch to apache's pdf box (is it still not mature enough?)
2. is it possible to skip html tags with tika (say i don't want to have <script> or <style> contents in my resulting plain text

and most important

3. are there any plan for outputing the result into RDF (currently i'm using aperture), but i would be more than happy to switch to an apache project
    and i'm also willing to contribute on that one.

any insight appreciated
wkr www.turnguard.com    



     
Reply | Threaded
Open this post in threaded view
|

Re: rdf output

Jukka Zitting
Hi,

On Sun, Sep 20, 2009 at 10:22 PM, jakobitsch juergen
<[hidden email]> wrote:
> 1. when will tika switch to apache's pdf box (is it still not mature enough?)

As soon as the 0.8.0 release is officially out and available from the
central Maven repository. I expect this to happen within a week, so
Tika 0.5 will be based on Apache PDFBox.

> 2. is it possible to skip html tags with tika (say i don't want to have <script>
> or <style> contents in my resulting plain text

Yes. That's actually what the HTML parser in Tika is programmed to do
by default. See the DISCARD_ELEMENTS set in
org.apache.tika.parser.html.HTMLParser.

> 3. are there any plan for outputing the result into RDF (currently i'm using aperture),
> but i would be more than happy to switch to an apache project and i'm also willing
> to contribute on that one.

We've had discussions about using XMP for expressing and handling
extracted document metadata. So far we haven't reached clear consensus
and not much work has yet been done about this, but contributions are
of course welcome.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: rdf output

kkrugler
Hi Jukka,

On Sep 20, 2009, at 2:26pm, Jukka Zitting wrote:

>> 2. is it possible to skip html tags with tika (say i don't want to  
>> have <script>
>> or <style> contents in my resulting plain text
>
> Yes. That's actually what the HTML parser in Tika is programmed to do
> by default. See the DISCARD_ELEMENTS set in
> org.apache.tika.parser.html.HTMLParser.

I recently ran into the need to customize the behavior of the  
HtmlParser, in terms of what tags it passed through.

In particular, the <span> tag contained attributes I wanted, but these  
aren't part of the "SAFE_ELEMENTS" set.

1. From what I can see, <span> should be part of the XHTML safe set.

2. It would be great to have some way to easily customize this  
behavior, e.g. a protected isSafeElement() method.

3. It looks like the code currently will skip calling startElement/
endElement for non-safe tags, but will output any characters found  
between those tags.

Depending on the where the non-safe tag occurs, this could result in  
an invalid XHTML document, e.g. if you had

<body><non-safe tag>some text</non-safe tag></body>

this would output

<body>some text</body>

Thanks,

-- Ken

>>
>> 3. are there any plan for outputing the result into RDF (currently  
>> i'm using aperture),
>> but i would be more than happy to switch to an apache project and  
>> i'm also willing
>> to contribute on that one.
>
> We've had discussions about using XMP for expressing and handling
> extracted document metadata. So far we haven't reached clear consensus
> and not much work has yet been done about this, but contributions are
> of course welcome.
>
> BR,
>
> Jukka Zitting