[jira] Created: (TIKA-272) Expose characters offsets information while parsing text-based inputs.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-272) Expose characters offsets information while parsing text-based inputs.

Akash (Jira)
Expose characters offsets information while parsing text-based inputs.
----------------------------------------------------------------------

                 Key: TIKA-272
                 URL: https://issues.apache.org/jira/browse/TIKA-272
             Project: Tika
          Issue Type: New Feature
          Components: parser
    Affects Versions: 0.4
            Reporter: David Causse
            Priority: Minor


It would be interesting to access actual characters offset information when parsing text-based files (I don't know if it's interesting/usable/doable for binary formats...).
If I use tika for parsing HTML and inject parsed strings into lucene, I'm not able to tell to the lucene analyzer where is the actual character in the original input.
If tika expose this information It will permit to use unmodified lucene analyzers behind tika and implement for example pretty highlighting in search result (see google cache view).
With new Lucene Attribute API it could be fairly easy to provide a sort of TikaOffsetRectifierTokenFilter in lucene contrib and use a stack like tika -> unmodified lucene analyzer -> tika offset correction.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-272) Expose characters offsets information while parsing text-based inputs.

Akash (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753880#action_12753880 ]

Jukka Zitting commented on TIKA-272:
------------------------------------

There are basically two ways for us to do this:

1) Use the SAX Locator API to report current parse location (line, column) whenever the ContentHandler implementation wants to know it.

2) Explicitly add XML attributes like tika:location="..." to the XHTML elements emitted by a parser.

The latter option would be more accurate and could also be adapted to things like PDF coordinates, etc., so that seems like a better alternative.

I'm not sure how to handle all the details here. Do we have some concrete simple use case that we could use as an example and a test case of a first approximation of the implementation?

> Expose characters offsets information while parsing text-based inputs.
> ----------------------------------------------------------------------
>
>                 Key: TIKA-272
>                 URL: https://issues.apache.org/jira/browse/TIKA-272
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: David Causse
>            Priority: Minor
>
> It would be interesting to access actual characters offset information when parsing text-based files (I don't know if it's interesting/usable/doable for binary formats...).
> If I use tika for parsing HTML and inject parsed strings into lucene, I'm not able to tell to the lucene analyzer where is the actual character in the original input.
> If tika expose this information It will permit to use unmodified lucene analyzers behind tika and implement for example pretty highlighting in search result (see google cache view).
> With new Lucene Attribute API it could be fairly easy to provide a sort of TikaOffsetRectifierTokenFilter in lucene contrib and use a stack like tika -> unmodified lucene analyzer -> tika offset correction.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.