[jira] Created: (TIKA-596) NetCDF and HDF files don't parse correctly from the command line via tika-app

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-596) NetCDF and HDF files don't parse correctly from the command line via tika-app

Hudson (Jira)
NetCDF and HDF files don't parse correctly from the command line via tika-app
-----------------------------------------------------------------------------

                 Key: TIKA-596
                 URL: https://issues.apache.org/jira/browse/TIKA-596
             Project: Tika
          Issue Type: Bug
          Components: packaging, parser
    Affects Versions: 0.8
         Environment: while prepping 0.9 RC
            Reporter: Chris A. Mattmann
            Assignee: Chris A. Mattmann
            Priority: Blocker
             Fix For: 0.9


The tika-app command line interface seems to be broken for HDF and NetCDF files. For example:

{noformat}
[chipotle:trunk/tika-app/target] mattmann% java -jar tika-app-0.9-SNAPSHOT.jar -m /Users/mattmann/src/tika/trunk/tika-parsers/target/test-classes/test-documents/test.he5
[chipotle:trunk/tika-app/target] mattmann%
{noformat}

and:

{noformat}
[chipotle:trunk/tika-app/target] mattmann% java -jar tika-app-0.9-SNAPSHOT.jar -m /Users/mattmann/src/tika/tags/0.8/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_200001.nc
[chipotle:trunk/tika-app/target] mattmann%
{noformat}

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-596) NetCDF and HDF files don't parse correctly from the command line via tika-app

Hudson (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved TIKA-596.
------------------------------------

    Resolution: Fixed

- fixed in r1070359. I added a special NoDocumentMetHandler to check whether or not endDocument was called (will only happen when XHTML output is present). If the endDocument method isn't called, then the output OutputType forces the method to be called and metadata to be output.

> NetCDF and HDF files don't parse correctly from the command line via tika-app
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-596
>                 URL: https://issues.apache.org/jira/browse/TIKA-596
>             Project: Tika
>          Issue Type: Bug
>          Components: packaging, parser
>    Affects Versions: 0.8
>         Environment: while prepping 0.9 RC
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Blocker
>              Labels: cmd, error, hdf, line, netcdf, packaging, tika-app
>             Fix For: 0.9
>
>
> The tika-app command line interface seems to be broken for HDF and NetCDF files. For example:
> {noformat}
> [chipotle:trunk/tika-app/target] mattmann% java -jar tika-app-0.9-SNAPSHOT.jar -m /Users/mattmann/src/tika/trunk/tika-parsers/target/test-classes/test-documents/test.he5
> [chipotle:trunk/tika-app/target] mattmann%
> {noformat}
> and:
> {noformat}
> [chipotle:trunk/tika-app/target] mattmann% java -jar tika-app-0.9-SNAPSHOT.jar -m /Users/mattmann/src/tika/tags/0.8/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_200001.nc
> [chipotle:trunk/tika-app/target] mattmann%
> {noformat}

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-596) NetCDF and HDF files don't parse correctly from the command line via tika-app

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994288#comment-12994288 ]

Jukka Zitting commented on TIKA-596:
------------------------------------

After thinking about this a bit, I believe an architecturally better solution for this would be for the parsers to always output XHTML even if empty. This is better in line with the expectations set in the Parser javadoc and prevents the need for special case code like the one you added in revision 1070359.

Adding empty XHTML output is as simple as this:

{code}
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
xhtml.startDocument();
xhtml.endDocument();
{code}


> NetCDF and HDF files don't parse correctly from the command line via tika-app
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-596
>                 URL: https://issues.apache.org/jira/browse/TIKA-596
>             Project: Tika
>          Issue Type: Bug
>          Components: packaging, parser
>    Affects Versions: 0.8
>         Environment: while prepping 0.9 RC
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Blocker
>              Labels: cmd, error, hdf, line, netcdf, packaging, tika-app
>             Fix For: 0.9
>
>
> The tika-app command line interface seems to be broken for HDF and NetCDF files. For example:
> {noformat}
> [chipotle:trunk/tika-app/target] mattmann% java -jar tika-app-0.9-SNAPSHOT.jar -m /Users/mattmann/src/tika/trunk/tika-parsers/target/test-classes/test-documents/test.he5
> [chipotle:trunk/tika-app/target] mattmann%
> {noformat}
> and:
> {noformat}
> [chipotle:trunk/tika-app/target] mattmann% java -jar tika-app-0.9-SNAPSHOT.jar -m /Users/mattmann/src/tika/tags/0.8/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_200001.nc
> [chipotle:trunk/tika-app/target] mattmann%
> {noformat}

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (TIKA-596) NetCDF and HDF files don't parse correctly from the command line via tika-app

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994325#comment-12994325 ]

Chris A. Mattmann commented on TIKA-596:
----------------------------------------

No probs Jukka. I'll add those methods to the NetCDF and HDF parser, and remove my work-around in the cmd line driver. After that I'll roll 0.9 RC #2.

> NetCDF and HDF files don't parse correctly from the command line via tika-app
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-596
>                 URL: https://issues.apache.org/jira/browse/TIKA-596
>             Project: Tika
>          Issue Type: Bug
>          Components: packaging, parser
>    Affects Versions: 0.8
>         Environment: while prepping 0.9 RC
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Blocker
>              Labels: cmd, error, hdf, line, netcdf, packaging, tika-app
>             Fix For: 0.9
>
>
> The tika-app command line interface seems to be broken for HDF and NetCDF files. For example:
> {noformat}
> [chipotle:trunk/tika-app/target] mattmann% java -jar tika-app-0.9-SNAPSHOT.jar -m /Users/mattmann/src/tika/trunk/tika-parsers/target/test-classes/test-documents/test.he5
> [chipotle:trunk/tika-app/target] mattmann%
> {noformat}
> and:
> {noformat}
> [chipotle:trunk/tika-app/target] mattmann% java -jar tika-app-0.9-SNAPSHOT.jar -m /Users/mattmann/src/tika/tags/0.8/tika-parsers/src/test/resources/test-documents/sresa1b_ncar_ccsm3_0_run1_200001.nc
> [chipotle:trunk/tika-app/target] mattmann%
> {noformat}

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira