Contribution of parser for FITS file format to Apache Tika

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Contribution of parser for FITS file format to Apache Tika

Rahul Khanna
Hi,

I'm a developer who has used Apache Tika in a Research Data Repository
System at The Australian National University. As part of the
requirements of the project we extended the functionality of Apache Tika
by creating a parser that extracts the headers of files in the FITS
format
(http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?s
tatus=detailReport&id=657) using the nom.tam.fits library available at
http://heasarc.gsfc.nasa.gov/docs/heasarc/fits/java/v1.0/ .

 

Apache Tika already has the ability to identify FITS files (without
parsing them) as per https://issues.apache.org/jira/browse/TIKA-874 . Is
your team willing to review and potentially incorporate the parser into
Tika? The parser in its current form is available at
https://github.com/anu-doi/anudc/blob/master/DcShared/src/main/java/au/e
du/anu/dcbag/metadata/FitsParser.java .

 

Thank you,

Rahul Khanna

[hidden email]

 

Reply | Threaded
Open this post in threaded view
|

Re: Contribution of parser for FITS file format to Apache Tika

Nick Burch-2
On Wed, 5 Dec 2012, Rahul Khanna wrote:
> I'm a developer who has used Apache Tika in a Research Data Repository
> System at The Australian National University. As part of the
> requirements of the project we extended the functionality of Apache Tika
> by creating a parser that extracts the headers of files in the FITS
> format
> (http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?s 
> tatus=detailReport&id=657) using the nom.tam.fits library available at
> http://heasarc.gsfc.nasa.gov/docs/heasarc/fits/java/v1.0/ .

Four questions spring to mind:
* How stable is the nom.tam.fits library? Lots of changes at the moment,
   or few?
* Is the library already in maven central?
* How complicated is the parser? Is a fairly simple one (basically call
   the library, then process the output into Tika structures/formats) or
   does it do a large amount of work?
* Are there unit tests?

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Contribution of parser for FITS file format to Apache Tika

Mattmann, Chris A (3010)
In reply to this post by Rahul Khanna
Hey Rahul,

This is great and I'm totally willing to work with you to shepherd this
in. The first step would be to create a JIRA issue for your parser, and
then to submit a patch to incorporate it into the tika-parsers module. Of
course, you can start with changing the namespace to org.apache.* (from
its current edu.anu.* package).

Then, it would be nice to create a unit test for the parser, and include a
sample FITS file that the unit tests can run against. There are a number
of existing examples under test-resources within tika-parsers.

While you are doing all this, you might want to file an Apache Individual
Contributor License Agreement (ICLA) -- and to submit the application to
[hidden email] to cover your contributions:

http://www.apache.org/licenses/icla.txt


Again I'd be happy to help and thanks for wanting to contribute to the
project!

Cheers,
Chris

On 12/4/12 3:18 PM, "Rahul Khanna" <[hidden email]> wrote:

>Hi,
>
>I'm a developer who has used Apache Tika in a Research Data Repository
>System at The Australian National University. As part of the
>requirements of the project we extended the functionality of Apache Tika
>by creating a parser that extracts the headers of files in the FITS
>format
>(http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?s
>tatus=detailReport&id=657) using the nom.tam.fits library available at
>http://heasarc.gsfc.nasa.gov/docs/heasarc/fits/java/v1.0/ .
>
>
>
>Apache Tika already has the ability to identify FITS files (without
>parsing them) as per https://issues.apache.org/jira/browse/TIKA-874 . Is
>your team willing to review and potentially incorporate the parser into
>Tika? The parser in its current form is available at
>https://github.com/anu-doi/anudc/blob/master/DcShared/src/main/java/au/e
>du/anu/dcbag/metadata/FitsParser.java .
>
>
>
>Thank you,
>
>Rahul Khanna
>
>[hidden email]
>
>
>