[DISCUSS] Centralizing JSON handling of Metadata

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Centralizing JSON handling of Metadata

Allison, Timothy B.
All,

  Nick recommended I put the question to the dev list for discussion.  It might be useful to centralize our json handling of Metadata.  We are now currently using different libraries and doing different things in CLI and in tika-server.  

 1) Do we want to centralize json handling of Metadata?

 2) If so, where?  Core?  I share Nick's hesitance to add a dependency to core.  OTOH, GSON is only 186k, but this would add potential for jar conflicts with folks integrating Tika, and it doesn't feel like a core function to me...it is a handy decorator for applications.

 3) Wherever it goes, what package do we want to put it in?  I like Nick's recommendations, with a slight preference for the second (oat.utils.json).

Thank you!

          Best,

                  Tim

-----Original Message-----
From: Nick Burch (JIRA) [mailto:[hidden email]]
Sent: Wednesday, May 28, 2014 12:41 PM
To: [hidden email]
Subject: [jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata


    [ https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011287#comment-14011287 ]

Nick Burch commented on TIKA-1311:
----------------------------------

If we put it into core, we'd need to add another dependency (to GSON) which isn't ideal, so we might want to run the plan past the dev list first to see what people think (core tends to try to have a very minimal set of deps, unlike the other modules)

Package wise, org.apache.tika.metadata.json is what I'd lean towards, otherwise utils.json

> Centralize JSON handling of Metadata
> ------------------------------------
>
>                 Key: TIKA-1311
>                 URL: https://issues.apache.org/jira/browse/TIKA-1311
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> When json was initially added to TIKA CLI (TIKA-213), there was a recommendation to centralize JSON handling of Metadata, potentially putting it in core.  On a recent bug fix (TIKA-1291), the same recommendation was repeated especially noting that we now handle JSON/Metadata differently in CLI and server.
> Let's centralize JSON handling in core and use GSON.  We should add a serializer and a deserializer so that users don't have to reinvent that wheel.



--
This message was sent by Atlassian JIRA
(v6.2#6252)
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Centralizing JSON handling of Metadata

Ray Gauss II-2
Hi Tim,

1) Sounds good to me.

2) I do think we want core as lean as possible, so my vote would be for a separate project/module, similar to what was done with tika-xmp.  Perhaps something like tika-serialization-json to indicate other formats may follow in the same precedence?

3) Similar to above, perhaps org.apache.tika.metadata.serialization.json?

Just curious, any particular reason for GSON over Jackson?

Regards,

Ray


On May 28, 2014 at 1:32:41 PM, Allison, Timothy B. ([hidden email]) wrote:

> All,
>  
> Nick recommended I put the question to the dev list for discussion. It might be useful  
> to centralize our json handling of Metadata. We are now currently using different libraries  
> and doing different things in CLI and in tika-server.
>  
> 1) Do we want to centralize json handling of Metadata?
>  
> 2) If so, where? Core? I share Nick's hesitance to add a dependency to core. OTOH, GSON  
> is only 186k, but this would add potential for jar conflicts with folks integrating Tika,  
> and it doesn't feel like a core function to me...it is a handy decorator for applications.  
>  
> 3) Wherever it goes, what package do we want to put it in? I like Nick's recommendations,  
> with a slight preference for the second (oat.utils.json).
>  
> Thank you!
>  
> Best,
>  
> Tim
>  
> -----Original Message-----
> From: Nick Burch (JIRA) [mailto:[hidden email]]
> Sent: Wednesday, May 28, 2014 12:41 PM
> To: [hidden email]
> Subject: [jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata
>  
>  
> [ https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011287#comment-14011287 
> ]
>  
> Nick Burch commented on TIKA-1311:
> ----------------------------------
>  
> If we put it into core, we'd need to add another dependency (to GSON) which isn't ideal,  
> so we might want to run the plan past the dev list first to see what people think (core tends  
> to try to have a very minimal set of deps, unlike the other modules)
>  
> Package wise, org.apache.tika.metadata.json is what I'd lean towards, otherwise  
> utils.json
>  
> > Centralize JSON handling of Metadata
> > ------------------------------------
> >
> > Key: TIKA-1311
> > URL: https://issues.apache.org/jira/browse/TIKA-1311
> > Project: Tika
> > Issue Type: Task
> > Reporter: Tim Allison
> > Priority: Minor
> >
> > When json was initially added to TIKA CLI (TIKA-213), there was a recommendation to  
> centralize JSON handling of Metadata, potentially putting it in core. On a recent bug  
> fix (TIKA-1291), the same recommendation was repeated especially noting that we now  
> handle JSON/Metadata differently in CLI and server.
> > Let's centralize JSON handling in core and use GSON. We should add a serializer and a  
> deserializer so that users don't have to reinvent that wheel.
>  
>  
>  
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>  

Reply | Threaded
Open this post in threaded view
|

RE: [DISCUSS] Centralizing JSON handling of Metadata

Allison, Timothy B.
Thank you, Ray!

In almost reverse order, I've been using Jackson for this already, but I used GSON in TIKA-1291 because that's what CLI was already using.  In GSON's favor, the jar is a bit smaller, but I have no real preference or reason to pick one over the other.  I'm not a json-blackbelt (or, I guess that would be blckbelt), so I'm happy to go with either.

A new compilation unit makes sense. I'm wondering if we want to be that specific?  tika-serialization? Or, maybe just tika-utils?

Package name looks good to me.

Thanks, again!

Best,

        Tim

-----Original Message-----
From: Ray Gauss II [mailto:[hidden email]]
Sent: Wednesday, May 28, 2014 3:07 PM
To: [hidden email]; Allison, Timothy B.
Subject: Re: [DISCUSS] Centralizing JSON handling of Metadata

Hi Tim,

1) Sounds good to me.

2) I do think we want core as lean as possible, so my vote would be for a separate project/module, similar to what was done with tika-xmp.  Perhaps something like tika-serialization-json to indicate other formats may follow in the same precedence?

3) Similar to above, perhaps org.apache.tika.metadata.serialization.json?

Just curious, any particular reason for GSON over Jackson?

Regards,

Ray


On May 28, 2014 at 1:32:41 PM, Allison, Timothy B. ([hidden email]) wrote:

> All,
>  
> Nick recommended I put the question to the dev list for discussion. It might be useful  
> to centralize our json handling of Metadata. We are now currently using different libraries  
> and doing different things in CLI and in tika-server.
>  
> 1) Do we want to centralize json handling of Metadata?
>  
> 2) If so, where? Core? I share Nick's hesitance to add a dependency to core. OTOH, GSON  
> is only 186k, but this would add potential for jar conflicts with folks integrating Tika,  
> and it doesn't feel like a core function to me...it is a handy decorator for applications.  
>  
> 3) Wherever it goes, what package do we want to put it in? I like Nick's recommendations,  
> with a slight preference for the second (oat.utils.json).
>  
> Thank you!
>  
> Best,
>  
> Tim
>  
> -----Original Message-----
> From: Nick Burch (JIRA) [mailto:[hidden email]]
> Sent: Wednesday, May 28, 2014 12:41 PM
> To: [hidden email]
> Subject: [jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata
>  
>  
> [ https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011287#comment-14011287 
> ]
>  
> Nick Burch commented on TIKA-1311:
> ----------------------------------
>  
> If we put it into core, we'd need to add another dependency (to GSON) which isn't ideal,  
> so we might want to run the plan past the dev list first to see what people think (core tends  
> to try to have a very minimal set of deps, unlike the other modules)
>  
> Package wise, org.apache.tika.metadata.json is what I'd lean towards, otherwise  
> utils.json
>  
> > Centralize JSON handling of Metadata
> > ------------------------------------
> >
> > Key: TIKA-1311
> > URL: https://issues.apache.org/jira/browse/TIKA-1311
> > Project: Tika
> > Issue Type: Task
> > Reporter: Tim Allison
> > Priority: Minor
> >
> > When json was initially added to TIKA CLI (TIKA-213), there was a recommendation to  
> centralize JSON handling of Metadata, potentially putting it in core. On a recent bug  
> fix (TIKA-1291), the same recommendation was repeated especially noting that we now  
> handle JSON/Metadata differently in CLI and server.
> > Let's centralize JSON handling in core and use GSON. We should add a serializer and a  
> deserializer so that users don't have to reinvent that wheel.
>  
>  
>  
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>  

Reply | Threaded
Open this post in threaded view
|

RE: [DISCUSS] Centralizing JSON handling of Metadata

Ray Gauss II-2
I’ve used Jackson a bit but I don’t have a strong preference either.

I’m generally a fan of splitting things up into very small projects to keep the dependency hierarchy as clean as possible.  In this example, if we decided to do a direct serialization to, say, a Mongo DBObject in the future the json project wouldn’t need to bring in Mongo dependencies.  Apache Camel does a good job of segmenting things [1].

However, that sort of modularization is probably a broader discussion than what we need for this particular issue, so between those two I’d vote for tika-serialization.

Regards,

Ray


[1] https://git-wip-us.apache.org/repos/asf?p=camel.git;a=tree;f=components;h=1132bd1bb98a446aec97d5c7bc4d032276a65d83;hb=HEAD


On May 28, 2014 at 8:42:03 PM, Allison, Timothy B. ([hidden email]) wrote:

> Thank you, Ray!
>  
> In almost reverse order, I've been using Jackson for this already, but I used GSON in TIKA-1291  
> because that's what CLI was already using. In GSON's favor, the jar is a bit smaller, but  
> I have no real preference or reason to pick one over the other. I'm not a json-blackbelt  
> (or, I guess that would be blckbelt), so I'm happy to go with either.
>  
> A new compilation unit makes sense. I'm wondering if we want to be that specific? tika-serialization?  
> Or, maybe just tika-utils?
>  
> Package name looks good to me.
>  
> Thanks, again!
>  
> Best,
>  
> Tim
>  
> -----Original Message-----
> From: Ray Gauss II [mailto:[hidden email]]
> Sent: Wednesday, May 28, 2014 3:07 PM
> To: [hidden email]; Allison, Timothy B.
> Subject: Re: [DISCUSS] Centralizing JSON handling of Metadata
>  
> Hi Tim,
>  
> 1) Sounds good to me.
>  
> 2) I do think we want core as lean as possible, so my vote would be for a separate project/module,  
> similar to what was done with tika-xmp. Perhaps something like tika-serialization-json  
> to indicate other formats may follow in the same precedence?
>  
> 3) Similar to above, perhaps org.apache.tika.metadata.serialization.json?
>  
> Just curious, any particular reason for GSON over Jackson?
>  
> Regards,
>  
> Ray
>  
>  
> On May 28, 2014 at 1:32:41 PM, Allison, Timothy B. ([hidden email]) wrote:
> > All,
> >
> > Nick recommended I put the question to the dev list for discussion. It might be useful  
> > to centralize our json handling of Metadata. We are now currently using different libraries  
> > and doing different things in CLI and in tika-server.
> >
> > 1) Do we want to centralize json handling of Metadata?
> >
> > 2) If so, where? Core? I share Nick's hesitance to add a dependency to core. OTOH, GSON  
> > is only 186k, but this would add potential for jar conflicts with folks integrating  
> Tika,
> > and it doesn't feel like a core function to me...it is a handy decorator for applications.  
> >
> > 3) Wherever it goes, what package do we want to put it in? I like Nick's recommendations,  
> > with a slight preference for the second (oat.utils.json).
> >
> > Thank you!
> >
> > Best,
> >
> > Tim
> >
> > -----Original Message-----
> > From: Nick Burch (JIRA) [mailto:[hidden email]]
> > Sent: Wednesday, May 28, 2014 12:41 PM
> > To: [hidden email]
> > Subject: [jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata
> >
> >
> > [ https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011287#comment-14011287 
> > ]
> >
> > Nick Burch commented on TIKA-1311:
> > ----------------------------------
> >
> > If we put it into core, we'd need to add another dependency (to GSON) which isn't ideal,  
> > so we might want to run the plan past the dev list first to see what people think (core tends  
> > to try to have a very minimal set of deps, unlike the other modules)
> >
> > Package wise, org.apache.tika.metadata.json is what I'd lean towards, otherwise  
> > utils.json
> >
> > > Centralize JSON handling of Metadata
> > > ------------------------------------
> > >
> > > Key: TIKA-1311
> > > URL: https://issues.apache.org/jira/browse/TIKA-1311
> > > Project: Tika
> > > Issue Type: Task
> > > Reporter: Tim Allison
> > > Priority: Minor
> > >
> > > When json was initially added to TIKA CLI (TIKA-213), there was a recommendation to  
> > centralize JSON handling of Metadata, potentially putting it in core. On a recent bug  
> > fix (TIKA-1291), the same recommendation was repeated especially noting that we now  
> > handle JSON/Metadata differently in CLI and server.
> > > Let's centralize JSON handling in core and use GSON. We should add a serializer and  
> a
> > deserializer so that users don't have to reinvent that wheel.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.2#6252)
> >
>  
>  

Reply | Threaded
Open this post in threaded view
|

RE: [DISCUSS] Centralizing JSON handling of Metadata

Nick Burch-2
On Wed, 28 May 2014, Ray Gauss II wrote:
> However, that sort of modularization is probably a broader discussion
> than what we need for this particular issue, so between those two I’d
> vote for tika-serialization.

Tika-CLI and Tika-Server will likely want to depend on all of the
serialisation methods. So, I'd suggest we go for a single component for
now, tika-seriali{s,z}ation seems good to me. Later on, we can always
split that into something like:

   tika-serialisation
     (no code)
     depends on:
       tika-serialisation-json
       tika-serialisation-mongo
       tika-serialisation-blah

If and when there's a strong enough use case for the splitting!

Nick