[jira] Created: (TIKA-321) Optimize type detection speed

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-321) Optimize type detection speed

JIRA jira@apache.org
Optimize type detection speed
-----------------------------

                 Key: TIKA-321
                 URL: https://issues.apache.org/jira/browse/TIKA-321
             Project: Tika
          Issue Type: Improvement
          Components: mime
            Reporter: Jukka Zitting
            Priority: Minor


It would be good to do some simple benchmarks on the type detection code (Tika.detect) to see if there are obvious performance optimizations we could make. There are some use cases like attaching file type information directory listings where type detection speed is important and not necessarily dwarfed by IO waits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-321) Optimize type detection speed

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-321.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.6
         Assignee: Jukka Zitting

I've made a number of optimizations to the type detection code and as a result it's already over an order of magnitude faster than before. I believe there's *still* an order of magnitude of improvement available (check most common types first, short-circuit matching to only subtypes of already detected types, etc.), but already now I've reached the performance goals I had so I'll mark this as resolved for Tika 0.6. We can follow up with another issue in case anyone has more strict performance requirements.

> Optimize type detection speed
> -----------------------------
>
>                 Key: TIKA-321
>                 URL: https://issues.apache.org/jira/browse/TIKA-321
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.6
>
>
> It would be good to do some simple benchmarks on the type detection code (Tika.detect) to see if there are obvious performance optimizations we could make. There are some use cases like attaching file type information directory listings where type detection speed is important and not necessarily dwarfed by IO waits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.