[jira] [Commented] (TIKA-1568) AutoDetectReader performance problem

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-1568) AutoDetectReader performance problem

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880151#comment-16880151 ]

Sebastian Nagel commented on TIKA-1568:
---------------------------------------

Actually, the performance impact is significant: looks like 75% of the time are spent to lookup and load classes, the actual work (detecting charsets requires only 25% of the time). A detailed profile of CC's Nutch WARC writer is attached to the [issue report on github|https://github.com/commoncrawl/nutch/issues/7].

> AutoDetectReader performance problem
> ------------------------------------
>
>                 Key: TIKA-1568
>                 URL: https://issues.apache.org/jira/browse/TIKA-1568
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.7
>            Reporter: Andrzej Bialecki
>            Priority: Major
>
> Parsing performance of many text files suffers from repeated calls to ServiceLoader.loadServiceProviders(EncodingDetector.class). This happens in TXTParser, HTMLParser and SourceCodeParser. In most cases, when Tika is using the default ServiceLoader instance created in the Parser's static section this cost can be avoided by caching the resulting List<EncodingDetector> either at a higher level in the Parser (as a static property). If using custom ServiceLoader-s this can be achieved by putting this list in ParsingContext, or caching these lists at a lower level in the ServiceLoader component.
> Relevant part of  the stacktrace follows:
> {code}
>    java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:304)
> - locked <0x00000007909d2e48> (a java.util.jar.JarFile)
> at java.util.jar.JarFile.getEntry(JarFile.java:227)
> at java.util.jar.JarFile.getJarEntry(JarFile.java:210)
> at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840)
> at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818)
> at sun.misc.URLClassPath$1.next(URLClassPath.java:226)
> at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:236)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:583)
> at java.net.URLClassLoader$3$1.run(URLClassLoader.java:581)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader$3.next(URLClassLoader.java:580)
> at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:605)
> at java.util.Collections.list(Collections.java:3687)
> at org.eclipse.jetty.webapp.WebAppClassLoader.toList(WebAppClassLoader.java:337)
> at org.eclipse.jetty.webapp.WebAppClassLoader.getResources(WebAppClassLoader.java:321)
> at org.apache.tika.config.ServiceLoader.findServiceResources(ServiceLoader.java:210)
> at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:277)
> at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:306)
> at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:228)
> at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:104)
> at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:70)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)