[jira] [Commented] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479176#comment-16479176 ]

Yossi Tamari commented on NUTCH-2578:

Hi [~wastl-nagel], maybe I'm missing something, but it seems like putting the Tika instance in the ObjectCache would mean that it would only be created once, thereby preventing the need to cache MimeUtil at a higher level. It will not introduce any more thread-safety, but it will mean that there will be no external requirement for the thread-safety of MimeUtil, it will be an implementation detail of the class. This is purely about better encapsulation, not actual functionality.

> Avoid lock by MimeUtil in constructor of protocol.Content
> ---------------------------------------------------------
>                 Key: NUTCH-2578
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2578
>             Project: Nutch
>          Issue Type: Improvement
>          Components: protocol
>    Affects Versions: 1.14
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
> The constructor of the class o.a.n.protocol.Content instantiates a new MimeUtil object. That's not cheap as it always creates a new tika.MimeTypes object and there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f70523c3800 nid=0x1de2 waiting for monitor entry [0x00007f70193a8000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
>         - waiting to lock <0x00000005e0285758> (a java.util.jar.JarFile)
>         at java.util.jar.JarFile.getEntry(JarFile.java:240)
>         at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
>         at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
>         at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
>         at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
>         at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
>         at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
>         at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
>         at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
>         at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
>         at sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
>         at java.util.Collections.list(Collections.java:5239)
>         at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
>         at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
>         at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
>         at org.apache.tika.detect.DefaultEncodingDetector.<init>(DefaultEncodingDetector.java:45)
>         at org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
>         at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:248)
>         at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
>         at org.apache.tika.Tika.<init>(Tika.java:116)
>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:69)
>         at org.apache.nutch.protocol.Content.<init>(Content.java:83)
>         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
> {noformat}
> If there are many Fetcher threads this may cause a significant bottleneck, running a Fetcher with 120 threads I've found up to 50 threads waiting for this lock:
> {noformat}
> # pid 7195 is a Fetcher map task
> % sudo -u yarn jstack 7195 \
>       | grep -A25 'waiting to lock' \
>       | grep -F 'org.apache.tika.Tika.<init>' \
>       | wc -l
> 49
> {noformat}
> As MimeUtil is thread-safe [including the called Tika detector|https://www.mail-archive.com/user@.../msg00296.html], the best solution seems to cache the MimeUtil object in the actual protocol implementation as it is done in Nutch 2.x ([lib-http HttpBase, line #151|https://github.com/apache/nutch/blob/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L151]).

This message was sent by Atlassian JIRA