[jira] [Commented] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16510780#comment-16510780 ]

ASF GitHub Bot commented on NUTCH-2578:
---------------------------------------

sebastian-nagel closed pull request #338: NUTCH-2578 Avoid lock by MimeUtil in constructor of protocol.Content
URL: https://github.com/apache/nutch/pull/338
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/protocol/Content.java b/src/java/org/apache/nutch/protocol/Content.java
index 7de491617..2b49f7dbb 100755
--- a/src/java/org/apache/nutch/protocol/Content.java
+++ b/src/java/org/apache/nutch/protocol/Content.java
@@ -81,6 +81,29 @@ public Content(String url, String base, byte[] content, String contentType,
     this.metadata = metadata;
 
     this.mimeTypes = new MimeUtil(conf);
+
+    this.contentType = getContentType(contentType, url, content);
+  }
+
+  public Content(String url, String base, byte[] content, String contentType,
+      Metadata metadata, MimeUtil mimeTypes) {
+
+    if (url == null)
+      throw new IllegalArgumentException("null url");
+    if (base == null)
+      throw new IllegalArgumentException("null base");
+    if (content == null)
+      throw new IllegalArgumentException("null content");
+    if (metadata == null)
+      throw new IllegalArgumentException("null metadata");
+
+    this.url = url;
+    this.base = base;
+    this.content = content;
+    this.metadata = metadata;
+
+    this.mimeTypes = mimeTypes;
+
     this.contentType = getContentType(contentType, url, content);
   }
 
diff --git a/src/java/org/apache/nutch/util/MimeUtil.java b/src/java/org/apache/nutch/util/MimeUtil.java
index 7f7ec0c1b..d380427ae 100644
--- a/src/java/org/apache/nutch/util/MimeUtil.java
+++ b/src/java/org/apache/nutch/util/MimeUtil.java
@@ -66,8 +66,12 @@
       .getLogger(MethodHandles.lookup().lookupClass());
 
   public MimeUtil(Configuration conf) {
-    tika = new Tika();
     ObjectCache objectCache = ObjectCache.get(conf);
+    tika = (Tika) objectCache.getObject(Tika.class.getName());
+    if (tika == null) {
+      tika = new Tika();
+      objectCache.setObject(Tika.class.getName(), tika);
+    }
     MimeTypes mimeTypez = (MimeTypes) objectCache.getObject(MimeTypes.class
         .getName());
     if (mimeTypez == null) {
diff --git a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
index d9284c9aa..145a8aefa 100644
--- a/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
+++ b/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
@@ -41,6 +41,7 @@
 import org.apache.nutch.protocol.ProtocolOutput;
 import org.apache.nutch.protocol.ProtocolStatus;
 import org.apache.nutch.util.GZIPUtils;
+import org.apache.nutch.util.MimeUtil;
 import org.apache.nutch.util.DeflateUtils;
 import org.apache.hadoop.util.StringUtils;
 
@@ -105,6 +106,13 @@
   /** The nutch configuration */
   private Configuration conf = null;
 
+  /**
+   * MimeUtil for MIME type detection. Note (see NUTCH-2578): MimeUtil object is
+   * used concurrently by parallel fetcher threads, methods to detect MIME type
+   * must be thread-safe.
+   */
+  private MimeUtil mimeTypes = null;
+
   /** Do we use HTTP/1.1? */
   protected boolean useHttp11 = false;
 
@@ -158,6 +166,7 @@ public void setConf(Configuration conf) {
         .trim();
     this.acceptCharset = conf.get("http.accept.charset", acceptCharset).trim();
     this.accept = conf.get("http.accept", accept).trim();
+    this.mimeTypes = new MimeUtil(conf);
     // backward-compatible default setting
     this.useHttp11 = conf.getBoolean("http.useHttp11", false);
     this.responseTime = conf.getBoolean("http.store.responsetime", true);
@@ -282,7 +291,7 @@ public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
       byte[] content = response.getContent();
       Content c = new Content(u.toString(), u.toString(),
           (content == null ? EMPTY_CONTENT : content),
-          response.getHeader("Content-Type"), response.getHeaders(), this.conf);
+          response.getHeader("Content-Type"), response.getHeaders(), mimeTypes);
 
       if (code == 200) { // got a good response
         return new ProtocolOutput(c); // return it


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Avoid lock by MimeUtil in constructor of protocol.Content
> ---------------------------------------------------------
>
>                 Key: NUTCH-2578
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2578
>             Project: Nutch
>          Issue Type: Improvement
>          Components: protocol
>    Affects Versions: 1.14
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new MimeUtil object. That's not cheap as it always creates a new Tika object and there is a lock on the job/jar file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f70523c3800 nid=0x1de2 waiting for monitor entry [0x00007f70193a8000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
>         - waiting to lock <0x00000005e0285758> (a java.util.jar.JarFile)
>         at java.util.jar.JarFile.getEntry(JarFile.java:240)
>         at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
>         at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
>         at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
>         at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
>         at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
>         at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
>         at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
>         at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
>         at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
>         at sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
>         at java.util.Collections.list(Collections.java:5239)
>         at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
>         at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
>         at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
>         at org.apache.tika.detect.DefaultEncodingDetector.<init>(DefaultEncodingDetector.java:45)
>         at org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
>         at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:248)
>         at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
>         at org.apache.tika.Tika.<init>(Tika.java:116)
>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:69)
>         at org.apache.nutch.protocol.Content.<init>(Content.java:83)
>         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
> {noformat}
> If there are many Fetcher threads this may cause a significant bottleneck, running a Fetcher with 120 threads I've found up to 50 threads waiting for this lock:
> {noformat}
> # pid 7195 is a Fetcher map task
> % sudo -u yarn jstack 7195 \
>       | grep -A25 'waiting to lock' \
>       | grep -F 'org.apache.tika.Tika.<init>' \
>       | wc -l
> 49
> {noformat}
> As MimeUtil is thread-safe [including the called Tika detector|https://www.mail-archive.com/user@.../msg00296.html], the best solution seems to cache the MimeUtil object in the actual protocol implementation as it is done in Nutch 2.x ([lib-http HttpBase, line #151|https://github.com/apache/nutch/blob/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L151]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)