[jira] [Commented] (NUTCH-2579) Fetcher to use parsed URL to call ProtocolFactory.getProtocol(url)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2579) Fetcher to use parsed URL to call ProtocolFactory.getProtocol(url)

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16510767#comment-16510767 ]

ASF GitHub Bot commented on NUTCH-2579:
---------------------------------------

sebastian-nagel closed pull request #334: NUTCH-2579 Fetcher to use parsed URL to call ProtocolFactory.getProtocol(url)
URL: https://github.com/apache/nutch/pull/334
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/fetcher/FetcherThread.java b/src/java/org/apache/nutch/fetcher/FetcherThread.java
index 4f67391f0..3b000f995 100644
--- a/src/java/org/apache/nutch/fetcher/FetcherThread.java
+++ b/src/java/org/apache/nutch/fetcher/FetcherThread.java
@@ -296,14 +296,14 @@ public void run() {
               LOG.debug("redirectCount={}", redirectCount);
             }
             redirecting = false;
-            Protocol protocol = this.protocolFactory.getProtocol(fit.url
-                .toString());
-            BaseRobotRules rules = protocol.getRobotRules(fit.url, fit.datum, robotsTxtContent);
+            Protocol protocol = this.protocolFactory.getProtocol(fit.u);
+            BaseRobotRules rules = protocol.getRobotRules(fit.url, fit.datum,
+                robotsTxtContent);
             if (robotsTxtContent != null) {
               outputRobotsTxt(robotsTxtContent);
               robotsTxtContent.clear();
             }
-            if (!rules.isAllowed(fit.u.toString())) {
+            if (!rules.isAllowed(fit.url.toString())) {
               // unblock
               ((FetchItemQueues) fetchQueues).finishFetchItem(fit, true);
               if (LOG.isDebugEnabled()) {
diff --git a/src/java/org/apache/nutch/protocol/ProtocolFactory.java b/src/java/org/apache/nutch/protocol/ProtocolFactory.java
index 6a4205931..b39155b79 100644
--- a/src/java/org/apache/nutch/protocol/ProtocolFactory.java
+++ b/src/java/org/apache/nutch/protocol/ProtocolFactory.java
@@ -65,16 +65,36 @@ public ProtocolFactory(Configuration conf) {
    * @return The appropriate {@link Protocol} implementation for a given
    *         {@link URL}.
    * @throws ProtocolNotFound
-   *           when Protocol can not be found for urlString
+   *           when Protocol can not be found for urlString or urlString is not
+   *           a valid URL
    */
-  public synchronized Protocol getProtocol(String urlString)
+  public Protocol getProtocol(String urlString) throws ProtocolNotFound {
+    try {
+      URL url = new URL(urlString);
+      return getProtocol(url);
+    } catch (MalformedURLException e) {
+      throw new ProtocolNotFound(urlString, e.toString());
+    }
+  }
+
+  /**
+   * Returns the appropriate {@link Protocol} implementation for a url.
+   *
+   * @param url
+   *          URL to be fetched by returned {@link Protocol} implementation
+   * @return The appropriate {@link Protocol} implementation for a given
+   *         {@link URL}.
+   * @throws ProtocolNotFound
+   *           when Protocol can not be found for url
+   */
+  public synchronized Protocol getProtocol(URL url)
       throws ProtocolNotFound {
     ObjectCache objectCache = ObjectCache.get(conf);
     try {
-      URL url = new URL(urlString);
       String protocolName = url.getProtocol();
-      if (protocolName == null)
-        throw new ProtocolNotFound(urlString);
+      if (protocolName == null) {
+        throw new ProtocolNotFound(url.toString());
+      }
 
       String cacheId = Protocol.X_POINT_ID + protocolName;
       Protocol protocol = (Protocol) objectCache.getObject(cacheId);
@@ -90,10 +110,8 @@ public synchronized Protocol getProtocol(String urlString)
       protocol = (Protocol) extension.getExtensionInstance();
       objectCache.setObject(cacheId, protocol);
       return protocol;
-    } catch (MalformedURLException e) {
-      throw new ProtocolNotFound(urlString, e.toString());
     } catch (PluginRuntimeException e) {
-      throw new ProtocolNotFound(urlString, e.toString());
+      throw new ProtocolNotFound(url.toString(), e.toString());
     }
   }
 
diff --git a/src/java/org/apache/nutch/protocol/RobotRulesParser.java b/src/java/org/apache/nutch/protocol/RobotRulesParser.java
index a5a2d975e..1cddeea44 100644
--- a/src/java/org/apache/nutch/protocol/RobotRulesParser.java
+++ b/src/java/org/apache/nutch/protocol/RobotRulesParser.java
@@ -285,7 +285,7 @@ public int run(String[] args) {
       }
       ProtocolFactory factory = new ProtocolFactory(conf);
       try {
-        protocol = factory.getProtocol(robotsTxtUrl.toString());
+        protocol = factory.getProtocol(robotsTxtUrl);
       } catch (ProtocolNotFound e) {
         LOG.error("No protocol found for {}: {}", args[0],
             StringUtils.stringifyException(e));


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Fetcher to use parsed URL to call ProtocolFactory.getProtocol(url)
> ------------------------------------------------------------------
>
>                 Key: NUTCH-2579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2579
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher, protocol
>    Affects Versions: 1.14
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.15
>
>
> The call of ProtocolFactory.getProtocol(url) is synchronized and causes waits for the lock in a multi-threaded fetcher. It uses the URL string, although it would be more efficient to use the parsed URL hold in the FetchItem. The lock could be released faster. In addition, parsing the URL also causes a lock in the URL stream handler:
> {noformat}
> "FetcherThread" #37 daemon prio=5 os_prio=0 tid=0x00007f21edea2000 nid=0x5c20 waiting for monitor entry [0x00007f21bacb4000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at java.util.Hashtable.get(Hashtable.java:363)
>         - waiting to lock <0x00000005e01b5840> (a java.util.Hashtable)
>         at java.net.URL.getURLStreamHandler(URL.java:1135)
>         at java.net.URL.<init>(URL.java:599)
>         at java.net.URL.<init>(URL.java:490)
>         at java.net.URL.<init>(URL.java:439)
>         at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:74)
>         - locked <0x00000005fc5f4fb8> (a org.apache.nutch.protocol.ProtocolFactory)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:299)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)