[jira] [Commented] (NUTCH-2429) Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2429) Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193272#comment-16193272 ]

ASF GitHub Bot commented on NUTCH-2429:
---------------------------------------

lewismc commented on a change in pull request #222: NUTCH-2429 Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers
URL: https://github.com/apache/nutch/pull/222#discussion_r143005046
 
 

 ##########
 File path: src/plugin/protocol-foo/src/java/my/foo/Foo.java
 ##########
 @@ -0,0 +1,126 @@
+package my.foo;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.List;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.net.protocols.HttpDateFormat;
+import org.apache.nutch.plugin.URLStreamHandlerFactory;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.protocol.Protocol;
+import org.apache.nutch.protocol.ProtocolOutput;
+import org.apache.nutch.protocol.ProtocolStatus;
+import org.apache.nutch.protocol.RobotRulesParser;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import crawlercommons.robots.BaseRobotRules;
+
+public class Foo implements Protocol {
+  protected static final Logger LOG = LoggerFactory.getLogger(Foo.class);
+
+  private Configuration conf;
+
+  @Override
+  public Configuration getConf() {
+    LOG.debug("getConf()");
+    return conf;
+  }
+
+  @Override
+  public void setConf(Configuration conf) {
+    LOG.debug("setConf(...)");
+    this.conf = conf;
+  }
+
+  /**
+   * This is a dummy implementation only. So what we will do is return this
+   * structure:
+   *
+   * <pre>
+   * foo://example.com - will contain one directory and one file
+   * foo://example.com/a - directory, will contain two files
+   * foo://example.com/a/aa.txt - text file
+   * foo://example.com/a/ab.txt - text file
+   * foo://example.com/a.txt - text file
+   * </pre>
+   */
+  @Override
+  public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
+    LOG.debug("getProtocolOutput(" + url + ", " + datum + ")");
+
+    try {
+      String urlstr = String.valueOf(url);
+      URL u = new URL(urlstr);
+      URL base = new URL(u, ".");
+      byte[] bytes = new byte[0];
+      String contentType = "foo/something";
+      ProtocolStatus status = ProtocolStatus.STATUS_GONE;
+
+      switch (urlstr) {
+      case "foo://example.com": {
+        String time = HttpDateFormat.toString(System.currentTimeMillis());
+        contentType = "text/html";
+        StringBuffer sb = new StringBuffer();
+        sb.append("<html><head>");
+        sb.append("<title>Index of /</title></head>\n");
+        sb.append("<body><h1>Index of /</h1><pre>\n");
+        sb.append("<a href='a/" + "'>a/</a>\t"+ time + "\t-\n"); // add directory
+        sb.append("<a href='a.txt'>a.txt</a>\t" + time + "\t" + 0 + "\n"); // add file
+        sb.append("</pre></html></body>");
+        bytes = sb.toString().getBytes();
+        status = ProtocolStatus.STATUS_SUCCESS;
+        break;
+      }
+      case "foo://example.com/a/": {
+        String time = HttpDateFormat.toString(System.currentTimeMillis());
+        contentType = "text/html";
+        StringBuffer sb = new StringBuffer();
+        sb.append("<html><head>");
+        sb.append("<title>Index of /a/</title></head>\n");
+        sb.append("<body><h1>Index of /a/</h1><pre>\n");
+        sb.append("<a href='aa.txt'>aa.txt</a>\t" + time + "\t" + 0 + "\n"); // add file
+        sb.append("<a href='ab.txt'>ab.txt</a>\t" + time + "\t" + 0 + "\n"); // add file
+        sb.append("</pre></html></body>");
+        bytes = sb.toString().getBytes();
+        status = ProtocolStatus.STATUS_SUCCESS;
+        break;
+      }
+      case "foo://example.com/a.txt":
+      case "foo://example.com/a/aa.txt":
+      case "foo://example.com/a/ab.txt": {
+        contentType = "text/plain";
+        bytes = "In publishing and graphic design, lorem ipsum is a filler text or greeking commonly used to demonstrate the textual elements of a graphic document or visual presentation. Replacing meaningful content with placeholder text allows designers to design the form of the content before the content itself has been produced.".getBytes();
+        status = ProtocolStatus.STATUS_SUCCESS;
+        break;
+      }
+      default:
+        LOG.warn("Unknown url '" + url + "'. This dummy implementation only supports 'foo://example.com'");
+        // all our default values are set for URLs that do not exist.
+        break;
+      }
+
+      Metadata metadata = new Metadata();
+      Content content = new Content(String.valueOf(url), String.valueOf(base),
+          bytes, contentType, metadata, getConf());
+
+      return new ProtocolOutput(content, status);
+    } catch (MalformedURLException mue) {
+      LOG.error("Could not retrieve " + url, mue);
 
 Review comment:
   Parameterized logging
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Fix Plugin System to allow protocol plugins to bundle their URLStreamHandlers
> -----------------------------------------------------------------------------
>
>                 Key: NUTCH-2429
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2429
>             Project: Nutch
>          Issue Type: Improvement
>          Components: commoncrawl
>    Affects Versions: 1.14
>         Environment: Tested on both Nutch 1.13 and 1.14 in Ubuntu Linux with OpenJDK 1.8.
>            Reporter: Hiran Chaudhuri
>             Fix For: 1.14
>
>
> While trying to use the protocol-smb plugin (which is not part of the Nutch distribution) I realized there are four steps to successfully make use of a protocol plugin:
> 1 - put the artifact into the plugins directory
> 2 - modify Nutch configuration files to allow smb:// urls plus include the plugin to the loaded list
> 3 - extract jcifs.jar and place it on the system classpath
> 4 - run nutch with the correct system property
> While steps 1 and 2 seem obvious, 3 and 4 require knowledge of plugin internals which does not feel right for nutch and plugin users. Even more, the jcifs.jar would exist twice on the classpath and could even cause further problems during runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)