Nutch Plugin Lifecycle broken due to lazy loading?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch Plugin Lifecycle broken due to lazy loading?

Hiran Chaudhuri

Hello there.

 

Is it possible that the plugin lifecycle is broken or at least buggy?

 

I'm trying to setup Nutch 1.13 on Solr 6.6.1 such that it crawls the intranet.

That said, a lot of our documents are accessed via SMB, and to make the URLs in the search result actually clickable, I want to enable Nutch to fetch the documents via SMB/jcifs.

 

So first I configured Nutch to scan urls like smb://server/share.

Nutch writes into the logs that the smb protocol is unknown and therefore the url is skipped (yes, it already passed all the regex filters)

Then I installed the protocol-smb plugin from here: https://issues.apache.org/jira/browse/NUTCH-427

Nutch confirms that protocol-smb is loaded on startup and registered in the PluginRepository.

But right after that Nutch writes into the logs that the smb protocol is unknown and therefore the url is skipped....

 

So I was wondering what may have happened here and I went to check the plugin source code.

It seems as soon as the protocol-smb plugin is instantiated, it writes a log message indicating this fact. Then it tries to register the SMB protocol URLHandler with the JVM and again writes a log message. I have not seen any of these two messages.

 

Then I checked the Nutch 1.13 source code, especially the PluginRepository class. It detects and successfully registers the plugins, and the code is commented as being sparse on resources by only instantiating plugins when they are required. So it is intentional that the protocol-smb plugin is registered but not instantiated. Which invokes a chicken-egg problem.

 

If the protocol plugin does not get instantiated, it cannot register its protocol. So although the plugin is registered, the smb://.... urls will throw MalformedURLExceptions.

And more generally speaking: Plugins are not able to initialize after being registered, only just before they are being loaded. My feeling is something is missing the plugin lifecycle....

 

Any ideas? Or should this post go to the developer's list?

 

Hiran

 

 

Hiran Chaudhuri
Principal Support Engineer

Service Reliability Engineering - Custom

Amadeus Data Processing GmbH
Berghamer Strasse 6
85435 Erding
T: +49-8122-43x3662
[hidden email]

http://amadeus.com

 

Reply | Threaded
Open this post in threaded view
|

Re: Nutch Plugin Lifecycle broken due to lazy loading?

Sebastian Nagel
Hi Hiran,

> Is it possible that the plugin lifecycle is broken or at least buggy?

The Nutch plugin system is complex but in general a good idea
(https://wiki.apache.org/nutch/WhyNutchHasAPluginSystem). It's definitely not broken, although there may
be issues (e.g., the recently fixed NUTCH-2378).

Regarding the protocol plugins: I haven't tried protocol-smb but other protocol plugins
(protocol-file or protocol-ftp) use the same mechanism to register the supported protocol:

The plugin.xml defines the supported protocol:

  <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol"
             point="org.apache.nutch.protocol.Protocol">
    <implementation id="org.apache.nutch.protocol.smb.SMB"
                    class="org.apache.nutch.protocol.smb.SMB">
      <parameter name="protocolName" value="smb" />
    </implementation>
  </extension>

The check whether a protocol is supported by one of the registered plugins is done
without any protocol plugin instantiated just using the plugin.xml.

If the protocol "smb" is not supported you should find a message:
  org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb

You can try this via (here for file:// URLs):

  # file:// not supported (ProtocolNotFound exception)
  bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' file://...
  Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound:
    protocol not found for url=file
        at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:85)
        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:137)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:268)

  # enable protocol-file and retry:
  nutch parsechecker -Dplugin.includes='protocol-file|parse-html' file://...

If you see a MalformedURLException the problem is located somewhere else.
What is the exact error message and the full stack trace?

Thanks,
Sebastian

On 09/14/2017 09:35 PM, Hiran CHAUDHURI wrote:

> Hello there.
>
>  
>
> Is it possible that the plugin lifecycle is broken or at least buggy?
>
>  
>
> I'm trying to setup Nutch 1.13 on Solr 6.6.1 such that it crawls the intranet.
>
> That said, a lot of our documents are accessed via SMB, and to make the URLs in the search result
> actually clickable, I want to enable Nutch to fetch the documents via SMB/jcifs.
>
>  
>
> So first I configured Nutch to scan urls like smb://server/share.
>
> Nutch writes into the logs that the smb protocol is unknown and therefore the url is skipped (yes,
> it already passed all the regex filters)
>
> Then I installed the protocol-smb plugin from here: https://issues.apache.org/jira/browse/NUTCH-427
>
> Nutch confirms that protocol-smb is loaded on startup and registered in the PluginRepository.
>
> But right after that Nutch writes into the logs that the smb protocol is unknown and therefore the
> url is skipped....
>
>  
>
> So I was wondering what may have happened here and I went to check the plugin source code.
>
> It seems as soon as the protocol-smb plugin is instantiated, it writes a log message indicating this
> fact. Then it tries to register the SMB protocol URLHandler with the JVM and again writes a log
> message. I have not seen any of these two messages.
>
>  
>
> Then I checked the Nutch 1.13 source code, especially the PluginRepository class. It detects and
> successfully registers the plugins, and the code is commented as being sparse on resources by only
> instantiating plugins when they are required. So it is intentional that the protocol-smb plugin is
> registered but not instantiated. Which invokes a chicken-egg problem.
>
>  
>
> If the protocol plugin does not get instantiated, it cannot register its protocol. So although the
> plugin is registered, the smb://.... urls will throw MalformedURLExceptions.
>
> And more generally speaking: Plugins are not able to initialize after being registered, only just
> before they are being loaded. My feeling is something is missing the plugin lifecycle....
>
>  
>
> Any ideas? Or should this post go to the developer's list?
>
>  
>
> Hiran
>
>  
>
>  
>
> *Hiran Chaudhuri**
> Principal Support Engineer*
>
> Service Reliability Engineering - Custom
>
> Amadeus Data Processing GmbH
> Berghamer Strasse 6
> 85435 Erding
> T: +49-8122-43x3662
> hiran.chaudhuri@amadeus.com_
> http://amadeus.com <http://amadeus.com/>_**
>
>  
>

Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Hiran Chaudhuri
Hello Sebastian,

>> Is it possible that the plugin lifecycle is broken or at least buggy?
>
>The Nutch plugin system is complex but in general a good idea
>(https://wiki.apache.org/nutch/WhyNutchHasAPluginSystem). It's definitely not broken,
>although there may be issues (e.g., the recently fixed NUTCH-2378).
>
> Regarding the protocol plugins: I haven't tried protocol-smb but other protocol plugins
> (protocol-file or protocol-ftp) use the same mechanism to register the supported protocol:

I'm afraid the protocols file and ftp are no good examples, as they are known to the Java platform out of the box.
I just tried this sample application:

----8<----------------------------------------------------
package test;

import java.net.URL;

public class Test {

    public static void main(String[] args) throws Exception {
        new URL("http://foo/bar");
        new URL("https://foo/bar");
        new URL("file://foo/blar");
        new URL("ftp://foo/bar");
        new URL("smb://foo/bar");
        new URL("foo://bar/baz");
    }
   
}
---------------------------------------->8----------------

The output is, as expected "Exception in thread "main" java.net.MalformedURLException: unknown protocol: smb".
The smb protocol, as well as the foo protocol need to be installed in the JVM by setting the system property java.protocol.handler.pkgs.
An example is visible on https://jcifs.samba.org/src/src/jcifs/Config.java, in the method registerSmbURLHandler().

>The plugin.xml defines the supported protocol:
>
> <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol"
>             point="org.apache.nutch.protocol.Protocol">
>    <implementation id="org.apache.nutch.protocol.smb.SMB"
>                    class="org.apache.nutch.protocol.smb.SMB">
>      <parameter name="protocolName" value="smb" />
>    </implementation>
>  </extension>
>
>The check whether a protocol is supported by one of the registered plugins is done without any protocol plugin instantiated just using the plugin.xml.

My feeling is that this check happens or does not happen, but at some point in time Nutch tries to run the URL() constructor, which does not rely on the PluginRepository but the JVM factory methods which are unaware of the new protocol.

>If the protocol "smb" is not supported you should find a message:
>  org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb

>If you see a MalformedURLException the problem is located somewhere else.
>What is the exact error message and the full stack trace?

Here is what I get:
----8<----------------------------------------------------
Executing bin/crawl --index -D solr.server.url=http://172.17.0.9:8983/solr/nutch -D java.protocol.handler.pkgs=jcifs urls crawl 1
Injecting seed URLs
/nutch/bin/nutch inject crawl/crawldb urls
2017-09-18 19:25:20,324 INFO  org.apache.nutch.crawl.Injector - Injector: starting at 2017-09-18 19:25:20
2017-09-18 19:25:20,326 INFO  org.apache.nutch.crawl.Injector - Injector: crawlDb: crawl/crawldb
2017-09-18 19:25:20,327 INFO  org.apache.nutch.crawl.Injector - Injector: urlDir: urls
2017-09-18 19:25:20,327 INFO  org.apache.nutch.crawl.Injector - Injector: Converting injected urls to crawl db entries.
2017-09-18 19:25:20,610 WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-09-18 19:25:22,620 INFO  org.apache.nutch.plugin.PluginRepository - Plugins: looking in: /nutch/plugins
2017-09-18 19:25:22,851 WARN  org.apache.nutch.plugin.PluginRepository - Error while loading plugin `/nutch/plugins/parse-replace/plugin.xml` java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml (No such file or directory)
2017-09-18 19:25:22,904 WARN  org.apache.nutch.plugin.PluginRepository - Error while loading plugin `/nutch/plugins/plugin/plugin.xml` java.io.FileNotFoundException: /nutch/plugins/plugin/plugin.xml (No such file or directory)
2017-09-18 19:25:22,956 WARN  org.apache.nutch.plugin.PluginRepository - Error while loading plugin `/nutch/plugins/publish-rabitmq/plugin.xml` java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml (No such file or directory)
2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository - Plugin Auto-activation mode: [true]
2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository - Registered Plugins:
2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository -        Regex URL Filter (urlfilter-regex)
2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository -        Html Parse Plug-in (parse-html)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        HTTP Framework (lib-http)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        the nutch core extension points (nutch-extensionpoints)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Basic Indexing Filter (index-basic)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Anchor Indexing Filter (index-anchor)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Tika Parser Plug-in (parse-tika)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Basic URL Normalizer (urlnormalizer-basic)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Regex URL Filter Framework (lib-regex-filter)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Regex URL Normalizer (urlnormalizer-regex)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        CyberNeko HTML Parser (lib-nekohtml)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        OPIC Scoring Plug-in (scoring-opic)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Pass-through URL Normalizer (urlnormalizer-pass)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        SMB Protocol Plug-in (protocol-smb)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Http Protocol Plug-in (protocol-http)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        File Protocol Plug-in (protocol-file)
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        SolrIndexWriter (indexer-solr)
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository - Registered Extension-Points:
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Content Parser (org.apache.nutch.parse.Parser)
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch URL Filter (org.apache.nutch.net.URLFilter)
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Publisher (org.apache.nutch.publisher.NutchPublisher)
2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Protocol (org.apache.nutch.protocol.Protocol)
2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2017-09-18 19:25:23,131 INFO  org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2017-09-18 19:25:23,469 WARN  org.apache.nutch.crawl.Injector - Skipping smb://nas/Documents:java.net.MalformedURLException: unknown protocol: smb
2017-09-18 19:25:23,473 INFO  org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2017-09-18 19:25:23,771 INFO  org.apache.nutch.crawl.Injector - Injector: overwrite: false
2017-09-18 19:25:23,772 INFO  org.apache.nutch.crawl.Injector - Injector: update: false
2017-09-18 19:25:24,285 INFO  org.apache.nutch.crawl.Injector - Injector: Total urls rejected by filters: 2
2017-09-18 19:25:24,285 INFO  org.apache.nutch.crawl.Injector - Injector: Total urls injected after normalization and filtering: 0
2017-09-18 19:25:24,285 INFO  org.apache.nutch.crawl.Injector - Injector: Total urls injected but already in CrawlDb: 0
2017-09-18 19:25:24,286 INFO  org.apache.nutch.crawl.Injector - Injector: Total new urls injected: 0
2017-09-18 19:25:24,288 INFO  org.apache.nutch.crawl.Injector - Injector: finished at 2017-09-18 19:25:24, elapsed: 00:00:03
---------------------------------------->8----------------

Mind the fact that both the plugin and the extension point are listed, and still there is this warning line with the hint for a MalformedURLException.

Hiran
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Sebastian Nagel
Hi Hiran,

ok, got it. - the problem is already given in https://issues.apache.org/jira/browse/NUTCH-427 :)

In this case, you're right. The plugin system wasn't designed to manipulate Java system properties.
But it should be possible to do it by adding a static hook which is called before instantiation.
The second problem would be the class loader encapsulation: the class java.net.URL is used in many
places and the protocol handler (jcifs.smb.Handler) must be globally available.

But be pragmatic - protocol-smb will not make it into the "official" Nutch package because of the
LGPL license [1].  To make protocol-smb working for "your" Nutch package:

1. set the system property accordingly. If you use bin/nutch, modify it or pass it via the
environment variable
    export NUTCH_OPTS=-Djava.protocol.handler.pkgs=jcifs

2. make sure that the jcifs jar is added as global dependency
    - add it to ivy/ivy.xml
    - or copy it to runtime/local/lib/  (local mode for quick testing)
   (or alternatively copy the jcifs/smb/Handler.java and dependencies
    to your source tree)

Best,
Sebastian

[1] https://www.apache.org/legal/resolved.html#category-x

On 09/18/2017 09:42 PM, Hiran CHAUDHURI wrote:

> Hello Sebastian,
>
>>> Is it possible that the plugin lifecycle is broken or at least buggy?
>>
>> The Nutch plugin system is complex but in general a good idea
>> (https://wiki.apache.org/nutch/WhyNutchHasAPluginSystem). It's definitely not broken,
>> although there may be issues (e.g., the recently fixed NUTCH-2378).
>>
>> Regarding the protocol plugins: I haven't tried protocol-smb but other protocol plugins
>> (protocol-file or protocol-ftp) use the same mechanism to register the supported protocol:
>
> I'm afraid the protocols file and ftp are no good examples, as they are known to the Java platform out of the box.
> I just tried this sample application:
>
> ----8<----------------------------------------------------
> package test;
>
> import java.net.URL;
>
> public class Test {
>
>     public static void main(String[] args) throws Exception {
>         new URL("http://foo/bar");
>         new URL("https://foo/bar");
>         new URL("file://foo/blar");
>         new URL("ftp://foo/bar");
>         new URL("smb://foo/bar");
>         new URL("foo://bar/baz");
>     }
>    
> }
> ---------------------------------------->8----------------
>
> The output is, as expected "Exception in thread "main" java.net.MalformedURLException: unknown protocol: smb".
> The smb protocol, as well as the foo protocol need to be installed in the JVM by setting the system property java.protocol.handler.pkgs.
> An example is visible on https://jcifs.samba.org/src/src/jcifs/Config.java, in the method registerSmbURLHandler().
>
>> The plugin.xml defines the supported protocol:
>>
>> <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol"
>>             point="org.apache.nutch.protocol.Protocol">
>>    <implementation id="org.apache.nutch.protocol.smb.SMB"
>>                    class="org.apache.nutch.protocol.smb.SMB">
>>      <parameter name="protocolName" value="smb" />
>>    </implementation>
>>  </extension>
>>
>> The check whether a protocol is supported by one of the registered plugins is done without any protocol plugin instantiated just using the plugin.xml.
>
> My feeling is that this check happens or does not happen, but at some point in time Nutch tries to run the URL() constructor, which does not rely on the PluginRepository but the JVM factory methods which are unaware of the new protocol.
>
>> If the protocol "smb" is not supported you should find a message:
>>  org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb
>
>> If you see a MalformedURLException the problem is located somewhere else.
>> What is the exact error message and the full stack trace?
>
> Here is what I get:
> ----8<----------------------------------------------------
> Executing bin/crawl --index -D solr.server.url=http://172.17.0.9:8983/solr/nutch -D java.protocol.handler.pkgs=jcifs urls crawl 1
> Injecting seed URLs
> /nutch/bin/nutch inject crawl/crawldb urls
> 2017-09-18 19:25:20,324 INFO  org.apache.nutch.crawl.Injector - Injector: starting at 2017-09-18 19:25:20
> 2017-09-18 19:25:20,326 INFO  org.apache.nutch.crawl.Injector - Injector: crawlDb: crawl/crawldb
> 2017-09-18 19:25:20,327 INFO  org.apache.nutch.crawl.Injector - Injector: urlDir: urls
> 2017-09-18 19:25:20,327 INFO  org.apache.nutch.crawl.Injector - Injector: Converting injected urls to crawl db entries.
> 2017-09-18 19:25:20,610 WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 2017-09-18 19:25:22,620 INFO  org.apache.nutch.plugin.PluginRepository - Plugins: looking in: /nutch/plugins
> 2017-09-18 19:25:22,851 WARN  org.apache.nutch.plugin.PluginRepository - Error while loading plugin `/nutch/plugins/parse-replace/plugin.xml` java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml (No such file or directory)
> 2017-09-18 19:25:22,904 WARN  org.apache.nutch.plugin.PluginRepository - Error while loading plugin `/nutch/plugins/plugin/plugin.xml` java.io.FileNotFoundException: /nutch/plugins/plugin/plugin.xml (No such file or directory)
> 2017-09-18 19:25:22,956 WARN  org.apache.nutch.plugin.PluginRepository - Error while loading plugin `/nutch/plugins/publish-rabitmq/plugin.xml` java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml (No such file or directory)
> 2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository - Plugin Auto-activation mode: [true]
> 2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository - Registered Plugins:
> 2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository -        Regex URL Filter (urlfilter-regex)
> 2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository -        Html Parse Plug-in (parse-html)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        HTTP Framework (lib-http)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        the nutch core extension points (nutch-extensionpoints)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Basic Indexing Filter (index-basic)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Anchor Indexing Filter (index-anchor)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Tika Parser Plug-in (parse-tika)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Basic URL Normalizer (urlnormalizer-basic)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Regex URL Filter Framework (lib-regex-filter)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Regex URL Normalizer (urlnormalizer-regex)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        CyberNeko HTML Parser (lib-nekohtml)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        OPIC Scoring Plug-in (scoring-opic)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Pass-through URL Normalizer (urlnormalizer-pass)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        SMB Protocol Plug-in (protocol-smb)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        Http Protocol Plug-in (protocol-http)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        File Protocol Plug-in (protocol-file)
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        SolrIndexWriter (indexer-solr)
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository - Registered Extension-Points:
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Content Parser (org.apache.nutch.parse.Parser)
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Publisher (org.apache.nutch.publisher.NutchPublisher)
> 2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
> 2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
> 2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
> 2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 2017-09-18 19:25:23,131 INFO  org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
> 2017-09-18 19:25:23,469 WARN  org.apache.nutch.crawl.Injector - Skipping smb://nas/Documents:java.net.MalformedURLException: unknown protocol: smb
> 2017-09-18 19:25:23,473 INFO  org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
> 2017-09-18 19:25:23,771 INFO  org.apache.nutch.crawl.Injector - Injector: overwrite: false
> 2017-09-18 19:25:23,772 INFO  org.apache.nutch.crawl.Injector - Injector: update: false
> 2017-09-18 19:25:24,285 INFO  org.apache.nutch.crawl.Injector - Injector: Total urls rejected by filters: 2
> 2017-09-18 19:25:24,285 INFO  org.apache.nutch.crawl.Injector - Injector: Total urls injected after normalization and filtering: 0
> 2017-09-18 19:25:24,285 INFO  org.apache.nutch.crawl.Injector - Injector: Total urls injected but already in CrawlDb: 0
> 2017-09-18 19:25:24,286 INFO  org.apache.nutch.crawl.Injector - Injector: Total new urls injected: 0
> 2017-09-18 19:25:24,288 INFO  org.apache.nutch.crawl.Injector - Injector: finished at 2017-09-18 19:25:24, elapsed: 00:00:03
> ---------------------------------------->8----------------
>
> Mind the fact that both the plugin and the extension point are listed, and still there is this warning line with the hint for a MalformedURLException.
>
> Hiran
>

Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Hiran Chaudhuri
>Hi Hiran,
>
>ok, got it. - the problem is already given in https://issues.apache.org/jira/browse/NUTCH-427 :)

Indeed - when rereading that article it exactly describes my perception.

>In this case, you're right. The plugin system wasn't designed to manipulate Java system properties.

If it does not then setting the system property when using the crawl script should have helped - but then I probably missed putting the jar into the system classpath.

> But it should be possible to do it by adding a static hook which is called before instantiation.

When you look at the protocol-smb hook it comes with this static hook, but as it is never executed does not help.

> The second problem would be the class loader encapsulation: the class java.net.URL is used in many places and the protocol handler (jcifs.smb.Handler) must be globally available.

True. That is where I almost assumed the Nutch configuration code would at some point collect all the protocol plugins (everything registered to the protocol extension point) and set the system property globally but could not find it.

> But be pragmatic - protocol-smb will not make it into the "official" Nutch package because of the LGPL license [1].

That is understood. Although I could think of two other exercises that would help:
- create a tutorial to add some arbitrary protocol (e.g. the foo://bar/baz url)
- modify the protocol-smb plugin to make use of the smbclient binary.

I'd be willing to do the latter but would like to see a less clumsy behaviour for plugins. Adding the plugin plus modifying config files should be enough in my eyes.

> To make protocol-smb working for "your" Nutch package:
>
> 1. set the system property accordingly. If you use bin/nutch, modify it or pass it via the environment variable
>    export NUTCH_OPTS=-Djava.protocol.handler.pkgs=jcifs
>
> 2. make sure that the jcifs jar is added as global dependency
>    - add it to ivy/ivy.xml
>    - or copy it to runtime/local/lib/  (local mode for quick testing)
>   (or alternatively copy the jcifs/smb/Handler.java and dependencies
>    to your source tree)

So far I used
bin/crawl --index -D  solr.server.url=http://172.17.0.9:8983/solr/nutch -D  java.protocol.handler.pkgs=jcifs urls crawl

but I will try your hints. Will need a few days for this.

Hiran
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Sebastian Nagel
> When you look at the protocol-smb hook it comes with this static hook,
> but as it is never executed does not help.

Yes, it has to be called.

> That is understood. Although I could think of two other exercises that would help:
> - create a tutorial to add some arbitrary protocol (e.g. the foo://bar/baz url)
> - modify the protocol-smb plugin to make use of the smbclient binary.
>
> I'd be willing to do the latter but would like to see a less clumsy behaviour for plugins.

Great! Nutch could not exist without voluntary work. Thanks!

Sorry, that integration will not be that easy. The problem was indeed already known since long and
should have been better tested, see also [1] and [2] - the class
org.apache.nutch.protocol.sftp.Handler (a dummy handler) has been lost, you'll find it in the zip
file attached to NUTCH-714.

However, encapsulation and lazy instantiation I would not call "clumsy behavior", it's useful
for heavy-weight plugins (e.g., parse-tika which brings 50 MB dependencies).

A solution should be possible. Actually, it's easy for protocol-sftp on 2.x:
 - place the dummy handler from the zip file in
    src/java/org/apache/nutch/protocol/sftp/Handler.java
   and recompile
 - register the package org.apache.nutch.protocol as handler
    e.g. NUTCH_OPTS=-Djava.protocol.handler.pkgs=org.apache.nutch.protocol
    but that could be done also from inside Nutch

   NUTCH_OPTS=-Djava.protocol.handler.pkgs=org.apache.nutch.protocol \
     .../2.x/runtime/local/bin/nutch parsechecker -Dplugin.includes='protocol-sftp' sftp://..

   (no MalformedURLException)

 - in case the handler class shall live in the plugin's protocol-sftp.jar, some more work would
   be necessary to make it available also in the main class loader, not only in the plugin's child
   class loader. But as it could be a small class without dependencies, a solution should be
   possible.

Thanks, looking forward how you get it solved,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-714
[2] http://grokbase.com/t/nutch/dev/1192bgy9fc/protocol-not-found-or-malformedurl-protocol-sftp

> Adding the plugin plus modifying config files should be enough in my eyes.
On 09/19/2017 06:52 AM, Hiran CHAUDHURI wrote:

>> Hi Hiran,
>>
>> ok, got it. - the problem is already given in https://issues.apache.org/jira/browse/NUTCH-427 :)
>
> Indeed - when rereading that article it exactly describes my perception.
>
>> In this case, you're right. The plugin system wasn't designed to manipulate Java system properties.
>
> If it does not then setting the system property when using the crawl script should have helped - but then I probably missed putting the jar into the system classpath.
>
>> But it should be possible to do it by adding a static hook which is called before instantiation.
>
> When you look at the protocol-smb hook it comes with this static hook, but as it is never executed does not help.
>
>> The second problem would be the class loader encapsulation: the class java.net.URL is used in many places and the protocol handler (jcifs.smb.Handler) must be globally available.
>
> True. That is where I almost assumed the Nutch configuration code would at some point collect all the protocol plugins (everything registered to the protocol extension point) and set the system property globally but could not find it.
>
>> But be pragmatic - protocol-smb will not make it into the "official" Nutch package because of the LGPL license [1].
>
> That is understood. Although I could think of two other exercises that would help:
> - create a tutorial to add some arbitrary protocol (e.g. the foo://bar/baz url)
> - modify the protocol-smb plugin to make use of the smbclient binary.
>
> I'd be willing to do the latter but would like to see a less clumsy behaviour for plugins. Adding the plugin plus modifying config files should be enough in my eyes.
>
>> To make protocol-smb working for "your" Nutch package:
>>
>> 1. set the system property accordingly. If you use bin/nutch, modify it or pass it via the environment variable
>>    export NUTCH_OPTS=-Djava.protocol.handler.pkgs=jcifs
>>
>> 2. make sure that the jcifs jar is added as global dependency
>>    - add it to ivy/ivy.xml
>>    - or copy it to runtime/local/lib/  (local mode for quick testing)
>>   (or alternatively copy the jcifs/smb/Handler.java and dependencies
>>    to your source tree)
>
> So far I used
> bin/crawl --index -D  solr.server.url=http://172.17.0.9:8983/solr/nutch -D  java.protocol.handler.pkgs=jcifs urls crawl
>
> but I will try your hints. Will need a few days for this.
>
> Hiran
>

Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Hiran Chaudhuri
>> When you look at the protocol-smb hook it comes with this static hook,
>> but as it is never executed does not help.
>
>Yes, it has to be called.

So when would Nutch call this static hook? In practice this does not happen before the plugin is required, but then it is too late as the MalformedURLException is thrown already.
And this aproach cannot cover the classpath issue.

>> - create a tutorial to add some arbitrary protocol (e.g. the  foo://bar/baz url)
>> - modify the protocol-smb plugin to make use of the smbclient binary.
>>
>> I'd be willing to do the latter but would like to see a less clumsy behaviour for plugins.
>
>Great! Nutch could not exist without voluntary work. Thanks!
>
>Sorry, that integration will not be that easy. The problem was indeed already known since long and should have been better tested, see also [1] and [2] - the class >org.apache.nutch.protocol.sftp.Handler (a dummy handler) has been lost, you'll find it in the zip file attached to NUTCH-714.
>
>However, encapsulation and lazy instantiation I would not call "clumsy behavior", it's useful for heavy-weight plugins (e.g., parse-tika which brings 50 MB dependencies).

Both concepts, encapsulation and lazy instantiation are great. What I call clumsy is that the encapsulation does not work. Look at it from a user perspective of the protocol-smb plugin.
It comes as a (set of) jars, together with an XML descriptor. This could be nicely wrapped in a zip file and thus is one artifact that can easily be versioned and distributed.

But as soon as I want to install it, I have to
1 - put the artifact into the plugins directory
2 - modify Nutch configuration files to allow smb:// urls plus include the plugin to the loaded list
3 - extract jcifs.jar and place it on the system classpath
4 - run nutch with the correct system property

While items 1 and 2 can be understood easily and maybe one day come with a nice management interface, items 3 and 4 require knowledge about the internals of the plugin. Where did the encapsulation go? This is where I'd like to improve, and I have an idea how that could be established. Need to test it though.

>Thanks, looking forward how you get it solved, Sebastian

It seems I may need some support to go further. Maybe as you help me two documents could arise:
- Building nutch from source
- Developing a (protocol) plugin

I would need the first to test modifications to the plugin system.
Then with the second I would create a smb plugin that would suffer other limitations than the LGPL. ;-)

Hiran
Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Yossi Tamari
Hi Hiran,

I recently needed the documents you requested myself, and the two below were the most helpful. Keep in mind that like most Nutch documentation, they are not totally up to date, so you need to be a bit flexible.
The most important difference for me was getting the source from GitHub rather than SVN.

https://wiki.apache.org/nutch/RunNutchInEclipse
https://florianhartl.com/nutch-plugin-tutorial.html



-----Original Message-----
From: Hiran CHAUDHURI [mailto:[hidden email]]
Sent: 20 September 2017 09:50
To: [hidden email]
Subject: RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

>> When you look at the protocol-smb hook it comes with this static
>> hook, but as it is never executed does not help.
>
>Yes, it has to be called.

So when would Nutch call this static hook? In practice this does not happen before the plugin is required, but then it is too late as the MalformedURLException is thrown already.
And this aproach cannot cover the classpath issue.

>> - create a tutorial to add some arbitrary protocol (e.g. the  
>> foo://bar/baz url)
>> - modify the protocol-smb plugin to make use of the smbclient binary.
>>
>> I'd be willing to do the latter but would like to see a less clumsy behaviour for plugins.
>
>Great! Nutch could not exist without voluntary work. Thanks!
>
>Sorry, that integration will not be that easy. The problem was indeed already known since long and should have been better tested, see also [1] and [2] - the class >org.apache.nutch.protocol.sftp.Handler (a dummy handler) has been lost, you'll find it in the zip file attached to NUTCH-714.
>
>However, encapsulation and lazy instantiation I would not call "clumsy behavior", it's useful for heavy-weight plugins (e.g., parse-tika which brings 50 MB dependencies).

Both concepts, encapsulation and lazy instantiation are great. What I call clumsy is that the encapsulation does not work. Look at it from a user perspective of the protocol-smb plugin.
It comes as a (set of) jars, together with an XML descriptor. This could be nicely wrapped in a zip file and thus is one artifact that can easily be versioned and distributed.

But as soon as I want to install it, I have to
1 - put the artifact into the plugins directory
2 - modify Nutch configuration files to allow smb:// urls plus include the plugin to the loaded list
3 - extract jcifs.jar and place it on the system classpath
4 - run nutch with the correct system property

While items 1 and 2 can be understood easily and maybe one day come with a nice management interface, items 3 and 4 require knowledge about the internals of the plugin. Where did the encapsulation go? This is where I'd like to improve, and I have an idea how that could be established. Need to test it though.

>Thanks, looking forward how you get it solved, Sebastian

It seems I may need some support to go further. Maybe as you help me two documents could arise:
- Building nutch from source
- Developing a (protocol) plugin

I would need the first to test modifications to the plugin system.
Then with the second I would create a smb plugin that would suffer other limitations than the LGPL. ;-)

Hiran

Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Hiran Chaudhuri
In reply to this post by Hiran Chaudhuri
Hello all.

This time following up on my own post...

>>> When you look at the protocol-smb hook it comes with this static
>>> hook, but as it is never executed does not help.
>>
>>Yes, it has to be called.
>
>So when would Nutch call this static hook? In practice this does not happen before the plugin is required, but then it is too late as the MalformedURLException is thrown already.
>And this aproach cannot cover the classpath issue.

It seems Nutch would never call this static hook. That is why I patched the PluginRepository class.

>>> - create a tutorial to add some arbitrary protocol (e.g. the  
>>> foo://bar/baz url)
>>> - modify the protocol-smb plugin to make use of the smbclient binary.
>>>
>>> I'd be willing to do the latter but would like to see a less clumsy behaviour for plugins.
>>
>>Great! Nutch could not exist without voluntary work. Thanks!
>>
>>Sorry, that integration will not be that easy. The problem was indeed already known since long and should have been better tested, see also [1] and [2] - the class
>>org.apache.nutch.protocol.sftp.Handler (a dummy handler) has been lost, you'll find it in the zip file attached to NUTCH-714.
>>
>>However, encapsulation and lazy instantiation I would not call "clumsy behavior", it's useful for heavy-weight plugins (e.g., parse-tika which brings 50 MB dependencies).
>
>Both concepts, encapsulation and lazy instantiation are great. What I call clumsy is that the encapsulation does not work. Look at it from a user perspective of the protocol-smb plugin.
>It comes as a (set of) jars, together with an XML descriptor. This could be nicely wrapped in a zip file and thus is one artifact that can easily be versioned and distributed.
>
>But as soon as I want to install it, I have to
>1 - put the artifact into the plugins directory
>2 - modify Nutch configuration files to allow smb:// urls plus include the plugin to the loaded list
>3 - extract jcifs.jar and place it on the system classpath
>4 - run nutch with the correct system property
>
>While items 1 and 2 can be understood easily and maybe one day come with a nice management interface, items 3 and 4 require knowledge about the internals of the plugin.
>Where did the encapsulation go? This is where I'd like to improve, and I have an idea how that could be established. Need to test it though.

I have a solution that makes steps 3 and 4 obsolete.

>I would need the first to test modifications to the plugin system.
>Then with the second I would create a smb plugin that would suffer other limitations than the LGPL. ;-)

So here is the solution to the first step - the modified plugin system. It is available here, however I am not sure how to create the pull request...
https://github.com/HiranChaudhuri/nutch/commit/dc9cbeb3da7ca021e2cce322482d2eaa1ec15b28

Next will be one example plugin and the mentioned protocol-smb.

Hiran
Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Yossi Tamari
Hi Hiran,

Your code call setURLStreamHandlerFactory, the documentation for which says "This method can be called at most once in a given Java Virtual Machine". Isn't this going to be a problem?
https://docs.oracle.com/javase/8/docs/api/java/net/URL.html#setURLStreamHandlerFactory-java.net.URLStreamHandlerFactory-
Additionally, does this URLStreamHandlerFactory successfully load the standard handlers (HTTP, HTTPS...)? I would expect it to fail on these.

To be able to create a pull request, your repository needs to be a fork of the original repository, which does not seem to be the case here.

        Yossi.

-----Original Message-----
From: Hiran CHAUDHURI [mailto:[hidden email]]
Sent: 22 September 2017 11:54
To: [hidden email]
Subject: RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Hello all.

This time following up on my own post...

>>> When you look at the protocol-smb hook it comes with this static
>>> hook, but as it is never executed does not help.
>>
>>Yes, it has to be called.
>
>So when would Nutch call this static hook? In practice this does not happen before the plugin is required, but then it is too late as the MalformedURLException is thrown already.
>And this aproach cannot cover the classpath issue.

It seems Nutch would never call this static hook. That is why I patched the PluginRepository class.

>>> - create a tutorial to add some arbitrary protocol (e.g. the
>>> foo://bar/baz url)
>>> - modify the protocol-smb plugin to make use of the smbclient binary.
>>>
>>> I'd be willing to do the latter but would like to see a less clumsy behaviour for plugins.
>>
>>Great! Nutch could not exist without voluntary work. Thanks!
>>
>>Sorry, that integration will not be that easy. The problem was indeed
>>already known since long and should have been better tested, see also [1] and [2] - the class org.apache.nutch.protocol.sftp.Handler (a dummy handler) has been lost, you'll find it in the zip file attached to NUTCH-714.
>>
>>However, encapsulation and lazy instantiation I would not call "clumsy behavior", it's useful for heavy-weight plugins (e.g., parse-tika which brings 50 MB dependencies).
>
>Both concepts, encapsulation and lazy instantiation are great. What I call clumsy is that the encapsulation does not work. Look at it from a user perspective of the protocol-smb plugin.
>It comes as a (set of) jars, together with an XML descriptor. This could be nicely wrapped in a zip file and thus is one artifact that can easily be versioned and distributed.
>
>But as soon as I want to install it, I have to
>1 - put the artifact into the plugins directory
>2 - modify Nutch configuration files to allow smb:// urls plus include
>the plugin to the loaded list
>3 - extract jcifs.jar and place it on the system classpath
>4 - run nutch with the correct system property
>
>While items 1 and 2 can be understood easily and maybe one day come with a nice management interface, items 3 and 4 require knowledge about the internals of the plugin.
>Where did the encapsulation go? This is where I'd like to improve, and I have an idea how that could be established. Need to test it though.

I have a solution that makes steps 3 and 4 obsolete.

>I would need the first to test modifications to the plugin system.
>Then with the second I would create a smb plugin that would suffer
>other limitations than the LGPL. ;-)

So here is the solution to the first step - the modified plugin system. It is available here, however I am not sure how to create the pull request...
https://github.com/HiranChaudhuri/nutch/commit/dc9cbeb3da7ca021e2cce322482d2eaa1ec15b28

Next will be one example plugin and the mentioned protocol-smb.

Hiran

Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Hiran Chaudhuri
>Hi Hiran,
>
>Your code call setURLStreamHandlerFactory, the documentation for which says "This method can be called at most once in a given Java Virtual Machine". Isn't this going to be a problem?
>https://docs.oracle.com/javase/8/docs/api/java/net/URL.html#setURLStreamHandlerFactory-java.net.URLStreamHandlerFactory-

I thought of falling back to the already-installed URLStreamHandlerFactory so they can chain up. However there is no method to read the current value, so that attempt died.
When debugging the procedure I found out that the URLStreamHandlerFactory was null during normal application runs, and it was the same on invoking Nutch. So for the time being I do not see a problem here. It could arise if a single plugin would set the factory - but then it is wiser to do this on application level (nutch) than in any plugin.
To come back to your question: I believe by making use of that feature we would reduce the risk for plugin developers to get creative. Therefore I rate it a general improvent.

>Additionally, does this URLStreamHandlerFactory successfully load the standard handlers (HTTP, HTTPS...)? I would expect it to fail on these.

Yes, it fails on these and returns null. Which triggers the JVM to just continue as if there were no URLStreamHandlerFactory installed. So no harm done for the well-known protocols if not overridden by plugins.

>To be able to create a pull request, your repository needs to be a fork of the original repository, which does not seem to be the case here.

I thought to have forked from gitbox.apache.org but then something may be broken. Do you have an idea how I could fix this?

Hiran
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Sebastian Nagel
In reply to this post by Yossi Tamari
Hi Hiran, hi Yossi,

excellent idea to register the PluginRepository as URLStreamHander! It's the only class which knows
the protocols supported by active plugins.

> Your code call setURLStreamHandlerFactory, the documentation for which says "This method can be
> called at most once in a given Java Virtual Machine". Isn't this going to be a problem?

It's not called somewhere else in Nutch. Of course, it must be made sure that
setURLStreamHandlerFactory is not called by any library which loaded
before the PluginRepository is instantiated. But this should not happen.
Needs to be tested, of course. Should also ev. catch the exception if not called first.

> Additionally, does this URLStreamHandlerFactory successfully load the standard handlers (HTTP,
> HTTPS...)? I would expect it to fail on these.

According to
https://docs.oracle.com/javase/8/docs/api/java/net/URL.html#URL-java.lang.String-java.lang.String-int-java.lang.String-
it should fall through and use the default handlers, right?

> To be able to create a pull request, your repository needs to be a fork of the original
> repository, which does not seem to be the case here.

Please, also open an issue on the Nutch Jira - it's required to properly track the changes and
for the release notes. A pull request on github which contains the Jira ID (NUTCH-XXXX) is then
automatically tracked on Jira.  Also, please use the Eclipse code formatting template
(https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml). Thanks!

Best,
Sebastian

On 09/22/2017 11:10 AM, Yossi Tamari wrote:

> Hi Hiran,
>
> Your code call setURLStreamHandlerFactory, the documentation for which says "This method can be called at most once in a given Java Virtual Machine". Isn't this going to be a problem?
> https://docs.oracle.com/javase/8/docs/api/java/net/URL.html#setURLStreamHandlerFactory-java.net.URLStreamHandlerFactory-
> Additionally, does this URLStreamHandlerFactory successfully load the standard handlers (HTTP, HTTPS...)? I would expect it to fail on these.
>
> To be able to create a pull request, your repository needs to be a fork of the original repository, which does not seem to be the case here.
>
> Yossi.
>
> -----Original Message-----
> From: Hiran CHAUDHURI [mailto:[hidden email]]
> Sent: 22 September 2017 11:54
> To: [hidden email]
> Subject: RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?
>
> Hello all.
>
> This time following up on my own post...
>
>>>> When you look at the protocol-smb hook it comes with this static
>>>> hook, but as it is never executed does not help.
>>>
>>> Yes, it has to be called.
>>
>> So when would Nutch call this static hook? In practice this does not happen before the plugin is required, but then it is too late as the MalformedURLException is thrown already.
>> And this aproach cannot cover the classpath issue.
>
> It seems Nutch would never call this static hook. That is why I patched the PluginRepository class.
>
>>>> - create a tutorial to add some arbitrary protocol (e.g. the
>>>> foo://bar/baz url)
>>>> - modify the protocol-smb plugin to make use of the smbclient binary.
>>>>
>>>> I'd be willing to do the latter but would like to see a less clumsy behaviour for plugins.
>>>
>>> Great! Nutch could not exist without voluntary work. Thanks!
>>>
>>> Sorry, that integration will not be that easy. The problem was indeed
>>> already known since long and should have been better tested, see also [1] and [2] - the class org.apache.nutch.protocol.sftp.Handler (a dummy handler) has been lost, you'll find it in the zip file attached to NUTCH-714.
>>>
>>> However, encapsulation and lazy instantiation I would not call "clumsy behavior", it's useful for heavy-weight plugins (e.g., parse-tika which brings 50 MB dependencies).
>>
>> Both concepts, encapsulation and lazy instantiation are great. What I call clumsy is that the encapsulation does not work. Look at it from a user perspective of the protocol-smb plugin.
>> It comes as a (set of) jars, together with an XML descriptor. This could be nicely wrapped in a zip file and thus is one artifact that can easily be versioned and distributed.
>>
>> But as soon as I want to install it, I have to
>> 1 - put the artifact into the plugins directory
>> 2 - modify Nutch configuration files to allow smb:// urls plus include
>> the plugin to the loaded list
>> 3 - extract jcifs.jar and place it on the system classpath
>> 4 - run nutch with the correct system property
>>
>> While items 1 and 2 can be understood easily and maybe one day come with a nice management interface, items 3 and 4 require knowledge about the internals of the plugin.
>> Where did the encapsulation go? This is where I'd like to improve, and I have an idea how that could be established. Need to test it though.
>
> I have a solution that makes steps 3 and 4 obsolete.
>
>> I would need the first to test modifications to the plugin system.
>> Then with the second I would create a smb plugin that would suffer
>> other limitations than the LGPL. ;-)
>
> So here is the solution to the first step - the modified plugin system. It is available here, however I am not sure how to create the pull request...
> https://github.com/HiranChaudhuri/nutch/commit/dc9cbeb3da7ca021e2cce322482d2eaa1ec15b28
>
> Next will be one example plugin and the mentioned protocol-smb.
>
> Hiran
>

Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Yossi Tamari
In reply to this post by Hiran Chaudhuri
Fork from https://github.com/apache/nutch.

-----Original Message-----
From: Hiran CHAUDHURI [mailto:[hidden email]]
Sent: 22 September 2017 12:27
To: [hidden email]
Subject: RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

>Hi Hiran,
>
>Your code call setURLStreamHandlerFactory, the documentation for which says "This method can be called at most once in a given Java Virtual Machine". Isn't this going to be a problem?
>https://docs.oracle.com/javase/8/docs/api/java/net/URL.html#setURLStreamHandlerFactory-java.net.URLStreamHandlerFactory-

I thought of falling back to the already-installed URLStreamHandlerFactory so they can chain up. However there is no method to read the current value, so that attempt died.
When debugging the procedure I found out that the URLStreamHandlerFactory was null during normal application runs, and it was the same on invoking Nutch. So for the time being I do not see a problem here. It could arise if a single plugin would set the factory - but then it is wiser to do this on application level (nutch) than in any plugin.
To come back to your question: I believe by making use of that feature we would reduce the risk for plugin developers to get creative. Therefore I rate it a general improvent.

>Additionally, does this URLStreamHandlerFactory successfully load the standard handlers (HTTP, HTTPS...)? I would expect it to fail on these.

Yes, it fails on these and returns null. Which triggers the JVM to just continue as if there were no URLStreamHandlerFactory installed. So no harm done for the well-known protocols if not overridden by plugins.

>To be able to create a pull request, your repository needs to be a fork of the original repository, which does not seem to be the case here.

I thought to have forked from gitbox.apache.org but then something may be broken. Do you have an idea how I could fix this?

Hiran

Reply | Threaded
Open this post in threaded view
|

Re: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Sebastian Nagel
In reply to this post by Hiran Chaudhuri
> I thought to have forked from gitbox.apache.org but then something may be broken. Do you have an
idea how I could fix this?

You can directly fork from
   https://github.com/apache/nutch/
(gitbox.apache.org is a 1:1 mirror)

See also
   https://github.com/apache/nutch/tree/master#contributing


On 09/22/2017 11:26 AM, Hiran CHAUDHURI wrote:

>> Hi Hiran,
>>
>> Your code call setURLStreamHandlerFactory, the documentation for which says "This method can be called at most once in a given Java Virtual Machine". Isn't this going to be a problem?
>> https://docs.oracle.com/javase/8/docs/api/java/net/URL.html#setURLStreamHandlerFactory-java.net.URLStreamHandlerFactory-
>
> I thought of falling back to the already-installed URLStreamHandlerFactory so they can chain up. However there is no method to read the current value, so that attempt died.
> When debugging the procedure I found out that the URLStreamHandlerFactory was null during normal application runs, and it was the same on invoking Nutch. So for the time being I do not see a problem here. It could arise if a single plugin would set the factory - but then it is wiser to do this on application level (nutch) than in any plugin.
> To come back to your question: I believe by making use of that feature we would reduce the risk for plugin developers to get creative. Therefore I rate it a general improvent.
>
>> Additionally, does this URLStreamHandlerFactory successfully load the standard handlers (HTTP, HTTPS...)? I would expect it to fail on these.
>
> Yes, it fails on these and returns null. Which triggers the JVM to just continue as if there were no URLStreamHandlerFactory installed. So no harm done for the well-known protocols if not overridden by plugins.
>
>> To be able to create a pull request, your repository needs to be a fork of the original repository, which does not seem to be the case here.
>
> I thought to have forked from gitbox.apache.org but then something may be broken. Do you have an idea how I could fix this?
>
> Hiran
>

Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Hiran Chaudhuri
>> I thought to have forked from gitbox.apache.org but then something may be broken. Do you have an
>idea how I could fix this?
>
>You can directly fork from
>   https://github.com/apache/nutch/
>(gitbox.apache.org is a 1:1 mirror)

Done. And committed again.
I also created a Jira issue and the pull request.

:-)
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Sebastian Nagel
> I also created a Jira issue and the pull request.

Great! NUTCH-2429 will be reviewed and tested during the next days.

Thanks!

On 09/22/2017 12:26 PM, Hiran CHAUDHURI wrote:

>>> I thought to have forked from gitbox.apache.org but then something may be broken. Do you have an
>> idea how I could fix this?
>>
>> You can directly fork from
>>   https://github.com/apache/nutch/
>> (gitbox.apache.org is a 1:1 mirror)
>
> Done. And committed again.
> I also created a Jira issue and the pull request.
>
> :-)
>

Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Hiran Chaudhuri
In reply to this post by Sebastian Nagel
When trying to run the example protocol-foo plugin (I am writing it), I was able to pass the injector and generator phases, but it seems the fetch phase fails.

From the log I have it seems the fetcher tries to resolve URLs before the PluginRepository is initialized. Such behaviour would of course render the whole protocol plugins useless...

So yes, the whole construct still needs to be tested carefully.

2017-09-23 08:13:06,783 INFO  fetcher.FetchItemQueues - Using queue mode : byHost
2017-09-23 08:13:06,785 INFO  fetcher.Fetcher - Fetcher: threads: 50
2017-09-23 08:13:06,785 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
2017-09-23 08:13:06,836 INFO  plugin.PluginRepository - Plugins: looking in: /home/hiran/dev/nutch/runtime/local/plugins
2017-09-23 08:13:06,845 WARN  fetcher.FetchItem - Cannot parse url: foo://example.com
java.net.MalformedURLException: unknown protocol: foo
        at java.net.URL.<init>(URL.java:600)
        at java.net.URL.<init>(URL.java:490)
        at java.net.URL.<init>(URL.java:439)
        at org.apache.nutch.fetcher.FetchItem.create(FetchItem.java:71)
        at org.apache.nutch.fetcher.FetchItem.create(FetchItem.java:63)
        at org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:87)
        at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:91)
2017-09-23 08:13:06,899 INFO  fetcher.QueueFeeder - QueueFeeder finished: total 2 records + hit by time limit :0
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository - Registered Plugins:
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Anchor Indexing Filter (index-anchor)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Tika Parser Plug-in (parse-tika)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Http Protocol Plug-in (protocol-http)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Foo Protocol Example Plug-in (protocol-foo)
2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         SolrIndexWriter (indexer-solr)
2
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Sebastian Nagel
Hi Hiran,

> From the log I have it seems the fetcher tries to resolve URLs
> before the PluginRepository is initialized.

The Fetcher is highly concurrent, it may (even has to) start feeding the fetch queues
before fetching can start. The PluginRepository is initialized when the first plugin
instance is required (one of the protocol plugins).

We could instantiate the PluginRepository beforehand, e.g. in NutchConfiguration.create().
However, it's not guaranteed that the configuration is not changed afterwards. Indeed,
that's done sometimes, esp. in unit tests.

What's worse is that there are definitely two cases
 - in unit tests
 - in Nutch server
where more than one Configuration is used, every configuration with its own PluginRepository!
That's in contradiction with the "one and only" JVM-wide URLStreamHandlerFactory.
When running the unit tests ("ant test") we already get the exception
  Caused by: java.lang.Error: factory already defined
        at java.net.URL.setURLStreamHandlerFactory(URL.java:1112)

I see two ways to go:

1. be pragmatic
   - instantiate PluginRepository in NutchConfiguration.create()
   - set this instance as URLStreamHandlerFactory in the static method
     PluginRepository.get(config) to make sure that the method
     URL.setURLStreamHandlerFactory(..) is called exactly once
   The default usage (one MapReduce job running in its own JVM)
   will work this way. Unit tests should be easily fixed.
   But yes, the Nutch server would require that protocol plugins
   stay the same. It's not really a problem, since it's easy to
   filter away undesired URLs using URL filters.

2. think of protocol handlers as static and more low-level,
   e.g., implement them all to org.apache.nutch.protocol.<protocol>.Handler
   and implement only the minimally required methods (eg. getDefaultPort()).
   Plugins are dynamic but URLStreamHandler-s are not - they cannot
   be changed.

What do you think?

Best,
Sebastian


On 09/23/2017 08:23 AM, Hiran CHAUDHURI wrote:

> When trying to run the example protocol-foo plugin (I am writing it), I was able to pass the injector and generator phases, but it seems the fetch phase fails.
>
> From the log I have it seems the fetcher tries to resolve URLs before the PluginRepository is initialized. Such behaviour would of course render the whole protocol plugins useless...
>
> So yes, the whole construct still needs to be tested carefully.
>
> 2017-09-23 08:13:06,783 INFO  fetcher.FetchItemQueues - Using queue mode : byHost
> 2017-09-23 08:13:06,785 INFO  fetcher.Fetcher - Fetcher: threads: 50
> 2017-09-23 08:13:06,785 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
> 2017-09-23 08:13:06,836 INFO  plugin.PluginRepository - Plugins: looking in: /home/hiran/dev/nutch/runtime/local/plugins
> 2017-09-23 08:13:06,845 WARN  fetcher.FetchItem - Cannot parse url: foo://example.com
> java.net.MalformedURLException: unknown protocol: foo
>         at java.net.URL.<init>(URL.java:600)
>         at java.net.URL.<init>(URL.java:490)
>         at java.net.URL.<init>(URL.java:439)
>         at org.apache.nutch.fetcher.FetchItem.create(FetchItem.java:71)
>         at org.apache.nutch.fetcher.FetchItem.create(FetchItem.java:63)
>         at org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:87)
>         at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:91)
> 2017-09-23 08:13:06,899 INFO  fetcher.QueueFeeder - QueueFeeder finished: total 2 records + hit by time limit :0
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository - Registered Plugins:
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Anchor Indexing Filter (index-anchor)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Tika Parser Plug-in (parse-tika)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Http Protocol Plug-in (protocol-http)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Foo Protocol Example Plug-in (protocol-foo)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         SolrIndexWriter (indexer-solr)
> 2
>

Reply | Threaded
Open this post in threaded view
|

RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Hiran Chaudhuri
>> From the log I have it seems the fetcher tries to resolve URLs before
>> the PluginRepository is initialized.
>
>The Fetcher is highly concurrent, it may (even has to) start feeding the fetch queues before
> fetching can start. The PluginRepository is initialized when the first plugin instance is required
> (one of the protocol plugins).

Even if the fetcher is highly concurrent, at some point in time it needs to initialize from configuration files. Since the PluginRepository only registers the plugins and instantiates them on demand only, what value would be in loading the PluginRepository on demand only?

>We could instantiate the PluginRepository beforehand, e.g. in NutchConfiguration.create().
> However, it's not guaranteed that the configuration is not changed afterwards. Indeed, that's
> done sometimes, esp. in unit tests.

This confuses me. What use case would justify to change the configuration but not run nutch (speak: the JVM) again? I may be too new to nutch to completely understand.

>What's worse is that there are definitely two cases
> - in unit tests
> - in Nutch server
>where more than one Configuration is used, every configuration with its own >PluginRepository!
>That's in contradiction with the "one and only" JVM-wide URLStreamHandlerFactory.
>When running the unit tests ("ant test") we already get the exception
>  Caused by: java.lang.Error: factory already defined
>        at java.net.URL.setURLStreamHandlerFactory(URL.java:1112)

More than one configurations but reusing the same JVM. This sound strange to me.
Does the configuration differ that much?

Since we register the PluginRepository in a 1:1 relationship with the JVM, this class should become a singleton I guess.

> I see two ways to go:
> 1. be pragmatic
>   - instantiate PluginRepository in NutchConfiguration.create()
>   - set this instance as URLStreamHandlerFactory in the static method
>     PluginRepository.get(config) to make sure that the method
>     URL.setURLStreamHandlerFactory(..) is called exactly once
>   The default usage (one MapReduce job running in its own JVM)
>   will work this way. Unit tests should be easily fixed.

This would be compatible with the singleton principle.

>2. think of protocol handlers as static and more low-level,
>   e.g., implement them all to org.apache.nutch.protocol.<protocol>.Handler
>   and implement only the minimally required methods (eg. getDefaultPort()).
>   Plugins are dynamic but URLStreamHandler-s are not - they cannot
>   be changed.

In the case of the SMB protocol, a protocol handler has already been implemented according to JVM standards which do not match your suggestion. Would all such protocols require re-implementation?

I think a better way could be that we remain with the singleton PluginRepository instance, but whenever the configuration is reloaded this class would allow to reload its configuration as well.
Such an attempt works as long as you do not need to have multiple PluginRepository instances in the same JVM in parallel.

To work around this limitation we could create a URLStreamHandlerFactory singleton that registers to the JVM, knows all instances of PluginRepository and with some magic (I still do not understand the use case) pick the right one to check for the protocol and the handler to be used for each URL.

Hiran
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Sebastian Nagel
Hi Hiran,

great! And bringing the discussion back to @user - sorry, wrong reply button...

> Sounds as if it were not a real problem other than a convention.

Just a decision made without having the URL protocol handlers on the radar.

> I wish there were some guide how to write protocol plugins. But that is why I am creating this
> dummy - it might help document the minimum tasks for a plugin developer.

Yes, it's not part of
  https://wiki.apache.org/nutch/WritingPluginExample-1.2
Parse and indexing filter plugins are the most common ones.

Thanks for your work,
Sebastian


On 09/26/2017 10:36 PM, Hiran CHAUDHURI wrote:

> Hello Sebastian,
>
>> The only value is that there is no need to load it explicitly (no call needed) - at the price, that
>> it's harder to control when the loading is done.
>
> Sounds as if it were not a real problem other than a convention. Maybe we can find some way to automatically have the PluginRepository initialized.
>
>> For unit tests it's just handy to test the same method/class/plugin with various
>> configurations.
>> Nutch server allows to run multiple jobs in a single JVM.  It's possible but not necessarily a
>> good idea to run jobs with different sets of plugins.
>
> Oh yes, unit tests. They usually run within the same JVM.
>
>>> Does the configuration differ that much?
>> Rarely and if we just shouldn't care because there is no way around with JVM-wide URL
>> handlers.
>
> This is what I see as the tricky part. If we'd like to have different PluginRepositories with different configuration, how would we find the correct one for instantiating the next URLStreamHandler?
>
>>>> We could instantiate the PluginRepository beforehand, e.g. in
>>>> NutchConfiguration.create().
>> That's the wrong place - too early, initialization must wait until command-line arguments
>> (properties set via -Dproperty=value) are processed.
>
> I will have to trust you here. Although I drilled into one part of nutch I do not have a full architecture overview.
>
>>> Since we register the PluginRepository in a 1:1 relationship with the
>>> JVM, this class should become a singleton I guess.
>> That was also my first thought, however for unit tests we need multiple PluginRepository-s
>> based on different configurations.
>>
>> I've tried it just with the first instance in
>>  https://github.com/sebastian-nagel/nutch/tree/NUTCH-2429
>>  https://github.com/apache/nutch/compare/master...sebastian-nagel:NUTCH-2429
>> (feel free to pull or cherry-pick any of my commits!)
>>
>> ... fetching now fails for foo:// URLs because the protocol-foo is (by now) only a dummy:
>
> That is good news already. :-)
>
> I fixed the issue that PluginRepository would have to be a singleton called URLStreamHandlerFactory. It keeps references to PluginRepository instances. I made them WeakReferences so the PluginRepository instances can get garbage collected if no longer needed.
>
> Then I applied you modification to NutchTool so the PluginRepository would get initialized in time for the fetch phase. This seems to work, after all I see the same exceptions you mention as the protocol-foo plugin is a dummy.
>
> I am trying to fill this gap now. I wish there were some guide how to write protocol plugins. But that is why I am creating this dummy - it might help document the minimum tasks for a plugin developer.
>
>> Some more things to do, esp. fix all 33 classes implementing org.apache.hadoop.util.Tool but
>> not org.apache.nutch.util.NutchTool. :)
>
> Sounds like we are getting somewhere. So far I tested running the crawl script. Enough errors there that needed to be fixed....
>
> Hiran
>