java.net.URL synchronization

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

java.net.URL synchronization

Otis Gospodnetic-2-2
Hello,

Has anyone seen this:
http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck ?

Is this something that needs to be addressed in Nutch (and thus in Bixo, and thus in the common crawler project)?


Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

Reply | Threaded
Open this post in threaded view
|

RE: java.net.URL synchronization

Fuad Efendi
I checked java.net.URL; yes, Nutch and BIXO implicitly use synchronized
Hashtable:
     
 
    public URL(String protocol, String host, int port, String file,
               URLStreamHandler handler) throws MalformedURLException {

...
        if (handler == null &&
            (handler = getURLStreamHandler(protocol)) == null) {
            throw new MalformedURLException("unknown protocol: " +
protocol);
        }

...


However, I don't think it hurts because both architecture (at least, BIXO)
run single thread in a single JVM to process, for instance, Outlinks. Only
"Fetch" part is multithreaded, but it doesn't use URL class.


Not sure about Nutch, how the fetch list is generated... if multithreaded
then "shared" between threads RegexUrlNormalizer is even bigger problem...


Fuad Efendi
+1 416-993-2060
http://www.tokenizer.ca/
Data Mining, Vertical Search


> -----Original Message-----
> From: Otis Gospodnetic [mailto:[hidden email]]
> Sent: December-09-09 5:12 PM
> To: [hidden email]
> Subject: java.net.URL synchronization
>
> Hello,
>
> Has anyone seen this:
> http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck
> ?
>
> Is this something that needs to be addressed in Nutch (and thus in Bixo,
> and thus in the common crawler project)?
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



Reply | Threaded
Open this post in threaded view
|

RE: java.net.URL synchronization

Fuad Efendi
Tomcat uses own slightly different version of URL class:

http://tomcat.apache.org/tomcat-5.5-doc/catalina/docs/api/index.html
URL is designed to provide public APIs for parsing and synthesizing Uniform
Resource Locators as similar as possible to the APIs of java.net.URL, but
without the ability to open a stream or connection. One of the consequences
of this is that you can construct URLs for protocols for which a
URLStreamHandler is not available (such as an "https" URL when JSSE is not
installed).



Synchonized staff in java.net.URL is URLStreamHandler -related.


> -----Original Message-----
> From: Fuad Efendi [mailto:[hidden email]]
> Sent: December-09-09 5:40 PM
> To: [hidden email]
> Subject: RE: java.net.URL synchronization
>
> I checked java.net.URL; yes, Nutch and BIXO implicitly use synchronized
> Hashtable:
>
>
>     public URL(String protocol, String host, int port, String file,
>       URLStreamHandler handler) throws MalformedURLException {
>
> ...
> if (handler == null &&
>             (handler = getURLStreamHandler(protocol)) == null) {
>             throw new MalformedURLException("unknown protocol: " +
> protocol);
>         }
>
> ...
>
>
> However, I don't think it hurts because both architecture (at least, BIXO)
> run single thread in a single JVM to process, for instance, Outlinks. Only
> "Fetch" part is multithreaded, but it doesn't use URL class.
>
>
> Not sure about Nutch, how the fetch list is generated... if multithreaded
> then "shared" between threads RegexUrlNormalizer is even bigger problem...
>
>
> Fuad Efendi
> +1 416-993-2060
> http://www.tokenizer.ca/
> Data Mining, Vertical Search
>
>
> > -----Original Message-----
> > From: Otis Gospodnetic [mailto:[hidden email]]
> > Sent: December-09-09 5:12 PM
> > To: [hidden email]
> > Subject: java.net.URL synchronization
> >
> > Hello,
> >
> > Has anyone seen this:
> > http://www.supermind.org/blog/580/java-net-url-synchronization-
> bottleneck
> > ?
> >
> > Is this something that needs to be addressed in Nutch (and thus in Bixo,
> > and thus in the common crawler project)?
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>


Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay