[jira] Created: (NUTCH-471) Fix synchronization in NutchBean creation

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)
Fix synchronization in NutchBean creation
-----------------------------------------

                 Key: NUTCH-471
                 URL: https://issues.apache.org/jira/browse/NUTCH-471
             Project: Nutch
          Issue Type: Bug
          Components: searcher
    Affects Versions: 1.0.0
            Reporter: Enis Soztutar
             Fix For: 1.0.0


NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated NUTCH-471:
--------------------------------

    Attachment: NutchBeanCreationSync_v1.patch

this patch synchronizes NutchBean.get((ServletContext app, Configuration conf) using servlet context as mutex. (NutchBean)app.getAttribute("nutchBean") is checked twice, the first one is not synchronized for performance reasons.

> Fix synchronization in NutchBean creation
> -----------------------------------------
>
>                 Key: NUTCH-471
>                 URL: https://issues.apache.org/jira/browse/NUTCH-471
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: NutchBeanCreationSync_v1.patch
>
>
> NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491290 ]

Andrzej Bialecki  commented on NUTCH-471:
-----------------------------------------

+1. Nice trick with the unsynchronized check. :)

> Fix synchronization in NutchBean creation
> -----------------------------------------
>
>                 Key: NUTCH-471
>                 URL: https://issues.apache.org/jira/browse/NUTCH-471
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: NutchBeanCreationSync_v1.patch
>
>
> NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491305 ]

Sami Siren commented on NUTCH-471:
----------------------------------

Isn't the DCL declared to be broken?

We could perhaps instead instantiate NutchBean in ServletContextListener once at startup?

> Fix synchronization in NutchBean creation
> -----------------------------------------
>
>                 Key: NUTCH-471
>                 URL: https://issues.apache.org/jira/browse/NUTCH-471
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: NutchBeanCreationSync_v1.patch
>
>
> NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491313 ]

Enis Soztutar commented on NUTCH-471:
-------------------------------------

> Nice trick with the unsynchronized check. :)
Wow, indeed i have used a pattern w/o knowing about it :) Seemed a simple and efficient solution to me.

>Isn't the DCL declared to be broken?
After reading http://en.wikipedia.org/wiki/Double-checked_locking, i can say that this a very subtle bug. As suggested we can fix it by declaring NutchBean volatile. However i guess, that in that case would the servlet container should also be configured to use Java 1.5 instead of 1.4.



> Fix synchronization in NutchBean creation
> -----------------------------------------
>
>                 Key: NUTCH-471
>                 URL: https://issues.apache.org/jira/browse/NUTCH-471
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: NutchBeanCreationSync_v1.patch
>
>
> NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar updated NUTCH-471:
--------------------------------

    Attachment: NutchBeanCreationSync_v2.patch

From http://www-128.ibm.com/developerworks/java/library/j-dcl.html

The bottom line is that double-checked locking, in whatever form, should not be used because you cannot guarantee that it will work on any JVM implementation. JSR-133 is addressing issues regarding the memory model, however, double-checked locking will not be supported by the new memory model. Therefore, you have two options:
    * Accept the synchronization of a getInstance() method as shown in Listing 2.
    * Forgo synchronization and use a static field.

We don't want to remise performance in NutchBean.get(), synchronization is not a solution. Thus as Sami has suggested, i have written a ServetContextListener and added NutchBean construction code there. And modified web.xml to register the event listener class. Also In the servlet initialization, the Configuration object is initialized and cached by NutchConfiguration, so we avoid the same problem in NutchConfiguration.get().

 i have tested the implementation and it seems OK.


> Fix synchronization in NutchBean creation
> -----------------------------------------
>
>                 Key: NUTCH-471
>                 URL: https://issues.apache.org/jira/browse/NUTCH-471
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: NutchBeanCreationSync_v1.patch, NutchBeanCreationSync_v2.patch
>
>
> NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506883 ]

Doğacan Güney commented on NUTCH-471:
-------------------------------------

We have been using this on our machines for some time, so if there are no objections, I am going to commit this one later today.

> Fix synchronization in NutchBean creation
> -----------------------------------------
>
>                 Key: NUTCH-471
>                 URL: https://issues.apache.org/jira/browse/NUTCH-471
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>             Fix For: 1.0.0
>
>         Attachments: NutchBeanCreationSync_v1.patch, NutchBeanCreationSync_v2.patch
>
>
> NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney resolved NUTCH-471.
---------------------------------

    Resolution: Fixed
      Assignee: Doğacan Güney

Committed in rev. 549507 with minor style modifications.

> Fix synchronization in NutchBean creation
> -----------------------------------------
>
>                 Key: NUTCH-471
>                 URL: https://issues.apache.org/jira/browse/NUTCH-471
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NutchBeanCreationSync_v1.patch, NutchBeanCreationSync_v2.patch
>
>
> NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507145 ]

Hudson commented on NUTCH-471:
------------------------------

Integrated in Nutch-Nightly #125 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/125/])

> Fix synchronization in NutchBean creation
> -----------------------------------------
>
>                 Key: NUTCH-471
>                 URL: https://issues.apache.org/jira/browse/NUTCH-471
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NutchBeanCreationSync_v1.patch, NutchBeanCreationSync_v2.patch
>
>
> NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Reopened: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes reopened NUTCH-471:
--------------------------------


This patch breaks the search.jsp with a null pointer because the nutch bean is no longer created in the get method, it is only retrieved once it has already been cached.

> Fix synchronization in NutchBean creation
> -----------------------------------------
>
>                 Key: NUTCH-471
>                 URL: https://issues.apache.org/jira/browse/NUTCH-471
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NutchBeanCreationSync_v1.patch, NutchBeanCreationSync_v2.patch
>
>
> NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512702 ]

Doğacan Güney commented on NUTCH-471:
-------------------------------------

Dennis, can you give some more details? AFAICS, searching works fine. I started tomcat then sent a query to search.jsp (with and without first requesting home page) and I didn't see any problems in both cases.

> Fix synchronization in NutchBean creation
> -----------------------------------------
>
>                 Key: NUTCH-471
>                 URL: https://issues.apache.org/jira/browse/NUTCH-471
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NutchBeanCreationSync_v1.patch, NutchBeanCreationSync_v2.patch
>
>
> NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512712 ]

Dennis Kubes commented on NUTCH-471:
------------------------------------

Ah, sorry, my configuration was the problem.  If you don't upgrade the web.xml to include the listener:

<listener>
  <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
</listener>

then NutchBean returns null.  I added a comment to the search.jsp to explain how NutchBean is initialized.



> Fix synchronization in NutchBean creation
> -----------------------------------------
>
>                 Key: NUTCH-471
>                 URL: https://issues.apache.org/jira/browse/NUTCH-471
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NutchBeanCreationSync_v1.patch, NutchBeanCreationSync_v2.patch
>
>
> NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-471) Fix synchronization in NutchBean creation

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes closed NUTCH-471.
------------------------------

    Resolution: Fixed

> Fix synchronization in NutchBean creation
> -----------------------------------------
>
>                 Key: NUTCH-471
>                 URL: https://issues.apache.org/jira/browse/NUTCH-471
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>            Reporter: Enis Soztutar
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: NutchBeanCreationSync_v1.patch, NutchBeanCreationSync_v2.patch
>
>
> NutchBean is created and then cached in servlet context. But NutchBean.get(ServletContext app, Configuration conf) is not syncronized, which causes more than one instance of the bean (and DistributedSearch$Client) if servlet container is accessed rapidly during startup.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

inject command fail on whole-web run

Tsengtan A Shuy
I am running the "bin/nutch inject crawl/crawldb dmoz" command on my ubuntu
OS by following the nutch-0.8.x tutorial. But I got the following error
message:

2007-07-14 11:38:35,238 WARN  mapred.LocalJobRunner
(LocalJobRunner.java:run(120)) - job_ij0atx
java.lang.NoClassDefFoundError: dk/brics/automaton/RunAutomaton
        at
org.apache.nutch.urlfilter.automaton.AutomatonURLFilter$Rule.<init>(Automato
nURLFilter.java:89)
        at
org.apache.nutch.urlfilter.automaton.AutomatonURLFilter.createRule(Automaton
URLFilter.java:70)
        at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRulesFile(RegexURLFilt
erBase.java:191)
        at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase
.java:140)
        at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:153)
        at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:53)
        at
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:56)
        at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
        at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:91)
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
        at org.apache.nutch.crawl.Injector.main(Injector.java:164)
adamshuy@adamshuy-desktop:~/nutch-0.8.1$
What is wrong in my ubuntu environment?
Please help!!

Adam Shuy, President
ePacific Web Design & Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com

Reply | Threaded
Open this post in threaded view
|

RE: inject command fail on whole-web run

Tsengtan A Shuy
I am able to fix the problem of last email and go through the command of
whole-web site crawl from nutch-0.8.x tutorial.

But the resultant folder crawl is still very small, and the last search of
"apache", I got the "hit 0" message.  Something is still wrong.

Please give me some feedback.

Adam Shuy, President
ePacific Web Design & Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com
-----Original Message-----
From: Tsengtan A Shuy [mailto:[hidden email]]
Sent: Saturday, July 14, 2007 12:11 PM
To: [hidden email]
Subject: inject command fail on whole-web run

I am running the "bin/nutch inject crawl/crawldb dmoz" command on my ubuntu
OS by following the nutch-0.8.x tutorial. But I got the following error
message:

2007-07-14 11:38:35,238 WARN  mapred.LocalJobRunner
(LocalJobRunner.java:run(120)) - job_ij0atx
java.lang.NoClassDefFoundError: dk/brics/automaton/RunAutomaton
        at
org.apache.nutch.urlfilter.automaton.AutomatonURLFilter$Rule.<init>(Automato
nURLFilter.java:89)
        at
org.apache.nutch.urlfilter.automaton.AutomatonURLFilter.createRule(Automaton
URLFilter.java:70)
        at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRulesFile(RegexURLFilt
erBase.java:191)
        at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase
.java:140)
        at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:153)
        at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:53)
        at
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:56)
        at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
        at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:91)
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
        at org.apache.nutch.crawl.Injector.main(Injector.java:164)
adamshuy@adamshuy-desktop:~/nutch-0.8.1$
What is wrong in my ubuntu environment?
Please help!!

Adam Shuy, President
ePacific Web Design & Hosting
Professional Web/Software developer
TEL: 408-272-6946
www.epacificweb.com