[jira] Created: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

Sebastian Nagel (Jira)
ConcurrentModificationException can be thrown when getSorted() is called.
-------------------------------------------------------------------------

                 Key: NUTCH-496
                 URL: https://issues.apache.org/jira/browse/NUTCH-496
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.9.0
         Environment: Nutch application, during fetch.
            Reporter: Marc Miller


NGramProfile (within the org.apache.nutch.analysis.lang) patckage is not threadsafe due to a ConcurrentModificationException that can occur if during iteration of the resulant List from getSorted() and another call to getSorted() is invoked from within another thread.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marc Miller updated NUTCH-496:
------------------------------

    Attachment: language_analyzer_ngram.patch

Updated the getSorted() method to be synchronized.

> ConcurrentModificationException can be thrown when getSorted() is called.
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-496
>                 URL: https://issues.apache.org/jira/browse/NUTCH-496
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Nutch application, during fetch.
>            Reporter: Marc Miller
>         Attachments: language_analyzer_ngram.patch
>
>
> NGramProfile (within the org.apache.nutch.analysis.lang) patckage is not threadsafe due to a ConcurrentModificationException that can occur if during iteration of the resulant List from getSorted() and another call to getSorted() is invoked from within another thread.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marc Miller updated NUTCH-496:
------------------------------

    Description: NGramProfile (within the org.apache.nutch.analysis.lang) package is not thread-safe due to a ConcurrentModificationException that can occur if during iteration of the resultant List from getSorted() and another call to getSorted() is invoked from within another thread.  (was: NGramProfile (within the org.apache.nutch.analysis.lang) patckage is not threadsafe due to a ConcurrentModificationException that can occur if during iteration of the resulant List from getSorted() and another call to getSorted() is invoked from within another thread.)

Fixed spelling errors.

> ConcurrentModificationException can be thrown when getSorted() is called.
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-496
>                 URL: https://issues.apache.org/jira/browse/NUTCH-496
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Nutch application, during fetch.
>            Reporter: Marc Miller
>         Attachments: language_analyzer_ngram.patch
>
>
> NGramProfile (within the org.apache.nutch.analysis.lang) package is not thread-safe due to a ConcurrentModificationException that can occur if during iteration of the resultant List from getSorted() and another call to getSorted() is invoked from within another thread.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501266 ]

Sami Siren commented on NUTCH-496:
----------------------------------

I believe the problem is even more severe. Now several threads share the NgramProfile what is used to identify a piece of text, if parllel threads have access to same object the reults are more or less random.

This could be fixed by changing the NGramProfile (what currently is a field "suspect" in LanguageIdentifier) to be a thread local.

> ConcurrentModificationException can be thrown when getSorted() is called.
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-496
>                 URL: https://issues.apache.org/jira/browse/NUTCH-496
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Nutch application, during fetch.
>            Reporter: Briggs
>         Attachments: language_analyzer_ngram.patch
>
>
> NGramProfile (within the org.apache.nutch.analysis.lang) package is not thread-safe due to a ConcurrentModificationException that can occur if during iteration of the resultant List from getSorted() and another call to getSorted() is invoked from within another thread.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

Briggs
Yeah, you are correct there.  How does this thing actually even
remotely begin to work on a  predictable level?





On 6/4/07, Sami Siren (JIRA) <[hidden email]> wrote:

>
>     [ https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501266 ]
>
> Sami Siren commented on NUTCH-496:
> ----------------------------------
>
> I believe the problem is even more severe. Now several threads share the NgramProfile what is used to identify a piece of text, if parllel threads have access to same object the reults are more or less random.
>
> This could be fixed by changing the NGramProfile (what currently is a field "suspect" in LanguageIdentifier) to be a thread local.
>
> > ConcurrentModificationException can be thrown when getSorted() is called.
> > -------------------------------------------------------------------------
> >
> >                 Key: NUTCH-496
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-496
> >             Project: Nutch
> >          Issue Type: Bug
> >          Components: fetcher
> >    Affects Versions: 0.9.0
> >         Environment: Nutch application, during fetch.
> >            Reporter: Briggs
> >         Attachments: language_analyzer_ngram.patch
> >
> >
> > NGramProfile (within the org.apache.nutch.analysis.lang) package is not thread-safe due to a ConcurrentModificationException that can occur if during iteration of the resultant List from getSorted() and another call to getSorted() is invoked from within another thread.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


--
"Conscious decisions by conscious minds are what make reality real"
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

Sami Siren-2
Briggs wrote:
> Yeah, you are correct there.  How does this thing actually even
> remotely begin to work on a  predictable level?

One crucial aspect of language identification is that the input properly
encoded. There was a patch that added icu4j character set encoding
detection into Nutch. I believe icu4j also offers language
identification in addition to character set detection. Has anyone
checked how usable the language identification from icu4j would be?

There is severe problems with current language identification for CJK
for example.

--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

Doğacan Güney-3
On 6/4/07, Sami Siren <[hidden email]> wrote:

> Briggs wrote:
> > Yeah, you are correct there.  How does this thing actually even
> > remotely begin to work on a  predictable level?
>
> One crucial aspect of language identification is that the input properly
> encoded. There was a patch that added icu4j character set encoding
> detection into Nutch. I believe icu4j also offers language
> identification in addition to character set detection. Has anyone
> checked how usable the language identification from icu4j would be?
>
> There is severe problems with current language identification for CJK
> for example.


Can you give a few links? I have looked at icu4j's API, but I haven't
found any info about language identification.

IBM does have something called Linguini
(http://www-306.ibm.com/software/globalization/topics/linguini/index.jsp)
. It doesn't seem to be open source, though.

>
> --
>  Sami Siren
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

Sami Siren-2
2007/6/5, Doğacan Güney <[hidden email]>:
>
> Can you give a few links? I have looked at icu4j's API, but I haven't
> found any info about language identification.

I just saw this on api and assumed it had to do with detecting the
language, I might be wrong:

http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetMatch.html#getLanguage()

--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

Doğacan Güney-3
On 6/5/07, Sami Siren <[hidden email]> wrote:
>
> I just saw this on api and assumed it had to do with detecting the
> language, I might be wrong:
>
> http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetMatch.html#getLanguage()

I think that method is used to get detected charset's ISO code. Like,
it returns "tr" for ISO-8859-9.

That being said, language identification is a very crucial feature and
if it doesn't work properly, well, someone should do something about
it :).


>
> --
>  Sami Siren
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sami Siren updated NUTCH-496:
-----------------------------

    Attachment: nutch-496.txt

This patch changes LanguageIdentifier to have NGramProfile per thread instead of one common one.

> ConcurrentModificationException can be thrown when getSorted() is called.
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-496
>                 URL: https://issues.apache.org/jira/browse/NUTCH-496
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Nutch application, during fetch.
>            Reporter: Briggs
>         Attachments: language_analyzer_ngram.patch, nutch-496.txt
>
>
> NGramProfile (within the org.apache.nutch.analysis.lang) package is not thread-safe due to a ConcurrentModificationException that can occur if during iteration of the resultant List from getSorted() and another call to getSorted() is invoked from within another thread.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.

Briggs
In reply to this post by Doğacan Güney-3
I have already 'fixed' the issue with concurrency.  I did as Sami suggested
and just threw in a ThreadLocal variable for the NGramProfile and am
currently testing (though, this is another difficult one to set up for
testing, since nobody has found an issue with this because it's a very quiet
bug).

On 6/5/07, Doğacan Güney <[hidden email]> wrote:

>
> On 6/5/07, Sami Siren <[hidden email]> wrote:
> >
> > I just saw this on api and assumed it had to do with detecting the
> > language, I might be wrong:
> >
> >
> http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetMatch.html#getLanguage()
>
> I think that method is used to get detected charset's ISO code. Like,
> it returns "tr" for ISO-8859-9.
>
> That being said, language identification is a very crucial feature and
> if it doesn't work properly, well, someone should do something about
> it :).
>
>
> >
> > --
> >  Sami Siren
> >
>
>
> --
> Doğacan Güney
>



--
"Conscious decisions by conscious minds are what make reality real"