[jira] Created: (NUTCH-60) Bad language identifier plugin performances

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-60) Bad language identifier plugin performances

Sergey Smolyakov (Jira)
Bad language identifier plugin performances
-------------------------------------------

         Key: NUTCH-60
         URL: http://issues.apache.org/jira/browse/NUTCH-60
     Project: Nutch
        Type: Improvement
  Components: indexer  
    Reporter: Jerome Charron
    Priority: Minor


As reported by Stefan Groschupf (http://www.mail-archive.com/nutch-developers@.../msg04090.html) the language identifier plugin consumes a lot of processing time.
Some optimizations and/or configuration options are required.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-60) Bad language identifier plugin performances

Sergey Smolyakov (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-60?page=all ]

Jerome Charron updated NUTCH-60:
--------------------------------

    Attachment: NUTCH-60-050526.patch

Patch with some minor performances improvements, but with some configurations parameters that enable to improve performances.
See http://wiki.apache.org/nutch/LanguageIdentifierBenchs and http://wiki.apache.org/nutch/NewLanguageIdentifier (coming soon) for more details.

Shortly, it adds the following configuration parameters:

 * lang.ngram.min.length : The minimum size of ngrams to uses to identify language (must be between 1 and lang.ngram.max.length). The larger is the range between lang.ngram.min.length and lang.ngram.max.length, the better is the identification, but the slowest it is.

 * lang.ngram.max.length: The maximum size of ngrams to uses to identify language (must be between lang.ngram.min.length and 4). The larger is the range between lang.ngram.min.length and lang.ngram.max.length, the better is the identification, but the slowest it is.

 * lang.analyze.max.length: The maximum bytes of data to uses to indentify the language (0 means full content analysis). The larger is this value, the better is the analyzis, but the slowest it is.

Some new ngram profiles have been generated for en, es, fr, nl, it, pt, da, sv, de, fi, el cause the new implementation need more ngrams in the profile, but it is backward compatible with old ones.

Some unitary tests added.



> Bad language identifier plugin performances
> -------------------------------------------
>
>          Key: NUTCH-60
>          URL: http://issues.apache.org/jira/browse/NUTCH-60
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Reporter: Jerome Charron
>     Priority: Minor
>  Attachments: NUTCH-60-050526.patch
>
> As reported by Stefan Groschupf (http://www.mail-archive.com/nutch-developers@.../msg04090.html) the language identifier plugin consumes a lot of processing time.
> Some optimizations and/or configuration options are required.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-60) Bad language identifier plugin performances

Sergey Smolyakov (Jira)
In reply to this post by Sergey Smolyakov (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-60?page=all ]

Jerome Charron updated NUTCH-60:
--------------------------------

    Attachment: NUTCH-60-050605.patch

This patch, keeps the improvements of the previous one (configuration), and provides some optimizations that reduce the processing time from 70% to 20%, depending on the configuration (size of data to process), with an average gain of 25%.
I will provides more detailled results of my benchs on the Wiki as soon as possible (http://wiki.apache.org/nutch/LanguageIdentifierBenchs).



> Bad language identifier plugin performances
> -------------------------------------------
>
>          Key: NUTCH-60
>          URL: http://issues.apache.org/jira/browse/NUTCH-60
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Reporter: Jerome Charron
>     Priority: Minor
>  Attachments: NUTCH-60-050526.patch, NUTCH-60-050605.patch
>
> As reported by Stefan Groschupf (http://www.mail-archive.com/nutch-developers@.../msg04090.html) the language identifier plugin consumes a lot of processing time.
> Some optimizations and/or configuration options are required.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-60) Bad language identifier plugin performances

Sergey Smolyakov (Jira)
In reply to this post by Sergey Smolyakov (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-60?page=comments#action_12312863 ]

Jerome Charron commented on NUTCH-60:
-------------------------------------

Committers, don't apply these patches, there is a loss of precision on identification.
I have identified the problem and have just quantified it.
I'm currently working on a new patch version solving this issue.

> Bad language identifier plugin performances
> -------------------------------------------
>
>          Key: NUTCH-60
>          URL: http://issues.apache.org/jira/browse/NUTCH-60
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Reporter: Jerome Charron
>     Priority: Minor
>  Attachments: NUTCH-60-050526.patch, NUTCH-60-050605.patch
>
> As reported by Stefan Groschupf (http://www.mail-archive.com/nutch-developers@.../msg04090.html) the language identifier plugin consumes a lot of processing time.
> Some optimizations and/or configuration options are required.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-60) Bad language identifier plugin performances

Sergey Smolyakov (Jira)
In reply to this post by Sergey Smolyakov (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-60?page=all ]

Jerome Charron updated NUTCH-60:
--------------------------------

    Attachment: NUTCH-60-050607.patch

Here it is: the final (?) patch. It provides around +25% performance and increase the identification precision.  More details are availale on http://wiki.apache.org/nutch/LanguageIdentifierBenchs

> Bad language identifier plugin performances
> -------------------------------------------
>
>          Key: NUTCH-60
>          URL: http://issues.apache.org/jira/browse/NUTCH-60
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Reporter: Jerome Charron
>     Priority: Minor
>  Attachments: NUTCH-60-050526.patch, NUTCH-60-050605.patch, NUTCH-60-050607.patch
>
> As reported by Stefan Groschupf (http://www.mail-archive.com/nutch-developers@.../msg04090.html) the language identifier plugin consumes a lot of processing time.
> Some optimizations and/or configuration options are required.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-60) Bad language identifier plugin performances

Sergey Smolyakov (Jira)
In reply to this post by Sergey Smolyakov (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-60?page=comments#action_12313316 ]

Sami Siren commented on NUTCH-60:
---------------------------------

Do you have some ready made scripts you used to measure the performance (quality and speed) that I could use to see if my additional optimization have any impact.

> Bad language identifier plugin performances
> -------------------------------------------
>
>          Key: NUTCH-60
>          URL: http://issues.apache.org/jira/browse/NUTCH-60
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Reporter: Jerome Charron
>     Priority: Minor
>  Attachments: NUTCH-60-050526.patch, NUTCH-60-050605.patch, NUTCH-60-050607.patch
>
> As reported by Stefan Groschupf (http://www.mail-archive.com/nutch-developers@.../msg04090.html) the language identifier plugin consumes a lot of processing time.
> Some optimizations and/or configuration options are required.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-60) Bad language identifier plugin performances

Sergey Smolyakov (Jira)
In reply to this post by Sergey Smolyakov (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-60?page=comments#action_12313323 ]

Jerome Charron commented on NUTCH-60:
-------------------------------------

Sami,

* for the performance speed, I simply uncomment some lines commented as "used for benchs" in the main method of LanguageIdentifier. Then, I launch the TestIdentifier on a big test of file using the fileset command line argument.

* for the performance quality, I just configure the language identifier plugin with the desired size of data to analyze, I comment the line of code uncommented for performance speed, and simply launch the command line with the fileset command line argument on a big set of documents of the same language with grep and wc commands piped in order to get the number of failed identifications:
java org.apache.nutch.analysis.lang.LanguageIdentifier -identifyfileset /somewhere/fr/*.txt | grep -v "identified as fr" | wc -l

 Hope this can help. But you are true, a set of scripts could be a good idea.

> Bad language identifier plugin performances
> -------------------------------------------
>
>          Key: NUTCH-60
>          URL: http://issues.apache.org/jira/browse/NUTCH-60
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Reporter: Jerome Charron
>     Priority: Minor
>  Attachments: NUTCH-60-050526.patch, NUTCH-60-050605.patch, NUTCH-60-050607.patch
>
> As reported by Stefan Groschupf (http://www.mail-archive.com/nutch-developers@.../msg04090.html) the language identifier plugin consumes a lot of processing time.
> Some optimizations and/or configuration options are required.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-60) Bad language identifier plugin performances

Sergey Smolyakov (Jira)
In reply to this post by Sergey Smolyakov (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-60?page=all ]

Jerome Charron updated NUTCH-60:
--------------------------------

    Attachment: NUTCH-60-050627.patch

In the previous patch there were no more public default constructor in the LanguageIdentifier IndexingFilter, because it was a singleton (cause a RuntimeException at runtime). This new patch split the code of the LanguageIdentifier class into a LanguageIdentifier singleton and a LanguageIndexingFilter so that all is ok...

> Bad language identifier plugin performances
> -------------------------------------------
>
>          Key: NUTCH-60
>          URL: http://issues.apache.org/jira/browse/NUTCH-60
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Reporter: Jerome Charron
>     Priority: Minor
>  Attachments: NUTCH-60-050526.patch, NUTCH-60-050605.patch, NUTCH-60-050607.patch, NUTCH-60-050627.patch
>
> As reported by Stefan Groschupf (http://www.mail-archive.com/nutch-developers@.../msg04090.html) the language identifier plugin consumes a lot of processing time.
> Some optimizations and/or configuration options are required.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-60) Bad language identifier plugin performances

Sergey Smolyakov (Jira)
In reply to this post by Sergey Smolyakov (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-60?page=all ]
     
Andrzej Bialecki  closed NUTCH-60:
----------------------------------

    Resolution: Fixed

Patches have been applied. Thanks!

> Bad language identifier plugin performances
> -------------------------------------------
>
>          Key: NUTCH-60
>          URL: http://issues.apache.org/jira/browse/NUTCH-60
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Reporter: Jerome Charron
>     Priority: Minor
>  Attachments: NUTCH-60-050526.patch, NUTCH-60-050605.patch, NUTCH-60-050607.patch, NUTCH-60-050627.patch
>
> As reported by Stefan Groschupf (http://www.mail-archive.com/nutch-developers@.../msg04090.html) the language identifier plugin consumes a lot of processing time.
> Some optimizations and/or configuration options are required.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira