[jira] [Commented] (LUCENE-8462) New Arabic snowball stemmer

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-8462) New Arabic snowball stemmer

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646314#comment-16646314 ]

Ryadh Dahimene commented on LUCENE-8462:
----------------------------------------

Hi lucene team,
Just a quick summary of the state of this change. In this version, the contributed snowball Arabic Stemmer has been generated using the `ant patch-snowball` task. To achieve that, the ant task has been updated and it is now compatible with the last version of snowball (revision 1964ce688cbeca505263c8f77e16ed923296ce7a) and also retro-compatible with the revision of the Snowball repository currently used by Lucene) In my opinion, this change is now ready and will allow users to use the new Arabic snowball stemmer.

In the longer term view, I believe that it will be better if all the lucene snowball stemmers are synced with the last version of the snowball stemmers (https://github.com/snowballstem/snowball). This will allow a smoother integration of newly added languages as well as the updated ones and will reduce the complexity of the `ant patch-snowball` task. The current version used is based on revision 502 of the Tartarus Snowball repository (https://github.com/snowballstem/snowball/tree/e103b5c257383ee94a96e7fc58cab3c567bf079b) and it is now more than 10 years old.

It is a wider change in the sense that the impacts have yet to be assessed, but if the team believe that it is relevant and see value in it, I'll be happy to invest some time in this task.

> New Arabic snowball stemmer
> ---------------------------
>
>                 Key: LUCENE-8462
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8462
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ryadh Dahimene
>            Priority: Trivial
>              Labels: Arabic, snowball, stemmer
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Added a new Arabic snowball stemmer based on [https://github.com/snowballstem/snowball/blob/master/algorithms/arabic.sbl]
> As well an Arabic test dataset in `TestSnowballVocabData.zip` from the -snowball-data- generated from the input file available here -[https://github.com/snowballstem/snowball-data/tree/master/arabic]-
> [https://github.com/ibnmalik/golden-corpus-arabic/blob/develop/core/words.txt]
>  
> It also updates the {{ant patch-snowball}} target to be compatible with
> the java classes generated by the last snowball version (tree:
> 1964ce688cbeca505263c8f77e16ed923296ce7a). The {{ant patch-snowball}} target
> is retro-compatible with the version of snowball stemmers used in
> lucene 7.x and ignores already patched classes.
>  
> Link to the corresponding Github PR:
> [https://github.com/apache/lucene-solr/pull/449]
>  Edited: updated the corpus link, PR link and description
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]