[lucy-user] stemming, Lucy and Stem::Lingua::Snowball

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] stemming, Lucy and Stem::Lingua::Snowball

arjan
Dear all,

In English possession can be indicated by apostrophe s. Like: "this
man's computer". In Dutch this is almost the same, only in most cases
without the apostrophe. We only use an apostrophe when the word ends on
an s or on a/o/e/i/u. So for example:

Jans hoed (hat)
Jos' tas (bag)
Monica's jas (coat)

The Stem::Lingua::Snowball module does not know this. The small script
below this email demonstrates that.

The default is stemmed correctly Jans -> Jan. On the exceptions - Jos'
and Minonica's - the stemmer leaves the apostrophe at the end. And the -
in Dutch erroneous - spelling of Jans as Jan's is also stemmed wrongly.

In Lucy this leads to having Jos' and Monica' as words in the lexicon.
Messages with "Monica's" will not be found when searching on "Monica".
This is demonstrated with the word Halsema's in the second copy-paste
script below.

Is this indeed a bug? Is there a way to work around this?

Kind regards,
Arjan Widlak

United Knowledge
http://www.unitedknowledge.nl

---Lingua::Stem::Snowball--------------------------------------------------------------------------------------
use strict;
use warnings;
use 5.010;

use Encode;
use Lingua::Stem::Snowball;

my @words = qw( Jans Jos' Monica's Jan's );

my $stemmer = Lingua::Stem::Snowball->new( lang => 'nl' );
$stemmer->stem_in_place( \@words );

foreach my $word ( @words ) {
     say encode( 'utf8', $word );
}
---Lingua::Stem::Snowball--------------------------------------------------------------------------------------

---Lucy---------------------------------------------------------------------------------------------------------------
use strict;
use warnings;
use 5.010;
use Encode;

use Lucy::Plan::Schema;
use Lucy::Index::Indexer;
use Lucy::Search::IndexSearcher;
use Lucy::Analysis::RegexTokenizer;
use Lucy::Analysis::PolyAnalyzer;
use Lucy::Analysis::CaseFolder;
use Lucy::Analysis::SnowballStemmer;
use Lucy::Index::IndexReader;
use Lucy::Index::LexiconReader;
use utf8; #data in script itself

# create an index
my $document1 = {
     searchstring    => qq|In een column schrijft hij een reactie op
Femke Halsema's voorstel om te komen tot meer samenwerking op links.|,
};

my $message_storage = "/tmp";
my $schema          = Lucy::Plan::Schema->new;
my $case_folder     = Lucy::Analysis::CaseFolder->new;
my $tokenizer       = Lucy::Analysis::RegexTokenizer->new;

my $stemmer = Lucy::Analysis::SnowballStemmer->new(
     language    => 'nl',
);
my $polyanalyzer    = Lucy::Analysis::PolyAnalyzer->new(
     language    => 'nl',
     analyzers   => [ $case_folder, $tokenizer, $stemmer ],
);

# Field Types
my $type_text    = Lucy::Plan::FullTextType->new(
     analyzer        => $polyanalyzer,
     indexed         => 1,
     stored          => 1,
     sortable        => 0
);

$schema->spec_field( name => "searchstring", type => $type_text );
my $indexer = Lucy::Index::Indexer->new(
     schema      => $schema,
     index       => $message_storage,
     create      => 1,
     truncate    => 1,
);

$indexer->add_doc( $document1 );
$indexer->commit;

# See what we find
my $query_parser = Lucy::Search::QueryParser->new(
     schema  => $schema,
     fields  => [ 'searchstring' ],
);

my $query = $query_parser->parse( qw( Halsema ) );

my $searcher = Lucy::Search::IndexSearcher->new(
     index => $message_storage,
);

my $hits = $searcher->hits(
     query       => $query,
     offset      => 0,
     num_wanted  => 10000,
);

say encode( 'utf8', "\n\tHits from the index:");
while ( my $hit = $hits->next ) {
     say encode( 'utf8', "found hit on: " . $hit->{ searchstring } );
}

# See what's in the lexicon
my $polyreader = Lucy::Index::IndexReader->open(
         index => $message_storage,
     );
my $seg_readers = $polyreader->seg_readers;

say encode('utf8', "\n\tIndividual words in the lexicon:");
foreach my $seg_reader ( @$seg_readers ) {
     my $lex_reader = $seg_reader->obtain( "Lucy::Index::LexiconReader" );
     my $lexicon    = $lex_reader->lexicon( field => 'searchstring' );

     while ( $lexicon->next ) {
         say encode( 'utf8', $lexicon->get_term );
     }
}
---Lucy---------------------------------------------------------------------------------------------------------------

--
Recent: http://www.lomcongres.nl/
Congres- en nieuwsbriefportaal met relatiebeheer systeem voor het Landelijk Overleg Milieuhandhaving

Setting Standards, a a Delft University of Technology and United Knowledge simulation exercise on strategy and cooperation in standardization, http://www.setting-standards.com

United Knowledge, internet voor de publieke sector
Keizersgracht 74
1015 CT Amsterdam
T +31 (0)20 52 18 300
F +31 (0)20 52 18 301
[hidden email]
http://www.unitedknowledge.nl

M +31 (0)6 2427 1444
E [hidden email]

Bezoek onze site op:
http://www.unitedknowledge.nl

Of bekijk een van onze projecten:
http://www.handhavingsportaal.nl/
http://www.setting-standards.com/
http://www.lomcongres.nl/
http://www.clubvanmaarssen.org/



Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] stemming, Lucy and Stem::Lingua::Snowball

Marvin Humphrey
Hello, Arjan,

Thanks for the thorough example and explanation.  Good test cases are
very helpful!

On Sat, Jul 09, 2011 at 06:07:18PM +0200, arjan wrote:

> In English possession can be indicated by apostrophe s. Like: "this  
> man's computer". In Dutch this is almost the same, only in most cases  
> without the apostrophe. We only use an apostrophe when the word ends on  
> an s or on a/o/e/i/u. So for example:
>
> Jans hoed (hat)
> Jos' tas (bag)
> Monica's jas (coat)
>
> The Stem::Lingua::Snowball module does not know this. The small script  
> below this email demonstrates that.
>
> The default is stemmed correctly Jans -> Jan. On the exceptions - Jos'  
> and Minonica's - the stemmer leaves the apostrophe at the end. And the -  
> in Dutch erroneous - spelling of Jans as Jan's is also stemmed wrongly.
>
> In Lucy this leads to having Jos' and Monica' as words in the lexicon.  
> Messages with "Monica's" will not be found when searching on "Monica".  
> This is demonstrated with the word Halsema's in the second copy-paste  
> script below.
>
> Is this indeed a bug?

If I understand your explanation well enough, then I think we may want to
treat it as a Lucy bug.

It seems that the behavior of the Dutch Snowball stemmer is known and
intentional.  From the Snowball website:

    http://snowball.tartarus.org/texts/introduction.html

    The Dutch stemmer presented here assumes hyphen and apostrophe have
    already been removed from the word to be stemmed.

That means we either have a bug in Lucy::Analysis::SnowballStemmer or
Lucy::Analysis::PolyAnalyzer, depending on how independent we consider
SnowballStemmer to be.

If we believe that SnowballStemmer should compensate for the idiosyncrasies of
the Snowball library and assume responsibility for stripping apostrophes, then
SnowballStemmer has a bug.

If we believe that SnowballStemmer should be the thinnest possible wrapper
around the Snowball libraries and that it should be PolyAnalyzer's
responsibility to feed it materials with apostrophes already stripped, then
PolyAnalyzer has a bug.

I suspect that we want SnowballStemmer to assume responsibility, since that
will make it easier to use SnowballStemmer as a component.  I don't think it
would be wise for us to require that the user know about this quirk and
manually intervene to compensate for it when assembling a custom PolyAnalyzer.

Still, there's also the possibility of using a different default Tokenizer
pattern within the Dutch PolyAnalyzer.  This is the existing pattern, which is
optimized for English:

    # Matches "it's", "O'Henry's", etc...
    "\\w+(?:[\\x{2019}']\\w+)*"

Is that also well-optimized for Dutch?

> Is there a way to work around this?

I believe the following PolyAnalyzer will get the job done for you:

  my $case_folder = Lucy::Analysis::CaseFolder->new;
  my $tokenizer   = Lucy::Analysis::RegexTokenizer->new;
  my $stemmer     = Lucy::Analysis::SnowballStemmer->new( language => 'nl' );
  my $apostrophe_stripper
    = Lucy::Analysis::RegexTokenizer->new( pattern => ".*[^'\\x{2019}]" );
  my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
    analyzers => [ $case_folder, $tokenizer, $stemmer, $apostrophe_stripper ],
  );

Best,

Marvin Humphrey