[lucy-user] Running query string thru Analyzer?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] Running query string thru Analyzer?

Gerald Richter
Hi,

 
I have defined a field in the following way:

 
    my $tokenizer    = Lucy::Analysis::StandardTokenizer->new;
    my $normalizer   = Lucy::Analysis::Normalizer->new (strip_accents => 1, case_fold => 1) ;
    my $field_analyzer = Lucy::Analysis::PolyAnalyzer->new
                            (
                            analyzers => [ $tokenizer, $normalizer ],
                            );
    my $field_type  = Lucy::Plan::FullTextType->new (analyzer => $field_analyzer) ;
    $schema->spec_field( name => 'option_ndx',  type => $field_type );

 
When I now run a query (either with a TermQuery or a WildcardQuery), and the indexed document was "Foo baß", it works as long as I query for "foo", but not when I query for "Foo" or "baß". So I guess I have to run the query string thru the same analyzer as the indexer does.

 
The question is how can I do this or is Lucy able to do this for me?

 
Thanks & Regards

 
Gerald

 
P.S. I am using Lucy 0.42

 
 

 
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Running query string thru Analyzer?

Nick Wellnhofer
On 17/01/2015 15:55, Gerald Richter wrote:
> When I now run a query (either with a TermQuery or a WildcardQuery), and the indexed document was "Foo baß", it works as long as I query for "foo", but not when I query for "Foo" or "baß". So I guess I have to run the query string thru the same analyzer as the indexer does.
>
> The question is how can I do this or is Lucy able to do this for me?

Lucy's Query classes do that automatically for you. My guess is that either
your indexed document or your query term contain a "ß" character in the wrong
encoding. The most common reasons are:

- UTF-8 string in source code without "use utf8;".
- String read from UTF-8 file without setting the file encoding
   or without decoding manually.

If a search for "ba\xC3\x9F" works, then the problem is with the indexed
document. If a search for "ba\xDF" works, the problem is with your query term.

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Running query string thru Analyzer?

Nick Wellnhofer
On 17/01/2015 17:50, Nick Wellnhofer wrote:
> Lucy's Query classes do that automatically for you. My guess is that either
> your indexed document or your query term contain a "ß" character in the wrong
> encoding. The most common reasons are:

Oops, I just saw that queries for "Foo" don't work either, so scratch that.
Can you show us your indexing and querying code or even a self-contained test
case?

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Running query string thru Analyzer?

Marvin Humphrey
On Sat, Jan 17, 2015 at 8:58 AM, Nick Wellnhofer <[hidden email]> wrote:

> Oops, I just saw that queries for "Foo" don't work either, so scratch that.
> Can you show us your indexing and querying code or even a self-contained
> test case?

The issue is probably that TermQuery's constructor takes exactly what you give
it, which may not match what's in the index.  In this case, `foo` is in the
index, so queries for `Foo` don't work.

A QueryParser will probably give Gerald what he wants, because it will apply
the appropriate Analyzer.

  use Lucy;
  use Data::Dumper qw( Dumper );
  my $searcher = Lucy::Search::IndexSearcher->new(index => '/path/to/index');
  my $qparser = Lucy::Search::QueryParser->new(
    schema => $searcher->get_schema,
    fields => ['content'],  # optional, try without this.
  );
  my $query = $qparser->parse("Foo");
  warn Dumper($query->dump);

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Running query string thru Analyzer?

Nick Wellnhofer
On 17/01/2015 18:19, Marvin Humphrey wrote:
> On Sat, Jan 17, 2015 at 8:58 AM, Nick Wellnhofer <[hidden email]> wrote:
>
>> Oops, I just saw that queries for "Foo" don't work either, so scratch that.
>> Can you show us your indexing and querying code or even a self-contained
>> test case?
>
> The issue is probably that TermQuery's constructor takes exactly what you give
> it, which may not match what's in the index.  In this case, `foo` is in the
> index, so queries for `Foo` don't work.

Ah yes, of course. Sorry for the noise.

Another approach to manually analyze fields for a TermQuery would be:

     my $type = $schema->fetch_type('option_ndx');
     # get_analyzer only works for FullTextType.
     my $analyzer = $type->get_analyzer;
     my $tokens = $analyzer->split('Foo');
     # Make sure to check the size of the returned array.
     my $term_query = Lucy::Search::TermQuery->new(
         field => 'option_ndx',
         term  => $tokens->[0],
     );

Some of this is explained in the QueryObjects tutorial and the
CustomQueryParser cookbook entry:

 
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Docs/Tutorial/QueryObjects.pod
 
https://metacpan.org/pod/distribution/Lucy/lib/Lucy/Docs/Cookbook/CustomQueryParser.pod

But unfortunately, the get_analyzer method of FullTextType is undocumented. I
think this should be fixed.

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] Running query string thru Analyzer?

Peter Karman
In reply to this post by Gerald Richter
On 1/17/15 8:55 AM, Gerald Richter wrote:

> Hi,
>
>  
> I have defined a field in the following way:
>
>  
>     my $tokenizer    = Lucy::Analysis::StandardTokenizer->new;
>     my $normalizer   = Lucy::Analysis::Normalizer->new (strip_accents => 1, case_fold => 1) ;
>     my $field_analyzer = Lucy::Analysis::PolyAnalyzer->new
>                             (
>                             analyzers => [ $tokenizer, $normalizer ],
>                             );
>     my $field_type  = Lucy::Plan::FullTextType->new (analyzer => $field_analyzer) ;
>     $schema->spec_field( name => 'option_ndx',  type => $field_type );
>
>  
> When I now run a query (either with a TermQuery or a WildcardQuery), and the indexed document was "Foo baß", it works as long as I query for "foo", but not when I query for "Foo" or "baß". So I guess I have to run the query string thru the same analyzer as the indexer does.
>
>  
> The question is how can I do this or is Lucy able to do this for me?
>

In addition to the good advice elsewhere on this thread, you can use the
Search::Query Lucy dialect to parse and analyze plain strings
appropriately, with code like this:

----------------------------------
use Lucy;
use Search::Query;

my ($idx, $query) = get_index_name_and_query();

my $searcher = Lucy::Search::IndexSearcher->new( index => $idx );
my $schema   = $searcher->get_schema();

# build field mapping
my %fields;
for my $field_name ( @{ $schema->all_fields() } ) {
    $fields{$field_name} = {
        type     => $schema->fetch_type($field_name),
        analyzer => $schema->fetch_analyzer($field_name),
    };
}

my $query_parser = Search::Query->parser(
    dialect        => 'Lucy',
    croak_on_error => 1,
    default_field  => 'foo',  # applied to "bare" terms with no field
    fields         => \%fields
);

my $parsed_query = $query_parser->parse($query);
my $lucy_query   = $parsed_query->as_lucy_query();
my $hits         = $searcher->hits( query => $lucy_query );

--------------------------------



Something similar is performed in Dezi::Lucy::Searcher:
https://metacpan.org/source/KARMAN/Dezi-App-0.013/lib/Dezi/Lucy/Searcher.pm#L124

See
https://metacpan.org/pod/Search::Query::Dialect::Lucy



--
Peter Karman  .  http://peknet.com/  .  [hidden email]