[lucy-user] synonym terms

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-user] synonym terms

Anil Pachuri
Hi there,

How should one handle synonym terms in Lucy? I wonder if expanding the query (e.g. terms separated by 'OR') is the best way to do this. Is there a built-in function/sample code available in Lucy that shows how to handle synonym terms at the index level? Please advise.

TIA!
AP
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] synonym terms

Peter Karman
On 3/2/14 4:18 PM, Anil Pachuri wrote:
> Hi there,
>
> How should one handle synonym terms in Lucy? I wonder if expanding
> the query (e.g. terms separated by 'OR') is the best way to do this.
> Is there a built-in function/sample code available in Lucy that shows
> how to handle synonym terms at the index level? Please advise.
>

As you allude, there are two ways to solve the problem: at index time,
or at search time.

There are trade-offs to both; I prefer to do as much at index time as
possible, for a couple of reasons. One, stuffing the index with extra
data at index time means the search-time code doesn't have to work
harder (running a long OR'd string, e.g.). Two, it makes debugging
easier IME, because standard searching code gets the same results as
customized searching code. E.g., you can dump a lexicon to see exactly
what is in the index, synonyms included. OTOH, see the caveats below.

I don't know of any examples in the wild for doing this at index time,
but I image something like this would work:

  my %doc = get_doc_to_index();
  my @terms = get_terms_from_doc($doc);  # should analyze like Lucy does
  my %synonyms;
  for my $term (@terms) {
      for my $syn (get_synonyms($term)) {
          $synonyms{$syn}++;  # avoid duplicates
      }
  }
  # make sure your schema has a 'synonyms' field defined
  $doc{synonyms} = join ' ', keys %synonyms;
  add_to_indexer(\%doc);


The caveats here (and anytime you do this at index-time) include:

  * snipping/highlighting will be strange, since a match in the
'synonyms' field will have zero context.

  * you're increasing the size of your index with content that doesn't
actually exist in your document corpus. That can have unforeseen
usability impact, depending on your application.

  * the 'synonyms' field is "virtual" or "private" so you'll have to
decide whether you want to expose it as part of your public interface or
not.


Otherwise, if you do this at search-time with query expansion, I would
expect a small (maybe not measurable) performance hit and more
complicated search code. You could use the Search::Query term_expander
feature[0].

  my $parser = Search::Query->parser(
      dialect => 'Lucy',
      term_expander => sub {
          my ($term, $field) = @_;
          return ($term) if ref $term;    # skip ranges
          return ( get_array_of_synonyms_for_term($term), $term );
      },
  );
  my $query      = $parser->parse($str);
  my $lucy_query = $query->as_lucy_query();
  my $hits       = $lucy_searcher->hits( query => $lucy_query );


A third way to approach the problem, though it doesn't directly answer
the question you posed, is to treat the synonyms as 'suggestions' for
further searches, rather than searching for them automatically.
Something like LucyX::Suggester[1] could be extended to include synonyms
in addition to spellings.


[0] https://metacpan.org/pod/Search::Query::Parser#term_expander
[1] https://metacpan.org/pod/LucyX::Suggester

--
Peter Karman  .  http://peknet.com/  .  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-user] synonym terms

Anil Pachuri
Very helpful and clear reply. Thanks a lot, Peter.



On Monday, March 3, 2014 2:22 PM, Peter Karman <[hidden email]> wrote:
 
On 3/2/14 4:18 PM, Anil Pachuri wrote:

> Hi there,
>
> How should one handle synonym terms in Lucy? I wonder if expanding
> the query (e.g. terms separated by 'OR') is the best way to do this.
> Is there a built-in function/sample code available in Lucy that shows
> how to handle synonym terms at the index level? Please advise.
>

As you allude, there are two ways to solve the problem: at index time,
or at search time.

There are trade-offs to both; I prefer to do as much at index time as
possible, for a couple of reasons. One, stuffing the index with extra
data at index time means the search-time code doesn't have to work
harder (running a long OR'd string, e.g.). Two, it makes debugging
easier IME, because standard searching code gets the same results as
customized searching code. E.g., you can dump a lexicon to see exactly
what is in the index, synonyms included. OTOH, see the caveats below.

I don't know of any examples in the wild for doing this at index time,
but I image something like this would work:

  my %doc = get_doc_to_index();
  my @terms = get_terms_from_doc($doc);  # should analyze like Lucy does
  my %synonyms;
  for my $term (@terms) {
      for my $syn (get_synonyms($term)) {
          $synonyms{$syn}++;  # avoid duplicates
      }
  }
  # make sure your schema has a 'synonyms' field defined
  $doc{synonyms} = join ' ', keys %synonyms;
  add_to_indexer(\%doc);


The caveats here (and anytime you do this at index-time) include:

  * snipping/highlighting will be strange, since a match in the
'synonyms' field will have zero context.

  * you're increasing the size of your index with content that doesn't
actually exist in your document corpus. That can have unforeseen
usability impact, depending on your application.

  * the 'synonyms' field is "virtual" or "private" so you'll have to
decide whether you want to expose it as part of your public interface or
not.


Otherwise, if you do this at search-time with query expansion, I would
expect a small (maybe not measurable) performance hit and more
complicated search code. You could use the Search::Query term_expander
feature[0].

  my $parser = Search::Query->parser(
      dialect => 'Lucy',
      term_expander => sub {
          my ($term, $field) = @_;
          return ($term) if ref $term;    # skip ranges
          return ( get_array_of_synonyms_for_term($term), $term );
      },
  );
  my $query      = $parser->parse($str);
  my $lucy_query = $query->as_lucy_query();
  my $hits       = $lucy_searcher->hits( query => $lucy_query );


A third way to approach the problem, though it doesn't directly answer
the question you posed, is to treat the synonyms as 'suggestions' for
further searches, rather than searching for them automatically.
Something like LucyX::Suggester[1] could be extended to include synonyms
in addition to spellings.


[0] https://metacpan.org/pod/Search::Query::Parser#term_expander
[1] https://metacpan.org/pod/LucyX::Suggester

--
Peter Karman  .  http://peknet.com/  .  [hidden email]