Highlight with Proximity search throws an exception

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Highlight with Proximity search throws an exception

Juraj Jurčo
Hi guys, 
we are trying to implement search and we have experienced a strange situation. When our text contains an apostrophe followed by a single character AND we our search query is composed of exactly two letters followed by proximity search AND we use highlighting, we get an exception:

java.lang.IllegalArgumentException: boost must be a positive float, got -1.0

It seems there is a problem at:FuzzyTermsEnum.java:271 (float similarity = 1.0f - (float) ed / (float) minTermLength) when it reaches it with ed=2 and it sets a negative boost. 

I was able to reproduce the error with following code:
import java.io.IOException;
import java.nio.file.Path;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.TokenSources;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.junit.jupiter.api.Test;

class FindSqlHighlightTest {

@Test
void reproduceHighlightProblem() throws IOException, ParseException, InvalidTokenOffsetsException {
String text = "doesn't";
String field = "text";
//NOK: se~, se~2 and any higher number
//OK: sel~, s~, se~1
String uQuery = "se~";
int maxStartOffset = -1;
Analyzer analyzer = new SimpleAnalyzer();

Path indexLocation = Path.of("temp", "reproduceHighlightProblem").toAbsolutePath();
if (indexLocation.toFile().exists()) {
FileUtils.deleteDirectory(indexLocation.toFile());
}
Directory indexDir = FSDirectory.open(indexLocation);

//Create index
IndexWriterConfig dimsIndexWriterConfig = new IndexWriterConfig(analyzer);
dimsIndexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter idxWriter = new IndexWriter(indexDir, dimsIndexWriterConfig);
//add doc
Document doc = new Document();
doc.add(new TextField(field, text, Field.Store.NO));
idxWriter.addDocument(doc);
//commit
idxWriter.commit();
idxWriter.close();

//search & highlight
Query query = new QueryParser(field, analyzer).parse(uQuery);
Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new QueryScorer(query));
TokenStream tokenStream = TokenSources.getTokenStream(field, null, text, analyzer, maxStartOffset);
String highlighted = highlighter.getBestFragment(tokenStream, text);
System.out.println(highlighted);
}
}

Could you please confirm whether it's a bug in Lucene or whether we do something that is not allowed?

Thanks a lot!
Best,
Juraj+
Reply | Threaded
Open this post in threaded view
|

Re: Highlight with Proximity search throws an exception

Michael Sokolov-4
I traced this to this block in FuzzyTermsEnum:

    if (ed == 0) { // exact match
      boostAtt.setBoost(1.0F);
    } else {
      final int codePointCount = UnicodeUtil.codePointCount(term);
      int minTermLength = Math.min(codePointCount, termLength);

      float similarity = 1.0f - (float) ed / (float) minTermLength;
      boostAtt.setBoost(similarity);
    }

where in your test ed (edit distance) was 2 and minTermLength 1,
leading to negative boost.

I don't really understand this code at all, but I wonder if it should
divide by maxTermLength instead of minTermLength?

On Thu, Oct 1, 2020 at 9:54 AM Juraj Jurčo <[hidden email]> wrote:

>
> Hi guys,
> we are trying to implement search and we have experienced a strange situation. When our text contains an apostrophe followed by a single character AND we our search query is composed of exactly two letters followed by proximity search AND we use highlighting, we get an exception:
>
>> java.lang.IllegalArgumentException: boost must be a positive float, got -1.0
>
>
> It seems there is a problem at:FuzzyTermsEnum.java:271 (float similarity = 1.0f - (float) ed / (float) minTermLength) when it reaches it with ed=2 and it sets a negative boost.
>
> I was able to reproduce the error with following code:
>
> import java.io.IOException;
> import java.nio.file.Path;
>
> import org.apache.commons.io.FileUtils;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.core.SimpleAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.TextField;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.queryparser.classic.ParseException;
> import org.apache.lucene.queryparser.classic.QueryParser;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.highlight.Highlighter;
> import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
> import org.apache.lucene.search.highlight.QueryScorer;
> import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
> import org.apache.lucene.search.highlight.TokenSources;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.junit.jupiter.api.Test;
>
> class FindSqlHighlightTest {
>
>    @Test
>    void reproduceHighlightProblem() throws IOException, ParseException, InvalidTokenOffsetsException {
>       String text = "doesn't";
>       String field = "text";
>       //NOK: se~, se~2 and any higher number
>       //OK: sel~, s~, se~1
>       String uQuery = "se~";
>       int maxStartOffset = -1;
>       Analyzer analyzer = new SimpleAnalyzer();
>
>       Path indexLocation = Path.of("temp", "reproduceHighlightProblem").toAbsolutePath();
>       if (indexLocation.toFile().exists()) {
>          FileUtils.deleteDirectory(indexLocation.toFile());
>       }
>       Directory indexDir = FSDirectory.open(indexLocation);
>
>       //Create index
>       IndexWriterConfig dimsIndexWriterConfig = new IndexWriterConfig(analyzer);
>       dimsIndexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
>       IndexWriter idxWriter = new IndexWriter(indexDir, dimsIndexWriterConfig);
>       //add doc
>       Document doc = new Document();
>       doc.add(new TextField(field, text, Field.Store.NO));
>       idxWriter.addDocument(doc);
>       //commit
>       idxWriter.commit();
>       idxWriter.close();
>
>       //search & highlight
>       Query query = new QueryParser(field, analyzer).parse(uQuery);
>       Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new QueryScorer(query));
>       TokenStream tokenStream = TokenSources.getTokenStream(field, null, text, analyzer, maxStartOffset);
>       String highlighted = highlighter.getBestFragment(tokenStream, text);
>       System.out.println(highlighted);
>    }
> }
>
>
> Could you please confirm whether it's a bug in Lucene or whether we do something that is not allowed?
>
> Thanks a lot!
> Best,
> Juraj+

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Highlight with Proximity search throws an exception

Michael McCandless-2
Hi Juraj+,

This indeed smells like a bug.  FuzzyTermsEnum should never try to set a negative boost!

Could you open an issue and open a PR (or attach a patch) with your test case?  Thank you for boiling this down.  This part really made me chuckle:

> When our text contains an apostrophe followed by a single character AND we our search query is composed of exactly two letters followed by proximity search AND we use highlighting, we get an exception:

On Thu, Oct 1, 2020 at 12:48 PM Michael Sokolov <[hidden email]> wrote:
I traced this to this block in FuzzyTermsEnum:

    if (ed == 0) { // exact match
      boostAtt.setBoost(1.0F);
    } else {
      final int codePointCount = UnicodeUtil.codePointCount(term);
      int minTermLength = Math.min(codePointCount, termLength);

      float similarity = 1.0f - (float) ed / (float) minTermLength;
      boostAtt.setBoost(similarity);
    }

where in your test ed (edit distance) was 2 and minTermLength 1,
leading to negative boost.

I don't really understand this code at all, but I wonder if it should
divide by maxTermLength instead of minTermLength?

On Thu, Oct 1, 2020 at 9:54 AM Juraj Jurčo <[hidden email]> wrote:
>
> Hi guys,
> we are trying to implement search and we have experienced a strange situation. When our text contains an apostrophe followed by a single character AND we our search query is composed of exactly two letters followed by proximity search AND we use highlighting, we get an exception:
>
>> java.lang.IllegalArgumentException: boost must be a positive float, got -1.0
>
>
> It seems there is a problem at:FuzzyTermsEnum.java:271 (float similarity = 1.0f - (float) ed / (float) minTermLength) when it reaches it with ed=2 and it sets a negative boost.
>
> I was able to reproduce the error with following code:
>
> import java.io.IOException;
> import java.nio.file.Path;
>
> import org.apache.commons.io.FileUtils;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.core.SimpleAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.TextField;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.queryparser.classic.ParseException;
> import org.apache.lucene.queryparser.classic.QueryParser;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.highlight.Highlighter;
> import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
> import org.apache.lucene.search.highlight.QueryScorer;
> import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
> import org.apache.lucene.search.highlight.TokenSources;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.junit.jupiter.api.Test;
>
> class FindSqlHighlightTest {
>
>    @Test
>    void reproduceHighlightProblem() throws IOException, ParseException, InvalidTokenOffsetsException {
>       String text = "doesn't";
>       String field = "text";
>       //NOK: se~, se~2 and any higher number
>       //OK: sel~, s~, se~1
>       String uQuery = "se~";
>       int maxStartOffset = -1;
>       Analyzer analyzer = new SimpleAnalyzer();
>
>       Path indexLocation = Path.of("temp", "reproduceHighlightProblem").toAbsolutePath();
>       if (indexLocation.toFile().exists()) {
>          FileUtils.deleteDirectory(indexLocation.toFile());
>       }
>       Directory indexDir = FSDirectory.open(indexLocation);
>
>       //Create index
>       IndexWriterConfig dimsIndexWriterConfig = new IndexWriterConfig(analyzer);
>       dimsIndexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
>       IndexWriter idxWriter = new IndexWriter(indexDir, dimsIndexWriterConfig);
>       //add doc
>       Document doc = new Document();
>       doc.add(new TextField(field, text, Field.Store.NO));
>       idxWriter.addDocument(doc);
>       //commit
>       idxWriter.commit();
>       idxWriter.close();
>
>       //search & highlight
>       Query query = new QueryParser(field, analyzer).parse(uQuery);
>       Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new QueryScorer(query));
>       TokenStream tokenStream = TokenSources.getTokenStream(field, null, text, analyzer, maxStartOffset);
>       String highlighted = highlighter.getBestFragment(tokenStream, text);
>       System.out.println(highlighted);
>    }
> }
>
>
> Could you please confirm whether it's a bug in Lucene or whether we do something that is not allowed?
>
> Thanks a lot!
> Best,
> Juraj+

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Highlight with Proximity search throws an exception

Michael McCandless-2

On Fri, Oct 2, 2020 at 12:03 PM Michael McCandless <[hidden email]> wrote:
Hi Juraj+,

This indeed smells like a bug.  FuzzyTermsEnum should never try to set a negative boost!

Could you open an issue and open a PR (or attach a patch) with your test case?  Thank you for boiling this down.  This part really made me chuckle:

> When our text contains an apostrophe followed by a single character AND we our search query is composed of exactly two letters followed by proximity search AND we use highlighting, we get an exception:

On Thu, Oct 1, 2020 at 12:48 PM Michael Sokolov <[hidden email]> wrote:
I traced this to this block in FuzzyTermsEnum:

    if (ed == 0) { // exact match
      boostAtt.setBoost(1.0F);
    } else {
      final int codePointCount = UnicodeUtil.codePointCount(term);
      int minTermLength = Math.min(codePointCount, termLength);

      float similarity = 1.0f - (float) ed / (float) minTermLength;
      boostAtt.setBoost(similarity);
    }

where in your test ed (edit distance) was 2 and minTermLength 1,
leading to negative boost.

I don't really understand this code at all, but I wonder if it should
divide by maxTermLength instead of minTermLength?

On Thu, Oct 1, 2020 at 9:54 AM Juraj Jurčo <[hidden email]> wrote:
>
> Hi guys,
> we are trying to implement search and we have experienced a strange situation. When our text contains an apostrophe followed by a single character AND we our search query is composed of exactly two letters followed by proximity search AND we use highlighting, we get an exception:
>
>> java.lang.IllegalArgumentException: boost must be a positive float, got -1.0
>
>
> It seems there is a problem at:FuzzyTermsEnum.java:271 (float similarity = 1.0f - (float) ed / (float) minTermLength) when it reaches it with ed=2 and it sets a negative boost.
>
> I was able to reproduce the error with following code:
>
> import java.io.IOException;
> import java.nio.file.Path;
>
> import org.apache.commons.io.FileUtils;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.core.SimpleAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.TextField;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.queryparser.classic.ParseException;
> import org.apache.lucene.queryparser.classic.QueryParser;
> import org.apache.lucene.search.Query;
> import org.apache.lucene.search.highlight.Highlighter;
> import org.apache.lucene.search.highlight.InvalidTokenOffsetsException;
> import org.apache.lucene.search.highlight.QueryScorer;
> import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
> import org.apache.lucene.search.highlight.TokenSources;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.junit.jupiter.api.Test;
>
> class FindSqlHighlightTest {
>
>    @Test
>    void reproduceHighlightProblem() throws IOException, ParseException, InvalidTokenOffsetsException {
>       String text = "doesn't";
>       String field = "text";
>       //NOK: se~, se~2 and any higher number
>       //OK: sel~, s~, se~1
>       String uQuery = "se~";
>       int maxStartOffset = -1;
>       Analyzer analyzer = new SimpleAnalyzer();
>
>       Path indexLocation = Path.of("temp", "reproduceHighlightProblem").toAbsolutePath();
>       if (indexLocation.toFile().exists()) {
>          FileUtils.deleteDirectory(indexLocation.toFile());
>       }
>       Directory indexDir = FSDirectory.open(indexLocation);
>
>       //Create index
>       IndexWriterConfig dimsIndexWriterConfig = new IndexWriterConfig(analyzer);
>       dimsIndexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
>       IndexWriter idxWriter = new IndexWriter(indexDir, dimsIndexWriterConfig);
>       //add doc
>       Document doc = new Document();
>       doc.add(new TextField(field, text, Field.Store.NO));
>       idxWriter.addDocument(doc);
>       //commit
>       idxWriter.commit();
>       idxWriter.close();
>
>       //search & highlight
>       Query query = new QueryParser(field, analyzer).parse(uQuery);
>       Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter(), new QueryScorer(query));
>       TokenStream tokenStream = TokenSources.getTokenStream(field, null, text, analyzer, maxStartOffset);
>       String highlighted = highlighter.getBestFragment(tokenStream, text);
>       System.out.println(highlighted);
>    }
> }
>
>
> Could you please confirm whether it's a bug in Lucene or whether we do something that is not allowed?
>
> Thanks a lot!
> Best,
> Juraj+

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]