highlighting - fuzzy search

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

highlighting - fuzzy search

Fisheye
Is it possible to get back a highlighted text "snippet" when using fuzzy search? I mean where does lucene stores the similar words to the search query? If I know where these words are, I can use one of these words to highlight.

thx

Simon Dietschi
Reply | Threaded
Open this post in threaded view
|

Re: highlighting - fuzzy search

Erik Hatcher

On Apr 4, 2006, at 8:30 AM, Fisheye wrote:
> Is it possible to get back a highlighted text "snippet" when using  
> fuzzy
> search? I mean where does lucene stores the similar words to the  
> search
> query? If I know where these words are, I can use one of these  
> words to
> highlight.

You mean using a FuzzyQuery (fuzzy~ in QueryParser syntax)?  For any  
query which expands to multiple terms, a rewrite of the Query is  
needed before the Highlighter can do its thing.  Look at Query.rewrite
().

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlighting - fuzzy search

Fisheye
Ok, thanks Erik. So probably my code may explain it:

-------------------------------------------------------------------------------------------------------------------------------

    public void searchQuery(String q, float rel, String indexDir){
   
    String excerpt = "";
   
        try{
       
          Searcher searcher = new IndexSearcher(indexDir);
          Analyzer analyzer = new StandardAnalyzer();
         
          Term searchTerm = new Term("text", q);
          FuzzyQuery fuzzyQuery = new FuzzyQuery(searchTerm, rel/100);

            System.out.println("Searching for: " + q);
           
            Hits hits = searcher.search(fuzzyQuery);
            System.out.println(hits.length() + " total matching documents");
           
            for (int i = 0; i < hits.length(); i++){
                     
                               Document doc = hits.doc(i);
                               String path = doc.get("path");
                               
                               SearchTextHighlighter processText =
                                new SearchTextHighlighter();
                               
//        excerpt =
//         processText.getExcerpt(doc.get("text"), q, fuzzyQuery);
                               
                               if (path != null){
                               
                                          System.out.println(i + ".-------");
                                          System.out.println("  Path:    " + path);
                                          System.out.println("  Score:   " + hits.score(i));
                                          System.out.println("  DocID:   " + doc.get("docID"));
                                          System.out.println("  Snippet: " + excerpt);
                                          System.out.println();
                                         
                               }else{
                               
                                          String url = doc.get("url");
                                         
                                          if (url != null){
                                         
                                                    System.out.println(i + ". " + url);
                                                    System.out.println("   - " + doc.get("title"));
                                                    System.out.println("Score: " + hits.score(i));
                                                   
                                          }else{
                                         
                                          System.out.println(i +
                                          ". " + "No path nor URL for this document");
                                          }
                               }
            }
         
          searcher.close();

        }catch (Exception e){
       
        e.printStackTrace();
        }
      }

-------------------------------------------------------------------------------------------------------------------------------

Method getExcerpt does the following:

-------------------------------------------------------------------------------------------------------------------------------

  public String getExcerpt(String textToCompute, String queryText, Query query) {

    String excerpt = "";
    String vTemp = "";
    Analyzer analyzer = new StandardAnalyzer();
    Highlighter highlighter = new Highlighter(new QueryScorer(query));

    if(textToCompute != null ){

      TokenStream tokenStream = analyzer.tokenStream("text",
            new StringReader(textToCompute));

      try {

        vTemp = highlighter.getBestFragment(tokenStream, textToCompute);
        excerpt = vTemp.replaceAll("" + queryText + "", queryText);
      }
      catch (IOException ex) {

       
      }
    }

    return excerpt;
  }

-------------------------------------------------------------------------------------------------------------------------------

And this is the same way I used the code for a simple query without fuzzy. If I use it with fuzzy query, I got an error.
Reply | Threaded
Open this post in threaded view
|

Re: highlighting - fuzzy search

Erik Hatcher
So, just like I said.... call Query.rewrite() and pass the returned  
Query to the Highlighter, not the original FuzzyQuery.  I believe the  
javadocs for Highlighter even mention this?   Or at least its an FAQ  
that hopefully is on the wiki or easily findable somehow.

        Erik


On Apr 4, 2006, at 9:10 AM, Fisheye wrote:

>
> Ok, thanks Erik. So probably my code may explain it:
>
> ----------------------------------------------------------------------
> ---------------------------------------------------------
>
>     public void searchQuery(String q, float rel, String indexDir){
>    
>     String excerpt = "";
>    
>         try{
>        
>           Searcher searcher = new IndexSearcher(indexDir);
>           Analyzer analyzer = new StandardAnalyzer();
>
>           Term searchTerm = new Term("text", q);
>           FuzzyQuery fuzzyQuery = new FuzzyQuery(searchTerm, rel/100);
>
>     System.out.println("Searching for: " + q);
>    
>     Hits hits = searcher.search(fuzzyQuery);
>     System.out.println(hits.length() + " total matching documents");
>    
>     for (int i = 0; i < hits.length(); i++){
>    
>        Document doc = hits.doc(i);
>        String path = doc.get("path");
>    
>        SearchTextHighlighter processText =
>         new SearchTextHighlighter();
>    
> //        excerpt =
> //         processText.getExcerpt(doc.get("text"), q, fuzzyQuery);
>    
>        if (path != null){
>        
>                  System.out.println(i + ".-------");
>                  System.out.println("  Path:    " + path);
>                  System.out.println("  Score:   " + hits.score
> (i));
>                  System.out.println("  DocID:   " + doc.get
> ("docID"));
>                  System.out.println("  Snippet: " + excerpt);
>                  System.out.println();
>
>        }else{
>        
>                  String url = doc.get("url");
>
>                  if (url != null){
>                
>     System.out.println(i + ". " + url);
>     System.out.println("   - " + doc.get("title"));
>     System.out.println("Score: " + hits.score(i));
>    
>                  }else{
>                
>                   System.out.println(i +
>                   ". " + "No path nor URL for this document");
>                  }
>        }
>     }
>
>           searcher.close();
>
>         }catch (Exception e){
>        
>         e.printStackTrace();
>         }
>       }
>
> ----------------------------------------------------------------------
> ---------------------------------------------------------
>
> Method getExcerpt does the following:
>
> ----------------------------------------------------------------------
> ---------------------------------------------------------
>
>   public String getExcerpt(String textToCompute, String queryText,  
> Query
> query) {
>
>     String excerpt = "";
>     String vTemp = "";
>     Analyzer analyzer = new StandardAnalyzer();
>     Highlighter highlighter = new Highlighter(new QueryScorer(query));
>
>     if(textToCompute != null ){
>
>       TokenStream tokenStream = analyzer.tokenStream("text",
>             new StringReader(textToCompute));
>
>       try {
>
>         vTemp = highlighter.getBestFragment(tokenStream,  
> textToCompute);
>         excerpt = vTemp.replaceAll("" + queryText + "", queryText);
>       }
>       catch (IOException ex) {
>
>
>       }
>     }
>
>     return excerpt;
>   }
>
> ----------------------------------------------------------------------
> ---------------------------------------------------------
> --
> View this message in context: http://www.nabble.com/highlighting--- 
> fuzzy-search-t1392775.html#a3743994
> Sent from the Lucene - Java Users forum at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlighting - fuzzy search

Fisheye
ok, thank Erik, now it works :-)

Probably, do you know if there is a possibility to get the similar words generated by the algorithm when doing fuzzy search?

Cheers

Simon Dietschi
Reply | Threaded
Open this post in threaded view
|

Re: highlighting - fuzzy search

Erik Hatcher

On Apr 4, 2006, at 11:23 AM, Fisheye wrote:
> Probably, do you know if there is a possibility to get the similar  
> words
> generated by the algorithm when doing fuzzy search?

Well, a roundabout way is to simply create a FuzzyQuery, rewrite it,  
cast it to a BooleanQuery and use the BooleanQuery API to extract the  
TermQuery objects and the Term within the TermQuery has what you're  
looking for.  That's actually not a bad way to go, but you could also  
go more low-level and borrow the technique used under FuzzyQuery itself:

        <http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/ 
apache/lucene/search/FuzzyTermEnum.java>

Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlighting - fuzzy search

Daniel Noll-3
Erik Hatcher wrote:

>
> On Apr 4, 2006, at 11:23 AM, Fisheye wrote:
>> Probably, do you know if there is a possibility to get the similar words
>> generated by the algorithm when doing fuzzy search?
>
> Well, a roundabout way is to simply create a FuzzyQuery, rewrite it,
> cast it to a BooleanQuery and use the BooleanQuery API to extract the
> TermQuery objects and the Term within the TermQuery has what you're
> looking for.  That's actually not a bad way to go, but you could also go
> more low-level and borrow the technique used under FuzzyQuery itself:
>
>     <http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/search/FuzzyTermEnum.java>

We take an approach somewhere down the middle...

     IndexReader reader = ...;
     FuzzyQuery q = ...

     FilteredTermEnum enum = q.getEnum(reader);

The advantage of this method is that it's easier to generalise (works
for any subclass of MultiTermQuery, not just FuzzyQuery), while not
needing any rewriting (which may eat more memory, although I can't say
for sure.)

In fact our own code takes any query and looks at the type of it to
extract terms from it, potentially recursively if it encounters a
BooleanQuery.  It would be Really Nice [TM] if Lucene had a method on
the Query class to do this directly. :-)

Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlighting - fuzzy search

Chris Hostetter-3


: > Well, a roundabout way is to simply create a FuzzyQuery, rewrite it,
: > cast it to a BooleanQuery and use the BooleanQuery API to extract the
: > TermQuery objects and the Term within the TermQuery has what you're
        ...

: We take an approach somewhere down the middle...
        ...
:      FuzzyQuery q = ...
        ...
:      FilteredTermEnum enum = q.getEnum(reader);
        ...
: In fact our own code takes any query and looks at the type of it to
: extract terms from it, potentially recursively if it encounters a
: BooleanQuery.  It would be Really Nice [TM] if Lucene had a method on
: the Query class to do this directly. :-)

Isn't that what Query.extractTerms is for?  Isn't it implimented by all
primitive Queries, so you should be able to say...

        HashSet terms = new HashSet();
        query.rewrite(reader).extractTerms(terms);

...and have yourself a list of all the terms?



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlighting - fuzzy search

mark harwood
>Isn't that what Query.extractTerms is for?  Isn't it
>implimented by all primitive Queries?..

As of last week, yes. I changed the SpanQueries to
implement this method and then refactored the
Highlighter package's QueryTermExtractor to make use
of this (it radically simplified the code in there).
This change to rely on extractTerms also means that
the highlighter now works properly with classes like
FilteredQuery.


Cheers,
Mark


               
___________________________________________________________
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlighting - fuzzy search

Daniel Noll-3
mark harwood wrote:

>> Isn't that what Query.extractTerms is for?  Isn't it
>> implimented by all primitive Queries?..
>
> As of last week, yes. I changed the SpanQueries to
> implement this method and then refactored the
> Highlighter package's QueryTermExtractor to make use
> of this (it radically simplified the code in there).
> This change to rely on extractTerms also means that
> the highlighter now works properly with classes like
> FilteredQuery.

Very nice.  Yet another point I can add onto the huge list of reasons
our app should update Lucene. :-)

Although I'd rather not rewrite the query first, it feels like it would
use more memory than an extractTerms(IndexReader) method would.  Maybe
I'm wrong on this, though.

Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlighting - fuzzy search

Fisheye
        HashSet terms = new HashSet();
        query.rewrite(reader).extractTerms(terms);

Ok, but this delivers every term, not just a list of words the Levenshtein algorithm produced with similarity. Regarding to the posts here in my opened thread, you guis seem to be experienced programmers so, why you can't post a code snippet as an example?
I'm asking that because probably I'm not so experienced in Lucene like you and having an exaple would help .

Cheers

Simon Dietschi
Reply | Threaded
Open this post in threaded view
|

Re: highlighting - fuzzy search

Daniel Noll-3
Fisheye wrote:
>         HashSet terms = new HashSet();
>         query.rewrite(reader).extractTerms(terms);
>
> Ok, but this delivers every term, not just a list of words the Levenshtein
> algorithm produced with similarity.

I asked a similar thing in the past about term highlighting in general,
and apparently it's just as fast to get all terms and then highlight
those terms in your text, as it is to determine which terms are in the
text without looking at the text, and then highlight those.

The terms which aren't in the text just won't result in any highlights.

Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: highlighting - fuzzy search

Fisheye
yes, this might be a way, but in my case it would not work:

The probles is, that I have to return an exceprt (snippet) and the words to be highlighted as two separate strings. So now I use highlighter and getBestFragment to extract the excerpt, then I remove the inserted html tags and return the string to my application.
In normal query mode I return the matching query string entered by the user. My application then does the highlighting and displays all I need.
So if I do fuzzy search and want to go the same way as if I do normal search, I need the similar words generated from the Levenshtein algorithm.

Another approach may be to extract the similar words automatically highlighted in the returned fragment by lucene, before I let remove the html tags...