Q: Highlighter + Search symbols "*, ?, ~"

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Q: Highlighter + Search symbols "*, ?, ~"

Stephan Spat
Hello!

I would like to highlight the searching words from the user query in my
result presentation. Therfore I use the highlighter package. I used to
example published in "Lucene in Action" to do so! When I use bollean
operators there is no problem, but with operators like "?", "*", ... the
words cannot be found anymore.

Is it possible to highlight text (word) fregments (without extension of
th package) when I use ?, *, ... operators? And when it is possible, how?

Thank's a lot!

Stephan Spat

PS: The used code:

public String cutAndHighlightText(String text, SearchParameterVO
searchParameter) {
       
        QueryParser queryParser = new QueryParser(
            ConstantsRetrieval.FIELD_DOC_CONTENT, new SimpleAnalyzer());
       
        String formattedText = null;
       
        try {
           
            QueryScorer queryScorer = new QueryScorer(
                    queryParser.parse(searchParameter.getUserQuery()));
           
           
//logger.debug(queryParser.parse(searchParameter.getUserQuery()).toString());
           
            SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(
                    "<span class=\"highlight\">", "</span>");
           
            Highlighter highlighter = new Highlighter(formatter,
queryScorer);
            Fragmenter fragmenter = new SimpleFragmenter(200);
            highlighter.setTextFragmenter(fragmenter);
           
            TokenStream tokenStream = new StandardAnalyzer().
                tokenStream(ConstantsRetrieval.FIELD_DOC_CONTENT, new
StringReader(text));
           
            formattedText = highlighter.getBestFragments(tokenStream,
text, 5, "...");
           
            FileWriter writer = new FileWriter(
                    "D:/development/iremr/text/highlightedDoc.html");
           
            writer.write("<html>");
            writer.write("<style>\n" +
                    ".highlight {\n" +
                    " background: yellow;\n" +
                    "}\n" +
                    "</style>");
            writer.write("<body>");
            writer.write(formattedText);
            writer.write("</body></html>");
            writer.close();
           
        } catch (ParseException e) {
            logger.error("Not able to parse query\n" + e.getMessage());
            return null;
        } catch (IOException e) {
            logger.error("IO-Exception in highlighting" + e.getMessage());
            return null;
        }
       
        return formattedText;
    }


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Q: Highlighter + Search symbols "*, ?, ~"

Stephan Spat
Stephan Spat schrieb:
> I would like to highlight the searching words from the user query in
> my result presentation. Therfore I use the highlighter package. I used
> to example published in "Lucene in Action" to do so! When I use
> bollean operators there is no problem, but with operators like "?",
> "*", ... the words cannot be found anymore.
>
> Is it possible to highlight text (word) fregments (without extension
> of th package) when I use ?, *, ... operators? And when it is
> possible, how?
I have already found a solution: Have to rewrite the query!!

with kind regards

Stephan


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Q: Highlighter + Search symbols "*, ?, ~"

Storey, Jeff
Stephan,
 
Could you explain what you did for your solution?  This is a problem I'm currently facing as well. But, for example, if the user searches for "head~" would you also be able to highlight "read" and "dead" if they are returned or just "head" without the ~.
 
Thanks.
Jeff

________________________________

From: Stephan Spat [mailto:[hidden email]]
Sent: Mon 11/20/2006 3:41 AM
To: [hidden email]
Subject: Re: Q: Highlighter + Search symbols "*, ?, ~"



Stephan Spat schrieb:
> I would like to highlight the searching words from the user query in
> my result presentation. Therfore I use the highlighter package. I used
> to example published in "Lucene in Action" to do so! When I use
> bollean operators there is no problem, but with operators like "?",
> "*", ... the words cannot be found anymore.
>
> Is it possible to highlight text (word) fregments (without extension
> of th package) when I use ?, *, ... operators? And when it is
> possible, how?
I have already found a solution: Have to rewrite the query!!

with kind regards

Stephan


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Q: Highlighter + Search symbols "*, ?, ~"

Stephan Spat
Hey Jeff!

Storey, Jeff schrieb:
> Could you explain what you did for your solution?  This is a problem I'm currently facing as well. But, for example, if the user searches for "head~" would you also be able to highlight "read" and "dead" if they are returned or just "head" without the ~.
>  
It is necessary to give a "native" query to the QueryScorer (only
Boolean operators). Therefore I just took the an IndexWriter object and
used its public method rewrite(query).

Here the code:

QueryParser queryParser = new QueryParser(
            ConstantsRetrieval.FIELD_DOC_CONTENT, new EMRAnalyzer());
       
        String formattedText = null;
       
        try {
           
            // for the usage of highlighting with wildcards
            Query query =
indexSearcher.rewrite(queryParser.parse(searchParameter.getUserQuery()));
            QueryScorer queryScorer = new QueryScorer(query);
           
            //logger.debug("User Query: " + query.toString());
           
            SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(
                    "<span class=\"highlight\">", "</span>");
           
            Highlighter highlighter = new Highlighter(formatter,
queryScorer);
            Fragmenter fragmenter = new SimpleFragmenter(100);
            highlighter.setTextFragmenter(fragmenter);
           
            TokenStream tokenStream = new EMRAnalyzer().
                tokenStream(ConstantsRetrieval.FIELD_DOC_CONTENT, new
StringReader(text));
           
            formattedText = highlighter.getBestFragments(tokenStream,
text, 5, "...");
           
            //logger.debug("Formatted Text: \n" + formattedText);
           
            FileWriter writer = new FileWriter(
                    "D:/development/iremr/text/highlightedDoc.html");
           
            writer.write("<html>");
            writer.write("<style>\n" +
                    ".highlight {\n" +
                    " background: yellow;\n" +
                    "}\n" +
                    "</style>");
            writer.write("<body>");
            writer.write(formattedText);
            writer.write("</body></html>");
            writer.close();
           
        } catch (ParseException e) {
            logger.error("Not able to parse query\n" + e.getMessage());
            return null;
        } catch (IOException e) {
            logger.error("IO-Exception in highlighting" + e.getMessage());
            return null;
        }

Stephan


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Q: Highlighter + Search symbols "*, ?, ~"

Storey, Jeff
Thanks for the quick reply. I'll be implementing this in the next couple
of days. Appreciate it!

Jeff

-----Original Message-----
From: Stephan Spat [mailto:[hidden email]]
Sent: Monday, November 20, 2006 8:43 AM
To: [hidden email]
Subject: Re: Q: Highlighter + Search symbols "*, ?, ~"

Hey Jeff!

Storey, Jeff schrieb:
> Could you explain what you did for your solution?  This is a problem
I'm currently facing as well. But, for example, if the user searches for
"head~" would you also be able to highlight "read" and "dead" if they
are returned or just "head" without the ~.
>  
It is necessary to give a "native" query to the QueryScorer (only
Boolean operators). Therefore I just took the an IndexWriter object and
used its public method rewrite(query).

Here the code:

QueryParser queryParser = new QueryParser(
            ConstantsRetrieval.FIELD_DOC_CONTENT, new EMRAnalyzer());
       
        String formattedText = null;
       
        try {
           
            // for the usage of highlighting with wildcards
            Query query =
indexSearcher.rewrite(queryParser.parse(searchParameter.getUserQuery()))
;
            QueryScorer queryScorer = new QueryScorer(query);
           
            //logger.debug("User Query: " + query.toString());
           
            SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(
                    "<span class=\"highlight\">", "</span>");
           
            Highlighter highlighter = new Highlighter(formatter,
queryScorer);
            Fragmenter fragmenter = new SimpleFragmenter(100);
            highlighter.setTextFragmenter(fragmenter);
           
            TokenStream tokenStream = new EMRAnalyzer().
                tokenStream(ConstantsRetrieval.FIELD_DOC_CONTENT, new
StringReader(text));
           
            formattedText = highlighter.getBestFragments(tokenStream,
text, 5, "...");
           
            //logger.debug("Formatted Text: \n" + formattedText);
           
            FileWriter writer = new FileWriter(
                    "D:/development/iremr/text/highlightedDoc.html");
           
            writer.write("<html>");
            writer.write("<style>\n" +
                    ".highlight {\n" +
                    " background: yellow;\n" +
                    "}\n" +
                    "</style>");
            writer.write("<body>");
            writer.write(formattedText);
            writer.write("</body></html>");
            writer.close();
           
        } catch (ParseException e) {
            logger.error("Not able to parse query\n" + e.getMessage());
            return null;
        } catch (IOException e) {
            logger.error("IO-Exception in highlighting" +
e.getMessage());
            return null;
        }

Stephan


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Q: Highlighter + Search symbols "*, ?, ~"

Daniel Noll-3-2
In reply to this post by Stephan Spat
Stephan Spat wrote:
> It is necessary to give a "native" query to the QueryScorer (only
> Boolean operators). Therefore I just took the an IndexWriter object and
> used its public method rewrite(query).

How efficient is this for huge wildcard queries?  e.g. "a*"

At the moment we highlight our terms by using for instance,
WildcardQuery#getEnum(IndexReader), and only storing the strings which
it returns, whereas I would assume Query#rewrite(IndexReader) would take
up much more memory.

Range queries are another one... try querying for a range from a to zzzz
and see what happens. :-)

Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://nuix.com/                               Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Q: Highlighter + Search symbols "*, ?, ~"

mark harwood
Daniel Noll wrote:
> How efficient is this for huge wildcard queries?  e.g. "a*"
> At the moment we highlight our terms by using for instance,
> WildcardQuery#getEnum(IndexReader), and only storing the strings which
> it returns, whereas I would assume Query#rewrite(IndexReader) would
> take up much more memory.
>
There is no discernible cost in practice. Query.rewrite always happens
internally anyway - it is a necessary part of searching. All we are
doing is pre-empting this step and calling rewrite before core Lucene
does so we have visibility of the terms actually used for matching.
Admittedly, the Lucene search code will then needlessly call rewrite on
our rewritten query when we search with it but this is evaluated *very*
quickly and does not add any noticeable performance overhead. (Always
remember to pass the rewritten query to the search method, not the
orginal). Your suggested approach of calling WildcardQuery.getEnum adds
an extra sweep of TermEnum on top of the one  that already happens
internally as part of rewrite (see base class MultiTermQuery.rewrite) so
is slower and uses no less memory.

> Range queries are another one... try querying for a range from a to
> zzzz and see what happens. :-)
Note: QueryParser does not use RangeQuery by default any more - it uses
a filter. This means it's faster and doesn't blow up with an explosion
in terms when ranges are large. Using this new default setting we lost
the ability to highlight terms in the range but I think that is
generally an unusual requirement and on balance the benefits of filters
over queries for ranges outweigh the costs.

While we're on the subject of large wilcard queries /filtering etc I
also recently found it useful recently to subclass the QueryParser when
querying all pages for a domain i.e a query for docs with url fields
starting with http://www.ibm.com/*.
The old "too many BooleanClause" exception was a likely scenario again
so I used the new PrefixFilter class as opposed to a WildcardQuery to
avoid the problem.

Cheers,
Mark

Send instant messages to your online friends http://uk.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]