highlight exception

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

highlight exception

nick19701
I have thousands of docs in my solr instance.
The following doc (maybe others) is causing exception everytime
highlight is turned on.

<doc>
<str name="topicTitle">
Best buy - Acer Aspire AS5610-2273 - $599. Windows vista, 1 GB RAM
</str>
</doc>

The exception is like this:

java.lang.StringIndexOutOfBoundsException: String index out of range: -52
        at java.lang.String.substring(String.java:1768)
        at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:235)
        at org.apache.solr.util.HighlightingUtils.doHighlighting(HighlightingUtils.java:252)
        at org.apache.solr.request.StandardRequestHandler.handleRequest(StandardRequestHandler.java:161)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:587)
        at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
        at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
        at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
        at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
        at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
        at java.lang.Thread.run(Thread.java:595)

This exception only occurs when highlight is on and the above doc is in the response.
So for example, these three requests all cause the exception:

hl=on&hl.fl=topicTitle&hl.fragsize=0&hl.simple.pre=<em>&hl.simple.post=</em>&q=topicTitle:best+buy;replies desc&start=40&rows=10

hl=on&hl.fl=topicTitle&hl.fragsize=0&hl.simple.pre=<em>&hl.simple.post=</em>&q=topicTitle:acer;replies desc&start=0&rows=10

hl=on&hl.fl=topicTitle&hl.fragsize=0&hl.simple.pre=<em>&hl.simple.post=</em>&q=topicTitle:vista;replies desc&start=60&rows=10


Below is the field definition for topicTitle. What's so special about the above doc?


 <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
       
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>

<field name="topicTitle" type="text" indexed="true" stored="true"/>
Reply | Threaded
Open this post in threaded view
|

Re: highlight exception

Yonik Seeley-2
Thanks for the report Nick,
could you open a JIRA bug for this?
Thanks,
-Yonik

On 2/15/07, nick19701 <[hidden email]> wrote:

>
> I have thousands of docs in my solr instance.
> The following doc (maybe others) is causing exception everytime
> highlight is turned on.
>
> <doc>
> <str name="topicTitle">
> Best buy - Acer Aspire AS5610-2273 - $599. Windows vista, 1 GB RAM
> </str>
> </doc>
>
> The exception is like this:
>
> java.lang.StringIndexOutOfBoundsException: String index out of range: -52
>         at java.lang.String.substring(String.java:1768)
>         at
> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:235)
>         at
> org.apache.solr.util.HighlightingUtils.doHighlighting(HighlightingUtils.java:252)
>         at
> org.apache.solr.request.StandardRequestHandler.handleRequest(StandardRequestHandler.java:161)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:587)
>         at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:92)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
>         at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
>         at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
>         at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>         at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
>         at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
>         at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
>         at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
>         at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
>         at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
>         at
> org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
>         at
> org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
>         at
> org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
>         at
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
>         at java.lang.Thread.run(Thread.java:595)
>
> This exception only occurs when highlight is on and the above doc is in the
> response.
> So for example, these three requests all cause the exception:
>
> hl=on&hl.fl=topicTitle&hl.fragsize=0&hl.simple.pre=<em>&hl.simple.post=</em>&q=topicTitle:best+buy;replies
> desc&start=40&rows=10
>
> hl=on&hl.fl=topicTitle&hl.fragsize=0&hl.simple.pre=<em>&hl.simple.post=</em>&q=topicTitle:acer;replies
> desc&start=0&rows=10
>
> hl=on&hl.fl=topicTitle&hl.fragsize=0&hl.simple.pre=<em>&hl.simple.post=</em>&q=topicTitle:vista;replies
> desc&start=60&rows=10
>
>
> Below is the field definition for topicTitle. What's so special about the
> above doc?
>
>
>  <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <!--in this example, we will only use synonyms at query time-->
>         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!--<filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldtype>
>
> <field name="topicTitle" type="text" indexed="true" stored="true"/>
> --
> View this message in context: http://www.nabble.com/highlight-exception-tf3234528.html#a8987980
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: highlight exception

Mike Klaas
In reply to this post by nick19701
On 2/15/07, nick19701 <[hidden email]> wrote:

> <doc>
> <str name="topicTitle">
> Best buy - Acer Aspire AS5610-2273 - $599. Windows vista, 1 GB RAM
> </str>
> </doc>

Doesn't look particularly out of the ordinary.

> The exception is like this:
>
> java.lang.StringIndexOutOfBoundsException: String index out of range: -52
>         at java.lang.String.substring(String.java:1768)
>         at
> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:235)
>         at
> org.apache.solr.util.HighlightingUtils.doHighlighting(HighlightingUtils.java:252)
>         at

Corresponds to:
                                        startOffset =
tokenGroup.matchStartOffset;
                                        endOffset = tokenGroup.matchEndOffset;
                                        tokenText =
text.substring(startOffset, endOffset);

where the offsets are token offsets from analysis, and should not be
-52.  Are you using term vectors?  Is the field multi-valued?  Also,
what version of Solr are you using?

Could you c&p the output of verbose analysis of this text in the solr admin?

thanks,
-Mike
Reply | Threaded
Open this post in threaded view
|

Re: highlight exception

nick19701
Mike Klaas wrote
Corresponds to: startOffset = tokenGroup.matchStartOffset; endOffset = tokenGroup.matchEndOffset; tokenText = text.substring(startOffset, endOffset); where the offsets are token offsets from analysis, and should not be -52. Are you using term vectors? Is the field multi-valued? Also, what version of Solr are you using? Could you c&p the output of verbose analysis of this text in the solr admin? thanks, -Mike
As far as I know, I'm not using term vectors and this field is single-valued. Solr version is 1.1.0 dated on 12/17/2006. Below is the verbose analysis:

Index Analyzer

org.apache.solr.analysis.WhitespaceTokenizerFactory {}

term position 12345678910111213
term text Bestbuy-AcerAspireAS5610-2273-$599.Windowsvista,1GBRAM
term type wordwordwordwordwordwordwordwordwordwordwordwordword
source start,end 0,45,89,1011,1516,2223,3435,3637,4243,5051,5758,5960,6263,66

org.apache.solr.analysis.SynonymFilterFactory {expand=true, ignoreCase=true, synonyms=index_synonyms.txt}

term position 12345678910111213
term text bestbuybuy-AcerAspireAS5610-2273-$599.Windowsvista,1GBRAM
bbgib
bestgigabyte
gigabytes
term type wordwordwordwordwordwordwordwordwordwordwordwordword
wordword
wordword
word
source start,end 0,80,89,1011,1516,2223,3435,3637,4243,5051,5758,5960,863,66
0,860,8
0,860,8
60,8

org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true}

term position 12345678910111213
term text bestbuybuy-AcerAspireAS5610-2273-$599.Windowsvista,1GBRAM
bbgib
bestgigabyte
gigabytes
term type wordwordwordwordwordwordwordwordwordwordwordwordword
wordword
wordword
word
source start,end 0,80,89,1011,1516,2223,3435,3637,4243,5051,5758,5960,863,66
0,860,8
0,860,8
60,8

org.apache.solr.analysis.WordDelimiterFilterFactory {catenateWords=1, catenateNumbers=1, catenateAll=0, generateNumberParts=1, generateWordParts=1}

term position 12345678910111213
term text bestbuybuyAcerAspireAS56102273599Windowsvista1GBRAM
bb56102273gib
bestgigabyte
gigabytes
term type wordwordwordwordwordwordwordwordwordwordwordwordword
wordwordword
wordword
word
source start,end 0,80,811,1516,2223,2525,2930,3438,4143,5051,5658,5960,863,66
0,825,3460,8
0,860,8
60,8

org.apache.solr.analysis.LowerCaseFilterFactory {}

term position 12345678910111213
term text bestbuybuyaceraspireas56102273599windowsvista1gbram
bb56102273gib
bestgigabyte
gigabytes
term type wordwordwordwordwordwordwordwordwordwordwordwordword
wordwordword
wordword
word
source start,end 0,80,811,1516,2223,2525,2930,3438,4143,5051,5658,5960,863,66
0,825,3460,8
0,860,8
60,8

org.apache.solr.analysis.EnglishPorterFilterFactory {protected=protwords.txt}

term position 12345678910111213
term text bestbuybuyaceraspiras56102273599windowvista1gbram
bb56102273gib
bestgigabyt
gigabyt
term type wordwordwordwordwordwordwordwordwordwordwordwordword
wordwordword
wordword
word
source start,end 0,80,811,1516,2223,2525,2930,3438,4143,5051,5658,5960,863,66
0,825,3460,8
0,860,8
60,8

org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}

term position 12345678910111213
term text bestbuybuyaceraspiras56102273599windowvista1gbram
bb56102273gib
bestgigabyt
term type wordwordwordwordwordwordwordwordwordwordwordwordword
wordwordword
wordword
source start,end 0,80,811,1516,2223,2525,2930,3438,4143,5051,5658,5960,863,66
0,825,3460,8
0,860,8
Reply | Threaded
Open this post in threaded view
|

Re: highlight exception

Mike Klaas
On 2/15/07, nick19701 <[hidden email]> wrote:

>
>
> Mike Klaas wrote:
> >
> > Corresponds to:
> >                                         startOffset =
> > tokenGroup.matchStartOffset;
> >                                         endOffset =
> > tokenGroup.matchEndOffset;
> >                                         tokenText =
> > text.substring(startOffset, endOffset);
> >
> > where the offsets are token offsets from analysis, and should not be
> > -52.  Are you using term vectors?  Is the field multi-valued?  Also,
> > what version of Solr are you using?
> >
> > Could you c&p the output of verbose analysis of this text in the solr
> > admin?
> >
> > thanks,
> > -Mike
> >
> >
>
> As far as I know, I'm not using term vectors and this field is
> single-valued.
> Solr version is 1.1.0 dated on 12/17/2006.
>
> Below is the verbose analysis:
>
> Index Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory   {}
>
>
> term position
> 1       2       3       4       5       6       7       8       9       10      11      12      13
>
>
> term text
> Best    buy     -       Acer    Aspire  AS5610-2273     -       $599.   Windows vista,  1       GB      RAM
>
>
> term type
> word    word    word    word    word    word    word    word    word    word    word    word    word
>
>
> source start,end
> 0,4     5,8     9,10    11,15   16,22   23,34   35,36   37,42   43,50   51,57   58,59   60,62   63,66
>
>
> org.apache.solr.analysis.SynonymFilterFactory   {expand=true,
> ignoreCase=true, synonyms=index_synonyms.txt}
>
>
> term position
> 1       2       3       4       5       6       7       8       9       10      11      12      13
>
>
> term text
> bestbuy buy     -       Acer    Aspire  AS5610-2273     -       $599.   Windows vista,  1       GB      RAM
>
>
> bb      gib
>
> best    gigabyte
>
> gigabytes
>
> term type
> word    word    word    word    word    word    word    word    word    word    word    word    word
>
>
> word    word
>
> word    word
>
> word
>
> source start,end
> 0,8     0,8     9,10    11,15   16,22   23,34   35,36   37,42   43,50   51,57   58,59   60,8    63,66
>
>
> 0,8     60,8
>
> 0,8     60,8
>
> 60,8
>
> org.apache.solr.analysis.StopFilterFactory   {words=stopwords.txt,
> ignoreCase=true}
>
>
> term position
>
> 1       2       3       4       5       6       7       8       9       10      11      12      13
>
> term text
>
> bestbuy buy     -       Acer    Aspire  AS5610-2273     -       $599.   Windows vista,  1       GB      RAM
>
> bb      gib
>
>
> best    gigabyte
>
> gigabytes
>
> term type
> word    word    word    word    word    word    word    word    word    word    word    word    word
>
>
> word    word
>
> word    word
>
> word
>
> source start,end
> 0,8     0,8     9,10    11,15   16,22   23,34   35,36   37,42   43,50   51,57   58,59   60,8    63,66
>
>
> 0,8     60,8
>
> 0,8     60,8
>
> 60,8
>
> org.apache.solr.analysis.WordDelimiterFilterFactory   {catenateWords=1,
> catenateNumbers=1, catenateAll=0, generateNumberParts=1,
> generateWordParts=1}
>
>
> term position
>
> 1       2       3       4       5       6       7       8       9       10      11      12      13
>
> term text
>
> bestbuy buy     Acer    Aspire  AS      5610    2273    599     Windows vista   1       GB      RAM
>
> bb      56102273        gib
>
>
> best    gigabyte
>
> gigabytes
>
> term type
> word    word    word    word    word    word    word    word    word    word    word    word    word
>
>
> word    word    word
>
> word    word
>
> word
>
> source start,end
> 0,8     0,8     11,15   16,22   23,25   25,29   30,34   38,41   43,50   51,56   58,59   60,8    63,66
>
>
> 0,8     25,34   60,8
>
> 0,8     60,8
>
> 60,8
>
> org.apache.solr.analysis.LowerCaseFilterFactory   {}
>
>
>
> term position
> 1       2       3       4       5       6       7       8       9       10      11      12      13
>
>
> term text
> bestbuy buy     acer    aspire  as      5610    2273    599     windows vista   1       gb      ram
>
>
> bb      56102273        gib
>
> best    gigabyte
>
> gigabytes
>
> term type
> word    word    word    word    word    word    word    word    word    word    word    word    word
>
>
> word    word    word
>
> word    word
>
> word
>
> source start,end
> 0,8     0,8     11,15   16,22   23,25   25,29   30,34   38,41   43,50   51,56   58,59   60,8    63,66
>
>
> 0,8     25,34   60,8
>
> 0,8     60,8
>
> 60,8
>
> org.apache.solr.analysis.EnglishPorterFilterFactory
> {protected=protwords.txt}
>
>
>
> term position
> 1       2       3       4       5       6       7       8       9       10      11      12      13
>
>
> term text
> bestbuy buy     acer    aspir   as      5610    2273    599     window  vista   1       gb      ram
>
>
> bb      56102273        gib
>
> best    gigabyt
>
> gigabyt
>
> term type
> word    word    word    word    word    word    word    word    word    word    word    word    word
>
>
> word    word    word
>
> word    word
>
> word
>
> source start,end
> 0,8     0,8     11,15   16,22   23,25   25,29   30,34   38,41   43,50   51,56   58,59   60,8    63,66
>
>
> 0,8     25,34   60,8
>
> 0,8     60,8
>
> 60,8

That 60, 8 produced by the synonym filter is surely signs of a bug
(and what is producing the -52).  What is your list of synonyms?

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: highlight exception

nick19701
Mike Klaas wrote
That 60, 8 produced by the synonym filter is surely signs of a bug
(and what is producing the -52).  What is your list of synonyms?

-Mike
Here is:

# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#-----------------------------------------------------------------------
#some test synonym mappings unlikely to appear in real input text
aaa => aaaa
bbb => bbbb1 bbbb2
ccc => cccc1,cccc2
a\=>a => b\=>b
a\,a => b\,b
fooaaa,baraaa,bazaaa

# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs

bestbuy,bb,best buy
circuitcity,cc,circuit city

#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.

# Synonym mappings can be used for spelling correction too
pixima => pixma
Reply | Threaded
Open this post in threaded view
|

Re: highlight exception

Mike Klaas
On 2/19/07, nick19701 <[hidden email]> wrote:
>
>
> Mike Klaas wrote:
> >
> > That 60, 8 produced by the synonym filter is surely signs of a bug
> > (and what is producing the -52).  What is your list of synonyms?
>
> Here is:
<>

nick,

It looks as though there is a bug in the synonym filter.  Since you
are using Solr's example synonym list, perhaps it would be sufficient
to remove that from your analyzer chain (schema.xml)?  At least that
would prevent crashes until the bug is fixed.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: highlight exception

nick19701
Mike Klaas wrote
nick,

It looks as though there is a bug in the synonym filter.  Since you
are using Solr's example synonym list, perhaps it would be sufficient
to remove that from your analyzer chain (schema.xml)?  At least that
would prevent crashes until the bug is fixed.

-Mike
Mike,
Thanks for your advice. I also suspect there is a bug in the synonym filter.

In this thread, I used synonyms at query time:
http://www.nabble.com/question-about-synonyms-t3222067.html

When "bb" was searched, the docs which contain "bb" were not returned.
There is no space in "bb". It's very puzzling to me this happened.

-Nick