Missing tokens

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Missing tokens

paul.moran

Hi, I'm having a problem with certain search terms not being found when I
do a query. I'm using Solrj to index a pdf document, and add the contents
to the 'contents' field. If I query the 'contents' field on the
SolrInputDocument doc object as below, I get 50k tokens.

StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
"contents"));
System.out.println( "Tokens:"  + to.countTokens() );

However, once the doc is indexed and I use Luke to analyse the index, it
has only 3300 tokens in that field. Where did the other 47k go?

I read some other threads mentioning to increase the maxfieldLength in
solrconfig.xml, and my setting is below.

  <maxFieldLength>2147483647</maxFieldLength>

Any advice is appreciated,
Paul

Reply | Threaded
Open this post in threaded view
|

Re: Missing tokens

Jan Høydahl / Cominvent
Hi,

Can you share with us how your schema looks for this field? What FieldType? What tokenizer and analyser?
How do you parse the PDF document? Before submitting to Solr? With what tool?
How do you do the query? Do you get the same results when doing the query from a browser, not SolrJ?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. aug. 2010, at 11.34, [hidden email] wrote:

>
> Hi, I'm having a problem with certain search terms not being found when I
> do a query. I'm using Solrj to index a pdf document, and add the contents
> to the 'contents' field. If I query the 'contents' field on the
> SolrInputDocument doc object as below, I get 50k tokens.
>
> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
> "contents"));
> System.out.println( "Tokens:"  + to.countTokens() );
>
> However, once the doc is indexed and I use Luke to analyse the index, it
> has only 3300 tokens in that field. Where did the other 47k go?
>
> I read some other threads mentioning to increase the maxfieldLength in
> solrconfig.xml, and my setting is below.
>
>  <maxFieldLength>2147483647</maxFieldLength>
>
> Any advice is appreciated,
> Paul
>

Reply | Threaded
Open this post in threaded view
|

Re: Missing tokens

paul.moran
Here's my field description. I mentioned 'contents' field in my original
post. I've changed it to a different field, 'summary'. It's using the
'text' fieldType as you can see below.

   <field name="summary" type="text" indexed="true" stored="true"/>


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
             enablePositionIncrements=true ensures that a 'gap' is left to
             allow for accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
"1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected=
"protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words=
"stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
"1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected=
"protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

I parsed the pdf using pdfbox. I can see my alphanumeric search term 'OB10'
in the extracted text before I add it to the index. I can also go into Luke
and see the 'OB10' in the contents of the 'summary' field even though Luke
can't find it when I do a search.

I can also use the browser to do a search in http://localhost/solr/admin
and again that search term doesn't return any results. I thought it may be
an alphanumber word splitting issue, but that doesn't seem be be the case
since I can search on ME26, and it returns a doc, and in fact, I can see
the 'OB10' search term in the summary field of the doc returned.

Here's a snippet of the summary field from that returned doc

To produce a downloadable file using a format suitable
for OB10. 8-26 Profiles

I'm thinking that the extracted text from pdfbox may have hidden chars that
solr can't parse. However, before I go down that road, I just want to be
sure I'm not making schoolboy errors with my solr setup.

thanks
Paul



From: Jan Høydahl / Cominvent <[hidden email]>
To: [hidden email]
Date: 18/08/2010 11:56
Subject: Re: Missing tokens



Hi,

Can you share with us how your schema looks for this field? What FieldType?
What tokenizer and analyser?
How do you parse the PDF document? Before submitting to Solr? With what
tool?
How do you do the query? Do you get the same results when doing the query
from a browser, not SolrJ?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. aug. 2010, at 11.34, [hidden email] wrote:

>
> Hi, I'm having a problem with certain search terms not being found when I
> do a query. I'm using Solrj to index a pdf document, and add the contents
> to the 'contents' field. If I query the 'contents' field on the
> SolrInputDocument doc object as below, I get 50k tokens.
>
> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
> "contents"));
> System.out.println( "Tokens:"  + to.countTokens() );
>
> However, once the doc is indexed and I use Luke to analyse the index, it
> has only 3300 tokens in that field. Where did the other 47k go?
>
> I read some other threads mentioning to increase the maxfieldLength in
> solrconfig.xml, and my setting is below.
>
>  <maxFieldLength>2147483647</maxFieldLength>
>
> Any advice is appreciated,
> Paul
>



Reply | Threaded
Open this post in threaded view
|

Re: Missing tokens

Jan Høydahl / Cominvent
Cannot see anything obvious...

Try
http://localhost/solr/select?q=contents:OB10*
http://localhost/solr/select?q=contents:"OB 10"
http://localhost/solr/select?q=contents:"OB10."
http://localhost/solr/select?q=contents:ob10

Also, go to the Analysis page in admin, typie in your field name, enable verbose output and copy paste the problematic sentence in the "Index" part and then enter a OB10 in the "Query" part, and see how your doc and query gets processed.

PS: Why don't you try this instead of doing the PDF extraction yourselv: http://wiki.apache.org/solr/ExtractingRequestHandler ??

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. aug. 2010, at 16.25, [hidden email] wrote:

> Here's my field description. I mentioned 'contents' field in my original
> post. I've changed it to a different field, 'summary'. It's using the
> 'text' fieldType as you can see below.
>
>   <field name="summary" type="text" indexed="true" stored="true"/>
>
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <!-- Case insensitive stop word removal.
>             enablePositionIncrements=true ensures that a 'gap' is left to
>             allow for accurate phrase queries.
>        -->
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
> "1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory" protected=
> "protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words=
> "stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
> "1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory" protected=
> "protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> I parsed the pdf using pdfbox. I can see my alphanumeric search term 'OB10'
> in the extracted text before I add it to the index. I can also go into Luke
> and see the 'OB10' in the contents of the 'summary' field even though Luke
> can't find it when I do a search.
>
> I can also use the browser to do a search in http://localhost/solr/admin
> and again that search term doesn't return any results. I thought it may be
> an alphanumber word splitting issue, but that doesn't seem be be the case
> since I can search on ME26, and it returns a doc, and in fact, I can see
> the 'OB10' search term in the summary field of the doc returned.
>
> Here's a snippet of the summary field from that returned doc
>
> To produce a downloadable file using a format suitable
> for OB10. 8-26 Profiles
>
> I'm thinking that the extracted text from pdfbox may have hidden chars that
> solr can't parse. However, before I go down that road, I just want to be
> sure I'm not making schoolboy errors with my solr setup.
>
> thanks
> Paul
>
>
>
> From: Jan Høydahl / Cominvent <[hidden email]>
> To: [hidden email]
> Date: 18/08/2010 11:56
> Subject: Re: Missing tokens
>
>
>
> Hi,
>
> Can you share with us how your schema looks for this field? What FieldType?
> What tokenizer and analyser?
> How do you parse the PDF document? Before submitting to Solr? With what
> tool?
> How do you do the query? Do you get the same results when doing the query
> from a browser, not SolrJ?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 18. aug. 2010, at 11.34, [hidden email] wrote:
>
>>
>> Hi, I'm having a problem with certain search terms not being found when I
>> do a query. I'm using Solrj to index a pdf document, and add the contents
>> to the 'contents' field. If I query the 'contents' field on the
>> SolrInputDocument doc object as below, I get 50k tokens.
>>
>> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
>> "contents"));
>> System.out.println( "Tokens:"  + to.countTokens() );
>>
>> However, once the doc is indexed and I use Luke to analyse the index, it
>> has only 3300 tokens in that field. Where did the other 47k go?
>>
>> I read some other threads mentioning to increase the maxfieldLength in
>> solrconfig.xml, and my setting is below.
>>
>> <maxFieldLength>2147483647</maxFieldLength>
>>
>> Any advice is appreciated,
>> Paul
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Missing tokens

paul.moran
Great! Now I'm getting somewhere, this worked! The others didn't.

http://localhost/solr/select?q=contents:"OB10."

Hope this makes sense to you. I'm still somewhat confused with the output
here. I had 'highlight matches' check, and from what I can tell, 'OB10'
wasn't found. When I enter 'OB10.' into the query, column 11 'ob10.' became
highlighted in the 'LowerCaseFilterFactory' table.

Am I using the wrong analyser, or supplying the wrong parameters to an
analyser?

Thanks for your help so far!
Paul

Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|term position |1   |2      |3    |4           |5    |6    |7    |8     |9       |10   |11    |12   |13      |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|  term text   |To  |produce|a    |downloadable|file |using|a    |format|suitable|for  |OB10. |8-26 |Profiles|
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|  term type   |word|word   |word |word        |word |word |word |word  |word    |word |word  |word |word    |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 |46,54   |55,58|59,64 |65,69|70,78   |
|  start,end   |    |       |     |            |     |     |     |      |        |     |      |     |        |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|   payload    |    |       |     |            |     |     |     |      |        |     |      |     |        |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|


org.apache.solr.analysis.StandardFilterFactory {}
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|term position |1   |2      |3    |4           |5    |6    |7    |8     |9       |10   |11    |12   |13      |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|  term text   |To  |produce|a    |downloadable|file |using|a    |format|suitable|for  |OB10. |8-26 |Profiles|
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|  term type   |word|word   |word |word        |word |word |word |word  |word    |word |word  |word |word    |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 |46,54   |55,58|59,64 |65,69|70,78   |
|  start,end   |    |       |     |            |     |     |     |      |        |     |      |     |        |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|   payload    |    |       |     |            |     |     |     |      |        |     |      |     |        |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|


org.apache.solr.analysis.LowerCaseFilterFactory {}
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
|term position |1   |2      |3    |4           |5    |6    |7    |8     |9       |10   |11   |12   |13      |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
|  term text   |to  |produce|a    |downloadable|file |using|a    |format|suitable|for  |ob10.|8-26 |profiles|
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
|  term type   |word|word   |word |word        |word |word |word |word  |word    |word |word |word |word    |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
|    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 |46,54   |55,58|59,64|65,69|70,78   |
|  start,end   |    |       |     |            |     |     |     |      |        |     |     |     |        |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
|   payload    |    |       |     |            |     |     |     |      |        |     |     |     |        |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|



Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
|--------------+-----------|
|term position |1          |
|--------------+-----------|
|  term text   |OB10       |
|--------------+-----------|
|  term type   |word       |
|--------------+-----------|
|    source    |0,4        |
|  start,end   |           |
|--------------+-----------|
|   payload    |           |
|--------------+-----------|


org.apache.solr.analysis.StandardFilterFactory {}
|--------------+----------|
|term position |1         |
|--------------+----------|
|  term text   |OB10      |
|--------------+----------|
|  term type   |word      |
|--------------+----------|
|    source    |0,4       |
|  start,end   |          |
|--------------+----------|
|   payload    |          |
|--------------+----------|


org.apache.solr.analysis.LowerCaseFilterFactory {}
|--------------+-------------|
|term position |1            |
|--------------+-------------|
|  term text   |ob10         |
|--------------+-------------|
|  term type   |word         |
|--------------+-------------|
|    source    |0,4          |
|  start,end   |             |
|--------------+-------------|
|   payload    |             |
|--------------+-------------|


I did look at ExtractingRequestHandler a while ago, but I don't think it
supported password protected files. Just looked at it again, and it looks
like it does now.





From: Jan Høydahl / Cominvent <[hidden email]>
To: [hidden email]
Date: 18/08/2010 23:16
Subject: Re: Missing tokens



Cannot see anything obvious...

Try
http://localhost/solr/select?q=contents:OB10*
http://localhost/solr/select?q=contents:"OB 10"
http://localhost/solr/select?q=contents:"OB10."
http://localhost/solr/select?q=contents:ob10

Also, go to the Analysis page in admin, typie in your field name, enable
verbose output and copy paste the problematic sentence in the "Index" part
and then enter a OB10 in the "Query" part, and see how your doc and query
gets processed.

PS: Why don't you try this instead of doing the PDF extraction yourselv:
http://wiki.apache.org/solr/ExtractingRequestHandler ??

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. aug. 2010, at 16.25, [hidden email] wrote:

> Here's my field description. I mentioned 'contents' field in my original
> post. I've changed it to a different field, 'summary'. It's using the
> 'text' fieldType as you can see below.
>
>   <field name="summary" type="text" indexed="true" stored="true"/>
>
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time >
<filter class="solr.SynonymFilterFactory" > synonyms="index_synonyms.txt"
ignoreCase="true" expand="false"/> >        -->
>        <!-- Case insensitive stop word removal. >
enablePositionIncrements=true ensures that a 'gap' is left to >
allow for accurate phrase queries. >        -->

>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
> "1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory" protected=
> "protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words=
> "stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
> "1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory" protected=
> "protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> I parsed the pdf using pdfbox. I can see my alphanumeric search term
'OB10'
> in the extracted text before I add it to the index. I can also go into
Luke
> and see the 'OB10' in the contents of the 'summary' field even though
Luke
> can't find it when I do a search.
>
> I can also use the browser to do a search in http://localhost/solr/admin
> and again that search term doesn't return any results. I thought it may
be

> an alphanumber word splitting issue, but that doesn't seem be be the case
> since I can search on ME26, and it returns a doc, and in fact, I can see
> the 'OB10' search term in the summary field of the doc returned.
>
> Here's a snippet of the summary field from that returned doc
>
> To produce a downloadable file using a format suitable
> for OB10. 8-26 Profiles
>
> I'm thinking that the extracted text from pdfbox may have hidden chars
that

> solr can't parse. However, before I go down that road, I just want to be
> sure I'm not making schoolboy errors with my solr setup.
>
> thanks
> Paul
>
>
>
> From: Jan Høydahl / Cominvent <[hidden email]>
> To: [hidden email]
> Date: 18/08/2010 11:56
> Subject: Re: Missing tokens
>
>
>
> Hi,
>
> Can you share with us how your schema looks for this field? What
FieldType?

> What tokenizer and analyser?
> How do you parse the PDF document? Before submitting to Solr? With what
> tool?
> How do you do the query? Do you get the same results when doing the query
> from a browser, not SolrJ?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 18. aug. 2010, at 11.34, [hidden email] wrote:
>
>>
>> Hi, I'm having a problem with certain search terms not being found when
I
>> do a query. I'm using Solrj to index a pdf document, and add the
contents

>> to the 'contents' field. If I query the 'contents' field on the
>> SolrInputDocument doc object as below, I get 50k tokens.
>>
>> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
>> "contents"));
>> System.out.println( "Tokens:"  + to.countTokens() );
>>
>> However, once the doc is indexed and I use Luke to analyse the index, it
>> has only 3300 tokens in that field. Where did the other 47k go?
>>
>> I read some other threads mentioning to increase the maxfieldLength in
>> solrconfig.xml, and my setting is below.
>>
>> <maxFieldLength>2147483647</maxFieldLength>
>>
>> Any advice is appreciated,
>> Paul
>>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Missing tokens

Jan Høydahl / Cominvent
Hi,

Your bug is right there in the WhitespaceTokenizer, where you see that it does NOT strip away the "." as whitespace.
Try with StandardTokenizerFactory instead, as it removes punctuation.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 19. aug. 2010, at 12.16, [hidden email] wrote:

> Great! Now I'm getting somewhere, this worked! The others didn't.
>
> http://localhost/solr/select?q=contents:"OB10."
>
> Hope this makes sense to you. I'm still somewhat confused with the output
> here. I had 'highlight matches' check, and from what I can tell, 'OB10'
> wasn't found. When I enter 'OB10.' into the query, column 11 'ob10.' became
> highlighted in the 'LowerCaseFilterFactory' table.
>
> Am I using the wrong analyser, or supplying the wrong parameters to an
> analyser?
>
> Thanks for your help so far!
> Paul
>
> Index Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |term position |1   |2      |3    |4           |5    |6    |7    |8     |9       |10   |11    |12   |13      |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |  term text   |To  |produce|a    |downloadable|file |using|a    |format|suitable|for  |OB10. |8-26 |Profiles|
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |  term type   |word|word   |word |word        |word |word |word |word  |word    |word |word  |word |word    |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 |46,54   |55,58|59,64 |65,69|70,78   |
> |  start,end   |    |       |     |            |     |     |     |      |        |     |      |     |        |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |   payload    |    |       |     |            |     |     |     |      |        |     |      |     |        |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
>
>
> org.apache.solr.analysis.StandardFilterFactory {}
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |term position |1   |2      |3    |4           |5    |6    |7    |8     |9       |10   |11    |12   |13      |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |  term text   |To  |produce|a    |downloadable|file |using|a    |format|suitable|for  |OB10. |8-26 |Profiles|
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |  term type   |word|word   |word |word        |word |word |word |word  |word    |word |word  |word |word    |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 |46,54   |55,58|59,64 |65,69|70,78   |
> |  start,end   |    |       |     |            |     |     |     |      |        |     |      |     |        |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |   payload    |    |       |     |            |     |     |     |      |        |     |      |     |        |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
>
>
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
> |term position |1   |2      |3    |4           |5    |6    |7    |8     |9       |10   |11   |12   |13      |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
> |  term text   |to  |produce|a    |downloadable|file |using|a    |format|suitable|for  |ob10.|8-26 |profiles|
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
> |  term type   |word|word   |word |word        |word |word |word |word  |word    |word |word |word |word    |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
> |    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 |46,54   |55,58|59,64|65,69|70,78   |
> |  start,end   |    |       |     |            |     |     |     |      |        |     |     |     |        |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
> |   payload    |    |       |     |            |     |     |     |      |        |     |     |     |        |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
>
>
>
> Query Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> |--------------+-----------|
> |term position |1          |
> |--------------+-----------|
> |  term text   |OB10       |
> |--------------+-----------|
> |  term type   |word       |
> |--------------+-----------|
> |    source    |0,4        |
> |  start,end   |           |
> |--------------+-----------|
> |   payload    |           |
> |--------------+-----------|
>
>
> org.apache.solr.analysis.StandardFilterFactory {}
> |--------------+----------|
> |term position |1         |
> |--------------+----------|
> |  term text   |OB10      |
> |--------------+----------|
> |  term type   |word      |
> |--------------+----------|
> |    source    |0,4       |
> |  start,end   |          |
> |--------------+----------|
> |   payload    |          |
> |--------------+----------|
>
>
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> |--------------+-------------|
> |term position |1            |
> |--------------+-------------|
> |  term text   |ob10         |
> |--------------+-------------|
> |  term type   |word         |
> |--------------+-------------|
> |    source    |0,4          |
> |  start,end   |             |
> |--------------+-------------|
> |   payload    |             |
> |--------------+-------------|
>
>
> I did look at ExtractingRequestHandler a while ago, but I don't think it
> supported password protected files. Just looked at it again, and it looks
> like it does now.
>
>
>
>
>
> From: Jan Høydahl / Cominvent <[hidden email]>
> To: [hidden email]
> Date: 18/08/2010 23:16
> Subject: Re: Missing tokens
>
>
>
> Cannot see anything obvious...
>
> Try
> http://localhost/solr/select?q=contents:OB10*
> http://localhost/solr/select?q=contents:"OB 10"
> http://localhost/solr/select?q=contents:"OB10."
> http://localhost/solr/select?q=contents:ob10
>
> Also, go to the Analysis page in admin, typie in your field name, enable
> verbose output and copy paste the problematic sentence in the "Index" part
> and then enter a OB10 in the "Query" part, and see how your doc and query
> gets processed.
>
> PS: Why don't you try this instead of doing the PDF extraction yourselv:
> http://wiki.apache.org/solr/ExtractingRequestHandler ??
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 18. aug. 2010, at 16.25, [hidden email] wrote:
>
>> Here's my field description. I mentioned 'contents' field in my original
>> post. I've changed it to a different field, 'summary'. It's using the
>> 'text' fieldType as you can see below.
>>
>>  <field name="summary" type="text" indexed="true" stored="true"/>
>>
>>
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>     <analyzer type="index">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <!-- in this example, we will only use synonyms at query time
>>
> <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt"
> ignoreCase="true" expand="false"/>
>>       -->
>>       <!-- Case insensitive stop word removal.
>>
> enablePositionIncrements=true ensures that a 'gap' is left to
>>
> allow for accurate phrase queries.
>>       -->
>>       <filter class="solr.StopFilterFactory"
>>               ignoreCase="true"
>>               words="stopwords.txt"
>>               enablePositionIncrements="true"
>>               />
>>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
>> "1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.EnglishPorterFilterFactory" protected=
>> "protwords.txt"/>
>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>     </analyzer>
>>     <analyzer type="query">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>       <filter class="solr.StopFilterFactory" ignoreCase="true" words=
>> "stopwords.txt"/>
>>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
>> "1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="1"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.EnglishPorterFilterFactory" protected=
>> "protwords.txt"/>
>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>     </analyzer>
>>   </fieldType>
>>
>> I parsed the pdf using pdfbox. I can see my alphanumeric search term
> 'OB10'
>> in the extracted text before I add it to the index. I can also go into
> Luke
>> and see the 'OB10' in the contents of the 'summary' field even though
> Luke
>> can't find it when I do a search.
>>
>> I can also use the browser to do a search in http://localhost/solr/admin
>> and again that search term doesn't return any results. I thought it may
> be
>> an alphanumber word splitting issue, but that doesn't seem be be the case
>> since I can search on ME26, and it returns a doc, and in fact, I can see
>> the 'OB10' search term in the summary field of the doc returned.
>>
>> Here's a snippet of the summary field from that returned doc
>>
>> To produce a downloadable file using a format suitable
>> for OB10. 8-26 Profiles
>>
>> I'm thinking that the extracted text from pdfbox may have hidden chars
> that
>> solr can't parse. However, before I go down that road, I just want to be
>> sure I'm not making schoolboy errors with my solr setup.
>>
>> thanks
>> Paul
>>
>>
>>
>> From: Jan Høydahl / Cominvent <[hidden email]>
>> To: [hidden email]
>> Date: 18/08/2010 11:56
>> Subject: Re: Missing tokens
>>
>>
>>
>> Hi,
>>
>> Can you share with us how your schema looks for this field? What
> FieldType?
>> What tokenizer and analyser?
>> How do you parse the PDF document? Before submitting to Solr? With what
>> tool?
>> How do you do the query? Do you get the same results when doing the query
>> from a browser, not SolrJ?
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>>
>> On 18. aug. 2010, at 11.34, [hidden email] wrote:
>>
>>>
>>> Hi, I'm having a problem with certain search terms not being found when
> I
>>> do a query. I'm using Solrj to index a pdf document, and add the
> contents
>>> to the 'contents' field. If I query the 'contents' field on the
>>> SolrInputDocument doc object as below, I get 50k tokens.
>>>
>>> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
>>> "contents"));
>>> System.out.println( "Tokens:"  + to.countTokens() );
>>>
>>> However, once the doc is indexed and I use Luke to analyse the index, it
>>> has only 3300 tokens in that field. Where did the other 47k go?
>>>
>>> I read some other threads mentioning to increase the maxfieldLength in
>>> solrconfig.xml, and my setting is below.
>>>
>>> <maxFieldLength>2147483647</maxFieldLength>
>>>
>>> Any advice is appreciated,
>>> Paul
>>>
>>
>>
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Missing tokens

paul.moran
I did that and it worked.

Thanks  very much for your expert assistance, Jan!

Paul



From: Jan Høydahl / Cominvent <[hidden email]>
To: [hidden email]
Date: 19/08/2010 16:15
Subject: Re: Missing tokens



Hi,

Your bug is right there in the WhitespaceTokenizer, where you see that it
does NOT strip away the "." as whitespace.
Try with StandardTokenizerFactory instead, as it removes punctuation.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 19. aug. 2010, at 12.16, [hidden email] wrote:

> Great! Now I'm getting somewhere, this worked! The others didn't.
>
> http://localhost/solr/select?q=contents:"OB10."
>
> Hope this makes sense to you. I'm still somewhat confused with the output
> here. I had 'highlight matches' check, and from what I can tell, 'OB10'
> wasn't found. When I enter 'OB10.' into the query, column 11 'ob10.'
became

> highlighted in the 'LowerCaseFilterFactory' table.
>
> Am I using the wrong analyser, or supplying the wrong parameters to an
> analyser?
>
> Thanks for your help so far!
> Paul
>
> Index Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|

> |term position |1   |2      |3    |4           |5    |6    |7    |8     |
9       |10   |11    |12   |13      |
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|

> |  term text   |To  |produce|a    |downloadable|file |using|a    |format|
suitable|for  |OB10. |8-26 |Profiles|
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|

> |  term type   |word|word   |word |word        |word |word |word |word  |
word    |word |word  |word |word    |
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|

> |    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 |
46,54   |55,58|59,64 |65,69|70,78   |
> |  start,end   |    |       |     |            |     |     |     |      |
|     |      |     |        |
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|

> |   payload    |    |       |     |            |     |     |     |      |
|     |      |     |        |
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|

>
>
> org.apache.solr.analysis.StandardFilterFactory {}
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|

> |term position |1   |2      |3    |4           |5    |6    |7    |8     |
9       |10   |11    |12   |13      |
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|

> |  term text   |To  |produce|a    |downloadable|file |using|a    |format|
suitable|for  |OB10. |8-26 |Profiles|
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|

> |  term type   |word|word   |word |word        |word |word |word |word  |
word    |word |word  |word |word    |
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|

> |    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 |
46,54   |55,58|59,64 |65,69|70,78   |
> |  start,end   |    |       |     |            |     |     |     |      |
|     |      |     |        |
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|

> |   payload    |    |       |     |            |     |     |     |      |
|     |      |     |        |
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|

>
>
> org.apache.solr.analysis.LowerCaseFilterFactory {}
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|

> |term position |1   |2      |3    |4           |5    |6    |7    |8     |
9       |10   |11   |12   |13      |
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|

> |  term text   |to  |produce|a    |downloadable|file |using|a    |format|
suitable|for  |ob10.|8-26 |profiles|
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|

> |  term type   |word|word   |word |word        |word |word |word |word  |
word    |word |word |word |word    |
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|

> |    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 |
46,54   |55,58|59,64|65,69|70,78   |
> |  start,end   |    |       |     |            |     |     |     |      |
|     |     |     |        |
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|

> |   payload    |    |       |     |            |     |     |     |      |
|     |     |     |        |
>
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|

>
>
>
> Query Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> |--------------+-----------|
> |term position |1          |
> |--------------+-----------|
> |  term text   |OB10       |
> |--------------+-----------|
> |  term type   |word       |
> |--------------+-----------|
> |    source    |0,4        |
> |  start,end   |           |
> |--------------+-----------|
> |   payload    |           |
> |--------------+-----------|
>
>
> org.apache.solr.analysis.StandardFilterFactory {}
> |--------------+----------|
> |term position |1         |
> |--------------+----------|
> |  term text   |OB10      |
> |--------------+----------|
> |  term type   |word      |
> |--------------+----------|
> |    source    |0,4       |
> |  start,end   |          |
> |--------------+----------|
> |   payload    |          |
> |--------------+----------|
>
>
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> |--------------+-------------|
> |term position |1            |
> |--------------+-------------|
> |  term text   |ob10         |
> |--------------+-------------|
> |  term type   |word         |
> |--------------+-------------|
> |    source    |0,4          |
> |  start,end   |             |
> |--------------+-------------|
> |   payload    |             |
> |--------------+-------------|
>
>
> I did look at ExtractingRequestHandler a while ago, but I don't think it
> supported password protected files. Just looked at it again, and it looks
> like it does now.
>
>
>
>
>
> From: Jan Høydahl / Cominvent <[hidden email]>
> To: [hidden email]
> Date: 18/08/2010 23:16
> Subject: Re: Missing tokens
>
>
>
> Cannot see anything obvious...
>
> Try
> http://localhost/solr/select?q=contents:OB10*
> http://localhost/solr/select?q=contents:"OB 10"
> http://localhost/solr/select?q=contents:"OB10."
> http://localhost/solr/select?q=contents:ob10
>
> Also, go to the Analysis page in admin, typie in your field name, enable
> verbose output and copy paste the problematic sentence in the "Index"
part

> and then enter a OB10 in the "Query" part, and see how your doc and query
> gets processed.
>
> PS: Why don't you try this instead of doing the PDF extraction yourselv:
> http://wiki.apache.org/solr/ExtractingRequestHandler ??
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 18. aug. 2010, at 16.25, [hidden email] wrote:
>
>> Here's my field description. I mentioned 'contents' field in my original
>> post. I've changed it to a different field, 'summary'. It's using the
>> 'text' fieldType as you can see below.
>>
>>  <field name="summary" type="text" indexed="true" stored="true"/>
>>
>>
>> <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
>>     <analyzer type="index">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <!-- in this example, we will only use synonyms at query time >> >
<filter class="solr.SynonymFilterFactory" >> synonyms="index_synonyms.txt" >
ignoreCase="true" expand="false"/> >>       -->
>>       <!-- Case insensitive stop word removal. >> >
enablePositionIncrements=true ensures that a 'gap' is left to >> > allow for
accurate phrase queries. >>       -->

>>       <filter class="solr.StopFilterFactory"
>>               ignoreCase="true"
>>               words="stopwords.txt"
>>               enablePositionIncrements="true"
>>               />
>>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
>> "1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.EnglishPorterFilterFactory" protected=
>> "protwords.txt"/>
>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>     </analyzer>
>>     <analyzer type="query">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>       <filter class="solr.StopFilterFactory" ignoreCase="true" words=
>> "stopwords.txt"/>
>>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
>> "1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="1"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.EnglishPorterFilterFactory" protected=
>> "protwords.txt"/>
>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>     </analyzer>
>>   </fieldType>
>>
>> I parsed the pdf using pdfbox. I can see my alphanumeric search term
> 'OB10'
>> in the extracted text before I add it to the index. I can also go into
> Luke
>> and see the 'OB10' in the contents of the 'summary' field even though
> Luke
>> can't find it when I do a search.
>>
>> I can also use the browser to do a search in http://localhost/solr/admin
>> and again that search term doesn't return any results. I thought it may
> be
>> an alphanumber word splitting issue, but that doesn't seem be be the
case

>> since I can search on ME26, and it returns a doc, and in fact, I can see
>> the 'OB10' search term in the summary field of the doc returned.
>>
>> Here's a snippet of the summary field from that returned doc
>>
>> To produce a downloadable file using a format suitable
>> for OB10. 8-26 Profiles
>>
>> I'm thinking that the extracted text from pdfbox may have hidden chars
> that
>> solr can't parse. However, before I go down that road, I just want to be
>> sure I'm not making schoolboy errors with my solr setup.
>>
>> thanks
>> Paul
>>
>>
>>
>> From:  Jan Høydahl / Cominvent
<[hidden email]>

>> To:  [hidden email]
>> Date:  18/08/2010 11:56
>> Subject:  Re: Missing tokens
>>
>>
>>
>> Hi,
>>
>> Can you share with us how your schema looks for this field? What
> FieldType?
>> What tokenizer and analyser?
>> How do you parse the PDF document? Before submitting to Solr? With what
>> tool?
>> How do you do the query? Do you get the same results when doing the
query

>> from a browser, not SolrJ?
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>>
>> On 18. aug. 2010, at 11.34, [hidden email] wrote:
>>
>>>
>>> Hi, I'm having a problem with certain search terms not being found when
> I
>>> do a query. I'm using Solrj to index a pdf document, and add the
> contents
>>> to the 'contents' field. If I query the 'contents' field on the
>>> SolrInputDocument doc object as below, I get 50k tokens.
>>>
>>> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
>>> "contents"));
>>> System.out.println( "Tokens:"  + to.countTokens() );
>>>
>>> However, once the doc is indexed and I use Luke to analyse the index,
it

>>> has only 3300 tokens in that field. Where did the other 47k go?
>>>
>>> I read some other threads mentioning to increase the maxfieldLength in
>>> solrconfig.xml, and my setting is below.
>>>
>>> <maxFieldLength>2147483647</maxFieldLength>
>>>
>>> Any advice is appreciated,
>>> Paul
>>>
>>
>>
>>
>
>
>