Solr finding doc by one field but not by another

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr finding doc by one field but not by another

Theodan
Hi everyone.

Can anyone explain how this might happen?  I query by the "ID" field and get the following result:

=========================================================
<?xml version="1.0" encoding="UTF-8" ?>
<response>
        <lst name="responseHeader">
                <int name="status">0</int> 
                <int name="QTime">16</int> 
                <lst name="params">
                        <str name="q">ID:ee483237-399c-4b17-ad73-000cc54fd3e1</str> 
                </lst>
        </lst>
        <result name="response" numFound="1" start="0">
                <doc>
                        <str name="AllowedApplications">COSMEO US</str> 
                        <str name="Audiences" /> 
                        <str name="DefaultURL" /> 
                        <str name="FileType" /> 
                        <str name="HighGrade" /> 
                        <str name="ID">ee483237-399c-4b17-ad73-000cc54fd3e1</str> 
                        <str name="IsClosedCaptioned" /> 
                        <str name="Language">en-US</str> 
                        <str name="LargeIcon" /> 
                        <str name="LaunchIcon" /> 
                        <str name="LowGrade" /> 
                        <str name="MediaGroups" /> 
                        <str name="Producer" /> 
                        <str name="Provider" /> 
                        <str name="Publisher" /> 
                        <str name="SmallIcon" /> 
                        <str name="Taxonomy">Social Studies American History Historical Periods Expansion and Reform 1801-1861 Territorial Expansion</str> 
                        <str name="TitleEvent" /> 
                        <str name="TitleLength" /> 
                        <str name="TitleLocation" /> 
                        <str name="TitleParticipant" /> 
                        <str name="Type">EncyclopediaArticles</str> 
                        <str name="concepts" /> 
                        <str name="copyright">2005</str> 
                        <str name="description">Pony Express was a mail service operating between Saint Joseph, Mo., and Sacramento, Calif., inaugurated on April 3, 1860, under the direction of the Central Overland California and Pike's Peak Express Co.</str> 
                        <str name="editable">True</str> 
                        <str name="keywords" /> 
                        <str name="spanish" /> 
                        <str name="title">Pony Express</str> 
                        <str name="vocabulary">pony express</str> 
                </doc>
        </result>
</response>
=========================================================

Then I query by the "title" field from the result above (so I know the document is in the index and has been committed), and I get zero results:

=========================================================
<?xml version="1.0" encoding="UTF-8" ?>
<response>
        <lst name="responseHeader">
                <int name="status">0</int> 
                <int name="QTime">0</int> 
                <lst name="params">
                        <str name="q">title:"Pony Express"</str> 
                </lst>
        </lst>
        <result name="response" numFound="0" start="0" />
</response>
=========================================================

"ID" is not the only field that I can find the doc by.  Searching for "Type:encyclopediaarticles" finds it too.  Also, "title" is not the only field that misses the doc.  A search by "vocabulary" misses it too.  I haven't tried all the fields yet to see exhaustively which ones find it and which ones don't.  I can do that if it would help.

For what it's worth, I started with an existing Lucene index and modified Solr's schema.xml so that I could just use the Lucene index in Solr.  That Lucene index had about 230K docs.  I then used your "post.jar" to post another 10K docs to the index after starting up the server.  Those 10K docs only had 7 of the 30 fields that the original 230K docs had.  Could that be the problem?  I am noticing that the docs that I'm having problems with are from the original 230K-doc index, not from my subsequent 10K-doc post.  The 10K docs seem to be findable by any of their 7 fields.

Here are my config files:
schema.xml
solrconfig.xml

Any help is greatly appreciated.

Thanks,
-Dan
Reply | Threaded
Open this post in threaded view
|

Re: Solr finding doc by one field but not by another

Mike Klaas
On 3/28/07, Theodan <[hidden email]> wrote:

> For what it's worth, I started with an existing Lucene index and modified
> Solr's schema.xml so that I could just use the Lucene index in Solr.  That
> Lucene index had about 230K docs.  I then used your "post.jar" to post
> another 10K docs to the index after starting up the server.  Those 10K docs
> only had 7 of the 30 fields that the original 230K docs had.  Could that be
> the problem?  I am noticing that the docs that I'm having problems with are
> from the original 230K-doc index, not from my subsequent 10K-doc post.  The
> 10K docs seem to be findable by any of their 7 fields.

This is almost certainly due to a mismatch between the index- and
query-time analysis of the fields.  For instance, your schema defines
the title field to be "string" (unanalyzed), but it is likely that
some tokenization (perhaps via StandardAnalyzer) occurred in the
original index.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Solr finding doc by one field but not by another

Theodan
Mike Klaas wrote
This is almost certainly due to a mismatch between the index- and
query-time analysis of the fields.  For instance, your schema defines
the title field to be "string" (unanalyzed), but it is likely that
some tokenization (perhaps via StandardAnalyzer) occurred in the
original index.
Yep, that was exactly the problem.  I changed all of my field types from "string" to "text", and things still didn't work right when querying.  So I asked the guy who created the Lucene index what analyzers he used, and he had used the StandardAnalyzer, whereas my Solr configuration was using the default advanced analyzer setup that Solr comes with in schema.xml.  So I changed my schema.xml to use just StandardAnalyzer, and the searches now seem to be returning expected results.

-Dan