How does solr.StrField handle punctuation?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How does solr.StrField handle punctuation?

terhorst
I have a question about how punctuation and other special characters are handled in the Solr index when using the facets toolkit. I have an index of employees and facets based on their employer. Attempt to constrain the search based on facets works only as long as the company name doesn't contain an ampersand. So for example, the following query works:

fl=*,score&start=0&q=division_t:"Accounting"&company_facet:"Pricewaterhousecooper"&qt=standard&wt=ruby&rows=30

(I'm using acts_as_solr/Rails, so _t fields are Solr.TextField, and _facet fields are Solr.StrField.)

However, the following query produces no hits, even though I know from the facets info that there are over 4000 matches in the index:

fl=*,score&start=0&q=division_t:"Accounting"&company_facet:"Deloitte+%26+Touche"&qt=standard&wt=ruby&rows=30

I was under the impression that Solr.StrField just indexes the literal string, so I'm confused why this won't work. What's the proper way to feed this query into Solr and get results? Thanks- Jonathan

Reply | Threaded
Open this post in threaded view
|

Re: How does solr.StrField handle punctuation?

hossman

: However, the following query produces no hits, even though I know from the
: facets info that there are over 4000 matches in the index:
:
: fl=*,score&start=0&q=division_t:"Accounting"&company_facet:"Deloitte+%26+Touche"&qt=standard&wt=ruby&rows=30

That's not a "legal" URL ... note the "...&company_facet...".  You've
specified a URL param named: 'company_facet:"Deloitte+%26+Touche"' which
has no value.

I think you ment to use...

fl=*,score&start=0&q=division_t:"Accounting"&fq=company_facet:"Deloitte+%26+Touche"&qt=standard&wt=ruby&rows=30

: I was under the impression that Solr.StrField just indexes the literal
: string, so I'm confused why this won't work. What's the proper way to feed

the record: that is in fact exactly what StrField does.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: How does solr.StrField handle punctuation?

terhorst
Thanks for the reply. I was in a hurry and made the URL up to illustrate my point. The real query string is more like what you suggest. In any case I'm certain that the actual query being used is valid (Solr would complain if it weren't) and that the ampersand is somehow affecting results. Is there any way I can get Solr to dump some information about how it stores indexes, keys, etc. for a certain record? I'm wondering if the ampersand was handled in a weird way by my application when the records were added to the index. (Although I doubt this since it shows up properly in the facets.) Thanks again for your help.

Jonathan

hossman wrote
: However, the following query produces no hits, even though I know from the
: facets info that there are over 4000 matches in the index:
:
: fl=*,score&start=0&q=division_t:"Accounting"&company_facet:"Deloitte+%26+Touche"&qt=standard&wt=ruby&rows=30

That's not a "legal" URL ... note the "...&company_facet...".  You've
specified a URL param named: 'company_facet:"Deloitte+%26+Touche"' which
has no value.

I think you ment to use...

fl=*,score&start=0&q=division_t:"Accounting"&fq=company_facet:"Deloitte+%26+Touche"&qt=standard&wt=ruby&rows=30

: I was under the impression that Solr.StrField just indexes the literal
: string, so I'm confused why this won't work. What's the proper way to feed

the record: that is in fact exactly what StrField does.


-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: How does solr.StrField handle punctuation?

hossman

: Thanks for the reply. I was in a hurry and made the URL up to illustrate my
: point. The real query string is more like what you suggest. In any case I'm
: certain that the actual query being used is valid (Solr would complain if it
: weren't) and that the ampersand is somehow affecting results. Is there any

no, actually it wouldn't complain in that case ... a URL param with a name
it's not expecting would just be ignored.

if you send us the exact URLs you'rehaving problems with there may be
other nuances about it that we can spot to help figure out your problem.
(for example: are you absolutely sure the apersand in your field value is
URL escaped?)

: way I can get Solr to dump some information about how it stores indexes,
: keys, etc. for a certain record? I'm wondering if the ampersand was handled
: in a weird way by my application when the records were added to the index.
: (Although I doubt this since it shows up properly in the facets.) Thanks
: again for your help.

yep, there are a couple of things you can do in general to
troubleshoot things like this...

1) debugQuery=true ... add that param into your URL and Solr will give you
some nice debuging info about how your queries are bering parsed.  this is
important to post when asking followup questions.

2) analysis.jsp ... this is the "Analysis" link on the admin page, it will
show you how your analyzer is treating the fields you index ... but this
isn'treally relevant to your specific problem since you are using
StrField.

3) LukeRequestHandler, in the example schema it's mapped to /admin/luke
... this will let you see the actual terms indexed for your fields ... but
this as you said, this isn't going to be much help for you in this
specific case since you used facet.field to get the value in the first place -- that means it's
definitely indexed that way.

debugQuery=true is definitely your best first step ... send us the exact
URLs your having problems with (that have debugQuery=true) along with the
full output of that URL and people can probably help spot your problem.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: How does solr.StrField handle punctuation?

terhorst
Here are the exact query strings I'm using. The only modification I made is to change the output formatter from Ruby to XML and run the output through a pretty printer.

This is the one that returns the facet.fields I'm interested in. The problem field is the first one returned:

Query:
/solr/select/?facet=true&facet.mincount=1&facet.offset=0&facet.limit=22&wt=xml&rows=0&fl=*,score&start=0&facet.sort=true&q=division_t:%22Accounting%22;last_name_facet+asc&facet.field=company_facet&qt=standard&fq=in_redbook_b:true&debugQuery=true

Response:
<?xml version="1.0" encoding="UTF-8"?>
<response>
      <lst name="responseHeader">
            <int name="status">0</int>
            <int name="QTime">488</int>
            <lst name="params">
                  <str name="facet">true</str>
                  <str name="facet.offset">0</str>
                  <str name="facet.mincount">1</str>
                  <str name="facet.limit">22</str>
                  <str name="wt">xml</str>
                  <str name="rows">0</str>
                  <str name="fl">*,score</str>
                  <str name="debugQuery">true</str>
                  <str name="facet.sort">true</str>
                  <str name="start">0</str>
                  <str name="q">division_t:"Accounting";last_name_facet asc</str>
                  <str name="facet.field">company_facet</str>
                  <str name="qt">standard</str>
                  <str name="fq">in_redbook_b:true</str>
            </lst>
      </lst>
      <result name="response" numFound="16508" start="0" maxScore="4.144086"/>
      <lst name="facet_counts">
            <lst name="facet_queries"/>
            <lst name="facet_fields">
                  <lst name="company_facet">
                        <int name="Deloitte &amp;amp; Touche">4114</int>
                        <int name="Ernst &amp;amp; Young">1379</int>
                        <int name="PricewaterhouseCoopers">1257</int>
                        <int name="KPMG LLP">206</int>
                        <int name="Ernst &amp;amp; Young LLP">154</int>
                        <int name="Weiser LLP">134</int>
                        <int name="WithumSmith+Brown">86</int>
                        <int name="Eisner LLP">80</int>
                        <int name="Rothstein Kass">68</int>
                        <int name="Grant Thornton LLP">64</int>
                        <int name="RSM McGladrey Inc.">56</int>
                        <int name="Deloitte">49</int>
                        <int name="McGladrey &amp;amp; Pullen LLP">49</int>
                        <int name="J.H. Cohn LLP">45</int>
                        <int name="J. H. Cohn LLP">44</int>
                        <int name="Marks Paneth &amp;amp; Shron LLP">42</int>
                        <int name="Amper, Politziner &amp;amp; Mattia PC">41</int>
                        <int name="Marcum &amp;amp; Kliegman LLP">40</int>
                        <int name="Citrin Cooperman &amp;amp; Company LLP">36</int>
                        <int name="Holtz Rubenstein Reminick LLP">36</int>
                        <int name="Mahoney Cohen &amp;amp; Company CPAs P.C.">36</int>
                        <int name="D'Arcangelo &amp;amp; Company LLP">35</int></lst></lst><lst name="facet_dates"/></lst><lst name="debug"><str name="rawquerystring">division_t:"Accounting";last_name_facet asc</str><str name="querystring">division_t:"Accounting";last_name_facet asc</str><str name="parsedquery">division_t:account</str><str name="parsedquery_toString">division_t:account</str><lst name="explain"/><arr name="filter_queries"><str>in_redbook_b:true</str></arr><arr name="parsed_filter_queries"><str>in_redbook_b:true</str></arr><lst name="timing"><double name="time">488.0</double><lst name="prepare"><double name="time">1.0</double><lst name="org.apache.solr.handler.component.QueryComponent"><double name="time">1.0</double></lst><lst name="org.apache.solr.handler.component.FacetComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.MoreLikeThisComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.HighlightComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.DebugComponent"><double name="time">0.0</double></lst></lst><lst name="process"><double name="time">487.0</double><lst name="org.apache.solr.handler.component.QueryComponent"><double name="time">1.0</double></lst><lst name="org.apache.solr.handler.component.FacetComponent"><double name="time">486.0</double></lst><lst name="org.apache.solr.handler.component.MoreLikeThisComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.HighlightComponent"><double name="time">0.0</double></lst><lst name="org.apache.solr.handler.component.DebugComponent"><double name="time">0.0</double></lst></lst></lst></lst></response> 

---------------------------------------------------------------------

And then this is the one where I select the first facet.field returned above, and attempt to pull up those results:


Query:
/solr/select/?fl=*,score&start=0&wt=json&q=division_t:%22Accounting%22;last_name_facet+asc&qt=standard&fq=company_facet:%22Deloitte+%26+Touche%22&fq=in_redbook_b:true&rows=30&debugQuery=true

Response:
<?xml version="1.0" encoding="UTF-8"?>
<response>
      <lst name="responseHeader">
            <int name="status">0</int>
            <int name="QTime">1</int>
            <lst name="params">
                  <str name="fl">*,score</str>
                  <str name="debugQuery">true</str>
                  <str name="start">0</str>
                  <str name="q">division_t:"Accounting";last_name_facet asc</str>
                  <str name="wt">xml</str>
                  <str name="qt">standard</str>
                  <arr name="fq">
                        <str>company_facet:"Deloitte & Touche"</str>
                        <str>in_redbook_b:true</str>
                  </arr>
                  <str name="rows">30</str>
            </lst>
      </lst>
      <result name="response" numFound="0" start="0" maxScore="0.0"/>
      <lst name="debug">
            <str name="rawquerystring">division_t:"Accounting";last_name_facet asc</str>
            <str name="querystring">division_t:"Accounting";last_name_facet asc</str>
            <str name="parsedquery">division_t:account</str>
            <str name="parsedquery_toString">division_t:account</str>
            <lst name="explain"/>
            <arr name="filter_queries">
                  <str>company_facet:"Deloitte & Touche"</str>
                  <str>in_redbook_b:true</str>
            </arr>
            <arr name="parsed_filter_queries">
                  <str>company_facet:Deloitte & Touche</str>
                  <str>in_redbook_b:true</str>
            </arr>
            <lst name="timing">
                  <double name="time">1.0</double>
                  <lst name="prepare">
                        <double name="time">1.0</double>
                        <lst name="org.apache.solr.handler.component.QueryComponent">
                              <double name="time">1.0</double>
                        </lst>
                        <lst name="org.apache.solr.handler.component.FacetComponent">
                              <double name="time">0.0</double>
                        </lst>
                        <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
                              <double name="time">0.0</double>
                        </lst>
                        <lst name="org.apache.solr.handler.component.HighlightComponent">
                              <double name="time">0.0</double>
                        </lst>
                        <lst name="org.apache.solr.handler.component.DebugComponent">
                              <double name="time">0.0</double>
                        </lst>
                  </lst>
                  <lst name="process">
                        <double name="time">0.0</double>
                        <lst name="org.apache.solr.handler.component.QueryComponent">
                              <double name="time">0.0</double>
                        </lst>
                        <lst name="org.apache.solr.handler.component.FacetComponent">
                              <double name="time">0.0</double>
                        </lst>
                        <lst name="org.apache.solr.handler.component.MoreLikeThisComponent">
                              <double name="time">0.0</double>
                        </lst>
                        <lst name="org.apache.solr.handler.component.HighlightComponent">
                              <double name="time">0.0</double>
                        </lst>
                        <lst name="org.apache.solr.handler.component.DebugComponent">
                              <double name="time">0.0</double>
                        </lst>
                  </lst>
            </lst>
      </lst>
</response>

(The other filter query, in_redbook_b, is a boolean field used to partition our dataset. It should affect the results since it's in both queries.)

Thanks again for your help, I really appreciate your time.

Jonathan

hossman wrote
: Thanks for the reply. I was in a hurry and made the URL up to illustrate my
: point. The real query string is more like what you suggest. In any case I'm
: certain that the actual query being used is valid (Solr would complain if it
: weren't) and that the ampersand is somehow affecting results. Is there any

no, actually it wouldn't complain in that case ... a URL param with a name
it's not expecting would just be ignored.

if you send us the exact URLs you'rehaving problems with there may be
other nuances about it that we can spot to help figure out your problem.
(for example: are you absolutely sure the apersand in your field value is
URL escaped?)

: way I can get Solr to dump some information about how it stores indexes,
: keys, etc. for a certain record? I'm wondering if the ampersand was handled
: in a weird way by my application when the records were added to the index.
: (Although I doubt this since it shows up properly in the facets.) Thanks
: again for your help.

yep, there are a couple of things you can do in general to
troubleshoot things like this...

1) debugQuery=true ... add that param into your URL and Solr will give you
some nice debuging info about how your queries are bering parsed.  this is
important to post when asking followup questions.

2) analysis.jsp ... this is the "Analysis" link on the admin page, it will
show you how your analyzer is treating the fields you index ... but this
isn'treally relevant to your specific problem since you are using
StrField.

3) LukeRequestHandler, in the example schema it's mapped to /admin/luke
... this will let you see the actual terms indexed for your fields ... but
this as you said, this isn't going to be much help for you in this
specific case since you used facet.field to get the value in the first place -- that means it's
definitely indexed that way.

debugQuery=true is definitely your best first step ... send us the exact
URLs your having problems with (that have debugQuery=true) along with the
full output of that URL and people can probably help spot your problem.



-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: How does solr.StrField handle punctuation?

hossman

:                         <int name="Deloitte &amp;amp; Touche">4114</int>
:                         <int name="Ernst &amp;amp; Young">1379</int>

A-Ha! ... this is where the details relaly matter.  unless your email
program did something funky with the XML you sent, what this tells me is
that you don't actually have the values "Deloitte & Touche" or "Ernst &
Young" in your index.  The literal values in your index are "Deloitte
&amp; Touche" and "Ernst &amp; Young" .. most likely you are "double XML
escaping" your source data before indexing.  if i'm right, then when you
use the ruby output format, you'll see...
  ...
  'facet_fields'=>{
        'cat'=>[
           'Deloitte &amp; Touche',4114
           'Ernst &amp; Young',1379
  ...

If you change your fq to...

        fq=company_facet:%22Deloitte+%26amp%3B+Touche%22

...so that it is an URL escaping of the value you get after un-XML
escaping the response (just once!) you should start seeing the correct
results.  but your long term solution is to stop double escaping your data
before indexing it.



-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: How does solr.StrField handle punctuation?

terhorst
Nailed it right on the head. That solves it. Thanks much!

Jonathan

hossman wrote
:                         <int name="Deloitte &amp;amp; Touche">4114</int>
:                         <int name="Ernst &amp;amp; Young">1379</int>

A-Ha! ... this is where the details relaly matter.  unless your email
program did something funky with the XML you sent, what this tells me is
that you don't actually have the values "Deloitte & Touche" or "Ernst &
Young" in your index.  The literal values in your index are "Deloitte
& Touche" and "Ernst & Young" .. most likely you are "double XML
escaping" your source data before indexing.  if i'm right, then when you
use the ruby output format, you'll see...
  ...
  'facet_fields'=>{
        'cat'=>[
           'Deloitte & Touche',4114
           'Ernst & Young',1379
  ...

If you change your fq to...

        fq=company_facet:%22Deloitte+%26amp%3B+Touche%22

...so that it is an URL escaping of the value you get after un-XML
escaping the response (just once!) you should start seeing the correct
results.  but your long term solution is to stop double escaping your data
before indexing it.



-Hoss