Indexing fieldvalues with dashes and spaces

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing fieldvalues with dashes and spaces

PeterKerk
Im having issues with indexing field values containing spaces and dashes.
For example: Im trying to index province names of the Netherlands. Some province names contain a "-":
Zuid-Holland
Noord-Holland

my data-config has this:
            <entity name="location_province" query="select provinceid from locations where id=${location.id}">
                <entity name="provinces" query="select title from provinces where id = ${location_province.provinceid}">
                    <field name="province" column="title"  />
                </entity>
            </entity>


When I check what has been indexed, I have this:
<response>

<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>

<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">*:*</str>
<str name="version">2.2</str>
<str name="rows">10</str>
</lst>
</lst>

<result name="response" numFound="3" start="0">

<doc>
<str name="city">Nijmegen</str>

<arr name="features">
<str>Tuin</str>
<str>Cafe</str>
</arr>
<str name="id">1</str>
<str name="province">Gelderland</str>

<arr name="services">
<str>Fotoreportage</str>
</arr>

<arr name="theme">
<str>Gemeentehuis</str>
</arr>
<date name="timestamp">2010-08-04T19:11:51.796Z</date>
<str name="title">Gemeentehuis Nijmegen</str>
</doc>

<doc>
<str name="city">Utrecht</str>

<arr name="features">
<str>Tuin</str>
<str>Cafe</str>
<str>Danszaal</str>
</arr>
<str name="id">2</str>
<str name="province">Utrecht</str>

<arr name="services">
<str>Fotoreportage</str>
<str>Exclusieve huur</str>
</arr>

<arr name="theme">
<str>Gemeentehuis</str>
</arr>
<date name="timestamp">2010-08-04T19:11:51.796Z</date>
<str name="title">Gemeentehuis Utrecht</str>
</doc>

<doc>
<str name="city">Bloemendaal</str>

<arr name="features">
<str>Strand</str>
<str>Cafe</str>
<str>Danszaal</str>
</arr>
<str name="id">3</str>
<str name="province">Zuid-Holland</str>

<arr name="services">
<str>Exclusieve huur</str>
<str>Live muziek</str>
</arr>

<arr name="theme">
<str>Strand & Zee</str>
</arr>
<date name="timestamp">2010-08-04T19:11:51.812Z</date>
<str name="title">Beachclub Vroeger</str>
</doc>
</result>
</response>



So we see that the full field has been indexed:
<str name="province">Zuid-Holland</str>


BUT, when I check the facets via
http://localhost:8983/solr/db/select/?wt=json&indent=on&q=*:*&fl=id,title,city,score,features,official,services&facet=true&facet.field=theme&facet.field=features&facet.field=province&facet.field=services

I get this (snippet):
"facet_counts":{
  "facet_queries":{},
  "facet_fields":{
        "theme":[
         "Gemeentehuis",2,
         "&",1,               <================ a
         "Strand",1,
         "Zee",1],
        "features":[
         "cafe",3,
         "danszaal",2,
         "tuin",2,
         "strand",1],
        "province":[
         "gelderland",1,
         "holland",1,
         "utrecht",1,
         "zuid",1,         <================  b
         "zuidholland",1],
        "services":[
         "exclusiev",2,
         "fotoreportag",2, <================  c
         "huur",2,
         "live",1,          <================  d
         "muziek",1]},


There several weird things happen which I have indicated with <===

a. the full field value is "Strand & Zee", but now one facet is "&"
b. the full field value is "Zuid-Holland", but now "zuid" is a separate facet
c. the full field value is "fotoreportage", but somehow the last character has been truncated
d. the full field value "live muziek", but now "live" and "muziek" have become separate facets

What can I do about this?
Reply | Threaded
Open this post in threaded view
|

RE: Indexing fieldvalues with dashes and spaces

Michael Griffiths
Your schema.xml setting for the field is probably tokenizing the punctuation. Change the field type to one that doesn't tokenize on punctuation; e.g. use "text_ws" and not "text"

-----Original Message-----
From: PeterKerk [mailto:[hidden email]]
Sent: Wednesday, August 04, 2010 3:36 PM
To: [hidden email]
Subject: Indexing fieldvalues with dashes and spaces


Im having issues with indexing field values containing spaces and dashes.
For example: Im trying to index province names of the Netherlands. Some province names contain a "-":
Zuid-Holland
Noord-Holland

my data-config has this:
            <entity name="location_province" query="select provinceid from locations where id=${location.id}">
                <entity name="provinces" query="select title from provinces where id = ${location_province.provinceid}">
                    <field name="province" column="title"  />
                </entity>
            </entity>


When I check what has been indexed, I have this:
<response>

<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>

<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">*:*</str>
<str name="version">2.2</str>
<str name="rows">10</str>
</lst>
</lst>

<result name="response" numFound="3" start="0"> − <doc> <str name="city">Nijmegen</str> − <arr name="features"> <str>Tuin</str> <str>Cafe</str> </arr> <str name="id">1</str> <str name="province">Gelderland</str> − <arr name="services"> <str>Fotoreportage</str> </arr> − <arr name="theme"> <str>Gemeentehuis</str> </arr> <date name="timestamp">2010-08-04T19:11:51.796Z</date>
<str name="title">Gemeentehuis Nijmegen</str> </doc> − <doc> <str name="city">Utrecht</str> − <arr name="features"> <str>Tuin</str> <str>Cafe</str> <str>Danszaal</str> </arr> <str name="id">2</str> <str name="province">Utrecht</str> − <arr name="services"> <str>Fotoreportage</str> <str>Exclusieve huur</str> </arr> − <arr name="theme"> <str>Gemeentehuis</str> </arr> <date name="timestamp">2010-08-04T19:11:51.796Z</date>
<str name="title">Gemeentehuis Utrecht</str> </doc> − <doc> <str name="city">Bloemendaal</str> − <arr name="features"> <str>Strand</str> <str>Cafe</str> <str>Danszaal</str> </arr> <str name="id">3</str> <str name="province">Zuid-Holland</str>

<arr name="services">
<str>Exclusieve huur</str>
<str>Live muziek</str>
</arr>

<arr name="theme">
<str>Strand & Zee</str>
</arr>
<date name="timestamp">2010-08-04T19:11:51.812Z</date>
<str name="title">Beachclub Vroeger</str> </doc> </result> </response>



So we see that the full field has been indexed:
<str name="province">Zuid-Holland</str>


BUT, when I check the facets via
http://localhost:8983/solr/db/select/?wt=json&indent=on&q=*:*&fl=id,title,city,score,features,official,services&facet=true&facet.field=theme&facet.field=features&facet.field=province&facet.field=services

I get this (snippet):
"facet_counts":{
  "facet_queries":{},
  "facet_fields":{
        "theme":[
         "Gemeentehuis",2,
         "&",1,               <================ a
         "Strand",1,
         "Zee",1],
        "features":[
         "cafe",3,
         "danszaal",2,
         "tuin",2,
         "strand",1],
        "province":[
         "gelderland",1,
         "holland",1,
         "utrecht",1,
         "zuid",1,         <================  b
         "zuidholland",1],
        "services":[
         "exclusiev",2,
         "fotoreportag",2, <================  c
         "huur",2,
         "live",1,          <================  d
         "muziek",1]},


There several weird things happen which I have indicated with <===

a. the full field value is "Strand & Zee", but now one facet is "&"
b. the full field value is "Zuid-Holland", but now "zuid" is a separate facet c. the full field value is "fotoreportage", but somehow the last character has been truncated d. the full field value "live muziek", but now "live" and "muziek" have become separate facets

What can I do about this?

--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1023699.html
Sent from the Solr - User mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

RE: Indexing fieldvalues with dashes and spaces

PeterKerk
I changed  values to text_ws

Now I only seem to have problems with fieldvalues that hold spaces....see below:

   <field name="city" type="text_ws" indexed="true" stored="true"/>
   <field name="theme" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true" termVectors="true" />
   <field name="features" type="text_ws" indexed="true" stored="true" multiValued="true"/>
   <field name="services" type="text_ws" indexed="true" stored="true" multiValued="true"/>
   <field name="province" type="text_ws" indexed="true" stored="true"/>

It has now become:

 "facet_counts":{
  "facet_queries":{},
  "facet_fields":{
        "theme":[
         "Gemeentehuis",2,
         "&",1,   <======== still & is created as separate facet
         "Strand",1,
         "Zee",1],
        "features":[
         "Cafe",3,
         "Danszaal",2,
         "Tuin",2,
         "Strand",1],
        "province":[
         "Gelderland",1,
         "Utrecht",1,
         "Zuid-Holland",1], <======== this is now correct
        "services":[
         "Exclusieve",2,
         "Fotoreportage",2,
         "huur",2,
         "Live",1, <======== "Live muziek" is split and separate facets are created
         "muziek",1]},
  "facet_dates":{}}}
Reply | Threaded
Open this post in threaded view
|

RE: Indexing fieldvalues with dashes and spaces

Markus Jelsma
You shouldn't fetch faceting results from analyzed fields, it will mess with your results. Search on analyzed fields but don't retrieve values from them.
 
-----Original message-----
From: PeterKerk <[hidden email]>
Sent: Wed 04-08-2010 22:15
To: [hidden email];
Subject: RE: Indexing fieldvalues with dashes and spaces


I changed  values to text_ws

Now I only seem to have problems with fieldvalues that hold spaces....see
below:

  <field name="city" type="text_ws" indexed="true" stored="true"/>
  <field name="theme" type="text_ws" indexed="true" stored="true"
multiValued="true" omitNorms="true" termVectors="true" />
  <field name="features" type="text_ws" indexed="true" stored="true"
multiValued="true"/>
  <field name="services" type="text_ws" indexed="true" stored="true"
multiValued="true"/>
  <field name="province" type="text_ws" indexed="true" stored="true"/>

It has now become:

"facet_counts":{
 "facet_queries":{},
 "facet_fields":{
"theme":[
"Gemeentehuis",2,
"&",1,   <======== still & is created as separate facet
"Strand",1,
"Zee",1],
"features":[
"Cafe",3,
"Danszaal",2,
"Tuin",2,
"Strand",1],
"province":[
"Gelderland",1,
"Utrecht",1,
"Zuid-Holland",1], <======== this is now correct
"services":[
"Exclusieve",2,
"Fotoreportage",2,
"huur",2,
"Live",1, <======== "Live muziek" is split and separate facets are created
"muziek",1]},
 "facet_dates":{}}}
--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1023787.html
Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

RE: Indexing fieldvalues with dashes and spaces

Michael Griffiths
Echoing Markus - use the tokenized field to return results, but have a duplicate field of fieldtype="string" to show the untokenized results. E.g. facet on that field.

-----Original Message-----
From: Markus Jelsma [mailto:[hidden email]]
Sent: Wednesday, August 04, 2010 4:18 PM
To: [hidden email]
Subject: RE: Indexing fieldvalues with dashes and spaces

You shouldn't fetch faceting results from analyzed fields, it will mess with your results. Search on analyzed fields but don't retrieve values from them.
 
-----Original message-----
From: PeterKerk <[hidden email]>
Sent: Wed 04-08-2010 22:15
To: [hidden email];
Subject: RE: Indexing fieldvalues with dashes and spaces


I changed  values to text_ws

Now I only seem to have problems with fieldvalues that hold spaces....see
below:

  <field name="city" type="text_ws" indexed="true" stored="true"/>
  <field name="theme" type="text_ws" indexed="true" stored="true"
multiValued="true" omitNorms="true" termVectors="true" />
  <field name="features" type="text_ws" indexed="true" stored="true"
multiValued="true"/>
  <field name="services" type="text_ws" indexed="true" stored="true"
multiValued="true"/>
  <field name="province" type="text_ws" indexed="true" stored="true"/>

It has now become:

"facet_counts":{
 "facet_queries":{},
 "facet_fields":{
"theme":[
"Gemeentehuis",2,
"&",1,   <======== still & is created as separate facet
"Strand",1,
"Zee",1],
"features":[
"Cafe",3,
"Danszaal",2,
"Tuin",2,
"Strand",1],
"province":[
"Gelderland",1,
"Utrecht",1,
"Zuid-Holland",1], <======== this is now correct
"services":[
"Exclusieve",2,
"Fotoreportage",2,
"huur",2,
"Live",1, <======== "Live muziek" is split and separate facets are created
"muziek",1]},
 "facet_dates":{}}}
--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1023787.html
Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

RE: Indexing fieldvalues with dashes and spaces

PeterKerk
In reply to this post by Markus Jelsma
Sorry, but Im a newbie to Solr...how would I change my schema.xml to match your requirements?

And what do you mean by "it will mess with your results"? What will happen then?
Reply | Threaded
Open this post in threaded view
|

RE: Indexing fieldvalues with dashes and spaces

Markus Jelsma
Hmm, you should first read a bit more on schema design on the wiki and learn about indexing and querying Solr.

 

The copyField directive is what is commonly used in a faceted navigation system, search on analyzed fields, show faceting results using the primitive string field type. With copyField, you can, well, copy the field from one to another without it being analyzed by the first - so no chaining is possible, which is good.

 

Let's say you have a city field you want to navigate with, but also search in, then you would have an analyzed field for search and a string field for displaying the navigation.

 

But, check the wiki on this subject.
 
-----Original message-----
From: PeterKerk <[hidden email]>
Sent: Wed 04-08-2010 22:23
To: [hidden email];
Subject: RE: Indexing fieldvalues with dashes and spaces


Sorry, but Im a newbie to Solr...how would I change my schema.xml to match
your requirements?

And what do you mean by "it will mess with your results"? What will happen
then?
--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1023824.html
Sent from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

RE: Indexing fieldvalues with dashes and spaces

PeterKerk
Well the example you provided is 100% relevant to me :)

I've read the wiki now (SchemaXml,SolrFacetingOverview,Query Syntax, SimpleFacetParameters), but still do not have an exact idea of what you mean.

My situation:
a city field is something that I want users to search on via text input, so lets say "New Yo" would give the results for "New York".
But also a facet "Cities" is available in which "New York" is just one of the cities that is clickable.

The other facet is "theme", which in my example holds values like "Gemeentehuis" and "Strand & Zee", that would not be a thing on which can be searched via manual input but IS clickable.

If you look at my schema.xml, do you see stuff im doing that is absolutely wrong for the purpose described above? Because as far as I can see the documents are indexed correctly (BESIDES the spaces in the fieldvalues).

Any help is greatly appreciated! :)
Reply | Threaded
Open this post in threaded view
|

Re: Indexing fieldvalues with dashes and spaces

Erick Erickson
I suspect you're running afoul of tokenizers and filters. The parts of your
schema
that you published aren't the ones that really count.

What you probably need to look at is the FieldType definitions, i.e. what
analysis is
done for, say, text_ws (see <FieldType... in your schema). There you might
find
things like WordDelimiterFilter with several options. LowerCaseFilter, etc.
Each of these
changes what's placed in your index. Here's a good place to start, although
it's not
exhaustive:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

The general idea here is that the Tokenizers in general break up the
incoming stream according to various rules. The Filters then (potentially)
modify each token in various ways.

Until you have a firm handle on this process, facets are probably a
distraction. You're
better off looking at your index with the admin pages and/or Luke and/or
LukeRequestHandler.

And do be aware that fields you get back from a request (i.e. a search) are
the stored fields,
NOT what's indexed. This may trip you up too...

HTH
Erick

On Wed, Aug 4, 2010 at 5:22 PM, PeterKerk <[hidden email]> wrote:

>
> Well the example you provided is 100% relevant to me :)
>
> I've read the wiki now (SchemaXml,SolrFacetingOverview,Query Syntax,
> SimpleFacetParameters), but still do not have an exact idea of what you
> mean.
>
> My situation:
> a city field is something that I want users to search on via text input, so
> lets say "New Yo" would give the results for "New York".
> But also a facet "Cities" is available in which "New York" is just one of
> the cities that is clickable.
>
> The other facet is "theme", which in my example holds values like
> "Gemeentehuis" and "Strand & Zee", that would not be a thing on which can
> be
> searched via manual input but IS clickable.
>
> If you look at my schema.xml, do you see stuff im doing that is absolutely
> wrong for the purpose described above? Because as far as I can see the
> documents are indexed correctly (BESIDES the spaces in the fieldvalues).
>
> Any help is greatly appreciated! :)
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1023992.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

RE: Indexing fieldvalues with dashes and spaces

PeterKerk
In reply to this post by PeterKerk
@Michael, @Erick,

You both mention interesting things that triggered me.

@Erick:
Your referenced page is very useful. It seems the whitespace tokenizer under the text_ws is causing issues.

You do mention another interesting thing:
"And do be aware that fields you get back from a request (i.e. a search) are the stored fields, NOT what's indexed."

On the page you provided I see this under the Analyzers section: "Analyzers are components that pre-process input text at index time and/or at search time."

So I dont completely understand how that sentence is in line with your comment.


@Michael:
You say: "use the tokenized field to return results, but have a duplicate field of fieldtype="string" to show the untokenized results. E.g. facet on that field."
I think your comment applies on my requirement: "a city field is something that I want users to search on via text input, so lets say "New Yo" would give the results for "New York".
But also a facet "Cities" is available in which "New York" is just one of the cities that is clickable.
The other facet is "theme", which in my example holds values like "Gemeentehuis" and "Strand & Zee", that would not be a thing on which can be searched via manual input but IS clickable. "

Could you please indicate (just for the above fields) what needs to be changed in my schema.xml and if so how that affects the way my request is build up?


Thanks so much ahead in getting me started!


This is my schema.xml


<?xml version="1.0" encoding="UTF-8" ?>

<schema name="db" version="1.1">

  <types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
    <fieldType name="long" class="solr.LongField" omitNorms="true"/>
    <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
    <fieldType name="double" class="solr.DoubleField" omitNorms="true"/>
    <fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sfloat" class="solr.SortableFloatField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="random" class="solr.RandomSortField" indexed="true" />
    <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

    <fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

    <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
      </analyzer>
    </fieldType>
    <fieldtype name="ignored" stored="false" indexed="false" class="solr.StrField" /> 
 </types>

 <fields>
   <field name="id" type="string" indexed="true" stored="true" required="true" /> 
   <field name="title" type="text_ws" indexed="true" stored="true"/>
   <field name="city" type="text_ws" indexed="true" stored="true"/>
   <field name="official" type="integer" indexed="true" stored="true"/>
   <field name="theme" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true" termVectors="true" />
   <field name="features" type="text_ws" indexed="true" stored="true" multiValued="true"/>
   <field name="services" type="text_ws" indexed="true" stored="true" multiValued="true"/>
   <field name="province" type="text_ws" indexed="true" stored="true"/>
   <field name="word" type="string" indexed="true" stored="true"/>
   <field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
   <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>

   <dynamicField name="*_i"  type="sint"    indexed="true"  stored="true"/>
   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
   <dynamicField name="*_l"  type="slong"   indexed="true"  stored="true"/>
   <dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
   <dynamicField name="*_f"  type="sfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_d"  type="sdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>
   <dynamicField name="random*" type="random" />

 </fields>

 <uniqueKey>id</uniqueKey>

 <defaultSearchField>text</defaultSearchField>

 <solrQueryParser defaultOperator="OR"/>

   <copyField source="theme" dest="text"/>
   <copyField source="title" dest="text"/>
   <copyField source="city" dest="text"/>
   <copyField source="official" dest="text" />
   <copyField source="features" dest="text"/>
   <copyField source="services" dest="text"/>
</schema>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing fieldvalues with dashes and spaces

Erick Erickson
This confuses lots of people. When you index a field, it's Analyzed 10
ways from Sunday. Consider "The World is an unknown Entity". When
you INDEX it, many thing happen, depending upon the analyser.
Stopwords may be removed. each token may be lower cased. Each token
may be stemmed. It all depends on what's in your analyzer chain. Assume
a simple chain consisting of breaking up tokens on whitespace, lowercasing,
and removing stopwords. The actual tokens INDEXED would be "world",
"unknown", and "entity". That is what is searched against.

However, the string, unchanged, would be STORED if you specified it so.
So when you asked for the field to be returned in a search result, you
would
get "The World is an unknown Entity" if you asked for the field to be
returned as part of a search result that matched on, say, "world".

HTH
Erick

On Thu, Aug 5, 2010 at 4:31 AM, PeterKerk <[hidden email]> wrote:

>
> @Michael, @Erick,
>
> You both mention interesting things that triggered me.
>
> @Erick:
> Your referenced page is very useful. It seems the whitespace tokenizer
> under
> the text_ws is causing issues.
>
> You do mention another interesting thing:
> "And do be aware that fields you get back from a request (i.e. a search)
> are
> the stored fields, NOT what's indexed."
>
> On the page you provided I see this under the Analyzers section: "Analyzers
> are components that pre-process input text at index time and/or at search
> time."
>
> So I dont completely understand how that sentence is in line with your
> comment.
>
>
> @Michael:
> You say: "use the tokenized field to return results, but have a duplicate
> field of fieldtype="string" to show the untokenized results. E.g. facet on
> that field."
> I think your comment applies on my requirement: "a city field is something
> that I want users to search on via text input, so lets say "New Yo" would
> give the results for "New York".
> But also a facet "Cities" is available in which "New York" is just one of
> the cities that is clickable.
> The other facet is "theme", which in my example holds values like
> "Gemeentehuis" and "Strand & Zee", that would not be a thing on which can
> be
> searched via manual input but IS clickable. "
>
> Could you please indicate (just for the above fields) what needs to be
> changed in my schema.xml and if so how that affects the way my request is
> build up?
>
>
> Thanks so much ahead in getting me started!
>
>
> This is my schema.xml
>
>
> <?xml version="1.0" encoding="UTF-8" ?>
>
> <schema name="db" version="1.1">
>
>  <types>
>    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> omitNorms="true"/>
>    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"
> omitNorms="true"/>
>    <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
>    <fieldType name="long" class="solr.LongField" omitNorms="true"/>
>    <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
>    <fieldType name="double" class="solr.DoubleField" omitNorms="true"/>
>    <fieldType name="sint" class="solr.SortableIntField"
> sortMissingLast="true" omitNorms="true"/>
>    <fieldType name="slong" class="solr.SortableLongField"
> sortMissingLast="true" omitNorms="true"/>
>    <fieldType name="sfloat" class="solr.SortableFloatField"
> sortMissingLast="true" omitNorms="true"/>
>    <fieldType name="sdouble" class="solr.SortableDoubleField"
> sortMissingLast="true" omitNorms="true"/>
>    <fieldType name="date" class="solr.DateField" sortMissingLast="true"
> omitNorms="true"/>
>    <fieldType name="random" class="solr.RandomSortField" indexed="true" />
>    <fieldType name="text_ws" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      </analyzer>
>    </fieldType>
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>    <fieldType name="textTight" class="solr.TextField"
> positionIncrementGap="100" >
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>    <fieldType name="alphaOnlySort" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>      <analyzer>
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory" />
>        <filter class="solr.TrimFilterFactory" />
>        <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
> replacement="" replace="all" />
>      </analyzer>
>    </fieldType>
>    <fieldtype name="ignored" stored="false" indexed="false"
> class="solr.StrField" />
>  </types>
>
>  <fields>
>   <field name="id" type="string" indexed="true" stored="true"
> required="true" />
>   <field name="title" type="text_ws" indexed="true" stored="true"/>
>    <field name="city" type="text_ws" indexed="true" stored="true"/>
>    <field name="official" type="integer" indexed="true" stored="true"/>
>    <field name="theme" type="text_ws" indexed="true" stored="true"
> multiValued="true" omitNorms="true" termVectors="true" />
>   <field name="features" type="text_ws" indexed="true" stored="true"
> multiValued="true"/>
>   <field name="services" type="text_ws" indexed="true" stored="true"
> multiValued="true"/>
>   <field name="province" type="text_ws" indexed="true" stored="true"/>
>    <field name="word" type="string" indexed="true" stored="true"/>
>   <field name="text" type="text" indexed="true" stored="false"
> multiValued="true"/>
>   <field name="timestamp" type="date" indexed="true" stored="true"
> default="NOW" multiValued="false"/>
>
>   <dynamicField name="*_i"  type="sint"    indexed="true"  stored="true"/>
>   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
>   <dynamicField name="*_l"  type="slong"   indexed="true"  stored="true"/>
>   <dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
>   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
>   <dynamicField name="*_f"  type="sfloat"  indexed="true"  stored="true"/>
>   <dynamicField name="*_d"  type="sdouble" indexed="true"  stored="true"/>
>   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>
>   <dynamicField name="random*" type="random" />
>
>  </fields>
>
>  <uniqueKey>id</uniqueKey>
>
>  <defaultSearchField>text</defaultSearchField>
>
>  <solrQueryParser defaultOperator="OR"/>
>
>   <copyField source="theme" dest="text"/>
>   <copyField source="title" dest="text"/>
>   <copyField source="city" dest="text"/>
>   <copyField source="official" dest="text" />
>   <copyField source="features" dest="text"/>
>   <copyField source="services" dest="text"/>
> </schema>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1025463.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing fieldvalues with dashes and spaces

PeterKerk
Ah, I'm glad it does, makes me feel a bit less stupid ;)

So to summarize and see if I understand it now:
- the analyzers allow for many different ways to index a field, these analyzers are placed in a chain
- when a field is indexed it can be searched
- a field could also be stored as-is, but I would need to indicate that
- only if a field is stored, a search query returns that field if the search query matches a part of it

If thats the case its clear what it all does.

Then still the question remains on how to configure this in the schema.xml. There's just so much documentation and examples in so many different places that I'm lost. I've used almost the literal example schema.xml which has many similarities (e.g. on categories facet) with my use case, but I dont know if they allow for the exact operations I require.

If you look at my schema.xml, how would you configure it to do the following:

a city field is something that I want users to search on via text input, so lets say "New Yo" would give the results for "New York".
===> so this field would need to be stored right?

But also a facet "Cities" is available in which "New York" is just one of the cities that is selectable as a filter/facet.
===> for this I need to create a facet

The other facet is "theme", which in my example holds values like "Gemeentehuis" and "Strand & Zee", that would not be a thing on which can be searched via manual input but IS selectable as a filter/facet
===> this field would NOT have to be stored right?

Thanks for your time! :)

Regards,
Pete


Erick Erickson wrote
This confuses lots of people. When you index a field, it's Analyzed 10
ways from Sunday. Consider "The World is an unknown Entity". When
you INDEX it, many thing happen, depending upon the analyser.
Stopwords may be removed. each token may be lower cased. Each token
may be stemmed. It all depends on what's in your analyzer chain. Assume
a simple chain consisting of breaking up tokens on whitespace, lowercasing,
and removing stopwords. The actual tokens INDEXED would be "world",
"unknown", and "entity". That is what is searched against.

However, the string, unchanged, would be STORED if you specified it so.
So when you asked for the field to be returned in a search result, you
would
get "The World is an unknown Entity" if you asked for the field to be
returned as part of a search result that matched on, say, "world".

HTH
Erick

On Thu, Aug 5, 2010 at 4:31 AM, PeterKerk <vetteparty@hotmail.com> wrote:

>
> @Michael, @Erick,
>
> You both mention interesting things that triggered me.
>
> @Erick:
> Your referenced page is very useful. It seems the whitespace tokenizer
> under
> the text_ws is causing issues.
>
> You do mention another interesting thing:
> "And do be aware that fields you get back from a request (i.e. a search)
> are
> the stored fields, NOT what's indexed."
>
> On the page you provided I see this under the Analyzers section: "Analyzers
> are components that pre-process input text at index time and/or at search
> time."
>
> So I dont completely understand how that sentence is in line with your
> comment.
>
>
> @Michael:
> You say: "use the tokenized field to return results, but have a duplicate
> field of fieldtype="string" to show the untokenized results. E.g. facet on
> that field."
> I think your comment applies on my requirement: "a city field is something
> that I want users to search on via text input, so lets say "New Yo" would
> give the results for "New York".
> But also a facet "Cities" is available in which "New York" is just one of
> the cities that is clickable.
> The other facet is "theme", which in my example holds values like
> "Gemeentehuis" and "Strand & Zee", that would not be a thing on which can
> be
> searched via manual input but IS clickable. "
>
> Could you please indicate (just for the above fields) what needs to be
> changed in my schema.xml and if so how that affects the way my request is
> build up?
>
>
> Thanks so much ahead in getting me started!
>
>
> This is my schema.xml
>
>
> <?xml version="1.0" encoding="UTF-8" ?>
>
> <schema name="db" version="1.1">
>
>  <types>
>    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> omitNorms="true"/>
>    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"
> omitNorms="true"/>
>    <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
>    <fieldType name="long" class="solr.LongField" omitNorms="true"/>
>    <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
>    <fieldType name="double" class="solr.DoubleField" omitNorms="true"/>
>    <fieldType name="sint" class="solr.SortableIntField"
> sortMissingLast="true" omitNorms="true"/>
>    <fieldType name="slong" class="solr.SortableLongField"
> sortMissingLast="true" omitNorms="true"/>
>    <fieldType name="sfloat" class="solr.SortableFloatField"
> sortMissingLast="true" omitNorms="true"/>
>    <fieldType name="sdouble" class="solr.SortableDoubleField"
> sortMissingLast="true" omitNorms="true"/>
>    <fieldType name="date" class="solr.DateField" sortMissingLast="true"
> omitNorms="true"/>
>    <fieldType name="random" class="solr.RandomSortField" indexed="true" />
>    <fieldType name="text_ws" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      </analyzer>
>    </fieldType>
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>    <fieldType name="textTight" class="solr.TextField"
> positionIncrementGap="100" >
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>    <fieldType name="alphaOnlySort" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>      <analyzer>
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory" />
>        <filter class="solr.TrimFilterFactory" />
>        <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
> replacement="" replace="all" />
>      </analyzer>
>    </fieldType>
>    <fieldtype name="ignored" stored="false" indexed="false"
> class="solr.StrField" />
>  </types>
>
>  <fields>
>   <field name="id" type="string" indexed="true" stored="true"
> required="true" />
>   <field name="title" type="text_ws" indexed="true" stored="true"/>
>    <field name="city" type="text_ws" indexed="true" stored="true"/>
>    <field name="official" type="integer" indexed="true" stored="true"/>
>    <field name="theme" type="text_ws" indexed="true" stored="true"
> multiValued="true" omitNorms="true" termVectors="true" />
>   <field name="features" type="text_ws" indexed="true" stored="true"
> multiValued="true"/>
>   <field name="services" type="text_ws" indexed="true" stored="true"
> multiValued="true"/>
>   <field name="province" type="text_ws" indexed="true" stored="true"/>
>    <field name="word" type="string" indexed="true" stored="true"/>
>   <field name="text" type="text" indexed="true" stored="false"
> multiValued="true"/>
>   <field name="timestamp" type="date" indexed="true" stored="true"
> default="NOW" multiValued="false"/>
>
>   <dynamicField name="*_i"  type="sint"    indexed="true"  stored="true"/>
>   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
>   <dynamicField name="*_l"  type="slong"   indexed="true"  stored="true"/>
>   <dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
>   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
>   <dynamicField name="*_f"  type="sfloat"  indexed="true"  stored="true"/>
>   <dynamicField name="*_d"  type="sdouble" indexed="true"  stored="true"/>
>   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>
>   <dynamicField name="random*" type="random" />
>
>  </fields>
>
>  <uniqueKey>id</uniqueKey>
>
>  <defaultSearchField>text</defaultSearchField>
>
>  <solrQueryParser defaultOperator="OR"/>
>
>   <copyField source="theme" dest="text"/>
>   <copyField source="title" dest="text"/>
>   <copyField source="city" dest="text"/>
>   <copyField source="official" dest="text" />
>   <copyField source="features" dest="text"/>
>   <copyField source="services" dest="text"/>
> </schema>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1025463.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing fieldvalues with dashes and spaces

Erick Erickson
See below:

On Fri, Aug 6, 2010 at 9:00 AM, PeterKerk <[hidden email]> wrote:

>
> Ah, I'm glad it does, makes me feel a bit less stupid ;)
>
> So to summarize and see if I understand it now:
> - the analyzers allow for many different ways to index a field, these
> analyzers are placed in a chain
>

Minor terminology nit. An Analyzer consists of a Tokenizer and N Filters.
The Tokenizer breaks up the input stream then the Filters "do things"
to the token. So say you're using WhitespaceTokenizer on "This
time all People are Good". The tokenizer would create tokens
This, time, all, People, Good. LowerCaseFilter would transform
these to
this, time, all, people, are, good
then you could apply, say, a StopWordFilter which could remove
tokens this all are and you'd have
time people good
etc....

See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


> - when a field is indexed it can be searched
> - a field could also be stored as-is, but I would need to indicate that
> - only if a field is stored, a search query returns that field if the
> search
> query matches a part of it
>
Very close. You have to configure your search handlers to return
a field or specify it with a parameter like &fl=fieldname

>
> If thats the case its clear what it all does.
>
> Then still the question remains on how to configure this in the schema.xml.
> There's just so much documentation and examples in so many different places
> that I'm lost. I've used almost the literal example schema.xml which has
> many similarities (e.g. on categories facet) with my use case, but I dont
> know if they allow for the exact operations I require.
>
> If you look at my schema.xml, how would you configure it to do the
> following:
>
> a city field is something that I want users to search on via text input, so
> lets say "New Yo" would give the results for "New York".
> ===> so this field would need to be stored right?
>
> No, you don't need to store it at all. You can search anything
that's indexed. Stored is only for returning a copy of the data
as a field. What you *would* have to do is figure out the rules you
wanted to apply to have "New Yo" match "New York". You could
use one of the NGramFilterFactory or EdgeNGramFilterFactory.
You could decide to search wildcards. You could choose to
autocomplete the user entering data. You could...


> But also a facet "Cities" is available in which "New York" is just one of
> the cities that is selectable as a filter/facet.
> ===> for this I need to create a facet
>
> The other facet is "theme", which in my example holds values like
> "Gemeentehuis" and "Strand & Zee", that would not be a thing on which can
> be
> searched via manual input but IS selectable as a filter/facet
> ===> this field would NOT have to be stored right?
>
> You don't have to store things that are faceted. See the discussion
here:
http://wiki.apache.org/solr/SolrFacetingOverview

Best
Erick


> Thanks for your time! :)
>
> Regards,
> Pete
>
>
>
> Erick Erickson wrote:
> >
> > This confuses lots of people. When you index a field, it's Analyzed 10
> > ways from Sunday. Consider "The World is an unknown Entity". When
> > you INDEX it, many thing happen, depending upon the analyser.
> > Stopwords may be removed. each token may be lower cased. Each token
> > may be stemmed. It all depends on what's in your analyzer chain. Assume
> > a simple chain consisting of breaking up tokens on whitespace,
> > lowercasing,
> > and removing stopwords. The actual tokens INDEXED would be "world",
> > "unknown", and "entity". That is what is searched against.
> >
> > However, the string, unchanged, would be STORED if you specified it so.
> > So when you asked for the field to be returned in a search result, you
> > would
> > get "The World is an unknown Entity" if you asked for the field to be
> > returned as part of a search result that matched on, say, "world".
> >
> > HTH
> > Erick
> >
> > On Thu, Aug 5, 2010 at 4:31 AM, PeterKerk <[hidden email]>
> wrote:
> >
> >>
> >> @Michael, @Erick,
> >>
> >> You both mention interesting things that triggered me.
> >>
> >> @Erick:
> >> Your referenced page is very useful. It seems the whitespace tokenizer
> >> under
> >> the text_ws is causing issues.
> >>
> >> You do mention another interesting thing:
> >> "And do be aware that fields you get back from a request (i.e. a search)
> >> are
> >> the stored fields, NOT what's indexed."
> >>
> >> On the page you provided I see this under the Analyzers section:
> >> "Analyzers
> >> are components that pre-process input text at index time and/or at
> search
> >> time."
> >>
> >> So I dont completely understand how that sentence is in line with your
> >> comment.
> >>
> >>
> >> @Michael:
> >> You say: "use the tokenized field to return results, but have a
> duplicate
> >> field of fieldtype="string" to show the untokenized results. E.g. facet
> >> on
> >> that field."
> >> I think your comment applies on my requirement: "a city field is
> >> something
> >> that I want users to search on via text input, so lets say "New Yo"
> would
> >> give the results for "New York".
> >> But also a facet "Cities" is available in which "New York" is just one
> of
> >> the cities that is clickable.
> >> The other facet is "theme", which in my example holds values like
> >> "Gemeentehuis" and "Strand & Zee", that would not be a thing on which
> can
> >> be
> >> searched via manual input but IS clickable. "
> >>
> >> Could you please indicate (just for the above fields) what needs to be
> >> changed in my schema.xml and if so how that affects the way my request
> is
> >> build up?
> >>
> >>
> >> Thanks so much ahead in getting me started!
> >>
> >>
> >> This is my schema.xml
> >>
> >>
> >> <?xml version="1.0" encoding="UTF-8" ?>
> >>
> >> <schema name="db" version="1.1">
> >>
> >>  <types>
> >>    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> >> omitNorms="true"/>
> >>    <fieldType name="boolean" class="solr.BoolField"
> >> sortMissingLast="true"
> >> omitNorms="true"/>
> >>    <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
> >>    <fieldType name="long" class="solr.LongField" omitNorms="true"/>
> >>    <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
> >>    <fieldType name="double" class="solr.DoubleField" omitNorms="true"/>
> >>    <fieldType name="sint" class="solr.SortableIntField"
> >> sortMissingLast="true" omitNorms="true"/>
> >>    <fieldType name="slong" class="solr.SortableLongField"
> >> sortMissingLast="true" omitNorms="true"/>
> >>    <fieldType name="sfloat" class="solr.SortableFloatField"
> >> sortMissingLast="true" omitNorms="true"/>
> >>    <fieldType name="sdouble" class="solr.SortableDoubleField"
> >> sortMissingLast="true" omitNorms="true"/>
> >>    <fieldType name="date" class="solr.DateField" sortMissingLast="true"
> >> omitNorms="true"/>
> >>    <fieldType name="random" class="solr.RandomSortField" indexed="true"
> >> />
> >>    <fieldType name="text_ws" class="solr.TextField"
> >> positionIncrementGap="100">
> >>      <analyzer>
> >>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>      </analyzer>
> >>    </fieldType>
> >>    <fieldType name="text" class="solr.TextField"
> >> positionIncrementGap="100">
> >>      <analyzer type="index">
> >>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords.txt"/>
> >>        <filter class="solr.WordDelimiterFilterFactory"
> >> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.EnglishPorterFilterFactory"
> >> protected="protwords.txt"/>
> >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>      </analyzer>
> >>      <analyzer type="query">
> >>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >> ignoreCase="true" expand="true"/>
> >>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords.txt"/>
> >>        <filter class="solr.WordDelimiterFilterFactory"
> >> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.EnglishPorterFilterFactory"
> >> protected="protwords.txt"/>
> >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>      </analyzer>
> >>    </fieldType>
> >>
> >>    <fieldType name="textTight" class="solr.TextField"
> >> positionIncrementGap="100" >
> >>      <analyzer>
> >>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >> ignoreCase="true" expand="false"/>
> >>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords.txt"/>
> >>        <filter class="solr.WordDelimiterFilterFactory"
> >> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> >> catenateNumbers="1" catenateAll="0"/>
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.EnglishPorterFilterFactory"
> >> protected="protwords.txt"/>
> >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>      </analyzer>
> >>    </fieldType>
> >>
> >>    <fieldType name="alphaOnlySort" class="solr.TextField"
> >> sortMissingLast="true" omitNorms="true">
> >>      <analyzer>
> >>        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>        <filter class="solr.LowerCaseFilterFactory" />
> >>        <filter class="solr.TrimFilterFactory" />
> >>        <filter class="solr.PatternReplaceFilterFactory"
> >> pattern="([^a-z])"
> >> replacement="" replace="all" />
> >>      </analyzer>
> >>    </fieldType>
> >>    <fieldtype name="ignored" stored="false" indexed="false"
> >> class="solr.StrField" />
> >>  </types>
> >>
> >>  <fields>
> >>   <field name="id" type="string" indexed="true" stored="true"
> >> required="true" />
> >>   <field name="title" type="text_ws" indexed="true" stored="true"/>
> >>    <field name="city" type="text_ws" indexed="true" stored="true"/>
> >>    <field name="official" type="integer" indexed="true" stored="true"/>
> >>    <field name="theme" type="text_ws" indexed="true" stored="true"
> >> multiValued="true" omitNorms="true" termVectors="true" />
> >>   <field name="features" type="text_ws" indexed="true" stored="true"
> >> multiValued="true"/>
> >>   <field name="services" type="text_ws" indexed="true" stored="true"
> >> multiValued="true"/>
> >>   <field name="province" type="text_ws" indexed="true" stored="true"/>
> >>    <field name="word" type="string" indexed="true" stored="true"/>
> >>   <field name="text" type="text" indexed="true" stored="false"
> >> multiValued="true"/>
> >>   <field name="timestamp" type="date" indexed="true" stored="true"
> >> default="NOW" multiValued="false"/>
> >>
> >>   <dynamicField name="*_i"  type="sint"    indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_s"  type="string"  indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_l"  type="slong"   indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_t"  type="text"    indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_b"  type="boolean" indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_f"  type="sfloat"  indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_d"  type="sdouble" indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_dt" type="date"    indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="random*" type="random" />
> >>
> >>  </fields>
> >>
> >>  <uniqueKey>id</uniqueKey>
> >>
> >>  <defaultSearchField>text</defaultSearchField>
> >>
> >>  <solrQueryParser defaultOperator="OR"/>
> >>
> >>   <copyField source="theme" dest="text"/>
> >>   <copyField source="title" dest="text"/>
> >>   <copyField source="city" dest="text"/>
> >>   <copyField source="official" dest="text" />
> >>   <copyField source="features" dest="text"/>
> >>   <copyField source="services" dest="text"/>
> >> </schema>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1025463.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >
> >
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1029811.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Indexing fieldvalues with dashes and spaces

PeterKerk
Hi Erick,

Ok. its more clear now. I indeed have the whitespace tokenizer:

    <fieldType name="textTrue" class="solr.TextField" positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/>
      </analyzer>
    </fieldType>


What happens is that I have a field called 'Beach & Sea", which is a theme for a location. What happens because of the whitespace tokenizer, it gets split up in 2 fields:
         "Beach",2,
         "Sea",2],
(see below)

Ofcourse those individual facet names are NOT correct facetnames, because it should be "Beach & Sea".
But if I REMOVE the whitespace tokenizer, it throws an error that a fieldtype should always have a tokenizer.
But which tokenizer would I need in order for me to have the correct facet name?
(I've been checking this page btw:http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html)


"facet_counts":{
  "facet_queries":{},
  "facet_fields":{
        "themes":[
         "Gemeentehuis",2,
         "Beach",2,
         "Sea",2],
        "province":[
         "gelderland",1,
         "utrecht",1,
         "zuidholland",1],
        "services":[
         "exclusiev",2,
         "fotoreportag",2,
         "hur",2,
         "liv",1,
         "muziek",1]},
  "facet_dates":{}}}


Reply | Threaded
Open this post in threaded view
|

Re: Indexing fieldvalues with dashes and spaces

Jan Høydahl / Cominvent
Hi,

Try solr.KeywordTokenizerFactory.

However, in your case it looks as if you have certain requirements for searching that requires tokenization. So you should leave the WhitespaceTokenizer as is and create a separate field specially for the faceting, with indexed=true, stored=false and type=String. I often create a dynamic field for such, e.g. <dynamicField name="*_facet"...> and then do a copyField.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 9. aug. 2010, at 09.54, PeterKerk wrote:

>
> Hi Erick,
>
> Ok. its more clear now. I indeed have the whitespace tokenizer:
>
>    <fieldType name="textTrue" class="solr.TextField"
> positionIncrementGap="100" >
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_dutch.txt" />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="Dutch"
> protected="protwords.txt"/>
>      </analyzer>
>    </fieldType>
>
>
> What happens is that I have a field called 'Beach & Sea", which is a theme
> for a location. What happens because of the whitespace tokenizer, it gets
> split up in 2 fields:
> "Beach",2,
> "Sea",2],
> (see below)
>
> Ofcourse those individual facet names are NOT correct facetnames, because it
> should be "Beach & Sea".
> But if I REMOVE the whitespace tokenizer, it throws an error that a
> fieldtype should always have a tokenizer.
> But which tokenizer would I need in order for me to have the correct facet
> name?
> (I've been checking this page
> btw:http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html)
>
>
> "facet_counts":{
>  "facet_queries":{},
>  "facet_fields":{
> "themes":[
> "Gemeentehuis",2,
> "Beach",2,
> "Sea",2],
> "province":[
> "gelderland",1,
> "utrecht",1,
> "zuidholland",1],
> "services":[
> "exclusiev",2,
> "fotoreportag",2,
> "hur",2,
> "liv",1,
> "muziek",1]},
>  "facet_dates":{}}}
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1052554.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Indexing fieldvalues with dashes and spaces

PeterKerk
Sorry for late reply, just back from holiday :)

I did what you mentioned:

<field name="services_raw" type="string" indexed="true" stored="true" multiValued="true"/>
<copyField source="services" dest="services_raw"/>

and then in url facet.field=services_raw

It works...awesome, thanks!