WordDelimiter in extended way.

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

WordDelimiter in extended way.

servus01
Hello,

maybe somebody can help me out. We have a lot of datasets that are always
built according to the same scheme:

Expression - Expression

as an example:

"CCF *HD - 2nd* BL 2019-2020 1st matchday VfL Osnabrück vs. 1st FC
Heidenheim 1846 | 1st HZ without WZ"

or

"Scouting Feed *mp4 - 2.* BL 2019-2020 1st matchday SV Wehen Wiesbaden vs.
Karlsruher SC"

Now Solr behaves in such a way that on the one hand the hyphens which have a
blank before and after are not indexed and also the search as soon as blank
- blank is searched does not return any results.
With the WordDelimiter I have already covered the cases like 2019-2020. But
for blank - blank i'm running out of ideas. Normally it should tokenize the
word before the hyphen the blanks with hyphen and the word after hyphen as
one token.

Best

Francois



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter in extended way.

Shawn Heisey-2
On 10/23/2019 7:43 AM, servus01 wrote:
> Now Solr behaves in such a way that on the one hand the hyphens which have a
> blank before and after are not indexed and also the search as soon as blank
> - blank is searched does not return any results.
> With the WordDelimiter I have already covered the cases like 2019-2020. But
> for blank - blank i'm running out of ideas. Normally it should tokenize the
> word before the hyphen the blanks with hyphen and the word after hyphen as
> one token.

To figure out what's happening, we will need to see the entire analysis
chain, both index and query.  In order to see those, we will need the
field definition as well as the referenced fieldType definition from
your schema.  Additional details needed:  Exact Solr version and the
schema version.  The schema version is at the top of the schema.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter in extended way.

servus01
This post was updated on .
Hey,

thank you for helping me:


<schema name="default-config" version="1.6">

 <field name="videoTitle" type="text_general" omitNorms="false" indexed="true" stored="true" termPositions="false" termVectors="false" multiValued="false"/>


 <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="German2" protected="protwords.txt"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" preserveOriginal="1" catenateWords="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="German2" protected="protwords.txt"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" preserveOriginal="1" catenateWords="1"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>





Thanks in advanced for any help, really appriciate.

<https://lucene.472066.n3.nabble.com/file/t494058/screenshot.jpg
<https://lucene.472066.n3.nabble.com/file/t494058/screenshot3.jpg



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter in extended way.

Shawn Heisey-2
On 10/23/2019 9:41 AM, servus01 wrote:
> Hey,
>
> thank you for helping me:
>
> Thanks in advanced for any help, really appriciate.
>
> <https://lucene.472066.n3.nabble.com/file/t494058/screenshot.jpg>
> <https://lucene.472066.n3.nabble.com/file/t494058/screenshot3.jpg>

It is not the WordDelimiter filter that is affecting your punctuation.
It is the StandardTokenizer, which is the first analysis component that
runs.  You can see this in the first screenshot, where that tokenizer
outputs terms of "CCF" "HD" and "2nd".

That filter is capable of affecting punctuation, depending on its
settings, but in this case, no punctuation is left by the time the
analysis hits that filter.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter in extended way.

servus01