shingles + stop words

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

shingles + stop words

David Hastings
Hey there, I have a field type defined as such:
<fieldType name="skw2" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.ManagedStopFilterFactory" managed="english"/>
      <filter class="solr.ShingleFilterFactory" minShingleSize="2"
 outputUnigrams="false" fillerToken="" maxShingleSize="2"/>
    </analyzer>
  </fieldType>

but whats happening is the shingles being returned are often times "
nonstopword"
with the space being defined as the filter token.  I was hoping that the
 ManagedStopFilterFactory would have removed the stop words completely
before going to the shingle factory, and would have returned "nonstopword1
nonstopword2" with an indexed value of
 "nonstopword1 stopword1 stopword2 nonstopword2" but obviously isnt the
case.  is there a way to force it as such?

 Thanks, David
Reply | Threaded
Open this post in threaded view
|

Re: shingles + stop words

Emir Arnautović
Hi David,
As you already observed shingles are concatenating tokens based on positions and in case of stopwords it results in empty string (you can configure it to be something else with fillerToken option).
You can do the following:
1. if you do not have too many stopwords, you could use PatternReplaceChartFilter to remove stopwords before it hits tokenizer. That way stopwords will not increase positions and it’ll result with expected shingles. This way you will loose managed part of stopwords and will have to reload cores in order to change stopwords.
2. customise stopword filter not to increment positions when finds stopword.
3. customise shingle filter to be able to add desired flag

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Dec 2018, at 15:18, David Hastings <[hidden email]> wrote:
>
> Hey there, I have a field type defined as such:
> <fieldType name="skw2" class="solr.TextField" positionIncrementGap="100">
>    <analyzer>
>      <tokenizer class="solr.StandardTokenizerFactory"/>
>      <filter class="solr.ManagedStopFilterFactory" managed="english"/>
>      <filter class="solr.ShingleFilterFactory" minShingleSize="2"
> outputUnigrams="false" fillerToken="" maxShingleSize="2"/>
>    </analyzer>
>  </fieldType>
>
> but whats happening is the shingles being returned are often times "
> nonstopword"
> with the space being defined as the filter token.  I was hoping that the
> ManagedStopFilterFactory would have removed the stop words completely
> before going to the shingle factory, and would have returned "nonstopword1
> nonstopword2" with an indexed value of
> "nonstopword1 stopword1 stopword2 nonstopword2" but obviously isnt the
> case.  is there a way to force it as such?
>
> Thanks, David