Help with StopFilterFactory

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Help with StopFilterFactory

heaven
This post was updated on .
Hi, I have the next text field:

<fieldType name="words_ngram" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\w]+" />
    <filter class="solr.StopFilterFactory" words="url_stopwords.txt" ignoreCase="true" />
    <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>

url_stopwords.txt looks like:
http
https
ftp
www

So very simple. In the index I have:
* twitter.com/testuser

All these queries do match:
* twitter.com/testuser
* com/testuser
* testuser

But none of these does:
* https://twitter.com/testuser
* https://www.twitter.com/testuser
* www.twitter.com/testuser

What do I do wrong? Analysis makes me think something is wrong with token positions:

but I was thinking StopFilterFactory is supposed to remove https/http/ftw/www keywords. Why do they figure there at all? That doesn't make much sense.

Regards,
Alexander
Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

Jack Krupansky-2
What release of Solr?

Do you have autoGeneratePhraseQueries="true" on the field?

And when you said "But any of these does", did you mean "But NONE of these
does"?

-- Jack Krupansky

-----Original Message-----
From: heaven
Sent: Tuesday, August 19, 2014 2:34 PM
To: [hidden email]
Subject: Help with StopFilterFactory

Hi, I have the next text field:

<fieldType name="words_ngram" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\w]+" />
    <filter class="solr.StopFilterFactory" words="url_stopwords.txt"
ignoreCase="true" />
    <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>

url_stopwords.txt looks like:
http
https
ftp
www

So very simple. In index I have:
* twitter.com/testuser

All these queries do match:
* twitter.com/testuser
* com/testuser
* testuser

But any of these does:
* https://twitter.com/testuser
* https://www.twitter.com/testuser
* www.twitter.com/testuser

What do I do wrong? Analysis makes me think something is wrong with token
positions:
<http://lucene.472066.n3.nabble.com/file/n4153839/oi7o69.jpg>
but I was thinking StopFilterFactory is supposed to remove
https/http/ftw/www keywords. Why do they figure there at all? That doesn't
make much sense.

Regards,
Alexander



--
View this message in context:
http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

heaven
> What release of Solr?
4.8.1.

> Do you have autoGeneratePhraseQueries="true" on the field?
No, the config I've provided is the exact.

> And when you said "But any of these does", did you mean "But NONE of these
does"?
Whoops, yes, fixed that.
Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

heaven
In reply to this post by Jack Krupansky-2
From this page: http://wiki.apache.org/solr/SchemaXml
>> autoGeneratePhraseQueries=true|false (in schema version 1.4 and later this now defaults to false)
Just checked, I've <schema name="sunspot" version="1.0"> so this may be true by default?
Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

heaven
Hello,

Yes, with schema version 1.5 all those examples that didn't work do work now. But results also include records that match by  "com", "twitter", etc, which is not desirable.

It seems we do need autoGeneratePhraseQueries="true" but also need to ignore blacklisted words. Is that somehow possible?

Best,
Alexader
Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

heaven
This post was updated on .
In reply to this post by Jack Krupansky-2
Any ideas? Doesn't that seem like a bug?
Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

Shawn Heisey-4
On 8/21/2014 7:25 AM, heaven wrote:
> Any ideas? Doesn't that seems like a bug?

I think it should have worked even with autoGeneratePhraseQueries
enabled by the older schema version.  The relative positions are the
same  -- it's 1,2,3 in the index and 2,3,4 in the query.  Absolute
positions don't matter, only relative.  I ran into the same behavior on
Solr 4.9.0 ... with a 1.5 schema version and your example, everything
works, but if I enable autoGeneratePhraseQueries, it stops working.

This probably needs to be filed in Jira, but let's wait for someone with
more experience to weigh in before taking that step.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

Jack Krupansky-2
For the sake of completeness, please post the parsed query that you get when
you add the debug=true parameter. IOW, how Solr/Lucene actually interprets
the query itself.

-- Jack Krupansky

-----Original Message-----
From: Shawn Heisey
Sent: Thursday, August 21, 2014 10:03 AM
To: [hidden email]
Subject: Re: Help with StopFilterFactory

On 8/21/2014 7:25 AM, heaven wrote:
> Any ideas? Doesn't that seems like a bug?

I think it should have worked even with autoGeneratePhraseQueries
enabled by the older schema version.  The relative positions are the
same  -- it's 1,2,3 in the index and 2,3,4 in the query.  Absolute
positions don't matter, only relative.  I ran into the same behavior on
Solr 4.9.0 ... with a 1.5 schema version and your example, everything
works, but if I enable autoGeneratePhraseQueries, it stops working.

This probably needs to be filed in Jira, but let's wait for someone with
more experience to weigh in before taking that step.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

heaven
This post was updated on .
In reply to this post by Shawn Heisey-4
With 1.5 schema it does work but not as it is expected. I am indexing twitter.com/testuser and only need to get exact matches, not those that match "twitter" or "com". so my search results should contain just one record:
* http://twitter.com/testuser

but what I see with 1.5 schema is:
* http://twitter.com/testuser
* http://twitter.com/otheruser (match by twitter and com)
* http://twitter.com/anotheruser
* etc, including all sites that match twitter and/or com (and there's a lot, and all are unrelated).
Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

Shawn Heisey-4
On 8/21/2014 8:40 AM, heaven wrote:

> With 1.5 schema it work but not as it is expected. I am indexing
> twitter.com/testuser and only need to get exact matches, not those that
> match "twitter" or "com". so my search results should contain just one
> record:
> * http://twitter.com/testuser
>
> but what I see with 1.5 schema is:
> * http://twitter.com/testuser
> * http://twitter.com/otheruser (match by twitter and com)
> * http://twitter.com/anotheruser
> * etc, including all sites that match twitter and/or com (and there's a lot,
> and all are unrelated).


If you set the q.op parameter to "AND", or issue a phrase query
(surrounded by quotes), that would do it.  Using the default operator
would still match if you searched for the following, but the phrase
query (same thing surrounded by quotes) would not:

testuser twittercom

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

Shawn Heisey-4
On 8/21/2014 9:52 AM, Shawn Heisey wrote:

> On 8/21/2014 8:40 AM, heaven wrote:
>> With 1.5 schema it work but not as it is expected. I am indexing
>> twitter.com/testuser and only need to get exact matches, not those that
>> match "twitter" or "com". so my search results should contain just one
>> record:
>> * http://twitter.com/testuser
>>
>> but what I see with 1.5 schema is:
>> * http://twitter.com/testuser
>> * http://twitter.com/otheruser (match by twitter and com)
>> * http://twitter.com/anotheruser
>> * etc, including all sites that match twitter and/or com (and there's a lot,
>> and all are unrelated).
>
> If you set the q.op parameter to "AND", or issue a phrase query
> (surrounded by quotes), that would do it.  Using the default operator
> would still match if you searched for the following, but the phrase
> query (same thing surrounded by quotes) would not:
>
> testuser twittercom

There was a space between twitter and com when I wrote that.  I don't
know why it's not there in the mail on the list.

Thanks,
Shawn


Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

heaven
In reply to this post by Shawn Heisey-4
Unfortunately I can't change the operator and phrase query for "https://twitter.com/testuser" doesn't work as well.

It does work for "twitter.com/testuser" but that makes no sense since I then can simply use old schema version or autoGenereratePhaseQueries=true and ask users to remove http/www from urls manually. But then I have a reasonable question, what then the StopFilterFactory is supposed to do if users still have to remove blacklisted keywords? It sounds lie a bug to me because stop filter factory only prevents words from being added to the index, but they still affect search.

It should generate phases after solr.StopFilterFactory (if one is defined for a field). Or there should be another mechanism to remove blacklisted words like if there were no such words at all so they simply disappear.
Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

Jack Krupansky-2
I think somehow the discussion has gotten confused, so we really need to
start over.

1. Make sure you're using the most current schema version.
2. Make sure autoGeneratePhraseQueries is set explicitly the way you want
it, based on #1 above.
3. Yes, stop filter should remove sop words. No question. If it isn't, lets
track down and see why and report a bug if necessary.
4. Restate the problem, very clearly, in plain English (after performing
steps #1 and #2). Please reread your reply carefully before clicking the
send button and make sure you are using negatives properly - you've confused
the discussion here by failing to do so on at least one occasion, and
possibly in this latest response although I can't tell for sure.
5. We'll confirm either any mistakes you've made, recommendations, and
whether there are any bugs.

Fair enough?

-- Jack Krupansky

-----Original Message-----
From: heaven
Sent: Sunday, August 24, 2014 11:02 AM
To: [hidden email]
Subject: Re: Help with StopFilterFactory

Unfortunately I can't change the operator and phrase query for
"https://twitter.com/testuser" doesn't work as well.

It does work for "twitter.com/testuser" but that makes no sense since I then
can simply use old schema version or autoGenereratePhaseQueries=true and ask
users to remove http/www from urls manually. But then I have a reasonable
question, what then the StopFilterFactory is supposed to do if users still
have to remove blacklisted keywords? It sounds lie a bug to me because stop
filter factory only prevents words from being added to the index, but they
still affect search.

It should generate phases after solr.StopFilterFactory (if one is defined
for a field). Or there should be another mechanism to remove blacklisted
words like if there were no such words at all so they simply disappear.



--
View this message in context:
http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4154795.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

heaven
This post was updated on .
I don't see any confusions, the problem is clearly explained in the first post. The one confusion I had was with the autoGeneratePhraseQueries and my schema version, I didn't know about that attribute and that its default behavior could differ per schema version. I think we now figured that out and I am using the most recent 1.5 schema version with autoGeneratePhraseQueries="true" (so the behavior should be exactly the same as for schema version 1 that I had before).

With autoGeneratePhraseQueries="false" I get unexpected results, e.g. all those that match only partially, like only by "twitter" and/or "com".

Following your steps:
1. Schema version is 1.5
2. autoGeneratePhraseQueries is set to true.
3. It seems it does, but that doesn't work as expected and those words still affect the search.
4. if I index twitter.com/testuser and search for https://twitter.com/testuser I am getting 0 matches even though "https" should be filtered out by the StopFilterFactory.
Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

heaven
Just a guess but it seems that auto phase generation and stop filter factory don't know of each other.

Here's the current field configuration:
{code}
<fieldType name="words_ngram" class="solr.TextField" omitNorms="false" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\w]+" />
    <filter class="solr.StopFilterFactory" words="url_stopwords.txt" ignoreCase="true" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="40" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="[^\w]+" />
    <filter class="solr.StopFilterFactory" words="url_stopwords.txt" ignoreCase="true" />
    <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>
{code}
Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

Jack Krupansky-2
In reply to this post by heaven
If autoGeneratePhraseQueries="true" (which I endorse) is working, then
what's the problem?

I mean, the only problem you mention is with
autoGeneratePhraseQueries="false", which is clearly NOT what you want.

Once again, I have to reiterate that the situation here remains very
confused, mostly from poor use of language.

It only adds to the confusion when you say things like "doesn't work",
rather than taking a constructive attitude of telling on the expected
results vs. the actual results.

And I think I did request that you add the debug=true query parameter and
post the parsed query so that we can see what was really generated for the
query.

-- Jack Krupansky

-----Original Message-----
From: heaven
Sent: Sunday, August 24, 2014 12:04 PM
To: [hidden email]
Subject: Re: Help with StopFilterFactory

I don't see any confusions, the problem is clearly explained in the first
post. The one confusion I had was with the autoGeneratePhraseQueries and my
schema version, I didn't know about that attribute and that its behavior
could differ per schema version. I think we now figured that out and I am
using the most recent 1.5 schema version with
autoGeneratePhraseQueries="true" (so the behavior should be exactly the same
as for schema version 1 that I had before).

With autoGeneratePhraseQueries="false" I get unexpected results, e.g. all
those that match only partially, like only by "twitter" and/or "com".

Following your steps:
1. Schema version is 1.5
2. autoGeneratePhraseQueries is set to true.
3. It seems it does, but that doesn't work as expected and those words still
affect the search.
4. if I index twitter.com/testuser and search for
https://twitter.com/testuser I am getting 0 matches even though "https"
should be filtered out by the StopFilterFactory.



--
View this message in context:
http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4154804.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

heaven
The problem is in #4:
>> 4. if I index twitter.com/testuser and search for https://twitter.com/testuser I am getting 0 matches even though "https" should be filtered out by the StopFilterFactory.

When I said that the stop filter factory "doesn't work" I mentioned that blacklisted words still somehow affect the search. My guess is that when autoGeneratePhraseQueries is set to true Solr generates phases before blacklisted words were removed. That's how it feels looking at search results (see the first post).

My first post still describes the problem completely, what we can add to it now is that schema version is 1.5 and autoGeneratePhraseQueries is set to true.

I remember about the debug output, will be able to add it tomorrow morning.
Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

Jack Krupansky-2
Just to confirm, the generated phrase query is generated using the analyzed
terms, so if the stop filter is removing the terms, they won't appear in the
generated query. It will be interesting to see what does get generated.

-- Jack Krupansky

-----Original Message-----
From: heaven
Sent: Sunday, August 24, 2014 12:47 PM
To: [hidden email]
Subject: Re: Help with StopFilterFactory

The problem is in #4:
>> 4. if I index twitter.com/testuser and search for
>> https://twitter.com/testuser I am getting 0 matches even though "https"
>> should be filtered out by the StopFilterFactory.

When I said that the stop filter factory "doesn't work" I mentioned that
blacklisted words still somehow affect the search. My guess is that when
autoGeneratePhraseQueries is set to true Solr generates phases before
blacklisted words were removed. That's how it feels looking at search
results (see the first post).

My first post still describes the problem completely, what we can add to it
now is that schema version is 1.5 and autoGeneratePhraseQueries is set to
true.

I remember about the debug output, will be able to add it tomorrow morning.



--
View this message in context:
http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4154822.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

heaven
This post was updated on .
A valid search: http://pastie.org/pastes/9500661/text?key=rgqj5ivlgsbk1jxsudx9za
An invalid search: http://pastie.org/pastes/9500662/text?key=b4zlh2oaxtikd8jvo5xaww

What weird I found is that the valid query has:
"parsedquery_toString": "+(url_words_ngram:\"twitter com zer0sleep\")"
And the invalid one has:
"parsedquery_toString": "+(url_words_ngram:\"? twitter com zer0sleep\")"

So "https" part was replaced with a "?".
Reply | Threaded
Open this post in threaded view
|

Re: Help with StopFilterFactory

Jack Krupansky-2
Interesting. First, an apology for an error in my e-book - it says that the
enablePositionIncrements parameter for the stop filter defaults to "false",
but it actually defaults to "true". The question mark represents a "position
increment". In your case you don't want position increments, so add the
enablePositionIncrements="false" parameter to the stop filter, and be sure
to reindex your data. The position increment leaves a "hole" where each stop
word was removed. The question mark represents the hole. All bets are off as
to what phrase query does when the phrase starts with a hole. I think the
basic idea is that there must be some term in the index at that position
that can be "skipped".

This is actually a change in behavior, which occurred as a side effect of
LUCENE-4963 in 4.4. The default for enablePositionIncrements was false, but
that release changed it to true.

I suspect that I wrote that section of my e-book before 4.4 came out.
Unfortunately, the change is not well documented - nothing in the Javadoc,
and this is another example of where an underlying change in Lucene that
impacts Solr users is not well highlighted for Solr users. Sorry about that.

In any case, try adding enablePositionIncrements="false", reindex, and see
what happens.

-- Jack Krupansky

-----Original Message-----
From: heaven
Sent: Monday, August 25, 2014 3:37 AM
To: [hidden email]
Subject: Re: Help with StopFilterFactory

A valid search:
http://pastie.org/pastes/9500661/text?key=rgqj5ivlgsbk1jxsudx9za
An Invalid search:
http://pastie.org/pastes/9500662/text?key=b4zlh2oaxtikd8jvo5xaww

What weird I found is that the valid query has:
"parsedquery_toString": "+(url_words_ngram:\"twitter com zer0sleep\")"
And the invalid one has:
"parsedquery_toString": "+(url_words_ngram:\"? twitter com zer0sleep\")"

So "https" part was replaced with a "?".



--
View this message in context:
http://lucene.472066.n3.nabble.com/Help-with-StopFilterFactory-tp4153839p4154957.html
Sent from the Solr - User mailing list archive at Nabble.com.

12