WordDelimiter filter, expanding to multiple words, unexpected results

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind
Hello, I'm running into a case where a query is not returning the
results I expect, and I'm hoping someone can offer some explanation that
might help me fine tune things or understand what's up.

I am running Solr 4.3.

My filter chain includes a WordDelimiterFilter and, later a filter that
downcases everything for case-insensitive searching. It includes many
other things too, but I think these are the pertinent facts.

For query "dELALAIN", the WordDelimiterFilter splits into:

text: d
start: 0
position: 1

text: ELALAIN
start: 1
position: 2

text: dELALAIN
start: 0
position: 2

Note the duplication/overlap of the tokens -- one version with "d" and
"ELALAIN" split into two tokens, and another with just one token.

Later, all the tokens are lowercased by another filter in the chain.
(actually an ICU filter which is doing something more complicated than
just lowercasing, but I think we can consider it lowercasing for the
purposes of this discussion).

If I understand right what the WordDelimiterFilter is trying to do here,
it's probably doing something special because of the lowercase "d"
followed by an uppercase letter, a special case for that. (I don't get
this behavior with other mixed case queries not beginning with 'd').

And, what I think it's trying to do, is match text indexed as "d
elalain" as well as text indexed by "delalain".

The problem is, it's not accomplishing that -- it is NOT matching text
that was indexed as "delalain" (one token).

I don't entirely understand what the "position" attribute is for -- but
I wonder if in this case, the position on "dELALAIN" is really supposed
to be 1, not 2?  Could that be responsible for the bug?  Or is position
irrelevant in this case?

If that's not it, then I'm at a loss as to what may be causing this bug
-- or even if it's a bug at all, or I'm just not understanding intended
behavior. I expect a query for "dELALAIN" to match text indexed as
"delalain" (because of the forced lowercasing in the filter chain). But
it's not doing so. Are my expectations wrong? Bug? Something else?

Thanks for any advice,

Jonathan
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Michael Della Bitta-2
Hi Jonathan,

Little confused by this line:

> And, what I think it's trying to do, is match text indexed as "d elalain"
as well as text indexed by "delalain".

In this case, I don't know how WordDelimiterFilter will help, as you're
likely tokenizing on spaces somewhere, and that input text has a space. I
could be wrong. It's probably best if you post your field definition from
your schema.

Also, is this a free-text field, or something that's more like a short
string?

Thanks,


Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <[hidden email]> wrote:

> Hello, I'm running into a case where a query is not returning the results
> I expect, and I'm hoping someone can offer some explanation that might help
> me fine tune things or understand what's up.
>
> I am running Solr 4.3.
>
> My filter chain includes a WordDelimiterFilter and, later a filter that
> downcases everything for case-insensitive searching. It includes many other
> things too, but I think these are the pertinent facts.
>
> For query "dELALAIN", the WordDelimiterFilter splits into:
>
> text: d
> start: 0
> position: 1
>
> text: ELALAIN
> start: 1
> position: 2
>
> text: dELALAIN
> start: 0
> position: 2
>
> Note the duplication/overlap of the tokens -- one version with "d" and
> "ELALAIN" split into two tokens, and another with just one token.
>
> Later, all the tokens are lowercased by another filter in the chain.
> (actually an ICU filter which is doing something more complicated than just
> lowercasing, but I think we can consider it lowercasing for the purposes of
> this discussion).
>
> If I understand right what the WordDelimiterFilter is trying to do here,
> it's probably doing something special because of the lowercase "d" followed
> by an uppercase letter, a special case for that. (I don't get this behavior
> with other mixed case queries not beginning with 'd').
>
> And, what I think it's trying to do, is match text indexed as "d elalain"
> as well as text indexed by "delalain".
>
> The problem is, it's not accomplishing that -- it is NOT matching text
> that was indexed as "delalain" (one token).
>
> I don't entirely understand what the "position" attribute is for -- but I
> wonder if in this case, the position on "dELALAIN" is really supposed to be
> 1, not 2?  Could that be responsible for the bug?  Or is position
> irrelevant in this case?
>
> If that's not it, then I'm at a loss as to what may be causing this bug --
> or even if it's a bug at all, or I'm just not understanding intended
> behavior. I expect a query for "dELALAIN" to match text indexed as
> "delalain" (because of the forced lowercasing in the filter chain). But
> it's not doing so. Are my expectations wrong? Bug? Something else?
>
> Thanks for any advice,
>
> Jonathan
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind
Thanks for the response.

I understand the problem a little bit better after investigating more.

Posting my full field definitions is, I think, going to be confusing, as
they are long and complicated. I can narrow it down to an isolation case
if I need to. My indexed field in question is relatively short strings.

But what it's got to do with is the WordDelimiterFilter's default
splitOnCaseChange=1 and generateWordParts=1, and the effects of such.

Let's take a less confusing example, query "MacBook". With a
WordDelimiterFilter followed by something that downcases everything.

I think what the WDF (followed by case folding) is trying to do is make
query "MacBook" match both indexed text "mac book" as well as "macbook"
-- either one should be a match. Is my understanding right of what
WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
intending to do?

In my actual index, query "MacBook" is matching ONLY "mac book", and not
"macbook".  Which is unexpected. I indeed want it to match both. (I
realize I could make it match only 'macbook' by setting
splitOnCaseChange=0 and/or generateWordParts=0).

It's possible this is happening as a side effect of other parts of my
complex field definition, and I really do need to post hte whole thing
and/or isolate it. But I wonder if there are known general problem cases
that cause this kind of failure, or any known bugs in
WordDelimiterFilter (in Solr 4.3?) that cause this kind of failure.

And I wonder if WordDelimiter filter spitting out the token "MacBook"
with position "2" rather than "1" is expected, irrelevant, or possibly a
relevant problem.

Thanks again,

Jonathan

On 9/2/14 12:59 PM, Michael Della Bitta wrote:

> Hi Jonathan,
>
> Little confused by this line:
>
>> And, what I think it's trying to do, is match text indexed as "d elalain"
> as well as text indexed by "delalain".
>
> In this case, I don't know how WordDelimiterFilter will help, as you're
> likely tokenizing on spaces somewhere, and that input text has a space. I
> could be wrong. It's probably best if you post your field definition from
> your schema.
>
> Also, is this a free-text field, or something that's more like a short
> string?
>
> Thanks,
>
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
> w: appinions.com <http://www.appinions.com/>
>
>
> On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <[hidden email]> wrote:
>
>> Hello, I'm running into a case where a query is not returning the results
>> I expect, and I'm hoping someone can offer some explanation that might help
>> me fine tune things or understand what's up.
>>
>> I am running Solr 4.3.
>>
>> My filter chain includes a WordDelimiterFilter and, later a filter that
>> downcases everything for case-insensitive searching. It includes many other
>> things too, but I think these are the pertinent facts.
>>
>> For query "dELALAIN", the WordDelimiterFilter splits into:
>>
>> text: d
>> start: 0
>> position: 1
>>
>> text: ELALAIN
>> start: 1
>> position: 2
>>
>> text: dELALAIN
>> start: 0
>> position: 2
>>
>> Note the duplication/overlap of the tokens -- one version with "d" and
>> "ELALAIN" split into two tokens, and another with just one token.
>>
>> Later, all the tokens are lowercased by another filter in the chain.
>> (actually an ICU filter which is doing something more complicated than just
>> lowercasing, but I think we can consider it lowercasing for the purposes of
>> this discussion).
>>
>> If I understand right what the WordDelimiterFilter is trying to do here,
>> it's probably doing something special because of the lowercase "d" followed
>> by an uppercase letter, a special case for that. (I don't get this behavior
>> with other mixed case queries not beginning with 'd').
>>
>> And, what I think it's trying to do, is match text indexed as "d elalain"
>> as well as text indexed by "delalain".
>>
>> The problem is, it's not accomplishing that -- it is NOT matching text
>> that was indexed as "delalain" (one token).
>>
>> I don't entirely understand what the "position" attribute is for -- but I
>> wonder if in this case, the position on "dELALAIN" is really supposed to be
>> 1, not 2?  Could that be responsible for the bug?  Or is position
>> irrelevant in this case?
>>
>> If that's not it, then I'm at a loss as to what may be causing this bug --
>> or even if it's a bug at all, or I'm just not understanding intended
>> behavior. I expect a query for "dELALAIN" to match text indexed as
>> "delalain" (because of the forced lowercasing in the filter chain). But
>> it's not doing so. Are my expectations wrong? Bug? Something else?
>>
>> Thanks for any advice,
>>
>> Jonathan
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Michael Della Bitta-2
If that's your problem, I bet all you have to do is twiddle on one of the
catenate options, either catenateWords or catenateAll.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind <[hidden email]> wrote:

> Thanks for the response.
>
> I understand the problem a little bit better after investigating more.
>
> Posting my full field definitions is, I think, going to be confusing, as
> they are long and complicated. I can narrow it down to an isolation case if
> I need to. My indexed field in question is relatively short strings.
>
> But what it's got to do with is the WordDelimiterFilter's default
> splitOnCaseChange=1 and generateWordParts=1, and the effects of such.
>
> Let's take a less confusing example, query "MacBook". With a
> WordDelimiterFilter followed by something that downcases everything.
>
> I think what the WDF (followed by case folding) is trying to do is make
> query "MacBook" match both indexed text "mac book" as well as "macbook" --
> either one should be a match. Is my understanding right of what
> WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
> intending to do?
>
> In my actual index, query "MacBook" is matching ONLY "mac book", and not
> "macbook".  Which is unexpected. I indeed want it to match both. (I realize
> I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or
> generateWordParts=0).
>
> It's possible this is happening as a side effect of other parts of my
> complex field definition, and I really do need to post hte whole thing
> and/or isolate it. But I wonder if there are known general problem cases
> that cause this kind of failure, or any known bugs in WordDelimiterFilter
> (in Solr 4.3?) that cause this kind of failure.
>
> And I wonder if WordDelimiter filter spitting out the token "MacBook" with
> position "2" rather than "1" is expected, irrelevant, or possibly a
> relevant problem.
>
> Thanks again,
>
> Jonathan
>
>
> On 9/2/14 12:59 PM, Michael Della Bitta wrote:
>
>> Hi Jonathan,
>>
>> Little confused by this line:
>>
>>  And, what I think it's trying to do, is match text indexed as "d elalain"
>>>
>> as well as text indexed by "delalain".
>>
>> In this case, I don't know how WordDelimiterFilter will help, as you're
>> likely tokenizing on spaces somewhere, and that input text has a space. I
>> could be wrong. It's probably best if you post your field definition from
>> your schema.
>>
>> Also, is this a free-text field, or something that's more like a short
>> string?
>>
>> Thanks,
>>
>>
>> Michael Della Bitta
>>
>> Applications Developer
>>
>> o: +1 646 532 3062
>>
>> appinions inc.
>>
>> “The Science of Influence Marketing”
>>
>> 18 East 41st Street
>>
>> New York, NY 10017
>>
>> t: @appinions <https://twitter.com/Appinions> | g+:
>> plus.google.com/appinions
>> <https://plus.google.com/u/0/b/112002776285509593336/
>> 112002776285509593336/posts>
>> w: appinions.com <http://www.appinions.com/>
>>
>>
>>
>> On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <[hidden email]>
>> wrote:
>>
>>  Hello, I'm running into a case where a query is not returning the results
>>> I expect, and I'm hoping someone can offer some explanation that might
>>> help
>>> me fine tune things or understand what's up.
>>>
>>> I am running Solr 4.3.
>>>
>>> My filter chain includes a WordDelimiterFilter and, later a filter that
>>> downcases everything for case-insensitive searching. It includes many
>>> other
>>> things too, but I think these are the pertinent facts.
>>>
>>> For query "dELALAIN", the WordDelimiterFilter splits into:
>>>
>>> text: d
>>> start: 0
>>> position: 1
>>>
>>> text: ELALAIN
>>> start: 1
>>> position: 2
>>>
>>> text: dELALAIN
>>> start: 0
>>> position: 2
>>>
>>> Note the duplication/overlap of the tokens -- one version with "d" and
>>> "ELALAIN" split into two tokens, and another with just one token.
>>>
>>> Later, all the tokens are lowercased by another filter in the chain.
>>> (actually an ICU filter which is doing something more complicated than
>>> just
>>> lowercasing, but I think we can consider it lowercasing for the purposes
>>> of
>>> this discussion).
>>>
>>> If I understand right what the WordDelimiterFilter is trying to do here,
>>> it's probably doing something special because of the lowercase "d"
>>> followed
>>> by an uppercase letter, a special case for that. (I don't get this
>>> behavior
>>> with other mixed case queries not beginning with 'd').
>>>
>>> And, what I think it's trying to do, is match text indexed as "d elalain"
>>> as well as text indexed by "delalain".
>>>
>>> The problem is, it's not accomplishing that -- it is NOT matching text
>>> that was indexed as "delalain" (one token).
>>>
>>> I don't entirely understand what the "position" attribute is for -- but I
>>> wonder if in this case, the position on "dELALAIN" is really supposed to
>>> be
>>> 1, not 2?  Could that be responsible for the bug?  Or is position
>>> irrelevant in this case?
>>>
>>> If that's not it, then I'm at a loss as to what may be causing this bug
>>> --
>>> or even if it's a bug at all, or I'm just not understanding intended
>>> behavior. I expect a query for "dELALAIN" to match text indexed as
>>> "delalain" (because of the forced lowercasing in the filter chain). But
>>> it's not doing so. Are my expectations wrong? Bug? Something else?
>>>
>>> Thanks for any advice,
>>>
>>> Jonathan
>>>
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind
Yes, thanks, I realize I can twiddle those parameters, but it will
probably result in "MacBook" no longer matching "mac book" at all, but
ONLY matching "macbook".

My understanding of the default settings of WordDelimiterFactory is that
they are intending for "MacBook" to match both "mac book" AND "macbook".

I will try to create an isolation reproduction that demonstrates this
ruling out interference from other filters (or identifying the other
filters), to make my question more clear, I guess.

Jonathan

On 9/2/14 1:34 PM, Michael Della Bitta wrote:

> If that's your problem, I bet all you have to do is twiddle on one of the
> catenate options, either catenateWords or catenateAll.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
> w: appinions.com <http://www.appinions.com/>
>
>
> On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind <[hidden email]> wrote:
>
>> Thanks for the response.
>>
>> I understand the problem a little bit better after investigating more.
>>
>> Posting my full field definitions is, I think, going to be confusing, as
>> they are long and complicated. I can narrow it down to an isolation case if
>> I need to. My indexed field in question is relatively short strings.
>>
>> But what it's got to do with is the WordDelimiterFilter's default
>> splitOnCaseChange=1 and generateWordParts=1, and the effects of such.
>>
>> Let's take a less confusing example, query "MacBook". With a
>> WordDelimiterFilter followed by something that downcases everything.
>>
>> I think what the WDF (followed by case folding) is trying to do is make
>> query "MacBook" match both indexed text "mac book" as well as "macbook" --
>> either one should be a match. Is my understanding right of what
>> WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
>> intending to do?
>>
>> In my actual index, query "MacBook" is matching ONLY "mac book", and not
>> "macbook".  Which is unexpected. I indeed want it to match both. (I realize
>> I could make it match only 'macbook' by setting splitOnCaseChange=0 and/or
>> generateWordParts=0).
>>
>> It's possible this is happening as a side effect of other parts of my
>> complex field definition, and I really do need to post hte whole thing
>> and/or isolate it. But I wonder if there are known general problem cases
>> that cause this kind of failure, or any known bugs in WordDelimiterFilter
>> (in Solr 4.3?) that cause this kind of failure.
>>
>> And I wonder if WordDelimiter filter spitting out the token "MacBook" with
>> position "2" rather than "1" is expected, irrelevant, or possibly a
>> relevant problem.
>>
>> Thanks again,
>>
>> Jonathan
>>
>>
>> On 9/2/14 12:59 PM, Michael Della Bitta wrote:
>>
>>> Hi Jonathan,
>>>
>>> Little confused by this line:
>>>
>>>   And, what I think it's trying to do, is match text indexed as "d elalain"
>>>>
>>> as well as text indexed by "delalain".
>>>
>>> In this case, I don't know how WordDelimiterFilter will help, as you're
>>> likely tokenizing on spaces somewhere, and that input text has a space. I
>>> could be wrong. It's probably best if you post your field definition from
>>> your schema.
>>>
>>> Also, is this a free-text field, or something that's more like a short
>>> string?
>>>
>>> Thanks,
>>>
>>>
>>> Michael Della Bitta
>>>
>>> Applications Developer
>>>
>>> o: +1 646 532 3062
>>>
>>> appinions inc.
>>>
>>> “The Science of Influence Marketing”
>>>
>>> 18 East 41st Street
>>>
>>> New York, NY 10017
>>>
>>> t: @appinions <https://twitter.com/Appinions> | g+:
>>> plus.google.com/appinions
>>> <https://plus.google.com/u/0/b/112002776285509593336/
>>> 112002776285509593336/posts>
>>> w: appinions.com <http://www.appinions.com/>
>>>
>>>
>>>
>>> On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <[hidden email]>
>>> wrote:
>>>
>>>   Hello, I'm running into a case where a query is not returning the results
>>>> I expect, and I'm hoping someone can offer some explanation that might
>>>> help
>>>> me fine tune things or understand what's up.
>>>>
>>>> I am running Solr 4.3.
>>>>
>>>> My filter chain includes a WordDelimiterFilter and, later a filter that
>>>> downcases everything for case-insensitive searching. It includes many
>>>> other
>>>> things too, but I think these are the pertinent facts.
>>>>
>>>> For query "dELALAIN", the WordDelimiterFilter splits into:
>>>>
>>>> text: d
>>>> start: 0
>>>> position: 1
>>>>
>>>> text: ELALAIN
>>>> start: 1
>>>> position: 2
>>>>
>>>> text: dELALAIN
>>>> start: 0
>>>> position: 2
>>>>
>>>> Note the duplication/overlap of the tokens -- one version with "d" and
>>>> "ELALAIN" split into two tokens, and another with just one token.
>>>>
>>>> Later, all the tokens are lowercased by another filter in the chain.
>>>> (actually an ICU filter which is doing something more complicated than
>>>> just
>>>> lowercasing, but I think we can consider it lowercasing for the purposes
>>>> of
>>>> this discussion).
>>>>
>>>> If I understand right what the WordDelimiterFilter is trying to do here,
>>>> it's probably doing something special because of the lowercase "d"
>>>> followed
>>>> by an uppercase letter, a special case for that. (I don't get this
>>>> behavior
>>>> with other mixed case queries not beginning with 'd').
>>>>
>>>> And, what I think it's trying to do, is match text indexed as "d elalain"
>>>> as well as text indexed by "delalain".
>>>>
>>>> The problem is, it's not accomplishing that -- it is NOT matching text
>>>> that was indexed as "delalain" (one token).
>>>>
>>>> I don't entirely understand what the "position" attribute is for -- but I
>>>> wonder if in this case, the position on "dELALAIN" is really supposed to
>>>> be
>>>> 1, not 2?  Could that be responsible for the bug?  Or is position
>>>> irrelevant in this case?
>>>>
>>>> If that's not it, then I'm at a loss as to what may be causing this bug
>>>> --
>>>> or even if it's a bug at all, or I'm just not understanding intended
>>>> behavior. I expect a query for "dELALAIN" to match text indexed as
>>>> "delalain" (because of the forced lowercasing in the filter chain). But
>>>> it's not doing so. Are my expectations wrong? Bug? Something else?
>>>>
>>>> Thanks for any advice,
>>>>
>>>> Jonathan
>>>>
>>>>
>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Erick Erickson
In reply to this post by Michael Della Bitta-2
bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
not "macbook"

I suspect your query parameters for WordDelimiterFilterFactory doesn't have
catenate words set.

What do you see when you enter these in both the index and query portions
of the admin/analysis page?

Best,
Erick


On Tue, Sep 2, 2014 at 10:34 AM, Michael Della Bitta <
[hidden email]> wrote:

> If that's your problem, I bet all you have to do is twiddle on one of the
> catenate options, either catenateWords or catenateAll.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> w: appinions.com <http://www.appinions.com/>
>
>
> On Tue, Sep 2, 2014 at 1:07 PM, Jonathan Rochkind <[hidden email]>
> wrote:
>
> > Thanks for the response.
> >
> > I understand the problem a little bit better after investigating more.
> >
> > Posting my full field definitions is, I think, going to be confusing, as
> > they are long and complicated. I can narrow it down to an isolation case
> if
> > I need to. My indexed field in question is relatively short strings.
> >
> > But what it's got to do with is the WordDelimiterFilter's default
> > splitOnCaseChange=1 and generateWordParts=1, and the effects of such.
> >
> > Let's take a less confusing example, query "MacBook". With a
> > WordDelimiterFilter followed by something that downcases everything.
> >
> > I think what the WDF (followed by case folding) is trying to do is make
> > query "MacBook" match both indexed text "mac book" as well as "macbook"
> --
> > either one should be a match. Is my understanding right of what
> > WordDelimiterfilter with splitOnCaseChange=1 and generateWordParts=1 is
> > intending to do?
> >
> > In my actual index, query "MacBook" is matching ONLY "mac book", and not
> > "macbook".  Which is unexpected. I indeed want it to match both. (I
> realize
> > I could make it match only 'macbook' by setting splitOnCaseChange=0
> and/or
> > generateWordParts=0).
> >
> > It's possible this is happening as a side effect of other parts of my
> > complex field definition, and I really do need to post hte whole thing
> > and/or isolate it. But I wonder if there are known general problem cases
> > that cause this kind of failure, or any known bugs in WordDelimiterFilter
> > (in Solr 4.3?) that cause this kind of failure.
> >
> > And I wonder if WordDelimiter filter spitting out the token "MacBook"
> with
> > position "2" rather than "1" is expected, irrelevant, or possibly a
> > relevant problem.
> >
> > Thanks again,
> >
> > Jonathan
> >
> >
> > On 9/2/14 12:59 PM, Michael Della Bitta wrote:
> >
> >> Hi Jonathan,
> >>
> >> Little confused by this line:
> >>
> >>  And, what I think it's trying to do, is match text indexed as "d
> elalain"
> >>>
> >> as well as text indexed by "delalain".
> >>
> >> In this case, I don't know how WordDelimiterFilter will help, as you're
> >> likely tokenizing on spaces somewhere, and that input text has a space.
> I
> >> could be wrong. It's probably best if you post your field definition
> from
> >> your schema.
> >>
> >> Also, is this a free-text field, or something that's more like a short
> >> string?
> >>
> >> Thanks,
> >>
> >>
> >> Michael Della Bitta
> >>
> >> Applications Developer
> >>
> >> o: +1 646 532 3062
> >>
> >> appinions inc.
> >>
> >> “The Science of Influence Marketing”
> >>
> >> 18 East 41st Street
> >>
> >> New York, NY 10017
> >>
> >> t: @appinions <https://twitter.com/Appinions> | g+:
> >> plus.google.com/appinions
> >> <https://plus.google.com/u/0/b/112002776285509593336/
> >> 112002776285509593336/posts>
> >> w: appinions.com <http://www.appinions.com/>
> >>
> >>
> >>
> >> On Tue, Sep 2, 2014 at 12:41 PM, Jonathan Rochkind <[hidden email]>
> >> wrote:
> >>
> >>  Hello, I'm running into a case where a query is not returning the
> results
> >>> I expect, and I'm hoping someone can offer some explanation that might
> >>> help
> >>> me fine tune things or understand what's up.
> >>>
> >>> I am running Solr 4.3.
> >>>
> >>> My filter chain includes a WordDelimiterFilter and, later a filter that
> >>> downcases everything for case-insensitive searching. It includes many
> >>> other
> >>> things too, but I think these are the pertinent facts.
> >>>
> >>> For query "dELALAIN", the WordDelimiterFilter splits into:
> >>>
> >>> text: d
> >>> start: 0
> >>> position: 1
> >>>
> >>> text: ELALAIN
> >>> start: 1
> >>> position: 2
> >>>
> >>> text: dELALAIN
> >>> start: 0
> >>> position: 2
> >>>
> >>> Note the duplication/overlap of the tokens -- one version with "d" and
> >>> "ELALAIN" split into two tokens, and another with just one token.
> >>>
> >>> Later, all the tokens are lowercased by another filter in the chain.
> >>> (actually an ICU filter which is doing something more complicated than
> >>> just
> >>> lowercasing, but I think we can consider it lowercasing for the
> purposes
> >>> of
> >>> this discussion).
> >>>
> >>> If I understand right what the WordDelimiterFilter is trying to do
> here,
> >>> it's probably doing something special because of the lowercase "d"
> >>> followed
> >>> by an uppercase letter, a special case for that. (I don't get this
> >>> behavior
> >>> with other mixed case queries not beginning with 'd').
> >>>
> >>> And, what I think it's trying to do, is match text indexed as "d
> elalain"
> >>> as well as text indexed by "delalain".
> >>>
> >>> The problem is, it's not accomplishing that -- it is NOT matching text
> >>> that was indexed as "delalain" (one token).
> >>>
> >>> I don't entirely understand what the "position" attribute is for --
> but I
> >>> wonder if in this case, the position on "dELALAIN" is really supposed
> to
> >>> be
> >>> 1, not 2?  Could that be responsible for the bug?  Or is position
> >>> irrelevant in this case?
> >>>
> >>> If that's not it, then I'm at a loss as to what may be causing this bug
> >>> --
> >>> or even if it's a bug at all, or I'm just not understanding intended
> >>> behavior. I expect a query for "dELALAIN" to match text indexed as
> >>> "delalain" (because of the forced lowercasing in the filter chain). But
> >>> it's not doing so. Are my expectations wrong? Bug? Something else?
> >>>
> >>> Thanks for any advice,
> >>>
> >>> Jonathan
> >>>
> >>>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind
On 9/2/14 1:51 PM, Erick Erickson wrote:
> bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
> not "macbook"
>
> I suspect your query parameters for WordDelimiterFilterFactory doesn't have
> catenate words set.
>
> What do you see when you enter these in both the index and query portions
> of the admin/analysis page?

Thanks Erick!

Our WordDelimiterFilterFactory does have catenate words set, in both
index and query phases (is that right?):

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>

It's hard to cut and paste the results of the analysis page into email
(or anywhere!), I'll give you screenshots, sorry -- and I'll give them
for our whole real world app complex field definition. I'll also paste
in our entire field definition below. But I realize my next step is
probably creating a simpler isolation/reproduction case (unless you have
a magic answer from this!).

Again, the problem is that "MacBook" seems to be only matching on
indexed "macbook" and not indexed "mac book".


"MacBook" query analysis:
https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png

"MacBook" index analysis:
https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png

"mac book" index analysis:
https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png


Our entire actual field definition:

   <fieldType name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
       <analyzer>
        <!-- the rulefiles thing is to keep ICUTokenizerFactory from
stripping punctuation,
             so our synonym filter involving C++ etc can still work.
             From:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.6070409@...%3E
             the rbbi file is in our local ./conf, copied from lucene
source tree -->
        <tokenizer class="solr.ICUTokenizerFactory"
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>

        <filter class="solr.SynonymFilterFactory"
synonyms="punctuation-whitelist.txt" ignoreCase="true"/>

         <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>


         <!-- folding need sto be after WordDelimiter, so WordDelimiter
              can do it's thing with full cases and such -->
         <filter class="solr.ICUFoldingFilterFactory" />


         <!-- ICUFolding already includes lowercasing, no
              need for seperate lowercasing step
         <filter class="solr.LowerCaseFilterFactory"/>
         -->

         <filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
     </fieldType>




Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Erick Erickson
What happens if you append &debug=query to your query? IOW, what does the
_parsed_ query look like?

Also note that the defaults for WDFF are _not_ identical. catenateWords and
catenateNumbers are 1 in the
index portion and 0 in the query section. Still, this shouldn't be a
problem all other things being equal.

Best,
Erick


On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind <[hidden email]> wrote:

> On 9/2/14 1:51 PM, Erick Erickson wrote:
>
>> bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
>> not "macbook"
>>
>> I suspect your query parameters for WordDelimiterFilterFactory doesn't
>> have
>> catenate words set.
>>
>> What do you see when you enter these in both the index and query portions
>> of the admin/analysis page?
>>
>
> Thanks Erick!
>
> Our WordDelimiterFilterFactory does have catenate words set, in both index
> and query phases (is that right?):
>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>
> It's hard to cut and paste the results of the analysis page into email (or
> anywhere!), I'll give you screenshots, sorry -- and I'll give them for our
> whole real world app complex field definition. I'll also paste in our
> entire field definition below. But I realize my next step is probably
> creating a simpler isolation/reproduction case (unless you have a magic
> answer from this!).
>
> Again, the problem is that "MacBook" seems to be only matching on indexed
> "macbook" and not indexed "mac book".
>
>
> "MacBook" query analysis:
> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
>
> "MacBook" index analysis:
> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
>
> "mac book" index analysis:
> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
>
>
> Our entire actual field definition:
>
>   <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
> autoGeneratePhraseQueries="true">
>       <analyzer>
>        <!-- the rulefiles thing is to keep ICUTokenizerFactory from
> stripping punctuation,
>             so our synonym filter involving C++ etc can still work.
>             From: https://mail-archives.apache.
> org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
> [hidden email]%3E
>             the rbbi file is in our local ./conf, copied from lucene
> source tree -->
>        <tokenizer class="solr.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>
>        <filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt"
> ignoreCase="true"/>
>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
>
>         <!-- folding need sto be after WordDelimiter, so WordDelimiter
>              can do it's thing with full cases and such -->
>         <filter class="solr.ICUFoldingFilterFactory" />
>
>
>         <!-- ICUFolding already includes lowercasing, no
>              need for seperate lowercasing step
>         <filter class="solr.LowerCaseFilterFactory"/>
>         -->
>
>         <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

aiguofer
In reply to this post by Jonathan Rochkind
Although not a solution, this may help in trying to find the problem.
In http://solr.pl/en/2010/08/16/what-is-schema-xml/ it says:

"It is worth noting that there is an additional attribute for the text field type:

    autoGeneratePhraseQueries

This attribute is responsible for telling filters how to behave when dividing tokens. Some filters (such as WordDelimiterFilter) can divide tokens into a set of tokens. Setting the attribute to true (default value) will automatically generate phrase queries. This means that WordDelimiterFilter will divide the word “wi-fi” into two tokens “wi” and “fi”. With autoGeneratePhraseQueries set to true query sent to Lucene will look like "field:wi fi", while with set to false Lucene query will look like field:wi OR field:fi. However, please note, that this attribute only behaves well with tokenizers based on white spaces."

Since phrases are made by looking at the position, it is possible that the position set for the other generated tokens have something to do with it.  Have you tried turning autoGeneratePhraseQueries="false" to see if it'll match both? (I know that might have other unintended behaviors but it might give some insight into the problem)

Diego Fernandez - 爱国
Software Engineer
US GSS Supportability - Diagnostics



----- Original Message -----

> On 9/2/14 1:51 PM, Erick Erickson wrote:
> > bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
> > not "macbook"
> >
> > I suspect your query parameters for WordDelimiterFilterFactory doesn't have
> > catenate words set.
> >
> > What do you see when you enter these in both the index and query portions
> > of the admin/analysis page?
>
> Thanks Erick!
>
> Our WordDelimiterFilterFactory does have catenate words set, in both
> index and query phases (is that right?):
>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>
> It's hard to cut and paste the results of the analysis page into email
> (or anywhere!), I'll give you screenshots, sorry -- and I'll give them
> for our whole real world app complex field definition. I'll also paste
> in our entire field definition below. But I realize my next step is
> probably creating a simpler isolation/reproduction case (unless you have
> a magic answer from this!).
>
> Again, the problem is that "MacBook" seems to be only matching on
> indexed "macbook" and not indexed "mac book".
>
>
> "MacBook" query analysis:
> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
>
> "MacBook" index analysis:
> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
>
> "mac book" index analysis:
> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
>
>
> Our entire actual field definition:
>
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>        <analyzer>
>         <!-- the rulefiles thing is to keep ICUTokenizerFactory from
> stripping punctuation,
>              so our synonym filter involving C++ etc can still work.
>              From:
> https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.6070409@...%3E
>              the rbbi file is in our local ./conf, copied from lucene
> source tree -->
>         <tokenizer class="solr.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="punctuation-whitelist.txt" ignoreCase="true"/>
>
>          <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>
>
>          <!-- folding need sto be after WordDelimiter, so WordDelimiter
>               can do it's thing with full cases and such -->
>          <filter class="solr.ICUFoldingFilterFactory" />
>
>
>          <!-- ICUFolding already includes lowercasing, no
>               need for seperate lowercasing step
>          <filter class="solr.LowerCaseFilterFactory"/>
>          -->
>
>          <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        </analyzer>
>      </fieldType>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind
In reply to this post by Erick Erickson
Thanks Erick and Diego. Yes, I noticed in my last message I'm not
actually using defaults, not sure why I chose non-defaults originally.

I still need to find time to make a smaller isolation/reproduction case,
I'm getting confusing results that suggest some other part of my field
def may be pertinent.

I'll come back when I've done that (hopefully next week), and include
the _parsed_ from &debug=query then. Thanks!

Jonathan


On 9/2/14 4:26 PM, Erick Erickson wrote:

> What happens if you append &debug=query to your query? IOW, what does the
> _parsed_ query look like?
>
> Also note that the defaults for WDFF are _not_ identical. catenateWords and
> catenateNumbers are 1 in the
> index portion and 0 in the query section. Still, this shouldn't be a
> problem all other things being equal.
>
> Best,
> Erick
>
>
> On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind <[hidden email]> wrote:
>
>> On 9/2/14 1:51 PM, Erick Erickson wrote:
>>
>>> bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
>>> not "macbook"
>>>
>>> I suspect your query parameters for WordDelimiterFilterFactory doesn't
>>> have
>>> catenate words set.
>>>
>>> What do you see when you enter these in both the index and query portions
>>> of the admin/analysis page?
>>>
>>
>> Thanks Erick!
>>
>> Our WordDelimiterFilterFactory does have catenate words set, in both index
>> and query phases (is that right?):
>>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>
>> It's hard to cut and paste the results of the analysis page into email (or
>> anywhere!), I'll give you screenshots, sorry -- and I'll give them for our
>> whole real world app complex field definition. I'll also paste in our
>> entire field definition below. But I realize my next step is probably
>> creating a simpler isolation/reproduction case (unless you have a magic
>> answer from this!).
>>
>> Again, the problem is that "MacBook" seems to be only matching on indexed
>> "macbook" and not indexed "mac book".
>>
>>
>> "MacBook" query analysis:
>> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
>>
>> "MacBook" index analysis:
>> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
>>
>> "mac book" index analysis:
>> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
>>
>>
>> Our entire actual field definition:
>>
>>    <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
>> autoGeneratePhraseQueries="true">
>>        <analyzer>
>>         <!-- the rulefiles thing is to keep ICUTokenizerFactory from
>> stripping punctuation,
>>              so our synonym filter involving C++ etc can still work.
>>              From: https://mail-archives.apache.
>> org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
>> [hidden email]%3E
>>              the rbbi file is in our local ./conf, copied from lucene
>> source tree -->
>>         <tokenizer class="solr.ICUTokenizerFactory"
>> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>>
>>         <filter class="solr.SynonymFilterFactory" synonyms="punctuation-whitelist.txt"
>> ignoreCase="true"/>
>>
>>          <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>
>>
>>          <!-- folding need sto be after WordDelimiter, so WordDelimiter
>>               can do it's thing with full cases and such -->
>>          <filter class="solr.ICUFoldingFilterFactory" />
>>
>>
>>          <!-- ICUFolding already includes lowercasing, no
>>               need for seperate lowercasing step
>>          <filter class="solr.LowerCaseFilterFactory"/>
>>          -->
>>
>>          <filter class="solr.SnowballPorterFilterFactory"
>> language="English" protected="protwords.txt"/>
>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>        </analyzer>
>>      </fieldType>
>>
>>
>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Erick Erickson
Jonathan:

If at all possible, delete your collection/data directory (the whole
directory, including data) between runs after you've changed
your schema (at least any of your analysis that pertains to indexing).
Mixing old and new schema definitions can add to the confusion!

Good luck!
Erick

On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind <[hidden email]> wrote:

> Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually
> using defaults, not sure why I chose non-defaults originally.
>
> I still need to find time to make a smaller isolation/reproduction case, I'm
> getting confusing results that suggest some other part of my field def may
> be pertinent.
>
> I'll come back when I've done that (hopefully next week), and include the
> _parsed_ from &debug=query then. Thanks!
>
> Jonathan
>
>
>
> On 9/2/14 4:26 PM, Erick Erickson wrote:
>>
>> What happens if you append &debug=query to your query? IOW, what does the
>> _parsed_ query look like?
>>
>> Also note that the defaults for WDFF are _not_ identical. catenateWords
>> and
>> catenateNumbers are 1 in the
>> index portion and 0 in the query section. Still, this shouldn't be a
>> problem all other things being equal.
>>
>> Best,
>> Erick
>>
>>
>> On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind <[hidden email]>
>> wrote:
>>
>>> On 9/2/14 1:51 PM, Erick Erickson wrote:
>>>
>>>> bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
>>>> not "macbook"
>>>>
>>>> I suspect your query parameters for WordDelimiterFilterFactory doesn't
>>>> have
>>>> catenate words set.
>>>>
>>>> What do you see when you enter these in both the index and query
>>>> portions
>>>> of the admin/analysis page?
>>>>
>>>
>>> Thanks Erick!
>>>
>>> Our WordDelimiterFilterFactory does have catenate words set, in both
>>> index
>>> and query phases (is that right?):
>>>
>>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>>> catenateAll="0" splitOnCaseChange="1"/>
>>>
>>> It's hard to cut and paste the results of the analysis page into email
>>> (or
>>> anywhere!), I'll give you screenshots, sorry -- and I'll give them for
>>> our
>>> whole real world app complex field definition. I'll also paste in our
>>> entire field definition below. But I realize my next step is probably
>>> creating a simpler isolation/reproduction case (unless you have a magic
>>> answer from this!).
>>>
>>> Again, the problem is that "MacBook" seems to be only matching on indexed
>>> "macbook" and not indexed "mac book".
>>>
>>>
>>> "MacBook" query analysis:
>>> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
>>>
>>> "MacBook" index analysis:
>>> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
>>>
>>> "mac book" index analysis:
>>> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
>>>
>>>
>>> Our entire actual field definition:
>>>
>>>    <fieldType name="text" class="solr.TextField"
>>> positionIncrementGap="100"
>>> autoGeneratePhraseQueries="true">
>>>        <analyzer>
>>>         <!-- the rulefiles thing is to keep ICUTokenizerFactory from
>>> stripping punctuation,
>>>              so our synonym filter involving C++ etc can still work.
>>>              From: https://mail-archives.apache.
>>> org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
>>> [hidden email]%3E
>>>              the rbbi file is in our local ./conf, copied from lucene
>>> source tree -->
>>>         <tokenizer class="solr.ICUTokenizerFactory"
>>> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>>>
>>>         <filter class="solr.SynonymFilterFactory"
>>> synonyms="punctuation-whitelist.txt"
>>> ignoreCase="true"/>
>>>
>>>          <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>
>>>
>>>          <!-- folding need sto be after WordDelimiter, so WordDelimiter
>>>               can do it's thing with full cases and such -->
>>>          <filter class="solr.ICUFoldingFilterFactory" />
>>>
>>>
>>>          <!-- ICUFolding already includes lowercasing, no
>>>               need for seperate lowercasing step
>>>          <filter class="solr.LowerCaseFilterFactory"/>
>>>          -->
>>>
>>>          <filter class="solr.SnowballPorterFilterFactory"
>>> language="English" protected="protwords.txt"/>
>>>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>        </analyzer>
>>>      </fieldType>
>>>
>>>
>>>
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind
Okay, some months later I've come back to this with an isolated
reproduction case. Thanks very much for any advice or debugging help you
can give.

The WordDelimiter filter is making a mixed-case query NOT match the
single-case source, when it ought to.

I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no
sense to debug here, and I need to install and try to reproduce on a
more recent version).

I have an index that includes ONE document (deleted and reindexed after
index change), with content in only one field ("text") other than 'id',
and that content is one word: "delalain".

My analysis (both index and query, I don't have different ones) for the
'text' field is simply:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="true">
       <analyzer>
         <tokenizer class="solr.ICUTokenizerFactory" />

         <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" catenateWords="1" splitOnCaseChange="1"/>

         <filter class="solr.ICUFoldingFilterFactory" />
       </analyzer>
</fieldType>

I am querying simply with eg /select?defType=lucene&q=text%3Adelalain

Querying for "delalain" finds this document, as expected. Querying for
"DELALAIN" finds this document, as expected (note the ICUFoldingFactory).

However, querying for "deLALAIN" does not find this document, which is
unexpected.

INDEX analysis of the source, "delalain", ends in this in the index,
which seems pretty straightforward, so I'll only bother pasting in the
final index analysis:

######
text delalain
raw_bytes [64 65 6c 61 6c 61 69 6e]
position 1
start 0
end 8
type <ALPHANUM>
script Latin
#######




QUERY analysis of the problematic query, "deLALAIN", looks like this:

#####
ICUT text deLALAIN
        raw_bytes [64 65 4c 41 4c 41 49 4e]
        start 0
        end 8
        type <ALPHANUM>
        script Latin
        position 1
                               
                               
WDF text de LALAIN deLALAIN
        raw_bytes [64 65] [4c 41 4c 41 49 4e] [64 65 4c 41 4c 41 49 4e]
        start 0 2 0
        end 2 8 8
        type <ALPHANUM> <ALPHANUM> <ALPHANUM>
        position 1 2 2
        script Common Common Common
                               
                               
ICUFF text de lalain delalain
        raw_bytes [64 65] [6c 61 6c 61 69 6e] [64 65 6c 61 6c 61 69 6e]
        position 1 2 2
        start 0 2 0
        end 2 8 8
        type <ALPHANUM> <ALPHANUM> <ALPHANUM>
        script Common Common Common
#######



It's obviously the WordDelimiterFilter that is messing things up -- but
how/why, and is it a bug?

It wants to search for both "de lalain" as a phrase, as well as
alternately "delalain" as one word -- that's the intended supported
point of the WDF with this configuration, right? And should work?

The problem is that is not succesfully matching "delalain" as one word
-- so, how to figure out why not and what to do about it?

Previously, Erick and Diego asked for the info from &debug=query, so
here is that as well:

####
<lst name="debug">
   <str name="rawquerystring">text:deLALAIN</str>
   <str name="querystring">text:deLALAIN</str>
   <str name="parsedquery">MultiPhraseQuery(text:"de (lalain
delalain)")</str>
   <str name="parsedquery_toString">text:"de (lalain delalain)"</str>
   <str name="QParser">LuceneQParser</str>
</lst>
####

Hmm, that does not seem to quite look like neccesarily, if I interpret
that correctly, it's looking for "de" followed by either "lalain" or
"delalain".  Ie, it would match "de delalain"?  But that's not right at
all.

So, what's gone wrong? Something with WDF with configuration to
generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's
a bug, one that might be fixed in a more recent Solr?).

Thanks!

Jonathan




On 9/3/14 7:15 PM, Erick Erickson wrote:

> Jonathan:
>
> If at all possible, delete your collection/data directory (the whole
> directory, including data) between runs after you've changed
> your schema (at least any of your analysis that pertains to indexing).
> Mixing old and new schema definitions can add to the confusion!
>
> Good luck!
> Erick
>
> On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind <[hidden email]> wrote:
>> Thanks Erick and Diego. Yes, I noticed in my last message I'm not actually
>> using defaults, not sure why I chose non-defaults originally.
>>
>> I still need to find time to make a smaller isolation/reproduction case, I'm
>> getting confusing results that suggest some other part of my field def may
>> be pertinent.
>>
>> I'll come back when I've done that (hopefully next week), and include the
>> _parsed_ from &debug=query then. Thanks!
>>
>> Jonathan
>>
>>
>>
>> On 9/2/14 4:26 PM, Erick Erickson wrote:
>>>
>>> What happens if you append &debug=query to your query? IOW, what does the
>>> _parsed_ query look like?
>>>
>>> Also note that the defaults for WDFF are _not_ identical. catenateWords
>>> and
>>> catenateNumbers are 1 in the
>>> index portion and 0 in the query section. Still, this shouldn't be a
>>> problem all other things being equal.
>>>
>>> Best,
>>> Erick
>>>
>>>
>>> On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind <[hidden email]>
>>> wrote:
>>>
>>>> On 9/2/14 1:51 PM, Erick Erickson wrote:
>>>>
>>>>> bq: In my actual index, query "MacBook" is matching ONLY "mac book", and
>>>>> not "macbook"
>>>>>
>>>>> I suspect your query parameters for WordDelimiterFilterFactory doesn't
>>>>> have
>>>>> catenate words set.
>>>>>
>>>>> What do you see when you enter these in both the index and query
>>>>> portions
>>>>> of the admin/analysis page?
>>>>>
>>>>
>>>> Thanks Erick!
>>>>
>>>> Our WordDelimiterFilterFactory does have catenate words set, in both
>>>> index
>>>> and query phases (is that right?):
>>>>
>>>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>>>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>>>> catenateAll="0" splitOnCaseChange="1"/>
>>>>
>>>> It's hard to cut and paste the results of the analysis page into email
>>>> (or
>>>> anywhere!), I'll give you screenshots, sorry -- and I'll give them for
>>>> our
>>>> whole real world app complex field definition. I'll also paste in our
>>>> entire field definition below. But I realize my next step is probably
>>>> creating a simpler isolation/reproduction case (unless you have a magic
>>>> answer from this!).
>>>>
>>>> Again, the problem is that "MacBook" seems to be only matching on indexed
>>>> "macbook" and not indexed "mac book".
>>>>
>>>>
>>>> "MacBook" query analysis:
>>>> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
>>>>
>>>> "MacBook" index analysis:
>>>> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
>>>>
>>>> "mac book" index analysis:
>>>> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
>>>>
>>>>
>>>> Our entire actual field definition:
>>>>
>>>>     <fieldType name="text" class="solr.TextField"
>>>> positionIncrementGap="100"
>>>> autoGeneratePhraseQueries="true">
>>>>         <analyzer>
>>>>          <!-- the rulefiles thing is to keep ICUTokenizerFactory from
>>>> stripping punctuation,
>>>>               so our synonym filter involving C++ etc can still work.
>>>>               From: https://mail-archives.apache.
>>>> org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
>>>> [hidden email]%3E
>>>>               the rbbi file is in our local ./conf, copied from lucene
>>>> source tree -->
>>>>          <tokenizer class="solr.ICUTokenizerFactory"
>>>> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>>>>
>>>>          <filter class="solr.SynonymFilterFactory"
>>>> synonyms="punctuation-whitelist.txt"
>>>> ignoreCase="true"/>
>>>>
>>>>           <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>
>>>>
>>>>           <!-- folding need sto be after WordDelimiter, so WordDelimiter
>>>>                can do it's thing with full cases and such -->
>>>>           <filter class="solr.ICUFoldingFilterFactory" />
>>>>
>>>>
>>>>           <!-- ICUFolding already includes lowercasing, no
>>>>                need for seperate lowercasing step
>>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>>           -->
>>>>
>>>>           <filter class="solr.SnowballPorterFilterFactory"
>>>> language="English" protected="protwords.txt"/>
>>>>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>         </analyzer>
>>>>       </fieldType>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jack Krupansky-3
WDF is powerful, but it is not magic. In general, the indexed data is
expected to be clean while the query might be sloppy. You need to separate
the index and query analyzers and they need to respect that distinction -
the index analyzer would index as you have indicated, indexing both the
unitary term and the multi-term phrase, while the query analyzer would NOT
do the split on case, so that the query could be a unitary term (possibly
with mixed case, but that would not split the term) or could be a two-word
phrase.

-- Jack Krupansky


-- Jack Krupansky

On Mon, Dec 29, 2014 at 5:12 PM, Jonathan Rochkind <[hidden email]> wrote:

> Okay, some months later I've come back to this with an isolated
> reproduction case. Thanks very much for any advice or debugging help you
> can give.
>
> The WordDelimiter filter is making a mixed-case query NOT match the
> single-case source, when it ought to.
>
> I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no
> sense to debug here, and I need to install and try to reproduce on a more
> recent version).
>
> I have an index that includes ONE document (deleted and reindexed after
> index change), with content in only one field ("text") other than 'id', and
> that content is one word: "delalain".
>
> My analysis (both index and query, I don't have different ones) for the
> 'text' field is simply:
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
> autoGeneratePhraseQueries="true">
>       <analyzer>
>         <tokenizer class="solr.ICUTokenizerFactory" />
>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" catenateWords="1" splitOnCaseChange="1"/>
>
>         <filter class="solr.ICUFoldingFilterFactory" />
>       </analyzer>
> </fieldType>
>
> I am querying simply with eg /select?defType=lucene&q=text%3Adelalain
>
> Querying for "delalain" finds this document, as expected. Querying for
> "DELALAIN" finds this document, as expected (note the ICUFoldingFactory).
>
> However, querying for "deLALAIN" does not find this document, which is
> unexpected.
>
> INDEX analysis of the source, "delalain", ends in this in the index, which
> seems pretty straightforward, so I'll only bother pasting in the final
> index analysis:
>
> ######
> text    delalain
> raw_bytes       [64 65 6c 61 6c 61 69 6e]
> position        1
> start   0
> end     8
> type    <ALPHANUM>
> script  Latin
> #######
>
>
>
>
> QUERY analysis of the problematic query, "deLALAIN", looks like this:
>
> #####
> ICUT    text    deLALAIN
>         raw_bytes       [64 65 4c 41 4c 41 49 4e]
>         start   0
>         end     8
>         type    <ALPHANUM>
>         script  Latin
>         position        1
>
>
> WDF     text    de      LALAIN  deLALAIN
>         raw_bytes       [64 65] [4c 41 4c 41 49 4e]     [64 65 4c 41 4c 41
> 49 4e]
>         start   0       2       0
>         end     2       8       8
>         type    <ALPHANUM>      <ALPHANUM>      <ALPHANUM>
>         position        1       2       2
>         script  Common  Common  Common
>
>
> ICUFF   text    de      lalain  delalain
>         raw_bytes       [64 65] [6c 61 6c 61 69 6e]     [64 65 6c 61 6c 61
> 69 6e]
>         position        1       2       2
>         start   0       2       0
>         end     2       8       8
>         type    <ALPHANUM>      <ALPHANUM>      <ALPHANUM>
>         script  Common  Common  Common
> #######
>
>
>
> It's obviously the WordDelimiterFilter that is messing things up -- but
> how/why, and is it a bug?
>
> It wants to search for both "de lalain" as a phrase, as well as
> alternately "delalain" as one word -- that's the intended supported point
> of the WDF with this configuration, right? And should work?
>
> The problem is that is not succesfully matching "delalain" as one word --
> so, how to figure out why not and what to do about it?
>
> Previously, Erick and Diego asked for the info from &debug=query, so here
> is that as well:
>
> ####
> <lst name="debug">
>   <str name="rawquerystring">text:deLALAIN</str>
>   <str name="querystring">text:deLALAIN</str>
>   <str name="parsedquery">MultiPhraseQuery(text:"de (lalain
> delalain)")</str>
>   <str name="parsedquery_toString">text:"de (lalain delalain)"</str>
>   <str name="QParser">LuceneQParser</str>
> </lst>
> ####
>
> Hmm, that does not seem to quite look like neccesarily, if I interpret
> that correctly, it's looking for "de" followed by either "lalain" or
> "delalain".  Ie, it would match "de delalain"?  But that's not right at all.
>
> So, what's gone wrong? Something with WDF with configuration to
> generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's
> a bug, one that might be fixed in a more recent Solr?).
>
> Thanks!
>
> Jonathan
>
>
>
>
> On 9/3/14 7:15 PM, Erick Erickson wrote:
>
>> Jonathan:
>>
>> If at all possible, delete your collection/data directory (the whole
>> directory, including data) between runs after you've changed
>> your schema (at least any of your analysis that pertains to indexing).
>> Mixing old and new schema definitions can add to the confusion!
>>
>> Good luck!
>> Erick
>>
>> On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind <[hidden email]>
>> wrote:
>>
>>> Thanks Erick and Diego. Yes, I noticed in my last message I'm not
>>> actually
>>> using defaults, not sure why I chose non-defaults originally.
>>>
>>> I still need to find time to make a smaller isolation/reproduction case,
>>> I'm
>>> getting confusing results that suggest some other part of my field def
>>> may
>>> be pertinent.
>>>
>>> I'll come back when I've done that (hopefully next week), and include the
>>> _parsed_ from &debug=query then. Thanks!
>>>
>>> Jonathan
>>>
>>>
>>>
>>> On 9/2/14 4:26 PM, Erick Erickson wrote:
>>>
>>>>
>>>> What happens if you append &debug=query to your query? IOW, what does
>>>> the
>>>> _parsed_ query look like?
>>>>
>>>> Also note that the defaults for WDFF are _not_ identical. catenateWords
>>>> and
>>>> catenateNumbers are 1 in the
>>>> index portion and 0 in the query section. Still, this shouldn't be a
>>>> problem all other things being equal.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>
>>>> On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind <[hidden email]>
>>>> wrote:
>>>>
>>>>  On 9/2/14 1:51 PM, Erick Erickson wrote:
>>>>>
>>>>>  bq: In my actual index, query "MacBook" is matching ONLY "mac book",
>>>>>> and
>>>>>> not "macbook"
>>>>>>
>>>>>> I suspect your query parameters for WordDelimiterFilterFactory doesn't
>>>>>> have
>>>>>> catenate words set.
>>>>>>
>>>>>> What do you see when you enter these in both the index and query
>>>>>> portions
>>>>>> of the admin/analysis page?
>>>>>>
>>>>>>
>>>>> Thanks Erick!
>>>>>
>>>>> Our WordDelimiterFilterFactory does have catenate words set, in both
>>>>> index
>>>>> and query phases (is that right?):
>>>>>
>>>>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>>>>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>>>>> catenateAll="0" splitOnCaseChange="1"/>
>>>>>
>>>>> It's hard to cut and paste the results of the analysis page into email
>>>>> (or
>>>>> anywhere!), I'll give you screenshots, sorry -- and I'll give them for
>>>>> our
>>>>> whole real world app complex field definition. I'll also paste in our
>>>>> entire field definition below. But I realize my next step is probably
>>>>> creating a simpler isolation/reproduction case (unless you have a magic
>>>>> answer from this!).
>>>>>
>>>>> Again, the problem is that "MacBook" seems to be only matching on
>>>>> indexed
>>>>> "macbook" and not indexed "mac book".
>>>>>
>>>>>
>>>>> "MacBook" query analysis:
>>>>> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
>>>>>
>>>>> "MacBook" index analysis:
>>>>> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
>>>>>
>>>>> "mac book" index analysis:
>>>>> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
>>>>>
>>>>>
>>>>> Our entire actual field definition:
>>>>>
>>>>>     <fieldType name="text" class="solr.TextField"
>>>>> positionIncrementGap="100"
>>>>> autoGeneratePhraseQueries="true">
>>>>>         <analyzer>
>>>>>          <!-- the rulefiles thing is to keep ICUTokenizerFactory from
>>>>> stripping punctuation,
>>>>>               so our synonym filter involving C++ etc can still work.
>>>>>               From: https://mail-archives.apache.
>>>>> org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
>>>>> [hidden email]%3E
>>>>>               the rbbi file is in our local ./conf, copied from lucene
>>>>> source tree -->
>>>>>          <tokenizer class="solr.ICUTokenizerFactory"
>>>>> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>>>>>
>>>>>          <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="punctuation-whitelist.txt"
>>>>> ignoreCase="true"/>
>>>>>
>>>>>           <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>
>>>>>
>>>>>           <!-- folding need sto be after WordDelimiter, so
>>>>> WordDelimiter
>>>>>                can do it's thing with full cases and such -->
>>>>>           <filter class="solr.ICUFoldingFilterFactory" />
>>>>>
>>>>>
>>>>>           <!-- ICUFolding already includes lowercasing, no
>>>>>                need for seperate lowercasing step
>>>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>>>           -->
>>>>>
>>>>>           <filter class="solr.SnowballPorterFilterFactory"
>>>>> language="English" protected="protwords.txt"/>
>>>>>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>         </analyzer>
>>>>>       </fieldType>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Alexandre Rafalovitch
In reply to this post by Jonathan Rochkind
> splitOnCaseChange="1"

So, it does not get split during indexing because there is no case
change. But does get split during search and now you are looking for
partial tokens against a combined single-token in the index. And not
matching.

The WordDelimiterFilterFactory is more for product IDs that have
multitudes of spellings. Your use-case seems to be a lot more of just
matching with ignoring case (looking at last email only).

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 29 December 2014 at 17:12, Jonathan Rochkind <[hidden email]> wrote:

> Okay, some months later I've come back to this with an isolated reproduction
> case. Thanks very much for any advice or debugging help you can give.
>
> The WordDelimiter filter is making a mixed-case query NOT match the
> single-case source, when it ought to.
>
> I am in Solr 4.3 (sorry, that's what we run; let me know if it makes no
> sense to debug here, and I need to install and try to reproduce on a more
> recent version).
>
> I have an index that includes ONE document (deleted and reindexed after
> index change), with content in only one field ("text") other than 'id', and
> that content is one word: "delalain".
>
> My analysis (both index and query, I don't have different ones) for the
> 'text' field is simply:
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
> autoGeneratePhraseQueries="true">
>       <analyzer>
>         <tokenizer class="solr.ICUTokenizerFactory" />
>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" catenateWords="1" splitOnCaseChange="1"/>
>
>         <filter class="solr.ICUFoldingFilterFactory" />
>       </analyzer>
> </fieldType>
>
> I am querying simply with eg /select?defType=lucene&q=text%3Adelalain
>
> Querying for "delalain" finds this document, as expected. Querying for
> "DELALAIN" finds this document, as expected (note the ICUFoldingFactory).
>
> However, querying for "deLALAIN" does not find this document, which is
> unexpected.
>
> INDEX analysis of the source, "delalain", ends in this in the index, which
> seems pretty straightforward, so I'll only bother pasting in the final index
> analysis:
>
> ######
> text    delalain
> raw_bytes       [64 65 6c 61 6c 61 69 6e]
> position        1
> start   0
> end     8
> type    <ALPHANUM>
> script  Latin
> #######
>
>
>
>
> QUERY analysis of the problematic query, "deLALAIN", looks like this:
>
> #####
> ICUT    text    deLALAIN
>         raw_bytes       [64 65 4c 41 4c 41 49 4e]
>         start   0
>         end     8
>         type    <ALPHANUM>
>         script  Latin
>         position        1
>
>
> WDF     text    de      LALAIN  deLALAIN
>         raw_bytes       [64 65] [4c 41 4c 41 49 4e]     [64 65 4c 41 4c 41
> 49 4e]
>         start   0       2       0
>         end     2       8       8
>         type    <ALPHANUM>      <ALPHANUM>      <ALPHANUM>
>         position        1       2       2
>         script  Common  Common  Common
>
>
> ICUFF   text    de      lalain  delalain
>         raw_bytes       [64 65] [6c 61 6c 61 69 6e]     [64 65 6c 61 6c 61
> 69 6e]
>         position        1       2       2
>         start   0       2       0
>         end     2       8       8
>         type    <ALPHANUM>      <ALPHANUM>      <ALPHANUM>
>         script  Common  Common  Common
> #######
>
>
>
> It's obviously the WordDelimiterFilter that is messing things up -- but
> how/why, and is it a bug?
>
> It wants to search for both "de lalain" as a phrase, as well as alternately
> "delalain" as one word -- that's the intended supported point of the WDF
> with this configuration, right? And should work?
>
> The problem is that is not succesfully matching "delalain" as one word --
> so, how to figure out why not and what to do about it?
>
> Previously, Erick and Diego asked for the info from &debug=query, so here is
> that as well:
>
> ####
> <lst name="debug">
>   <str name="rawquerystring">text:deLALAIN</str>
>   <str name="querystring">text:deLALAIN</str>
>   <str name="parsedquery">MultiPhraseQuery(text:"de (lalain
> delalain)")</str>
>   <str name="parsedquery_toString">text:"de (lalain delalain)"</str>
>   <str name="QParser">LuceneQParser</str>
> </lst>
> ####
>
> Hmm, that does not seem to quite look like neccesarily, if I interpret that
> correctly, it's looking for "de" followed by either "lalain" or "delalain".
> Ie, it would match "de delalain"?  But that's not right at all.
>
> So, what's gone wrong? Something with WDF with configuration to
> generateWords/catenateWords/splitOnCaseChange? Is it a bug? (And if it's a
> bug, one that might be fixed in a more recent Solr?).
>
> Thanks!
>
> Jonathan
>
>
>
>
>
> On 9/3/14 7:15 PM, Erick Erickson wrote:
>>
>> Jonathan:
>>
>> If at all possible, delete your collection/data directory (the whole
>> directory, including data) between runs after you've changed
>> your schema (at least any of your analysis that pertains to indexing).
>> Mixing old and new schema definitions can add to the confusion!
>>
>> Good luck!
>> Erick
>>
>> On Wed, Sep 3, 2014 at 8:48 AM, Jonathan Rochkind <[hidden email]>
>> wrote:
>>>
>>> Thanks Erick and Diego. Yes, I noticed in my last message I'm not
>>> actually
>>> using defaults, not sure why I chose non-defaults originally.
>>>
>>> I still need to find time to make a smaller isolation/reproduction case,
>>> I'm
>>> getting confusing results that suggest some other part of my field def
>>> may
>>> be pertinent.
>>>
>>> I'll come back when I've done that (hopefully next week), and include the
>>> _parsed_ from &debug=query then. Thanks!
>>>
>>> Jonathan
>>>
>>>
>>>
>>> On 9/2/14 4:26 PM, Erick Erickson wrote:
>>>>
>>>>
>>>> What happens if you append &debug=query to your query? IOW, what does
>>>> the
>>>> _parsed_ query look like?
>>>>
>>>> Also note that the defaults for WDFF are _not_ identical. catenateWords
>>>> and
>>>> catenateNumbers are 1 in the
>>>> index portion and 0 in the query section. Still, this shouldn't be a
>>>> problem all other things being equal.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>
>>>> On Tue, Sep 2, 2014 at 12:43 PM, Jonathan Rochkind <[hidden email]>
>>>> wrote:
>>>>
>>>>> On 9/2/14 1:51 PM, Erick Erickson wrote:
>>>>>
>>>>>> bq: In my actual index, query "MacBook" is matching ONLY "mac book",
>>>>>> and
>>>>>> not "macbook"
>>>>>>
>>>>>> I suspect your query parameters for WordDelimiterFilterFactory doesn't
>>>>>> have
>>>>>> catenate words set.
>>>>>>
>>>>>> What do you see when you enter these in both the index and query
>>>>>> portions
>>>>>> of the admin/analysis page?
>>>>>>
>>>>>
>>>>> Thanks Erick!
>>>>>
>>>>> Our WordDelimiterFilterFactory does have catenate words set, in both
>>>>> index
>>>>> and query phases (is that right?):
>>>>>
>>>>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>>>>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>>>>> catenateAll="0" splitOnCaseChange="1"/>
>>>>>
>>>>> It's hard to cut and paste the results of the analysis page into email
>>>>> (or
>>>>> anywhere!), I'll give you screenshots, sorry -- and I'll give them for
>>>>> our
>>>>> whole real world app complex field definition. I'll also paste in our
>>>>> entire field definition below. But I realize my next step is probably
>>>>> creating a simpler isolation/reproduction case (unless you have a magic
>>>>> answer from this!).
>>>>>
>>>>> Again, the problem is that "MacBook" seems to be only matching on
>>>>> indexed
>>>>> "macbook" and not indexed "mac book".
>>>>>
>>>>>
>>>>> "MacBook" query analysis:
>>>>> https://www.dropbox.com/s/b8y11usjdlc88un/mixedcasequery.png
>>>>>
>>>>> "MacBook" index analysis:
>>>>> https://www.dropbox.com/s/fwae3nz4tdtjhjv/mixedcaseindex.png
>>>>>
>>>>> "mac book" index analysis:
>>>>> https://www.dropbox.com/s/mihd58f6zs3rfu8/twowordindex.png
>>>>>
>>>>>
>>>>> Our entire actual field definition:
>>>>>
>>>>>     <fieldType name="text" class="solr.TextField"
>>>>> positionIncrementGap="100"
>>>>> autoGeneratePhraseQueries="true">
>>>>>         <analyzer>
>>>>>          <!-- the rulefiles thing is to keep ICUTokenizerFactory from
>>>>> stripping punctuation,
>>>>>               so our synonym filter involving C++ etc can still work.
>>>>>               From: https://mail-archives.apache.
>>>>> org/mod_mbox/lucene-solr-user/201305.mbox/%3C51965E70.
>>>>> [hidden email]%3E
>>>>>               the rbbi file is in our local ./conf, copied from lucene
>>>>> source tree -->
>>>>>          <tokenizer class="solr.ICUTokenizerFactory"
>>>>> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>>>>>
>>>>>          <filter class="solr.SynonymFilterFactory"
>>>>> synonyms="punctuation-whitelist.txt"
>>>>> ignoreCase="true"/>
>>>>>
>>>>>           <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>
>>>>>
>>>>>           <!-- folding need sto be after WordDelimiter, so
>>>>> WordDelimiter
>>>>>                can do it's thing with full cases and such -->
>>>>>           <filter class="solr.ICUFoldingFilterFactory" />
>>>>>
>>>>>
>>>>>           <!-- ICUFolding already includes lowercasing, no
>>>>>                need for seperate lowercasing step
>>>>>           <filter class="solr.LowerCaseFilterFactory"/>
>>>>>           -->
>>>>>
>>>>>           <filter class="solr.SnowballPorterFilterFactory"
>>>>> language="English" protected="protwords.txt"/>
>>>>>           <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>>         </analyzer>
>>>>>       </fieldType>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind
In reply to this post by Jack Krupansky-3
On 12/29/14 5:24 PM, Jack Krupansky wrote:
> WDF is powerful, but it is not magic. In general, the indexed data is
> expected to be clean while the query might be sloppy. You need to separate
> the index and query analyzers and they need to respect that distinction

I do not understand what separate query/index analysis you are
suggesting to accomplish what I wanted.

I understand the WDF, like all software, is not magic, of course. But I
thought this was an intended use case of the WDF, with those settings:

A "mixedCase" query would match "mixedCase" in the index; and the same
query "mixedCase" would also match two separate words "mixed Case" in
index.  (Case insensitively since I apply an ICUFoldingFilter on top of
that).

Was I wrong, is this not an intended thing for the WDF to do? Or do I
just have the wrong configuration options for it to do it? Or is it a bug?

When I started this thread a few months ago, I think Erick Erickson
agreed this was an intended use case for the WDF, but maybe I explained
it poorly. Erick if you're around and want to at least confirm whether
WDF is supposed to do this in your understanding, that would be great!

Jonathan
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Erick Erickson
Jonathan:

Well, it works if you set splitOnCaseChange="0" in just the query part
of the analysis chain. I probably mislead you a bit months ago, WDFF
is intended for this case iff you expect the case change to generate
_tokens_ that are individually meaningful.. And unfortunately
"significant" in one case will be not-significant in others.

So what kinds of things do you want WDFF to handle? Case changes?
Letter/non-letter transitions? All of the above?

Best,
Erick



On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind <[hidden email]> wrote:

> On 12/29/14 5:24 PM, Jack Krupansky wrote:
>>
>> WDF is powerful, but it is not magic. In general, the indexed data is
>> expected to be clean while the query might be sloppy. You need to separate
>> the index and query analyzers and they need to respect that distinction
>
>
> I do not understand what separate query/index analysis you are suggesting to
> accomplish what I wanted.
>
> I understand the WDF, like all software, is not magic, of course. But I
> thought this was an intended use case of the WDF, with those settings:
>
> A "mixedCase" query would match "mixedCase" in the index; and the same query
> "mixedCase" would also match two separate words "mixed Case" in index.
> (Case insensitively since I apply an ICUFoldingFilter on top of that).
>
> Was I wrong, is this not an intended thing for the WDF to do? Or do I just
> have the wrong configuration options for it to do it? Or is it a bug?
>
> When I started this thread a few months ago, I think Erick Erickson agreed
> this was an intended use case for the WDF, but maybe I explained it poorly.
> Erick if you're around and want to at least confirm whether WDF is supposed
> to do this in your understanding, that would be great!
>
> Jonathan
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Alexandre Rafalovitch
In reply to this post by Jonathan Rochkind
On 29 December 2014 at 18:07, Jonathan Rochkind <[hidden email]> wrote:
> I do not understand what separate query/index analysis you are suggesting to
> accomplish what I wanted.

I am sure you do know that, but just in case. At the moment, you have
only one analyzer chain, so it applies at both index and query time.
You can split those and have separate treatment during indexing and
during search. Useful with synonyms, etc. The example schema has both
versions shown.

But I would start by just removing splitOnCaseChange attribute and
reindexing. I don't think that flag means what you want it to mean.

Regards,
    Alex.

----
Sign up for my Solr resources newsletter at http://www.solr-start.com/
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind
In reply to this post by Erick Erickson
Thanks Erick!

Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
query for "mixedCase" will no longer also match "mixed Case".

I think I want WDF to... kind of do all of the above.

Specifically, I had thought that it would allow a query for "mixedCase"
to match both/either "mixed Case" or "mixedCase" in the index. (with
case insensitivity on top of that via another filter).

That would support things like names like "duBois" which are sometimes
spelled "du bois" and sometimes "dubois", and allow the query "duBois"
to match both in the index.

I had somehow thought that was what WDF was intended for. But it's
actually not the usual functioning, and may not be realistic?

I'm a bit confused about what splitOnCaseChange combined with
catenateWords is meant to do at all.  It _is_ generating both the split
and single-word tokens at query time -- but not in a way that actually
allows it to match both the split and single-word tokens?  What is
supposed to be the purpose/use case for splitOnCaseChange with
catenateWords? If any?

Jonathan

On 12/29/14 7:20 PM, Erick Erickson wrote:

> Jonathan:
>
> Well, it works if you set splitOnCaseChange="0" in just the query part
> of the analysis chain. I probably mislead you a bit months ago, WDFF
> is intended for this case iff you expect the case change to generate
> _tokens_ that are individually meaningful.. And unfortunately
> "significant" in one case will be not-significant in others.
>
> So what kinds of things do you want WDFF to handle? Case changes?
> Letter/non-letter transitions? All of the above?
>
> Best,
> Erick
>
>
>
> On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind <[hidden email]> wrote:
>> On 12/29/14 5:24 PM, Jack Krupansky wrote:
>>>
>>> WDF is powerful, but it is not magic. In general, the indexed data is
>>> expected to be clean while the query might be sloppy. You need to separate
>>> the index and query analyzers and they need to respect that distinction
>>
>>
>> I do not understand what separate query/index analysis you are suggesting to
>> accomplish what I wanted.
>>
>> I understand the WDF, like all software, is not magic, of course. But I
>> thought this was an intended use case of the WDF, with those settings:
>>
>> A "mixedCase" query would match "mixedCase" in the index; and the same query
>> "mixedCase" would also match two separate words "mixed Case" in index.
>> (Case insensitively since I apply an ICUFoldingFilter on top of that).
>>
>> Was I wrong, is this not an intended thing for the WDF to do? Or do I just
>> have the wrong configuration options for it to do it? Or is it a bug?
>>
>> When I started this thread a few months ago, I think Erick Erickson agreed
>> this was an intended use case for the WDF, but maybe I explained it poorly.
>> Erick if you're around and want to at least confirm whether WDF is supposed
>> to do this in your understanding, that would be great!
>>
>> Jonathan
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jack Krupansky-3
Right, that's what I meant by WDF not being "magic" - you can configure it
to match any three out of four use cases as you choose, but there is no
choice that matches all of the use cases.

To be clear, this is not a "bug" in WDF, but simply a limitation.


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind <[hidden email]>
wrote:

> Thanks Erick!
>
> Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
> query for "mixedCase" will no longer also match "mixed Case".
>
> I think I want WDF to... kind of do all of the above.
>
> Specifically, I had thought that it would allow a query for "mixedCase" to
> match both/either "mixed Case" or "mixedCase" in the index. (with case
> insensitivity on top of that via another filter).
>
> That would support things like names like "duBois" which are sometimes
> spelled "du bois" and sometimes "dubois", and allow the query "duBois" to
> match both in the index.
>
> I had somehow thought that was what WDF was intended for. But it's
> actually not the usual functioning, and may not be realistic?
>
> I'm a bit confused about what splitOnCaseChange combined with
> catenateWords is meant to do at all.  It _is_ generating both the split and
> single-word tokens at query time -- but not in a way that actually allows
> it to match both the split and single-word tokens?  What is supposed to be
> the purpose/use case for splitOnCaseChange with catenateWords? If any?
>
> Jonathan
>
>
> On 12/29/14 7:20 PM, Erick Erickson wrote:
>
>> Jonathan:
>>
>> Well, it works if you set splitOnCaseChange="0" in just the query part
>> of the analysis chain. I probably mislead you a bit months ago, WDFF
>> is intended for this case iff you expect the case change to generate
>> _tokens_ that are individually meaningful.. And unfortunately
>> "significant" in one case will be not-significant in others.
>>
>> So what kinds of things do you want WDFF to handle? Case changes?
>> Letter/non-letter transitions? All of the above?
>>
>> Best,
>> Erick
>>
>>
>>
>> On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind <[hidden email]>
>> wrote:
>>
>>> On 12/29/14 5:24 PM, Jack Krupansky wrote:
>>>
>>>>
>>>> WDF is powerful, but it is not magic. In general, the indexed data is
>>>> expected to be clean while the query might be sloppy. You need to
>>>> separate
>>>> the index and query analyzers and they need to respect that distinction
>>>>
>>>
>>>
>>> I do not understand what separate query/index analysis you are
>>> suggesting to
>>> accomplish what I wanted.
>>>
>>> I understand the WDF, like all software, is not magic, of course. But I
>>> thought this was an intended use case of the WDF, with those settings:
>>>
>>> A "mixedCase" query would match "mixedCase" in the index; and the same
>>> query
>>> "mixedCase" would also match two separate words "mixed Case" in index.
>>> (Case insensitively since I apply an ICUFoldingFilter on top of that).
>>>
>>> Was I wrong, is this not an intended thing for the WDF to do? Or do I
>>> just
>>> have the wrong configuration options for it to do it? Or is it a bug?
>>>
>>> When I started this thread a few months ago, I think Erick Erickson
>>> agreed
>>> this was an intended use case for the WDF, but maybe I explained it
>>> poorly.
>>> Erick if you're around and want to at least confirm whether WDF is
>>> supposed
>>> to do this in your understanding, that would be great!
>>>
>>> Jonathan
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: WordDelimiter filter, expanding to multiple words, unexpected results

Jonathan Rochkind
I guess I don't understand what the four use cases are, or the three out
of four use cases, or whatever. What the intended uses of the WDF are.

Can you explain what the intended use of setting:

generateWordParts="1" catenateWords="1" splitOnCaseChange="1"

Is that supposed to do something useful (at either query or index time),
or is that a nonsensical configuration that nobody should ever use?

I understand how analysis can be different at index vs query time. I
think what I don't fully understand is what the possibilities and
intended use case of the WDF are, with various configurations.

I thought one of the intended use cases, with appropriate configuration,
was to do what I'm talking: allow "mixedCase" query to match both "mixed
Case" and "mixed Case" in the index. I think you're saying I'm wrong,
and this is not something WDF can do? Can you confirm I understand you
right?

Thanks!

Jonathan

On 12/30/14 11:30 AM, Jack Krupansky wrote:

> Right, that's what I meant by WDF not being "magic" - you can configure it
> to match any three out of four use cases as you choose, but there is no
> choice that matches all of the use cases.
>
> To be clear, this is not a "bug" in WDF, but simply a limitation.
>
>
> -- Jack Krupansky
>
> On Tue, Dec 30, 2014 at 11:12 AM, Jonathan Rochkind <[hidden email]>
> wrote:
>
>> Thanks Erick!
>>
>> Yes, if I set splitOnCaseChange=0, then of course it'll work -- but then
>> query for "mixedCase" will no longer also match "mixed Case".
>>
>> I think I want WDF to... kind of do all of the above.
>>
>> Specifically, I had thought that it would allow a query for "mixedCase" to
>> match both/either "mixed Case" or "mixedCase" in the index. (with case
>> insensitivity on top of that via another filter).
>>
>> That would support things like names like "duBois" which are sometimes
>> spelled "du bois" and sometimes "dubois", and allow the query "duBois" to
>> match both in the index.
>>
>> I had somehow thought that was what WDF was intended for. But it's
>> actually not the usual functioning, and may not be realistic?
>>
>> I'm a bit confused about what splitOnCaseChange combined with
>> catenateWords is meant to do at all.  It _is_ generating both the split and
>> single-word tokens at query time -- but not in a way that actually allows
>> it to match both the split and single-word tokens?  What is supposed to be
>> the purpose/use case for splitOnCaseChange with catenateWords? If any?
>>
>> Jonathan
>>
>>
>> On 12/29/14 7:20 PM, Erick Erickson wrote:
>>
>>> Jonathan:
>>>
>>> Well, it works if you set splitOnCaseChange="0" in just the query part
>>> of the analysis chain. I probably mislead you a bit months ago, WDFF
>>> is intended for this case iff you expect the case change to generate
>>> _tokens_ that are individually meaningful.. And unfortunately
>>> "significant" in one case will be not-significant in others.
>>>
>>> So what kinds of things do you want WDFF to handle? Case changes?
>>> Letter/non-letter transitions? All of the above?
>>>
>>> Best,
>>> Erick
>>>
>>>
>>>
>>> On Mon, Dec 29, 2014 at 3:07 PM, Jonathan Rochkind <[hidden email]>
>>> wrote:
>>>
>>>> On 12/29/14 5:24 PM, Jack Krupansky wrote:
>>>>
>>>>>
>>>>> WDF is powerful, but it is not magic. In general, the indexed data is
>>>>> expected to be clean while the query might be sloppy. You need to
>>>>> separate
>>>>> the index and query analyzers and they need to respect that distinction
>>>>>
>>>>
>>>>
>>>> I do not understand what separate query/index analysis you are
>>>> suggesting to
>>>> accomplish what I wanted.
>>>>
>>>> I understand the WDF, like all software, is not magic, of course. But I
>>>> thought this was an intended use case of the WDF, with those settings:
>>>>
>>>> A "mixedCase" query would match "mixedCase" in the index; and the same
>>>> query
>>>> "mixedCase" would also match two separate words "mixed Case" in index.
>>>> (Case insensitively since I apply an ICUFoldingFilter on top of that).
>>>>
>>>> Was I wrong, is this not an intended thing for the WDF to do? Or do I
>>>> just
>>>> have the wrong configuration options for it to do it? Or is it a bug?
>>>>
>>>> When I started this thread a few months ago, I think Erick Erickson
>>>> agreed
>>>> this was an intended use case for the WDF, but maybe I explained it
>>>> poorly.
>>>> Erick if you're around and want to at least confirm whether WDF is
>>>> supposed
>>>> to do this in your understanding, that would be great!
>>>>
>>>> Jonathan
>>>>
>>>
>
12