Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Jack Krupansky-2
Digging through the Jira and revision history, I discovered that back at the
end of May 2011, a change was made to Solr that fairly significantly
degrades the OOTB behavior for Solr queries, namely for word-splitting of
terms with embedded punctuation, so that they end up, by default, doing the
OR of the sub-terms, rather than doing the obvious phrase query of the
sub-terms.

Just a couple of examples:

CD-ROM => CD OR ROM rather than “CD ROM”
1,000 => 1 OR 000 rather than “1 000” (when using the WordDelimiterFilter)
out-of-the-box => out OR of OR the OR box rather than “out of the box”
3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter)
docid-001 => docid OR 001 rather than "DOCID 001"

All of those queries will give surprising and unexpected results.

Back to the history of the change, there was a lot of lively discussion on
SOLR-2015 - add a config hook for autoGeneratePhraseQueries:
https://issues.apache.org/jira/browse/SOLR-2015

And the actual change to default to the behavior described above was
SOLR-2519 - improve defaults for text_* field types:
https://issues.apache.org/jira/browse/SOLR-2519

I gather that the original motivation was for non-European languages, and
that even some European languages might search better without auto-phrase
generation, but the decision to default English terms to NOT automatically
generate phrase queries and to generate OR queries instead is rather
surprising and unexpected and outright undesirable, as my examples above
show.

I had been aware of the behavior for quite some time, but I had thought it
was simply a lingering bug so I paid little attention to it, until I
stumbled across this autoGeneratePhraseQueries "feature" while looking at
the query parser code. I can understand the need to disable automatic phrase
queries for SOME languages, but to disable it by default for English seems
rather bizarre, as my simple use cases above show.

I'll file this as a Jira, but I wanted to call wider attention to it in case
others were as unaware as me that what had seemed like buggy behavior was
done intentionally.

Unless there has been a change of heart since SOLR-2015/2519, I guess we are
stuck with the default TextField behavior, but at least we could improve the
example schema in several ways:

1. The English text field types should have autoGeneratePhraseQueries=true.
2. Add commentary about the impact of autoGeneratePhraseQueries=true/false -
in terms of use case examples, as above. Specifically note the ones that
will break with if the feature is disabled.

Another, more controversial change will be:

3. Change text_general to autoGeneratePhraseQueries=true so that English
will be treated reasonably by default. I suspect that most European
languages will be at least "okay". A comment will note that this field
attribute should be removed or set to false for non-whitespace languages, or
that an alternative field type should be used. I suspect that the first
thing any non-whitespace language application will want to do is pick the
text field type that has analysis that makes the most sense for them, so I
see no need to mess up English for no good reason.

Make no mistake, #3 is the primary and only real goal of this OOTB
improvement. Maybe "text_general" could be kept as is for reference as the
purported "general" text field type (except that it doesn't work well for
English. as shown above), and maybe there should be a "text_default" that I
would propose should be text_en with commentary to direct users to the other
choices for language.

I would note that text_ja already has autoGeneratePhraseQueries=false, so
I'm not sure why the default in the TextField code had to be changed to
false. Any languages for which automatic phrase query generation is
problematic should be attributed similarly. But, now that it is wired into
the schema defaults, we may be stuck with it.

I was rather surprised that SOLR-2519 actually changed the default in
TextField rather than simply set the attribute as appropriate for the
various text field types.

There are probably also a couple of places in the wikis where the surprising
behavior should be noted.

And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the
kinds of use cases that unsuspecting users may not realize were BROKEN by
the commit of SOLR-2519 that is masked under the innocent phrasing of
"improve defaults for text_* field types". How many users seriously
understood that a query with embedded dashes and commas behave differently
as a result of that change?

I am contemplating whether to suggest that the WordDelimiterFilter should
also be part of the default text field type. Right now, it is hidden off in
text_en_splitting.

I'll file the Jira tomorrow. Feel free to hold off comments until the Jira
appears.

-- Jack Krupansky


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Robert Muir
Where are the actual relevance measurements showing degradation? For
every example you have, i can give you a counter-example, including
whole languages that flat out won't work at all.

Anyone who *wants* a phrase query can ask for one with double quotes.
If you force this option on, users have no way to turn it off.

I'm strongly opposed. I could care less about english.

On Wed, Aug 8, 2012 at 8:13 PM, Jack Krupansky <[hidden email]> wrote:

> Digging through the Jira and revision history, I discovered that back at the
> end of May 2011, a change was made to Solr that fairly significantly
> degrades the OOTB behavior for Solr queries, namely for word-splitting of
> terms with embedded punctuation, so that they end up, by default, doing the
> OR of the sub-terms, rather than doing the obvious phrase query of the
> sub-terms.
>
> Just a couple of examples:
>
> CD-ROM => CD OR ROM rather than “CD ROM”
> 1,000 => 1 OR 000 rather than “1 000” (when using the WordDelimiterFilter)
> out-of-the-box => out OR of OR the OR box rather than “out of the box”
> 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter)
> docid-001 => docid OR 001 rather than "DOCID 001"
>
> All of those queries will give surprising and unexpected results.
>
> Back to the history of the change, there was a lot of lively discussion on
> SOLR-2015 - add a config hook for autoGeneratePhraseQueries:
> https://issues.apache.org/jira/browse/SOLR-2015
>
> And the actual change to default to the behavior described above was
> SOLR-2519 - improve defaults for text_* field types:
> https://issues.apache.org/jira/browse/SOLR-2519
>
> I gather that the original motivation was for non-European languages, and
> that even some European languages might search better without auto-phrase
> generation, but the decision to default English terms to NOT automatically
> generate phrase queries and to generate OR queries instead is rather
> surprising and unexpected and outright undesirable, as my examples above
> show.
>
> I had been aware of the behavior for quite some time, but I had thought it
> was simply a lingering bug so I paid little attention to it, until I
> stumbled across this autoGeneratePhraseQueries "feature" while looking at
> the query parser code. I can understand the need to disable automatic phrase
> queries for SOME languages, but to disable it by default for English seems
> rather bizarre, as my simple use cases above show.
>
> I'll file this as a Jira, but I wanted to call wider attention to it in case
> others were as unaware as me that what had seemed like buggy behavior was
> done intentionally.
>
> Unless there has been a change of heart since SOLR-2015/2519, I guess we are
> stuck with the default TextField behavior, but at least we could improve the
> example schema in several ways:
>
> 1. The English text field types should have autoGeneratePhraseQueries=true.
> 2. Add commentary about the impact of autoGeneratePhraseQueries=true/false -
> in terms of use case examples, as above. Specifically note the ones that
> will break with if the feature is disabled.
>
> Another, more controversial change will be:
>
> 3. Change text_general to autoGeneratePhraseQueries=true so that English
> will be treated reasonably by default. I suspect that most European
> languages will be at least "okay". A comment will note that this field
> attribute should be removed or set to false for non-whitespace languages, or
> that an alternative field type should be used. I suspect that the first
> thing any non-whitespace language application will want to do is pick the
> text field type that has analysis that makes the most sense for them, so I
> see no need to mess up English for no good reason.
>
> Make no mistake, #3 is the primary and only real goal of this OOTB
> improvement. Maybe "text_general" could be kept as is for reference as the
> purported "general" text field type (except that it doesn't work well for
> English. as shown above), and maybe there should be a "text_default" that I
> would propose should be text_en with commentary to direct users to the other
> choices for language.
>
> I would note that text_ja already has autoGeneratePhraseQueries=false, so
> I'm not sure why the default in the TextField code had to be changed to
> false. Any languages for which automatic phrase query generation is
> problematic should be attributed similarly. But, now that it is wired into
> the schema defaults, we may be stuck with it.
>
> I was rather surprised that SOLR-2519 actually changed the default in
> TextField rather than simply set the attribute as appropriate for the
> various text field types.
>
> There are probably also a couple of places in the wikis where the surprising
> behavior should be noted.
>
> And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the
> kinds of use cases that unsuspecting users may not realize were BROKEN by
> the commit of SOLR-2519 that is masked under the innocent phrasing of
> "improve defaults for text_* field types". How many users seriously
> understood that a query with embedded dashes and commas behave differently
> as a result of that change?
>
> I am contemplating whether to suggest that the WordDelimiterFilter should
> also be part of the default text field type. Right now, it is hidden off in
> text_en_splitting.
>
> I'll file the Jira tomorrow. Feel free to hold off comments until the Jira
> appears.
>
> -- Jack Krupansky
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Chris Hostetter-3

: Anyone who *wants* a phrase query can ask for one with double quotes.
: If you force this option on, users have no way to turn it off.
:
: I'm strongly opposed. I could care less about english.

Hold on a minute and think about what jack is pointing out here.

I can understand your argument against Jack's examples that refer to WDF,
because the only places WDF is used in the example schema is in fieldTypes
where we already include autoGeneratePhraseQueries="true"

But i didn't realize until Jack's email that StandardTokenizerFactory
splits on "-" .. meaning that any hypenated word is split into multiple
tokens, and (by default) produces a BooleanQuery with all SHOULD clauses.

It's easy to say users who want a phrase query can use quotes, but i
suspect most people aren't going to realize they need to explicitly ask
for a phrase query, because they have no idea that their input is going to
be split up.

Can you honestly say you think it makes sense that someone who types
in...

        fly-fishing

...will get match on documents containing either "fly" or "fishing" using
the example "text_general" fieldType?



-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Robert Muir
On Wed, Aug 8, 2012 at 11:43 PM, Chris Hostetter
<[hidden email]> wrote:

>
> Can you honestly say you think it makes sense that someone who types
> in...
>
>         fly-fishing
>
> ...will get match on documents containing either "fly" or "fishing" using
> the example "text_general" fieldType?
>

Can you honestly generalize this rule from "how to handle hyphen" to
"if > 1 term comes out of a whitespace-separated term, it must be a
phrase query?".
Its extremely short-sighted to think "only my language matters" and
not care what breakage comes as long as fucking hyphens work the way
you think they should for english.

Even for english itself, its debatable:

http://en.wikipedia.org/wiki/Hyphen#Varied_meanings

So I'm not sold for english, and breaking chinese text totally in
whats supposed to be a general field? hell no.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Michael McCandless-2
In reply to this post by Jack Krupansky-2
The text_general field type is meant to be a good default for all languages.

If you want English-specific behavior, you should use one of the
English field types (text_en, text_en_splitting,
text_en_splitting_tight).  The comments in schema.xml explain this.

Ideally would would eventually have default field types for many
different languages, not just English ... some day.

I don't think we should turn on autoGeneratePhraseQueries=true for
text_general: it's catastrophic to non-whitespace languages.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Aug 8, 2012 at 8:13 PM, Jack Krupansky <[hidden email]> wrote:

> Digging through the Jira and revision history, I discovered that back at the
> end of May 2011, a change was made to Solr that fairly significantly
> degrades the OOTB behavior for Solr queries, namely for word-splitting of
> terms with embedded punctuation, so that they end up, by default, doing the
> OR of the sub-terms, rather than doing the obvious phrase query of the
> sub-terms.
>
> Just a couple of examples:
>
> CD-ROM => CD OR ROM rather than “CD ROM”
> 1,000 => 1 OR 000 rather than “1 000” (when using the WordDelimiterFilter)
> out-of-the-box => out OR of OR the OR box rather than “out of the box”
> 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter)
> docid-001 => docid OR 001 rather than "DOCID 001"
>
> All of those queries will give surprising and unexpected results.
>
> Back to the history of the change, there was a lot of lively discussion on
> SOLR-2015 - add a config hook for autoGeneratePhraseQueries:
> https://issues.apache.org/jira/browse/SOLR-2015
>
> And the actual change to default to the behavior described above was
> SOLR-2519 - improve defaults for text_* field types:
> https://issues.apache.org/jira/browse/SOLR-2519
>
> I gather that the original motivation was for non-European languages, and
> that even some European languages might search better without auto-phrase
> generation, but the decision to default English terms to NOT automatically
> generate phrase queries and to generate OR queries instead is rather
> surprising and unexpected and outright undesirable, as my examples above
> show.
>
> I had been aware of the behavior for quite some time, but I had thought it
> was simply a lingering bug so I paid little attention to it, until I
> stumbled across this autoGeneratePhraseQueries "feature" while looking at
> the query parser code. I can understand the need to disable automatic phrase
> queries for SOME languages, but to disable it by default for English seems
> rather bizarre, as my simple use cases above show.
>
> I'll file this as a Jira, but I wanted to call wider attention to it in case
> others were as unaware as me that what had seemed like buggy behavior was
> done intentionally.
>
> Unless there has been a change of heart since SOLR-2015/2519, I guess we are
> stuck with the default TextField behavior, but at least we could improve the
> example schema in several ways:
>
> 1. The English text field types should have autoGeneratePhraseQueries=true.
> 2. Add commentary about the impact of autoGeneratePhraseQueries=true/false -
> in terms of use case examples, as above. Specifically note the ones that
> will break with if the feature is disabled.
>
> Another, more controversial change will be:
>
> 3. Change text_general to autoGeneratePhraseQueries=true so that English
> will be treated reasonably by default. I suspect that most European
> languages will be at least "okay". A comment will note that this field
> attribute should be removed or set to false for non-whitespace languages, or
> that an alternative field type should be used. I suspect that the first
> thing any non-whitespace language application will want to do is pick the
> text field type that has analysis that makes the most sense for them, so I
> see no need to mess up English for no good reason.
>
> Make no mistake, #3 is the primary and only real goal of this OOTB
> improvement. Maybe "text_general" could be kept as is for reference as the
> purported "general" text field type (except that it doesn't work well for
> English. as shown above), and maybe there should be a "text_default" that I
> would propose should be text_en with commentary to direct users to the other
> choices for language.
>
> I would note that text_ja already has autoGeneratePhraseQueries=false, so
> I'm not sure why the default in the TextField code had to be changed to
> false. Any languages for which automatic phrase query generation is
> problematic should be attributed similarly. But, now that it is wired into
> the schema defaults, we may be stuck with it.
>
> I was rather surprised that SOLR-2519 actually changed the default in
> TextField rather than simply set the attribute as appropriate for the
> various text field types.
>
> There are probably also a couple of places in the wikis where the surprising
> behavior should be noted.
>
> And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the
> kinds of use cases that unsuspecting users may not realize were BROKEN by
> the commit of SOLR-2519 that is masked under the innocent phrasing of
> "improve defaults for text_* field types". How many users seriously
> understood that a query with embedded dashes and commas behave differently
> as a result of that change?
>
> I am contemplating whether to suggest that the WordDelimiterFilter should
> also be part of the default text field type. Right now, it is hidden off in
> text_en_splitting.
>
> I'll file the Jira tomorrow. Feel free to hold off comments until the Jira
> appears.
>
> -- Jack Krupansky
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Chris Hostetter-3
In reply to this post by Robert Muir

: Can you honestly generalize this rule from "how to handle hyphen" to
: "if > 1 term comes out of a whitespace-separated term, it must be a
: phrase query?".

No, which is why i never said that.  what i said was "Hold on a minute and
think about what jack is pointing out here" -- instead of dismissing the
problem out of hand because you "could care less about english"

Just because you don't like Jack's suggested solution, doesn't make the
problem magically go away.  You may not care about english, but (suprise!)
lots of people do, and we should try to figure out some ways of mitigating
confusion like this people indexing english.

Maybe this is just a matter of better documentaiton, but it's at least
worth *discussing* what the possible solutions are, instead of being rude
and dismissive about the fact that the OOTB behavior is currently very
unintuitive for the english langauge.

Off the top of my head, i can think of several ideas (some trivial some
hypothetical) that *might* improve the OOTB experience for new users, that
are at least worth *discussing* ...

1) better class level QueryParser javadocs and example schema.xml comments
about the significance of autoGeneratePhraseQueries and the tradeoffs of
changing it.

2) mention autoGeneratePhraseQueries and it's trade-offs in the solr
tutorial

3) more configuration options in StandardTokenizer and
StandardTokenizerFactory about when/how tokens are split on things like
hyphen and comments about them in the example schema.xml

4) smarter logic/options in QueryParser for determining when to build a
phrase query automaticly based on the character ranges


: Even for english itself, its debatable:
:
: http://en.wikipedia.org/wiki/Hyphen#Varied_meanings

I'm not following your argument -- that URL demonstrates various
examples where {{ foo-bar }} has extremely differnt semantic meaning from
{{ foo bar }} ... which actually demonstrates the point I'm making:
it's highly unintuitive that a search for a hyphenated word like {{
foo-bar }} should be interpreted as "search for either of those words"


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Jack Krupansky-2
(Feel free to add these comments to the Jira I filed this morning:
https://issues.apache.org/jira/browse/SOLR-3723)

-- Jack Krupansky

-----Original Message-----
From: Chris Hostetter
Sent: Thursday, August 09, 2012 11:22 AM
To: Lucene Dev
Subject: Re: Improve OOTB behavior: English word-splitting should default to
autoGeneratePhraseQueries=true


: Can you honestly generalize this rule from "how to handle hyphen" to
: "if > 1 term comes out of a whitespace-separated term, it must be a
: phrase query?".

No, which is why i never said that.  what i said was "Hold on a minute and
think about what jack is pointing out here" -- instead of dismissing the
problem out of hand because you "could care less about english"

Just because you don't like Jack's suggested solution, doesn't make the
problem magically go away.  You may not care about english, but (suprise!)
lots of people do, and we should try to figure out some ways of mitigating
confusion like this people indexing english.

Maybe this is just a matter of better documentaiton, but it's at least
worth *discussing* what the possible solutions are, instead of being rude
and dismissive about the fact that the OOTB behavior is currently very
unintuitive for the english langauge.

Off the top of my head, i can think of several ideas (some trivial some
hypothetical) that *might* improve the OOTB experience for new users, that
are at least worth *discussing* ...

1) better class level QueryParser javadocs and example schema.xml comments
about the significance of autoGeneratePhraseQueries and the tradeoffs of
changing it.

2) mention autoGeneratePhraseQueries and it's trade-offs in the solr
tutorial

3) more configuration options in StandardTokenizer and
StandardTokenizerFactory about when/how tokens are split on things like
hyphen and comments about them in the example schema.xml

4) smarter logic/options in QueryParser for determining when to build a
phrase query automaticly based on the character ranges


: Even for english itself, its debatable:
:
: http://en.wikipedia.org/wiki/Hyphen#Varied_meanings

I'm not following your argument -- that URL demonstrates various
examples where {{ foo-bar }} has extremely differnt semantic meaning from
{{ foo bar }} ... which actually demonstrates the point I'm making:
it's highly unintuitive that a search for a hyphenated word like {{
foo-bar }} should be interpreted as "search for either of those words"


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Yonik Seeley-2-2
In reply to this post by Michael McCandless-2
On Thu, Aug 9, 2012 at 6:49 AM, Michael McCandless
<[hidden email]> wrote:
> The text_general field type is meant to be a good default for all languages.

What many of us not familiar with the tokenizing rules of the standard
tokenizer just realized is that it's not a good default for english
and probably most other european languages.

> If you want English-specific behavior, you should use one of the
> English field types (text_en, text_en_splitting,
> text_en_splitting_tight).

Seems like we should be showing best-practice and using these english
fields in our english examples.

-Yonik
http://lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Chris Hostetter-3

: What many of us not familiar with the tokenizing rules of the standard
: tokenizer just realized is that it's not a good default for english
: and probably most other european languages.

Jira is down for reindexing at the moment, so i can't file this suggestion
as a new Feature proposal (or comment on it's relevance in SOLR-3723) and
i probably won't be online for another few days, so i wanted to get this
idea out there now for discussion instead of waiting.

        ---

Based on the link steven mentioned clarifying why exactly
StandardTokenizer works the way it does...

        http://unicode.org/reports/tr29/#Word_Boundaries

...I think it would be a good idea to add some new customization options
to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
behavior based on the various "tailored improvement" notes...

  "It is not possible to provide a uniform set of rules that resolves
   all issues across languages or that handles all ambiguous situations
   within a given language. The goal for the specification presented in
   this annex is to provide a workable default; tailored implementations
   can be more sophisticated.

1) An option to include the various "hypen" characters in the "MidLetter"
class per this note...

  "Some or all of the following characters may be tailored to be in
   MidLetter, depending on the environment: ..."
   [\u002D\uFF0D\uFE63\u058A\u1806\u2010\u2011\u30A0\u30FB\u201B\u055A\u0F0B]

It might make sense to expand this option to also include the other
characters listed in the note below, and name the option something along
the lines of "splitItentifiers"...

  "Characters such as hyphens, apostrophes, quotation marks, and
   colon should be taken into account when using identifiers that
   are intended to represent words of one or more natural languages.
   See Section 2.4, Specific Character Adjustments, of [UAX31].
   Treatment of hyphens, in particular, may be different in the case
   of processing identifiers than when using word break analysis for
   a Whole Word Search or query, because when handling identifiers the
   goal will be to parse maximal units corresponding to natural language
   “words,” rather than to find smaller word units within longer lexical
   units connected by hyphens."

(this point about "parse maximal units" seems paticularly on point for
the usecase where a user's search input consists of a single hyphenated
word)

2) an option to control if/when the following characters in the "MidNum"
class per the corisponding note...

  "Some or all of the following characters may be tailored to be in
   MidNum, depending on the environment, to allow for languages that
   use spaces as thousands separators, such as €1 234,56.  ..."
   [\u0020\u00A0\u2007\u2008\u2009\u202F]

3) an option to control wether word breaking should happen between
scripts, per this note...

  "Normally word breaking does not require breaking between different
   scripts. However, adding that capability may be useful in combination
   with other extensions of word segmentation.  ..."

4) an option to control wether U+002E should be included in ExtendedNumLet
per this note ...

  "To allow acronyms like “U.S.A.”, a tailoring may include U+002E FULL
   STOP in ExtendNumLet"

5) an option to control wether '"' and U+05F3 are treated as MidLetter
basd on this note...

  "For Hebrew, a tailoring may include a double quotation mark between
   letters, because legacy data may contain that in place of U+05F4 ..."

6) an option to apostrophe's following vowels per this note...

  "The use of the apostrophe is ambiguous. ... In some languages,
   such as French and Italian, tailoring to break words when the
   character after the apostrophe is a vowel may yield better results
   in more cases. This can be done by adding a rule WB5a ..."




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Robert Muir
On Thu, Aug 9, 2012 at 11:43 PM, Chris Hostetter
<[hidden email]> wrote:

>
> : What many of us not familiar with the tokenizing rules of the standard
> : tokenizer just realized is that it's not a good default for english
> : and probably most other european languages.
>
> Jira is down for reindexing at the moment, so i can't file this suggestion
> as a new Feature proposal (or comment on it's relevance in SOLR-3723) and
> i probably won't be online for another few days, so i wanted to get this
> idea out there now for discussion instead of waiting.
>
>         ---
>
> Based on the link steven mentioned clarifying why exactly
> StandardTokenizer works the way it does...
>
>         http://unicode.org/reports/tr29/#Word_Boundaries
>
> ...I think it would be a good idea to add some new customization options
> to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
> behavior based on the various "tailored improvement" notes...
>

Use a CharFilter.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Chris Hostetter-3

: >         http://unicode.org/reports/tr29/#Word_Boundaries
: >
: > ...I think it would be a good idea to add some new customization options
: > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
: > behavior based on the various "tailored improvement" notes...


: Use a CharFilter.

can you elaborate on how you would suggest implenting these "tailored
improvements" using a CharFilter?

I imagine #5 ('"' used when U+05F4 should be) could be solved with a
CharFilter since it sounds like hte fundemental issue is that '"' is being
used as a substitute character in these situations that oculd be "fixed"
but i don't understand how any of the other examples could be dealt with
in this way.

none of them are about adding/removing/replacing any chacters in the
stream, they are all about giving the ability to tailor the logic used
to decide when/where word boundaries should be found w/o changing the
content...


: 1) An option to include the various "hypen" characters in the "MidLetter"
: class per this note...
:
:   "Some or all of the following characters may be tailored to be in
:    MidLetter, depending on the environment: ..."
:    [\u002D\uFF0D\uFE63\u058A\u1806\u2010\u2011\u30A0\u30FB\u201B\u055A\u0F0B]
:    
: It might make sense to expand this option to also include the other
: characters listed in the note below, and name the option something along
: the lines of "splitItentifiers"...
:
:   "Characters such as hyphens, apostrophes, quotation marks, and
:    colon should be taken into account when using identifiers that
:    are intended to represent words of one or more natural languages.
:    See Section 2.4, Specific Character Adjustments, of [UAX31].
:    Treatment of hyphens, in particular, may be different in the case
:    of processing identifiers than when using word break analysis for
:    a Whole Word Search or query, because when handling identifiers the
:    goal will be to parse maximal units corresponding to natural language
:    “words,” rather than to find smaller word units within longer lexical
:    units connected by hyphens."
:    
: (this point about "parse maximal units" seems paticularly on point for
: the usecase where a user's search input consists of a single hyphenated
: word)
:
: 2) an option to control if/when the following characters in the "MidNum"
: class per the corisponding note...
:
:   "Some or all of the following characters may be tailored to be in
:    MidNum, depending on the environment, to allow for languages that
:    use spaces as thousands separators, such as €1 234,56.  ..."
:    [\u0020\u00A0\u2007\u2008\u2009\u202F]
:    
: 3) an option to control wether word breaking should happen between
: scripts, per this note...
:
:   "Normally word breaking does not require breaking between different
:    scripts. However, adding that capability may be useful in combination
:    with other extensions of word segmentation.  ..."
:    
: 4) an option to control wether U+002E should be included in ExtendedNumLet
: per this note ...
:
:   "To allow acronyms like “U.S.A.”, a tailoring may include U+002E FULL
:    STOP in ExtendNumLet"

     [...]

: 6) an option to apostrophe's following vowels per this note...
:
:   "The use of the apostrophe is ambiguous. ... In some languages,
:    such as French and Italian, tailoring to break words when the
:    character after the apostrophe is a vowel may yield better results
:    in more cases. This can be done by adding a rule WB5a ..."




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Robert Muir
On Mon, Aug 13, 2012 at 1:58 PM, Chris Hostetter
<[hidden email]> wrote:

>
> : >         http://unicode.org/reports/tr29/#Word_Boundaries
> : >
> : > ...I think it would be a good idea to add some new customization options
> : > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
> : > behavior based on the various "tailored improvement" notes...
>
>
> : Use a CharFilter.
>
> can you elaborate on how you would suggest implenting these "tailored
> improvements" using a CharFilter?

Generally the easiest way is to replace your ambiguous character (such
as your hyphen-minus) with what your domain-specific knowledge tells
you it should be.
If you are indexing a dictionary where this ambiguous hyphen-minus is
being used to separate syllables, then replace it with \u2027
(hyphenation point), and it won't trigger word boundaries.

But it really depends on how you want your whole analysis process to
work. e.g. in the above example if you want to treat "foo-bar" as
really equivalent to foobar, or you want to treat U.S.A as equivalent
to USA, because thats how you want your search to work, then I would
just replace with U+2060 word joiner. follow through with NFKC_CF
unicode normalization filter in the icu package which will remove
this, since its Format.

So I think you can handle all of your cases there with a simple regex
charfilter, substituting the correct 'semantics' depending on
ultimately how you want it to work, and then just apply nfkc_cf at the
end.

As far as the last example, no need for the tokenizer to be involved.
We already have elisionfilter for this, and italian and french
analyzers use it to remove a default (but configurable) set of
contractions. The solr example for these languages is setup with
these, too.

If you really don't like these dead-simple approaches, then just use
the tokenizer in the ICU package, which is more flexible than the
jflex implementation: lets you supply custom grammars at runtime, and
can split by script, etc, etc.


--
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

steve_rowe
Another possibility that would increase customizability via exposing information we currently throw away, proposed by Mike McCandless on LUCENE-3940[1] (though controversially[2]): in addition to tokenizing alpha/numeric char sequences, StandardTokenizer could also tokenize everything else.

Then a NonAlphaNumericStopFilter could remove tokens with types other than <NUM> or <ALPHANUM>.

As an alternative to NonAlphaNumericStopFilter, a separate WordDelimiterFilter-like filter could instead generate synonyms like "wi-fi" and "wifi" when it sees the token sequence ("wi"<ALPHANUM>, "-"<PUNCT>, "fi"<ALPHANUM>).

Positions would need to be addressed.  I assume the default behavior would be to remove position holes when non-alphanumeric tokens are stopped.  (In fact, I can't think of any use case that would benefit from position holes for stopped non-alphanumeric tokens.)

AFAICT, Robert Muir's objection to enabling this kind of thing[2] is that people would use such a tokenizer in default (don't-throw-anything-away) mode, and as a result, unwittingly put tons of junk tokens in their indexes.  Maybe this concern could be addressed by making the default behavior the same as it is today, and providing the don't-throw-anything-away behavior as a non-default option?  Standard*Analyzer* would then remain exactly as it is today, and wouldn't need to include a NonAlphaNumericStopFilter.

Steve

[1] Mike McCandless's post on LUCENE-3940 <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13243299&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13243299>

[2] Robert Muir's subsequent post on LUCENE-3940 <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13244124&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13244124>
       

-----Original Message-----
From: Robert Muir [mailto:[hidden email]]
Sent: Tuesday, August 14, 2012 1:27 AM
To: [hidden email]
Subject: Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

On Mon, Aug 13, 2012 at 1:58 PM, Chris Hostetter
<[hidden email]> wrote:

>
> : >         http://unicode.org/reports/tr29/#Word_Boundaries
> : >
> : > ...I think it would be a good idea to add some new customization options
> : > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
> : > behavior based on the various "tailored improvement" notes...
>
>
> : Use a CharFilter.
>
> can you elaborate on how you would suggest implenting these "tailored
> improvements" using a CharFilter?

Generally the easiest way is to replace your ambiguous character (such
as your hyphen-minus) with what your domain-specific knowledge tells
you it should be.
If you are indexing a dictionary where this ambiguous hyphen-minus is
being used to separate syllables, then replace it with \u2027
(hyphenation point), and it won't trigger word boundaries.

But it really depends on how you want your whole analysis process to
work. e.g. in the above example if you want to treat "foo-bar" as
really equivalent to foobar, or you want to treat U.S.A as equivalent
to USA, because thats how you want your search to work, then I would
just replace with U+2060 word joiner. follow through with NFKC_CF
unicode normalization filter in the icu package which will remove
this, since its Format.

So I think you can handle all of your cases there with a simple regex
charfilter, substituting the correct 'semantics' depending on
ultimately how you want it to work, and then just apply nfkc_cf at the
end.

As far as the last example, no need for the tokenizer to be involved.
We already have elisionfilter for this, and italian and french
analyzers use it to remove a default (but configurable) set of
contractions. The solr example for these languages is setup with
these, too.

If you really don't like these dead-simple approaches, then just use
the tokenizer in the ICU package, which is more flexible than the
jflex implementation: lets you supply custom grammars at runtime, and
can split by script, etc, etc.


--
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Michael McCandless-2
I had forgotten about this but I agree it could also be used to handle
challenging tokenizations.

In general I think our Tokenizers should throw away as little
information as possible (at least have options to do so).  Subsequent
TokenFilters can always remove things ...

I agree there's a risk of junk getting into indices ... but setting
appropriate defaults should address this.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Aug 14, 2012 at 12:26 PM, Steven A Rowe <[hidden email]> wrote:

> Another possibility that would increase customizability via exposing information we currently throw away, proposed by Mike McCandless on LUCENE-3940[1] (though controversially[2]): in addition to tokenizing alpha/numeric char sequences, StandardTokenizer could also tokenize everything else.
>
> Then a NonAlphaNumericStopFilter could remove tokens with types other than <NUM> or <ALPHANUM>.
>
> As an alternative to NonAlphaNumericStopFilter, a separate WordDelimiterFilter-like filter could instead generate synonyms like "wi-fi" and "wifi" when it sees the token sequence ("wi"<ALPHANUM>, "-"<PUNCT>, "fi"<ALPHANUM>).
>
> Positions would need to be addressed.  I assume the default behavior would be to remove position holes when non-alphanumeric tokens are stopped.  (In fact, I can't think of any use case that would benefit from position holes for stopped non-alphanumeric tokens.)
>
> AFAICT, Robert Muir's objection to enabling this kind of thing[2] is that people would use such a tokenizer in default (don't-throw-anything-away) mode, and as a result, unwittingly put tons of junk tokens in their indexes.  Maybe this concern could be addressed by making the default behavior the same as it is today, and providing the don't-throw-anything-away behavior as a non-default option?  Standard*Analyzer* would then remain exactly as it is today, and wouldn't need to include a NonAlphaNumericStopFilter.
>
> Steve
>
> [1] Mike McCandless's post on LUCENE-3940 <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13243299&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13243299>
>
> [2] Robert Muir's subsequent post on LUCENE-3940 <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13244124&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13244124>
>
>
> -----Original Message-----
> From: Robert Muir [mailto:[hidden email]]
> Sent: Tuesday, August 14, 2012 1:27 AM
> To: [hidden email]
> Subject: Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true
>
> On Mon, Aug 13, 2012 at 1:58 PM, Chris Hostetter
> <[hidden email]> wrote:
>>
>> : >         http://unicode.org/reports/tr29/#Word_Boundaries
>> : >
>> : > ...I think it would be a good idea to add some new customization options
>> : > to StandardTokenizer (and StandardTokenizerFactory) to "tailor" the
>> : > behavior based on the various "tailored improvement" notes...
>>
>>
>> : Use a CharFilter.
>>
>> can you elaborate on how you would suggest implenting these "tailored
>> improvements" using a CharFilter?
>
> Generally the easiest way is to replace your ambiguous character (such
> as your hyphen-minus) with what your domain-specific knowledge tells
> you it should be.
> If you are indexing a dictionary where this ambiguous hyphen-minus is
> being used to separate syllables, then replace it with \u2027
> (hyphenation point), and it won't trigger word boundaries.
>
> But it really depends on how you want your whole analysis process to
> work. e.g. in the above example if you want to treat "foo-bar" as
> really equivalent to foobar, or you want to treat U.S.A as equivalent
> to USA, because thats how you want your search to work, then I would
> just replace with U+2060 word joiner. follow through with NFKC_CF
> unicode normalization filter in the icu package which will remove
> this, since its Format.
>
> So I think you can handle all of your cases there with a simple regex
> charfilter, substituting the correct 'semantics' depending on
> ultimately how you want it to work, and then just apply nfkc_cf at the
> end.
>
> As far as the last example, no need for the tokenizer to be involved.
> We already have elisionfilter for this, and italian and french
> analyzers use it to remove a default (but configurable) set of
> contractions. The solr example for these languages is setup with
> these, too.
>
> If you really don't like these dead-simple approaches, then just use
> the tokenizer in the ICU package, which is more flexible than the
> jflex implementation: lets you supply custom grammars at runtime, and
> can split by script, etc, etc.
>
>
> --
> lucidworks.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Chris Hostetter-3
In reply to this post by Robert Muir

: But it really depends on how you want your whole analysis process to
: work. e.g. in the above example if you want to treat "foo-bar" as
: really equivalent to foobar, or you want to treat U.S.A as equivalent

Unless i'm missreading the Word Boundary doc, the point of these types of
tailorings is to treat "foo-bar" as a single token "foo-bar" including the
hyphen -- ie: do not treat the hyphen as a "word" character.

If i understand correctly, you are argueing that instead of giving users
an option to tell StandardTokenizer to treat characters like hyphen as a
word character, they can achieve a tailoring like this by using a
CharFilter to translate these to less-ambiguious characters that are
already "word" characters according to the existing rules (ie: \u2027).

I understand how that might be a good idea in general (to normalize the
intra-word punctuation for improve matching if one query uses one type of
hyphen and another query uses a diff type of hyphen) but it still seems to
violate the point of the tailoring acording to the doc -- allowing people
to preserve the actual character in identifiers...

>>> Treatment of hyphens, in particular, may be different in the case of
>>> processing identifiers than when using word break analysis for a Whole
>>> Word Search or query, because when handling identifiers the goal will
>>> be to parse maximal units corresponding to natural language “words,”
>>> rather than to find smaller word units within longer lexical units
>>> connected by hyphens.

The doc even points oout specificly...

>>> Some or all of the following characters may be tailored to be in
>>> MidLetter, depending on the environment:  
    ...
>>> U+002D ( - ) HYPHEN-MINUS
>>> U+058A ( ֊ ) ARMENIAN HYPHEN
>>> U+2010 ( ‐ ) HYPHEN
>>> U+2011 ( ‑ ) NON-BREAKING HYPHEN
>>> U+FE63 ( ﹣ ) SMALL HYPHEN-MINUS
>>> U+FF0D ( - ) FULLWIDTH HYPHEN-MINUS

...so seemingly, according to the word boundary docs, there should be an
option to treat those individual characters as "MidLetter" characters w/o
requiring the user to change them to \u2027 in a CharFilter



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Robert Muir
On Tue, Aug 14, 2012 at 1:19 PM, Chris Hostetter
<[hidden email]> wrote:
0D ( - ) FULLWIDTH HYPHEN-MINUS
>
> ...so seemingly, according to the word boundary docs, there should be an
> option to treat those individual characters as "MidLetter" characters w/o
> requiring the user to change them to \u2027 in a CharFilter
>

I don't agree with that logic at all. Why doesnt
java.text.Breakiterator have such an option then?

Because people impl the default algorithm for general purposes. Those
tailorings are not 'mandatory'.

--
lucidworks.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true

Chris Hostetter-3

: Because people impl the default algorithm for general purposes. Those
: tailorings are not 'mandatory'.

I didn't say they were mandatory, I said it seems like it would be a good
idea to add options for them.

The spec says: "... implementations may override (tailor) the results to
meet the requirements of different environments or particular languages.
For some languages, it may also be necessary to have different tailored
word break rules for selection versus Whole Word Search" -- and i am
suggesting that our implementaion (StandardTokenzier) should have options
for these suggested tailorings to make it easier to meet the requirements
of various envornments/langauges our users will care out.  So that they
can "turn on" these tailorngs w/o being requred to compleltey re-implent
the entire Tokenizer.

Or at the very least, provide recepies for people who want to achieve
those tailorings using other means -- ie: a doc somewhere that suggests
the "breaking between different scripts" tailoring can be acheived with a
simple PatternCharFilter seems fine, since the whole point is to break
more often then the default algorithm.  But for people who want to take
advantage of tailorings that break *less* often, I don't see any easy
way for people to do that on their own, so it seems like we should have
an option to do them on the StandardTokenizer itself.

(either that: or go with mccandles idea to leave *EVERYTHING* in the
tokenztream, and offer TokenFilters that can re-constitue tokens in cases
where hte user thinks StadnardToknenizer applied breaks too often)


The hyphen situation is a prime example: if people want to index terms
that contain literal hyphen characters in the middle of them, w/o changing
those charcters into something else that seems like something that should
be possible using StandardTokenizer.  Circling back to the start of this
thread, it would also make it easier to address the crux of the concern
about using StandardTokenizer with english and if/when
autoGeneratePhraseQueries should be used...

 1) if you want the input "fly-swatter" to be treated as a single
    token, leave this default settings alone.
 2) if you want the input "fly-swatter" to be broken into two tokens,
    set this "wordBreakOnHyphens" option on the StandardTokenizer to true
    2a) if this is in a query analyzer, the "fly" and "swatter"
        tokens will be used to make a BooleanQUery by defualt
    2b) if you want a phrase query to be built instead, use
        autoGeneratePhraseQueries=true, but this will affect all
        cases where a wordbreak was found.

..ie: stop forcing users to choose between phrase wheres for
hypenenated works in english vs "sane" queries for all of the languages on
the planet that don't use shitepace between words, and instead let the
user make a choice about the hyphens directly - and then they can still
make a choice about hte phrase queries if they want.




-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]