[jira] [Created] (SOLR-2477) add analyzer type="phrase"

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (SOLR-2477) add analyzer type="phrase"

Michael Gibney (Jira)
add analyzer type="phrase"
--------------------------

                 Key: SOLR-2477
                 URL: https://issues.apache.org/jira/browse/SOLR-2477
             Project: Solr
          Issue Type: Improvement
            Reporter: Robert Muir
             Fix For: 4.0


This is just exposing LUCENE-2892, so you can easily configure things
so that if users put things in double quotes they get a more precise search.



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (SOLR-2477) add analyzer type="phrase"

Michael Gibney (Jira)

     [ https://issues.apache.org/jira/browse/SOLR-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-2477:
------------------------------

    Attachment: SOLR-2477.patch

here's my example fieldtype from the test:
{noformat}
      <analyzer type="index">
        <!--  pretty standard, except stopwords are indexed, and WDF preserves -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"  preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <!--  remove stopwords, expand synonyms, WDF, etc etc. -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="phrase">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!--  in this case no synonyms are expanded, and the exact stopwords, punctuation, etc must be present  -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>
{noformat}


> add analyzer type="phrase"
> --------------------------
>
>                 Key: SOLR-2477
>                 URL: https://issues.apache.org/jira/browse/SOLR-2477
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: SOLR-2477.patch
>
>
> This is just exposing LUCENE-2892, so you can easily configure things
> so that if users put things in double quotes they get a more precise search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-2477) add analyzer type="phrase"

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024944#comment-13024944 ]

Yonik Seeley commented on SOLR-2477:
------------------------------------

Interesting idea having a separate analyzer to expose this.
It's probably important to come up with a good example for the example schema, because I could see it being error-prone if people do it themselves.  For example, if they tried your test example (which may look reasonable to someone at first blush)
they wouldn't get any matches for anything that the WDF would normally split?


> add analyzer type="phrase"
> --------------------------
>
>                 Key: SOLR-2477
>                 URL: https://issues.apache.org/jira/browse/SOLR-2477
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: SOLR-2477.patch
>
>
> This is just exposing LUCENE-2892, so you can easily configure things
> so that if users put things in double quotes they get a more precise search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-2477) add analyzer type="phrase"

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024946#comment-13024946 ]

Robert Muir commented on SOLR-2477:
-----------------------------------

Well, we could maybe add something to the example, I thought it was sort of expert.

Well in my example, they would get matches for things that WDF normally splits, but only if the punctuation is exactly as they entered it:
assume doc 3 is 'foo bar' and doc4 is 'foo-bar'
{noformat}
  /**
   * test punctuation, we preserve the original for this purpose
   */
  public void testPunctuation() {
    assertQ("normal query: ",
       req("fl", "id", "q", "foo-bar", "sort", "id asc" ),
              "//*[@numFound='2']",
              "//result/doc[1]/int[@name='id'][.=3]",
              "//result/doc[2]/int[@name='id'][.=4]"
    );
   
    assertQ("phrase query: ",
        req("fl", "id", "q", "\"foo-bar\"", "sort", "id asc" ),
               "//*[@numFound='1']",
               "//result/doc[1]/int[@name='id'][.=4]"
     );
  }
{noformat}

But, this was just an example, you don't have to involve WDF to take advantage of this (probably stopwords/synonyms/decompounders are the simplest way). I was just coming up with an examples to have some unit tests.


> add analyzer type="phrase"
> --------------------------
>
>                 Key: SOLR-2477
>                 URL: https://issues.apache.org/jira/browse/SOLR-2477
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: SOLR-2477.patch
>
>
> This is just exposing LUCENE-2892, so you can easily configure things
> so that if users put things in double quotes they get a more precise search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-2477) add analyzer type="phrase"

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024954#comment-13024954 ]

Yonik Seeley commented on SOLR-2477:
------------------------------------

bq. Well in my example, they would get matches for things that WDF normally splits, but only if the punctuation is exactly as they entered it

Ah, I had missed the "preserveOriginal" on the index analyzer.


> add analyzer type="phrase"
> --------------------------
>
>                 Key: SOLR-2477
>                 URL: https://issues.apache.org/jira/browse/SOLR-2477
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: SOLR-2477.patch
>
>
> This is just exposing LUCENE-2892, so you can easily configure things
> so that if users put things in double quotes they get a more precise search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-2477) add analyzer type="phrase"

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13024958#comment-13024958 ]

Robert Muir commented on SOLR-2477:
-----------------------------------

Yeah, still even then, if we want something for the example, maybe its enough to just exclude the synonymfilter?


> add analyzer type="phrase"
> --------------------------
>
>                 Key: SOLR-2477
>                 URL: https://issues.apache.org/jira/browse/SOLR-2477
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: SOLR-2477.patch
>
>
> This is just exposing LUCENE-2892, so you can easily configure things
> so that if users put things in double quotes they get a more precise search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-2477) add analyzer type="phrase"

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050642#comment-13050642 ]

Hoss Man commented on SOLR-2477:
--------------------------------

At first glance this looks great to me ... but we should seriously consider whether FieldQParser should also be using getPhraseAnalyzer.  I think given the semantics the answer is "yes" -- but either way it should be clearly documented.

we should also make sure analysis.jsp and the Analysis RequestHandler(s?) have options for using this.



> add analyzer type="phrase"
> --------------------------
>
>                 Key: SOLR-2477
>                 URL: https://issues.apache.org/jira/browse/SOLR-2477
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: SOLR-2477.patch
>
>
> This is just exposing LUCENE-2892, so you can easily configure things
> so that if users put things in double quotes they get a more precise search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-2477) add analyzer type="phrase"

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050646#comment-13050646 ]

Robert Muir commented on SOLR-2477:
-----------------------------------

{quote}
but we should seriously consider whether FieldQParser should also be using getPhraseAnalyzer.
{quote}

Looking at how this is described, it seems to me it should use the phrase analyzer... we can document that it does this, and of course the change is backwards compatible (because if you don't define it, its your query analyzer).

{quote}
we should also make sure analysis.jsp and the Analysis RequestHandler(s?) have options for using this.
{quote}

I agree... hopefully this isn't too bad.


> add analyzer type="phrase"
> --------------------------
>
>                 Key: SOLR-2477
>                 URL: https://issues.apache.org/jira/browse/SOLR-2477
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: SOLR-2477.patch
>
>
> This is just exposing LUCENE-2892, so you can easily configure things
> so that if users put things in double quotes they get a more precise search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-2477) add analyzer type="phrase"

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067437#comment-13067437 ]

Hoss Man commented on SOLR-2477:
--------------------------------

Having just looked at this code in SOLR-2663 i'm realizing that as we add more types of analyzers, we should really clean up the semantics of how a analyzers w/o "type" attributes are treated, and how each of hte analyzers default if they aren't specified.

Consider the following (contrived) example...

{code}
<fieldType name="hoss" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   </analyzer>
   <analyzer type="index">
     <tokenizer class="solr.KeywordTokenizerFactory"/>
   </analyzer>
</fieldType>
{code}

Right now (on trunk and with this patch) that config will result in all of the analyzers (index/query[/phrase]) using KeywordTokenizerFactory because the type-less analyzer is ignored if there is is an analyzer with type="index".  I don't think that makes much sense, and as we add more types of analyzers it makes even less sense -- an analyzer w/o a type attribute should really be the "default" for each other type

I think we should change the overall flow to be (psudeo-code) ...

{code}

// exactly what is in the config
Analyzer defaultA = readAnalyzer(xpath("./analyzer[not(@type)]"));
Analyzer indexA = readAnalyzer(xpath("./analyzer[@type='index']"));
Analyzer queryA = readAnalyzer(xpath("./analyzer[@type='query']"));
Analyzer phraseA = readAnalyzer(xpath("./analyzer[@type='phrase']"));

if (null != defaultA) {
  // we have an explicit default
  if (null == indexA) indexA = defaultA;
  if (null == queryA) queryA = defaultA;
  if (null == phraseA) phraseA = defaultA;
} else {
  // implicit defaults, either historical or common sense
  if (null == queryA) queryA = indexA;
  if (null == phraseA) phraseA = queryA;
}
{code}

> add analyzer type="phrase"
> --------------------------
>
>                 Key: SOLR-2477
>                 URL: https://issues.apache.org/jira/browse/SOLR-2477
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: SOLR-2477.patch
>
>
> This is just exposing LUCENE-2892, so you can easily configure things
> so that if users put things in double quotes they get a more precise search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (SOLR-2477) add analyzer type="phrase"

Michael Gibney (Jira)
In reply to this post by Michael Gibney (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067441#comment-13067441 ]

Robert Muir commented on SOLR-2477:
-----------------------------------

+1

If we decide to implement this or SOLR-219 via 'types of analyzers', I don't want to think of all the combinations if we do it any other way.

I would even go so far as to say, dont call it defaultA, but instead globalA, and if you declare this thing, and then also declare some specific analyzer,
we throw an exception.

> add analyzer type="phrase"
> --------------------------
>
>                 Key: SOLR-2477
>                 URL: https://issues.apache.org/jira/browse/SOLR-2477
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: SOLR-2477.patch
>
>
> This is just exposing LUCENE-2892, so you can easily configure things
> so that if users put things in double quotes they get a more precise search.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]