Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Alexandre Rafalovitch
I can't see a reason why it should be different, but:

This works
    <fieldType name="text_basic" class="solr.TextField">
        <analyzer>
            <tokenizer class="solr.LowerCaseTokenizerFactory" />
        </analyzer>
   </fieldType>

This does not:
    <fieldType name="text_basic" class="solr.TextField">
        <analyzer class="solr.SimpleAnalyzer"/>
    </fieldType>

This does work again:
    <fieldType name="text_basic" class="solr.TextField">
        <analyzer class="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
    </fieldType>

Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same package.

Is this a bug or some sort of legacy decision?

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Uwe Schindler
Hallo Alexandre,

> I can't see a reason why it should be different, but:
>
> This works
>     <fieldType name="text_basic" class="solr.TextField">
>         <analyzer>
>             <tokenizer class="solr.LowerCaseTokenizerFactory" />
>         </analyzer>
>    </fieldType>
>
> This does not:
>     <fieldType name="text_basic" class="solr.TextField">
>         <analyzer class="solr.SimpleAnalyzer"/>
>     </fieldType>
>
> This does work again:
>     <fieldType name="text_basic" class="solr.TextField">
>         <analyzer class="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
>     </fieldType>
>
> Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same
> package.
>
> Is this a bug or some sort of legacy decision?

There is a long history behind that and there is also a *fundamental* difference between the factories used for building custom analyzers in XML code and just referring to an Analyzer!

Let me start with some history: From the early beginning there was the concept of factories in Solr, so implementation classes are initialized from a map of properties given in the XML. Those factories were specified by Java binary class name ("org.apache.solr.foo.bar.MyFactory"). This is used at many places in Solr. The problem is that those class names could be quite long, so the SolrResourceLoader has a "hack" to allow short names (IMHO, which was a horrible decision). When it sees a class starting with name "solr.", it tris to lookup different possibilities. See code here: https://goo.gl/P24ZU3 (subpackages is generally a list like "o.a.solr.something",...).

In the early days (before Lucene/Solr 4.0), those factories were *all* part of Solr, so the lookup with the "solr." short name prefix was easy and the subpackages list was short. So it "just worked" and many people had those class names in their config files.

The Analyzers (2nd example) were always referred to by their full name, because they were part of Lucene and not Solr. Using a "solr." Short name was never ever possible because of that.

Now a change in 4.0 comes into the game: To make the concept of building "custom" analyzers easier to use for non-Solr users, and to make the whole concept easier to maintain, the factories for tokenstream components were moved out of Solr into Lucene (https://issues.apache.org/jira/browse/LUCENE-2510). The analysis parts got new package names below the Lucene namespace. The effect of this would have been that all people have to change their config files, because the "solr." Shortcut won't work with Lucene classes.

Now you might ask why the "solr." Prefix still works? The reason is a second fundamental change with Lucene 4. We no longer use class names in Lucene to refer to stuff like Codecs, PostingFormats - we use the java concept of SPI. All components get a name, the implementation class is not exposed to outside. Like with Codecs, where you use Codec.forName("Lucene70") to instantiate it, the same was done for TokenStream components. This allows now to create StandardTokenizerFactory using the following code: TokenizerFactory.forName("standard"). Or LowercaseFilter with TokenFilterFactory.forName("lowercase"). There is no such concept for Analyzers (no SPI) [this explains your original question].

Now we have the two pieces to put together: Refactoring of class names and adding of SPI concept. The "correct" fix in Solr would have been to remove the "class=" attribute in the fieldType and replace by something called "name" or "type", so the XML would look like (https://goo.gl/Dr3gpO):

<fieldType name="something " class="solr.TextField">
   <analyzer>
      <tokenizer name="whitespace" />
   </analyzer>
</fieldType>

Similar to those examples of the corresponding class to build Analyzers from those SPI names in Lucene: https://lucene.apache.org/core/6_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html

The above syntax is wonderful, but again this caused lots of complaints from Solr developers, that people are unable to understand this WTF :-) It may also have to do with those short names look more like <add competitors name here>  analysis component names.... (no idea, although its completely unrelated). The issue with more history is here: https://issues.apache.org/jira/browse/LUCENE-4044

Because of that there was a second hack added so all schema.xml files worked like before (in LUCENE-4044). This hack is the only way to configure tokenstream components up to this day - which is a desaster, IMHO! The hack is a fancy regular expression that tries to convert the old "solr.FoobarTokenFilterFactory" to the nice reading "names" like above: https://goo.gl/mtWmjm
The factory is then loaded using SPI: https://goo.gl/EwDtQr
IMHO, the hack should be deprecated and removed and the new syntax, as described above, should be introduced.

Analyzer class names would still (and will for sure stay like that - as used seldom in Solr) be *full* class names. There is no way to change that!

Now you have a bit of history and you might see that there is absolutely no relationship between the class name / package name and the configured "class" in schema.xml. In fact, the thing above cannot be fixed. Instead, the issue mentioned before should finally be fixed and the "class" attribute in token stream components be deprecated and removed and the above "name" (or maybe "type") syntax be used.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Alexandre Rafalovitch
Wow Uwe,

Thanks for the treatise. That's an interesting discussion, but I
wonder if anything changed since?

In terms of user-confusion/migration, we now have managed schema and
can probably rewrite from 'solr.x' to symbol names on first use. That,
of course, requires some sort of registry of those names, which I am
not sure if it exists (apart from my own solrt-start.com hacks). But
then the registry may well align with some other configuration
reporting by the components. And with plugins/library jars.

I am also wondering if the objection is still valid that other
components in Solr (such as search components) are still not able to
move to SPI? I am especially curious if any of that was affected by
Nobble's work on having libraries loaded into Solr's special
collection. What is the mechanism used there to load things.

But yes, I can see it is a big topic. I may just update the
documentation and examples to mention that Analyzers have to use
full-name when I get to it.

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 10 September 2016 at 14:24, Uwe Schindler <[hidden email]> wrote:

> Hallo Alexandre,
>
>> I can't see a reason why it should be different, but:
>>
>> This works
>>     <fieldType name="text_basic" class="solr.TextField">
>>         <analyzer>
>>             <tokenizer class="solr.LowerCaseTokenizerFactory" />
>>         </analyzer>
>>    </fieldType>
>>
>> This does not:
>>     <fieldType name="text_basic" class="solr.TextField">
>>         <analyzer class="solr.SimpleAnalyzer"/>
>>     </fieldType>
>>
>> This does work again:
>>     <fieldType name="text_basic" class="solr.TextField">
>>         <analyzer class="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
>>     </fieldType>
>>
>> Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same
>> package.
>>
>> Is this a bug or some sort of legacy decision?
>
> There is a long history behind that and there is also a *fundamental* difference between the factories used for building custom analyzers in XML code and just referring to an Analyzer!
>
> Let me start with some history: From the early beginning there was the concept of factories in Solr, so implementation classes are initialized from a map of properties given in the XML. Those factories were specified by Java binary class name ("org.apache.solr.foo.bar.MyFactory"). This is used at many places in Solr. The problem is that those class names could be quite long, so the SolrResourceLoader has a "hack" to allow short names (IMHO, which was a horrible decision). When it sees a class starting with name "solr.", it tris to lookup different possibilities. See code here: https://goo.gl/P24ZU3 (subpackages is generally a list like "o.a.solr.something",...).
>
> In the early days (before Lucene/Solr 4.0), those factories were *all* part of Solr, so the lookup with the "solr." short name prefix was easy and the subpackages list was short. So it "just worked" and many people had those class names in their config files.
>
> The Analyzers (2nd example) were always referred to by their full name, because they were part of Lucene and not Solr. Using a "solr." Short name was never ever possible because of that.
>
> Now a change in 4.0 comes into the game: To make the concept of building "custom" analyzers easier to use for non-Solr users, and to make the whole concept easier to maintain, the factories for tokenstream components were moved out of Solr into Lucene (https://issues.apache.org/jira/browse/LUCENE-2510). The analysis parts got new package names below the Lucene namespace. The effect of this would have been that all people have to change their config files, because the "solr." Shortcut won't work with Lucene classes.
>
> Now you might ask why the "solr." Prefix still works? The reason is a second fundamental change with Lucene 4. We no longer use class names in Lucene to refer to stuff like Codecs, PostingFormats - we use the java concept of SPI. All components get a name, the implementation class is not exposed to outside. Like with Codecs, where you use Codec.forName("Lucene70") to instantiate it, the same was done for TokenStream components. This allows now to create StandardTokenizerFactory using the following code: TokenizerFactory.forName("standard"). Or LowercaseFilter with TokenFilterFactory.forName("lowercase"). There is no such concept for Analyzers (no SPI) [this explains your original question].
>
> Now we have the two pieces to put together: Refactoring of class names and adding of SPI concept. The "correct" fix in Solr would have been to remove the "class=" attribute in the fieldType and replace by something called "name" or "type", so the XML would look like (https://goo.gl/Dr3gpO):
>
> <fieldType name="something " class="solr.TextField">
>    <analyzer>
>       <tokenizer name="whitespace" />
>    </analyzer>
> </fieldType>
>
> Similar to those examples of the corresponding class to build Analyzers from those SPI names in Lucene: https://lucene.apache.org/core/6_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
>
> The above syntax is wonderful, but again this caused lots of complaints from Solr developers, that people are unable to understand this WTF :-) It may also have to do with those short names look more like <add competitors name here>  analysis component names.... (no idea, although its completely unrelated). The issue with more history is here: https://issues.apache.org/jira/browse/LUCENE-4044
>
> Because of that there was a second hack added so all schema.xml files worked like before (in LUCENE-4044). This hack is the only way to configure tokenstream components up to this day - which is a desaster, IMHO! The hack is a fancy regular expression that tries to convert the old "solr.FoobarTokenFilterFactory" to the nice reading "names" like above: https://goo.gl/mtWmjm
> The factory is then loaded using SPI: https://goo.gl/EwDtQr
> IMHO, the hack should be deprecated and removed and the new syntax, as described above, should be introduced.
>
> Analyzer class names would still (and will for sure stay like that - as used seldom in Solr) be *full* class names. There is no way to change that!
>
> Now you have a bit of history and you might see that there is absolutely no relationship between the class name / package name and the configured "class" in schema.xml. In fact, the thing above cannot be fixed. Instead, the issue mentioned before should finally be fixed and the "class" attribute in token stream components be deprecated and removed and the above "name" (or maybe "type") syntax be used.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Uwe Schindler
Hi,

The registry is there. To get all symbolic names of analyzer components in classpath, use XxxFacrory.availableXxx() static methods.

I don't think it makes sense to replace all factories in solr with named SPIs. But I'd suggest to add the type or name attribute to analysis components and promote it. Class attribute can still be used like now but logs warning if it was misused to load an SPI. If it refers to a real class all is fine.

Uwe

Am 10. September 2016 15:56:51 MESZ, schrieb Alexandre Rafalovitch <[hidden email]>:

>Wow Uwe,
>
>Thanks for the treatise. That's an interesting discussion, but I
>wonder if anything changed since?
>
>In terms of user-confusion/migration, we now have managed schema and
>can probably rewrite from 'solr.x' to symbol names on first use. That,
>of course, requires some sort of registry of those names, which I am
>not sure if it exists (apart from my own solrt-start.com hacks). But
>then the registry may well align with some other configuration
>reporting by the components. And with plugins/library jars.
>
>I am also wondering if the objection is still valid that other
>components in Solr (such as search components) are still not able to
>move to SPI? I am especially curious if any of that was affected by
>Nobble's work on having libraries loaded into Solr's special
>collection. What is the mechanism used there to load things.
>
>But yes, I can see it is a big topic. I may just update the
>documentation and examples to mention that Analyzers have to use
>full-name when I get to it.
>
>Regards,
>   Alex.
>----
>Newsletter and resources for Solr beginners and intermediates:
>http://www.solr-start.com/
>
>
>On 10 September 2016 at 14:24, Uwe Schindler <[hidden email]> wrote:
>> Hallo Alexandre,
>>
>>> I can't see a reason why it should be different, but:
>>>
>>> This works
>>>     <fieldType name="text_basic" class="solr.TextField">
>>>         <analyzer>
>>>             <tokenizer class="solr.LowerCaseTokenizerFactory" />
>>>         </analyzer>
>>>    </fieldType>
>>>
>>> This does not:
>>>     <fieldType name="text_basic" class="solr.TextField">
>>>         <analyzer class="solr.SimpleAnalyzer"/>
>>>     </fieldType>
>>>
>>> This does work again:
>>>     <fieldType name="text_basic" class="solr.TextField">
>>>         <analyzer
>class="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
>>>     </fieldType>
>>>
>>> Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same
>>> package.
>>>
>>> Is this a bug or some sort of legacy decision?
>>
>> There is a long history behind that and there is also a *fundamental*
>difference between the factories used for building custom analyzers in
>XML code and just referring to an Analyzer!
>>
>> Let me start with some history: From the early beginning there was
>the concept of factories in Solr, so implementation classes are
>initialized from a map of properties given in the XML. Those factories
>were specified by Java binary class name
>("org.apache.solr.foo.bar.MyFactory"). This is used at many places in
>Solr. The problem is that those class names could be quite long, so the
>SolrResourceLoader has a "hack" to allow short names (IMHO, which was a
>horrible decision). When it sees a class starting with name "solr.", it
>tris to lookup different possibilities. See code here:
>https://goo.gl/P24ZU3 (subpackages is generally a list like
>"o.a.solr.something",...).
>>
>> In the early days (before Lucene/Solr 4.0), those factories were
>*all* part of Solr, so the lookup with the "solr." short name prefix
>was easy and the subpackages list was short. So it "just worked" and
>many people had those class names in their config files.
>>
>> The Analyzers (2nd example) were always referred to by their full
>name, because they were part of Lucene and not Solr. Using a "solr."
>Short name was never ever possible because of that.
>>
>> Now a change in 4.0 comes into the game: To make the concept of
>building "custom" analyzers easier to use for non-Solr users, and to
>make the whole concept easier to maintain, the factories for
>tokenstream components were moved out of Solr into Lucene
>(https://issues.apache.org/jira/browse/LUCENE-2510). The analysis parts
>got new package names below the Lucene namespace. The effect of this
>would have been that all people have to change their config files,
>because the "solr." Shortcut won't work with Lucene classes.
>>
>> Now you might ask why the "solr." Prefix still works? The reason is a
>second fundamental change with Lucene 4. We no longer use class names
>in Lucene to refer to stuff like Codecs, PostingFormats - we use the
>java concept of SPI. All components get a name, the implementation
>class is not exposed to outside. Like with Codecs, where you use
>Codec.forName("Lucene70") to instantiate it, the same was done for
>TokenStream components. This allows now to create
>StandardTokenizerFactory using the following code:
>TokenizerFactory.forName("standard"). Or LowercaseFilter with
>TokenFilterFactory.forName("lowercase"). There is no such concept for
>Analyzers (no SPI) [this explains your original question].
>>
>> Now we have the two pieces to put together: Refactoring of class
>names and adding of SPI concept. The "correct" fix in Solr would have
>been to remove the "class=" attribute in the fieldType and replace by
>something called "name" or "type", so the XML would look like
>(https://goo.gl/Dr3gpO):
>>
>> <fieldType name="something " class="solr.TextField">
>>    <analyzer>
>>       <tokenizer name="whitespace" />
>>    </analyzer>
>> </fieldType>
>>
>> Similar to those examples of the corresponding class to build
>Analyzers from those SPI names in Lucene:
>https://lucene.apache.org/core/6_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
>>
>> The above syntax is wonderful, but again this caused lots of
>complaints from Solr developers, that people are unable to understand
>this WTF :-) It may also have to do with those short names look more
>like <add competitors name here>  analysis component names.... (no
>idea, although its completely unrelated). The issue with more history
>is here: https://issues.apache.org/jira/browse/LUCENE-4044
>>
>> Because of that there was a second hack added so all schema.xml files
>worked like before (in LUCENE-4044). This hack is the only way to
>configure tokenstream components up to this day - which is a desaster,
>IMHO! The hack is a fancy regular expression that tries to convert the
>old "solr.FoobarTokenFilterFactory" to the nice reading "names" like
>above: https://goo.gl/mtWmjm
>> The factory is then loaded using SPI: https://goo.gl/EwDtQr
>> IMHO, the hack should be deprecated and removed and the new syntax,
>as described above, should be introduced.
>>
>> Analyzer class names would still (and will for sure stay like that -
>as used seldom in Solr) be *full* class names. There is no way to
>change that!
>>
>> Now you have a bit of history and you might see that there is
>absolutely no relationship between the class name / package name and
>the configured "class" in schema.xml. In fact, the thing above cannot
>be fixed. Instead, the issue mentioned before should finally be fixed
>and the "class" attribute in token stream components be deprecated and
>removed and the above "name" (or maybe "type") syntax be used.
>>
>> Uwe
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Uwe Schindler
To add,

the manages schema really makes it easy to "rewrite". My plan would be:

- Add a new "type" or "name" attribute to schema.xml, which is contrary to "class" attribute usage
- When a manages schema is loaded, the resolving of classes using the hack is done as it is now. Warnings are printed as said before.
- The managed schema is then changes to switch to the new attribute (there is a getter to get the symbolic name from the factory, so rewriting is easy)

In addition, this simplifies usage: Some GUI could show a dropdown list for clicking together the analyzer. We just need to add a schema-REST endpoint to get all names.

Maybe open an issue targeted for 6.x / 7.0. I'd be happy to help to fix this, although I could only do the SolrResourceLoader and SolrAnalyzer stuff.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Uwe Schindler [mailto:[hidden email]]
> Sent: Saturday, September 10, 2016 4:03 PM
> To: [hidden email]; Alexandre Rafalovitch <[hidden email]>
> Subject: Re: Is "solr.AnalyzerName" expansion supposed to work for
> Analyzers?
>
> Hi,
>
> The registry is there. To get all symbolic names of analyzer components in
> classpath, use XxxFacrory.availableXxx() static methods.
>
> I don't think it makes sense to replace all factories in solr with named SPIs.
> But I'd suggest to add the type or name attribute to analysis components and
> promote it. Class attribute can still be used like now but logs warning if it was
> misused to load an SPI. If it refers to a real class all is fine.
>
> Uwe
>
> Am 10. September 2016 15:56:51 MESZ, schrieb Alexandre Rafalovitch
> <[hidden email]>:
> >Wow Uwe,
> >
> >Thanks for the treatise. That's an interesting discussion, but I
> >wonder if anything changed since?
> >
> >In terms of user-confusion/migration, we now have managed schema and
> >can probably rewrite from 'solr.x' to symbol names on first use. That,
> >of course, requires some sort of registry of those names, which I am
> >not sure if it exists (apart from my own solrt-start.com hacks). But
> >then the registry may well align with some other configuration
> >reporting by the components. And with plugins/library jars.
> >
> >I am also wondering if the objection is still valid that other
> >components in Solr (such as search components) are still not able to
> >move to SPI? I am especially curious if any of that was affected by
> >Nobble's work on having libraries loaded into Solr's special
> >collection. What is the mechanism used there to load things.
> >
> >But yes, I can see it is a big topic. I may just update the
> >documentation and examples to mention that Analyzers have to use
> >full-name when I get to it.
> >
> >Regards,
> >   Alex.
> >----
> >Newsletter and resources for Solr beginners and intermediates:
> >http://www.solr-start.com/
> >
> >
> >On 10 September 2016 at 14:24, Uwe Schindler <[hidden email]> wrote:
> >> Hallo Alexandre,
> >>
> >>> I can't see a reason why it should be different, but:
> >>>
> >>> This works
> >>>     <fieldType name="text_basic" class="solr.TextField">
> >>>         <analyzer>
> >>>             <tokenizer class="solr.LowerCaseTokenizerFactory" />
> >>>         </analyzer>
> >>>    </fieldType>
> >>>
> >>> This does not:
> >>>     <fieldType name="text_basic" class="solr.TextField">
> >>>         <analyzer class="solr.SimpleAnalyzer"/>
> >>>     </fieldType>
> >>>
> >>> This does work again:
> >>>     <fieldType name="text_basic" class="solr.TextField">
> >>>         <analyzer
> >class="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
> >>>     </fieldType>
> >>>
> >>> Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same
> >>> package.
> >>>
> >>> Is this a bug or some sort of legacy decision?
> >>
> >> There is a long history behind that and there is also a *fundamental*
> >difference between the factories used for building custom analyzers in
> >XML code and just referring to an Analyzer!
> >>
> >> Let me start with some history: From the early beginning there was
> >the concept of factories in Solr, so implementation classes are
> >initialized from a map of properties given in the XML. Those factories
> >were specified by Java binary class name
> >("org.apache.solr.foo.bar.MyFactory"). This is used at many places in
> >Solr. The problem is that those class names could be quite long, so the
> >SolrResourceLoader has a "hack" to allow short names (IMHO, which was a
> >horrible decision). When it sees a class starting with name "solr.", it
> >tris to lookup different possibilities. See code here:
> >https://goo.gl/P24ZU3 (subpackages is generally a list like
> >"o.a.solr.something",...).
> >>
> >> In the early days (before Lucene/Solr 4.0), those factories were
> >*all* part of Solr, so the lookup with the "solr." short name prefix
> >was easy and the subpackages list was short. So it "just worked" and
> >many people had those class names in their config files.
> >>
> >> The Analyzers (2nd example) were always referred to by their full
> >name, because they were part of Lucene and not Solr. Using a "solr."
> >Short name was never ever possible because of that.
> >>
> >> Now a change in 4.0 comes into the game: To make the concept of
> >building "custom" analyzers easier to use for non-Solr users, and to
> >make the whole concept easier to maintain, the factories for
> >tokenstream components were moved out of Solr into Lucene
> >(https://issues.apache.org/jira/browse/LUCENE-2510). The analysis parts
> >got new package names below the Lucene namespace. The effect of this
> >would have been that all people have to change their config files,
> >because the "solr." Shortcut won't work with Lucene classes.
> >>
> >> Now you might ask why the "solr." Prefix still works? The reason is a
> >second fundamental change with Lucene 4. We no longer use class names
> >in Lucene to refer to stuff like Codecs, PostingFormats - we use the
> >java concept of SPI. All components get a name, the implementation
> >class is not exposed to outside. Like with Codecs, where you use
> >Codec.forName("Lucene70") to instantiate it, the same was done for
> >TokenStream components. This allows now to create
> >StandardTokenizerFactory using the following code:
> >TokenizerFactory.forName("standard"). Or LowercaseFilter with
> >TokenFilterFactory.forName("lowercase"). There is no such concept for
> >Analyzers (no SPI) [this explains your original question].
> >>
> >> Now we have the two pieces to put together: Refactoring of class
> >names and adding of SPI concept. The "correct" fix in Solr would have
> >been to remove the "class=" attribute in the fieldType and replace by
> >something called "name" or "type", so the XML would look like
> >(https://goo.gl/Dr3gpO):
> >>
> >> <fieldType name="something " class="solr.TextField">
> >>    <analyzer>
> >>       <tokenizer name="whitespace" />
> >>    </analyzer>
> >> </fieldType>
> >>
> >> Similar to those examples of the corresponding class to build
> >Analyzers from those SPI names in Lucene:
> >https://lucene.apache.org/core/6_2_0/analyzers-
> common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
> >>
> >> The above syntax is wonderful, but again this caused lots of
> >complaints from Solr developers, that people are unable to understand
> >this WTF :-) It may also have to do with those short names look more
> >like <add competitors name here>  analysis component names.... (no
> >idea, although its completely unrelated). The issue with more history
> >is here: https://issues.apache.org/jira/browse/LUCENE-4044
> >>
> >> Because of that there was a second hack added so all schema.xml files
> >worked like before (in LUCENE-4044). This hack is the only way to
> >configure tokenstream components up to this day - which is a desaster,
> >IMHO! The hack is a fancy regular expression that tries to convert the
> >old "solr.FoobarTokenFilterFactory" to the nice reading "names" like
> >above: https://goo.gl/mtWmjm
> >> The factory is then loaded using SPI: https://goo.gl/EwDtQr
> >> IMHO, the hack should be deprecated and removed and the new syntax,
> >as described above, should be introduced.
> >>
> >> Analyzer class names would still (and will for sure stay like that -
> >as used seldom in Solr) be *full* class names. There is no way to
> >change that!
> >>
> >> Now you have a bit of history and you might see that there is
> >absolutely no relationship between the class name / package name and
> >the configured "class" in schema.xml. In fact, the thing above cannot
> >be fixed. Instead, the issue mentioned before should finally be fixed
> >and the "class" attribute in token stream components be deprecated and
> >removed and the above "name" (or maybe "type") syntax be used.
> >>
> >> Uwe
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: [hidden email]
> >For additional commands, e-mail: [hidden email]
>
> --
> Uwe Schindler
> H.-H.-Meier-Allee 63, 28213 Bremen
> http://www.thetaphi.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

david.w.smiley@gmail.com
In reply to this post by Uwe Schindler
Thanks for this detailed answer.

On Sat, Sep 10, 2016 at 3:24 AM Uwe Schindler <[hidden email]> wrote:
Hallo Alexandre,

> I can't see a reason why it should be different, but:
>
> This works
>     <fieldType name="text_basic" class="solr.TextField">
>         <analyzer>
>             <tokenizer class="solr.LowerCaseTokenizerFactory" />
>         </analyzer>
>    </fieldType>
>
> This does not:
>     <fieldType name="text_basic" class="solr.TextField">
>         <analyzer class="solr.SimpleAnalyzer"/>
>     </fieldType>
>
> This does work again:
>     <fieldType name="text_basic" class="solr.TextField">
>         <analyzer class="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
>     </fieldType>
>
> Both LowerCaseTokenizerFactory and SimpleAnalyzer are in the same
> package.
>
> Is this a bug or some sort of legacy decision?

There is a long history behind that and there is also a *fundamental* difference between the factories used for building custom analyzers in XML code and just referring to an Analyzer!

Let me start with some history: From the early beginning there was the concept of factories in Solr, so implementation classes are initialized from a map of properties given in the XML. Those factories were specified by Java binary class name ("org.apache.solr.foo.bar.MyFactory"). This is used at many places in Solr. The problem is that those class names could be quite long, so the SolrResourceLoader has a "hack" to allow short names (IMHO, which was a horrible decision). When it sees a class starting with name "solr.", it tris to lookup different possibilities. See code here: https://goo.gl/P24ZU3 (subpackages is generally a list like "o.a.solr.something",...).

In the early days (before Lucene/Solr 4.0), those factories were *all* part of Solr, so the lookup with the "solr." short name prefix was easy and the subpackages list was short. So it "just worked" and many people had those class names in their config files.

The Analyzers (2nd example) were always referred to by their full name, because they were part of Lucene and not Solr. Using a "solr." Short name was never ever possible because of that.

Now a change in 4.0 comes into the game: To make the concept of building "custom" analyzers easier to use for non-Solr users, and to make the whole concept easier to maintain, the factories for tokenstream components were moved out of Solr into Lucene (https://issues.apache.org/jira/browse/LUCENE-2510). The analysis parts got new package names below the Lucene namespace. The effect of this would have been that all people have to change their config files, because the "solr." Shortcut won't work with Lucene classes.

Now you might ask why the "solr." Prefix still works? The reason is a second fundamental change with Lucene 4. We no longer use class names in Lucene to refer to stuff like Codecs, PostingFormats - we use the java concept of SPI. All components get a name, the implementation class is not exposed to outside. Like with Codecs, where you use Codec.forName("Lucene70") to instantiate it, the same was done for TokenStream components. This allows now to create StandardTokenizerFactory using the following code: TokenizerFactory.forName("standard"). Or LowercaseFilter with TokenFilterFactory.forName("lowercase"). There is no such concept for Analyzers (no SPI) [this explains your original question].

Now we have the two pieces to put together: Refactoring of class names and adding of SPI concept. The "correct" fix in Solr would have been to remove the "class=" attribute in the fieldType and replace by something called "name" or "type", so the XML would look like (https://goo.gl/Dr3gpO):

<fieldType name="something " class="solr.TextField">
   <analyzer>
      <tokenizer name="whitespace" />
   </analyzer>
</fieldType>

Similar to those examples of the corresponding class to build Analyzers from those SPI names in Lucene: https://lucene.apache.org/core/6_2_0/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html

The above syntax is wonderful, but again this caused lots of complaints from Solr developers, that people are unable to understand this WTF :-) It may also have to do with those short names look more like <add competitors name here>  analysis component names.... (no idea, although its completely unrelated). The issue with more history is here: https://issues.apache.org/jira/browse/LUCENE-4044

Because of that there was a second hack added so all schema.xml files worked like before (in LUCENE-4044). This hack is the only way to configure tokenstream components up to this day - which is a desaster, IMHO! The hack is a fancy regular expression that tries to convert the old "solr.FoobarTokenFilterFactory" to the nice reading "names" like above: https://goo.gl/mtWmjm
The factory is then loaded using SPI: https://goo.gl/EwDtQr
IMHO, the hack should be deprecated and removed and the new syntax, as described above, should be introduced.

Analyzer class names would still (and will for sure stay like that - as used seldom in Solr) be *full* class names. There is no way to change that!

Now you have a bit of history and you might see that there is absolutely no relationship between the class name / package name and the configured "class" in schema.xml. In fact, the thing above cannot be fixed. Instead, the issue mentioned before should finally be fixed and the "class" attribute in token stream components be deprecated and removed and the above "name" (or maybe "type") syntax be used.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
Reply | Threaded
Open this post in threaded view
|

Re: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Malcolm Upayavira Holmes
In reply to this post by Uwe Schindler
On Sat, 10 Sep 2016, at 04:03 PM, Uwe Schindler wrote:

> To add,
>
> the manages schema really makes it easy to "rewrite". My plan would be:
>
> - Add a new "type" or "name" attribute to schema.xml, which is contrary
> to "class" attribute usage
> - When a manages schema is loaded, the resolving of classes using the
> hack is done as it is now. Warnings are printed as said before.
> - The managed schema is then changes to switch to the new attribute
> (there is a getter to get the symbolic name from the factory, so
> rewriting is easy)
>
> In addition, this simplifies usage: Some GUI could show a dropdown list
> for clicking together the analyzer. We just need to add a schema-REST
> endpoint to get all names.
>
> Maybe open an issue targeted for 6.x / 7.0. I'd be happy to help to fix
> this, although I could only do the SolrResourceLoader and SolrAnalyzer
> stuff.

 Not knowing how to get a list of acceptable components was the thing
 that stopped me adding that part of the schema API to the admin UI. And
 API to tell you which components exist would be extremely helpful.

Upayavira

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Uwe Schindler
Let's open an issue to do what I proposed! After that you could add the schema editor GUI.

I think Robert already proposed back at that time to add an additional abstract method to each factory that returns the acceptable parameter names. So one could select the component with help of SPI set. Once the component was chosen the acceptable configuration parameters can be retrieved from the instance.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Upayavira [mailto:[hidden email]]
> Sent: Saturday, September 10, 2016 5:21 PM
> To: [hidden email]
> Subject: Re: Is "solr.AnalyzerName" expansion supposed to work for
> Analyzers?
>
> On Sat, 10 Sep 2016, at 04:03 PM, Uwe Schindler wrote:
> > To add,
> >
> > the manages schema really makes it easy to "rewrite". My plan would be:
> >
> > - Add a new "type" or "name" attribute to schema.xml, which is contrary
> > to "class" attribute usage
> > - When a manages schema is loaded, the resolving of classes using the
> > hack is done as it is now. Warnings are printed as said before.
> > - The managed schema is then changes to switch to the new attribute
> > (there is a getter to get the symbolic name from the factory, so
> > rewriting is easy)
> >
> > In addition, this simplifies usage: Some GUI could show a dropdown list
> > for clicking together the analyzer. We just need to add a schema-REST
> > endpoint to get all names.
> >
> > Maybe open an issue targeted for 6.x / 7.0. I'd be happy to help to fix
> > this, although I could only do the SolrResourceLoader and SolrAnalyzer
> > stuff.
>
>  Not knowing how to get a list of acceptable components was the thing
>  that stopped me adding that part of the schema API to the admin UI. And
>  API to tell you which components exist would be extremely helpful.
>
> Upayavira
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Uwe Schindler
Hi,

In addition this change (to "name" or "type" in the components) would allow to remove Steve Rowe's hack in AbstractAnalysisFactory to keep the class name in the parameter map for serializing, which is Solr specific and should not be there! With the "official" names, this is no longer needed and Solr could simple serialize the name. This hack hurted me several times already!

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Uwe Schindler [mailto:[hidden email]]
> Sent: Saturday, September 10, 2016 6:54 PM
> To: [hidden email]
> Subject: RE: Is "solr.AnalyzerName" expansion supposed to work for
> Analyzers?
>
> Let's open an issue to do what I proposed! After that you could add the
> schema editor GUI.
>
> I think Robert already proposed back at that time to add an additional
> abstract method to each factory that returns the acceptable parameter
> names. So one could select the component with help of SPI set. Once the
> component was chosen the acceptable configuration parameters can be
> retrieved from the instance.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
> > -----Original Message-----
> > From: Upayavira [mailto:[hidden email]]
> > Sent: Saturday, September 10, 2016 5:21 PM
> > To: [hidden email]
> > Subject: Re: Is "solr.AnalyzerName" expansion supposed to work for
> > Analyzers?
> >
> > On Sat, 10 Sep 2016, at 04:03 PM, Uwe Schindler wrote:
> > > To add,
> > >
> > > the manages schema really makes it easy to "rewrite". My plan would be:
> > >
> > > - Add a new "type" or "name" attribute to schema.xml, which is contrary
> > > to "class" attribute usage
> > > - When a manages schema is loaded, the resolving of classes using the
> > > hack is done as it is now. Warnings are printed as said before.
> > > - The managed schema is then changes to switch to the new attribute
> > > (there is a getter to get the symbolic name from the factory, so
> > > rewriting is easy)
> > >
> > > In addition, this simplifies usage: Some GUI could show a dropdown list
> > > for clicking together the analyzer. We just need to add a schema-REST
> > > endpoint to get all names.
> > >
> > > Maybe open an issue targeted for 6.x / 7.0. I'd be happy to help to fix
> > > this, although I could only do the SolrResourceLoader and SolrAnalyzer
> > > stuff.
> >
> >  Not knowing how to get a list of acceptable components was the thing
> >  that stopped me adding that part of the schema API to the admin UI. And
> >  API to tell you which components exist would be extremely helpful.
> >
> > Upayavira
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Alexandre Rafalovitch
I feel the total issue might be somewhat above my current code
understanding, but I would be happy to do the grunt work for the
factories to self-describe their parameters. I think that would be
useful in multiple ways. I was already looking at perhaps using MBean
describers for that, as that allows to specify types, acceptable
values, etc.

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 11 September 2016 at 00:12, Uwe Schindler <[hidden email]> wrote:

> Hi,
>
> In addition this change (to "name" or "type" in the components) would allow to remove Steve Rowe's hack in AbstractAnalysisFactory to keep the class name in the parameter map for serializing, which is Solr specific and should not be there! With the "official" names, this is no longer needed and Solr could simple serialize the name. This hack hurted me several times already!
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
>> -----Original Message-----
>> From: Uwe Schindler [mailto:[hidden email]]
>> Sent: Saturday, September 10, 2016 6:54 PM
>> To: [hidden email]
>> Subject: RE: Is "solr.AnalyzerName" expansion supposed to work for
>> Analyzers?
>>
>> Let's open an issue to do what I proposed! After that you could add the
>> schema editor GUI.
>>
>> I think Robert already proposed back at that time to add an additional
>> abstract method to each factory that returns the acceptable parameter
>> names. So one could select the component with help of SPI set. Once the
>> component was chosen the acceptable configuration parameters can be
>> retrieved from the instance.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: [hidden email]
>>
>> > -----Original Message-----
>> > From: Upayavira [mailto:[hidden email]]
>> > Sent: Saturday, September 10, 2016 5:21 PM
>> > To: [hidden email]
>> > Subject: Re: Is "solr.AnalyzerName" expansion supposed to work for
>> > Analyzers?
>> >
>> > On Sat, 10 Sep 2016, at 04:03 PM, Uwe Schindler wrote:
>> > > To add,
>> > >
>> > > the manages schema really makes it easy to "rewrite". My plan would be:
>> > >
>> > > - Add a new "type" or "name" attribute to schema.xml, which is contrary
>> > > to "class" attribute usage
>> > > - When a manages schema is loaded, the resolving of classes using the
>> > > hack is done as it is now. Warnings are printed as said before.
>> > > - The managed schema is then changes to switch to the new attribute
>> > > (there is a getter to get the symbolic name from the factory, so
>> > > rewriting is easy)
>> > >
>> > > In addition, this simplifies usage: Some GUI could show a dropdown list
>> > > for clicking together the analyzer. We just need to add a schema-REST
>> > > endpoint to get all names.
>> > >
>> > > Maybe open an issue targeted for 6.x / 7.0. I'd be happy to help to fix
>> > > this, although I could only do the SolrResourceLoader and SolrAnalyzer
>> > > stuff.
>> >
>> >  Not knowing how to get a list of acceptable components was the thing
>> >  that stopped me adding that part of the schema API to the admin UI. And
>> >  API to tell you which components exist would be extremely helpful.
>> >
>> > Upayavira
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Is "solr.AnalyzerName" expansion supposed to work for Analyzers?

Uwe Schindler
Hi,

The analysis factories are pure-Lucene code, please don't add any mbean stuff!

In fact it's enough to return names of parameters, the types are simple: always STRING, because it is a Map<String,String>.
This may not be what you intend, but the internal representations of the types are casted in the constructor when the factory is created.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:[hidden email]]
> Sent: Sunday, September 11, 2016 6:02 AM
> To: [hidden email]
> Subject: Re: Is "solr.AnalyzerName" expansion supposed to work for
> Analyzers?
>
> I feel the total issue might be somewhat above my current code
> understanding, but I would be happy to do the grunt work for the
> factories to self-describe their parameters. I think that would be
> useful in multiple ways. I was already looking at perhaps using MBean
> describers for that, as that allows to specify types, acceptable
> values, etc.
>
> Regards,
>    Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 11 September 2016 at 00:12, Uwe Schindler <[hidden email]> wrote:
> > Hi,
> >
> > In addition this change (to "name" or "type" in the components) would
> allow to remove Steve Rowe's hack in AbstractAnalysisFactory to keep the
> class name in the parameter map for serializing, which is Solr specific and
> should not be there! With the "official" names, this is no longer needed and
> Solr could simple serialize the name. This hack hurted me several times
> already!
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: [hidden email]
> >
> >> -----Original Message-----
> >> From: Uwe Schindler [mailto:[hidden email]]
> >> Sent: Saturday, September 10, 2016 6:54 PM
> >> To: [hidden email]
> >> Subject: RE: Is "solr.AnalyzerName" expansion supposed to work for
> >> Analyzers?
> >>
> >> Let's open an issue to do what I proposed! After that you could add the
> >> schema editor GUI.
> >>
> >> I think Robert already proposed back at that time to add an additional
> >> abstract method to each factory that returns the acceptable parameter
> >> names. So one could select the component with help of SPI set. Once the
> >> component was chosen the acceptable configuration parameters can be
> >> retrieved from the instance.
> >>
> >> Uwe
> >>
> >> -----
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: [hidden email]
> >>
> >> > -----Original Message-----
> >> > From: Upayavira [mailto:[hidden email]]
> >> > Sent: Saturday, September 10, 2016 5:21 PM
> >> > To: [hidden email]
> >> > Subject: Re: Is "solr.AnalyzerName" expansion supposed to work for
> >> > Analyzers?
> >> >
> >> > On Sat, 10 Sep 2016, at 04:03 PM, Uwe Schindler wrote:
> >> > > To add,
> >> > >
> >> > > the manages schema really makes it easy to "rewrite". My plan would
> be:
> >> > >
> >> > > - Add a new "type" or "name" attribute to schema.xml, which is
> contrary
> >> > > to "class" attribute usage
> >> > > - When a manages schema is loaded, the resolving of classes using the
> >> > > hack is done as it is now. Warnings are printed as said before.
> >> > > - The managed schema is then changes to switch to the new attribute
> >> > > (there is a getter to get the symbolic name from the factory, so
> >> > > rewriting is easy)
> >> > >
> >> > > In addition, this simplifies usage: Some GUI could show a dropdown
> list
> >> > > for clicking together the analyzer. We just need to add a schema-REST
> >> > > endpoint to get all names.
> >> > >
> >> > > Maybe open an issue targeted for 6.x / 7.0. I'd be happy to help to fix
> >> > > this, although I could only do the SolrResourceLoader and SolrAnalyzer
> >> > > stuff.
> >> >
> >> >  Not knowing how to get a list of acceptable components was the thing
> >> >  that stopped me adding that part of the schema API to the admin UI.
> And
> >> >  API to tell you which components exist would be extremely helpful.
> >> >
> >> > Upayavira
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [hidden email]
> >> > For additional commands, e-mail: [hidden email]
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]