PreAnalyzed URP and SchemaRequest API

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

PreAnalyzed URP and SchemaRequest API

Markus Jelsma-2
Hello,

We intend to move to PreAnalyzed URP for analysis offloading. Browsing the Javadocs i came across the SchemaRequest API looking for a way to get a Field object remotely, which i seem to need for JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get from SchemaRequest API is FieldTypeRepresentation, which offers me getIndexAnalyzer() but won't allow me to construct a Field object.

So, to analyze remotely i do need an index-time analyzer. I can get it, but not turn it into a Field object, which the PreAnalyzedParser for some reason wants.

Any hints here? I must be looking the wrong way.

Many thanks!
Markus
Reply | Threaded
Open this post in threaded view
|

Re: PreAnalyzed URP and SchemaRequest API

david.w.smiley@gmail.com
Is this really a problem when you could easily enough create a TextField
and call setTokenStream?

Does your remote client have Solr-core and all its dependencies on the
classpath?   That's one way to do it... and presumably the direction you
are going because you're asking how to work with PreAnalyzedParser which is
in solr-core.  *Alternatively*, only bring in Lucene core and construct
things yourself in the right format.  You could copy PreAnalyzedParser into
your codebase so that you don't have to reinvent any wheels, even though
that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
SolrJ depending on Lucene-core, though it'd make a fine "optional"
dependency.

On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma <[hidden email]>
wrote:

> Hello,
>
> We intend to move to PreAnalyzed URP for analysis offloading. Browsing the
> Javadocs i came across the SchemaRequest API looking for a way to get a
> Field object remotely, which i seem to need for
> JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get from
> SchemaRequest API is FieldTypeRepresentation, which offers me
> getIndexAnalyzer() but won't allow me to construct a Field object.
>
> So, to analyze remotely i do need an index-time analyzer. I can get it,
> but not turn it into a Field object, which the PreAnalyzedParser for some
> reason wants.
>
> Any hints here? I must be looking the wrong way.
>
> Many thanks!
> Markus
>
--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com
Reply | Threaded
Open this post in threaded view
|

RE: PreAnalyzed URP and SchemaRequest API

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
Hello David,

The remote client has everything on the class path but just calling setTokenStream is not going to work. Remotely, all i get from SchemaRequest API is a AnalyzerDefinition. I haven't found any Solr code that allows me to transform that directly into an analyzer. If i had that, it would make things easy.

As far as i see it, i need to reconstruct a real Analyzer using AnalyzerDefinition's information. It won't be a problem, but it is cumbersome.

Thanks anyway,
Markus
 
-----Original message-----

> From:David Smiley <[hidden email]>
> Sent: Thursday 5th April 2018 19:38
> To: [hidden email]
> Subject: Re: PreAnalyzed URP and SchemaRequest API
>
> Is this really a problem when you could easily enough create a TextField
> and call setTokenStream?
>
> Does your remote client have Solr-core and all its dependencies on the
> classpath?   That's one way to do it... and presumably the direction you
> are going because you're asking how to work with PreAnalyzedParser which is
> in solr-core.  *Alternatively*, only bring in Lucene core and construct
> things yourself in the right format.  You could copy PreAnalyzedParser into
> your codebase so that you don't have to reinvent any wheels, even though
> that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
> SolrJ depending on Lucene-core, though it'd make a fine "optional"
> dependency.
>
> On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma <[hidden email]>
> wrote:
>
> > Hello,
> >
> > We intend to move to PreAnalyzed URP for analysis offloading. Browsing the
> > Javadocs i came across the SchemaRequest API looking for a way to get a
> > Field object remotely, which i seem to need for
> > JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get from
> > SchemaRequest API is FieldTypeRepresentation, which offers me
> > getIndexAnalyzer() but won't allow me to construct a Field object.
> >
> > So, to analyze remotely i do need an index-time analyzer. I can get it,
> > but not turn it into a Field object, which the PreAnalyzedParser for some
> > reason wants.
> >
> > Any hints here? I must be looking the wrong way.
> >
> > Many thanks!
> > Markus
> >
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>
Reply | Threaded
Open this post in threaded view
|

Re: PreAnalyzed URP and SchemaRequest API

david.w.smiley@gmail.com
Ah ok.
I've wondered how much value there is in pre-analysis.  The serialization
of the analyzed form in JSON is bulky.  If you can share any results, I'd
be interested to hear how it went.  It's an optimization so you should be
able to know how much better it is.  Of course it isn't for everybody --
only when the analysis chain is sufficiently complex.

On Mon, Apr 9, 2018 at 9:45 AM Markus Jelsma <[hidden email]>
wrote:

> Hello David,
>
> The remote client has everything on the class path but just calling
> setTokenStream is not going to work. Remotely, all i get from SchemaRequest
> API is a AnalyzerDefinition. I haven't found any Solr code that allows me
> to transform that directly into an analyzer. If i had that, it would make
> things easy.
>
> As far as i see it, i need to reconstruct a real Analyzer using
> AnalyzerDefinition's information. It won't be a problem, but it is
> cumbersome.
>
> Thanks anyway,
> Markus
>
> -----Original message-----
> > From:David Smiley <[hidden email]>
> > Sent: Thursday 5th April 2018 19:38
> > To: [hidden email]
> > Subject: Re: PreAnalyzed URP and SchemaRequest API
> >
> > Is this really a problem when you could easily enough create a TextField
> > and call setTokenStream?
> >
> > Does your remote client have Solr-core and all its dependencies on the
> > classpath?   That's one way to do it... and presumably the direction you
> > are going because you're asking how to work with PreAnalyzedParser which
> is
> > in solr-core.  *Alternatively*, only bring in Lucene core and construct
> > things yourself in the right format.  You could copy PreAnalyzedParser
> into
> > your codebase so that you don't have to reinvent any wheels, even though
> > that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
> > SolrJ depending on Lucene-core, though it'd make a fine "optional"
> > dependency.
> >
> > On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma <[hidden email]
> >
> > wrote:
> >
> > > Hello,
> > >
> > > We intend to move to PreAnalyzed URP for analysis offloading. Browsing
> the
> > > Javadocs i came across the SchemaRequest API looking for a way to get a
> > > Field object remotely, which i seem to need for
> > > JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get
> from
> > > SchemaRequest API is FieldTypeRepresentation, which offers me
> > > getIndexAnalyzer() but won't allow me to construct a Field object.
> > >
> > > So, to analyze remotely i do need an index-time analyzer. I can get it,
> > > but not turn it into a Field object, which the PreAnalyzedParser for
> some
> > > reason wants.
> > >
> > > Any hints here? I must be looking the wrong way.
> > >
> > > Many thanks!
> > > Markus
> > >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
> >
>
--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com
Reply | Threaded
Open this post in threaded view
|

RE: PreAnalyzed URP and SchemaRequest API

Markus Jelsma-2
In reply to this post by Markus Jelsma-2
Hello David,

If JSON serialization is too bulky, we could also opt for SimplePreAnalyzed right? At least as a FieldType it is possible, if not with URP, it just needs some work.

Regarding results; we haven't done it yet, and won't for some time, but we will when we reintroduce OpenNLP in the analysis chain. We tried to introduce POS-tagging on our own two years ago, but i wasn't suited for production because it was too heavy on the CPU. Indexing data suddenly took eight to ten times longer in a SolrCloud environment with three replica's.

If we offload our current chains without OpenNLP, it will only benefit when large fields pass through a regex, and for decompounding the Germanic languages we ingest. Offloading just this cost is a micro optimization, offloading the various OpenNLP char and token filters are really beneficial.

Regarding a dependency on Lucene core and analysis-common, it would be helpful, but we'll manage.

Thanks again,
Markus
 
-----Original message-----

> From:David Smiley <[hidden email]>
> Sent: Thursday 12th April 2018 19:16
> To: [hidden email]
> Subject: Re: PreAnalyzed URP and SchemaRequest API
>
> Ah ok.
> I've wondered how much value there is in pre-analysis.  The serialization
> of the analyzed form in JSON is bulky.  If you can share any results, I'd
> be interested to hear how it went.  It's an optimization so you should be
> able to know how much better it is.  Of course it isn't for everybody --
> only when the analysis chain is sufficiently complex.
>
> On Mon, Apr 9, 2018 at 9:45 AM Markus Jelsma <[hidden email]>
> wrote:
>
> > Hello David,
> >
> > The remote client has everything on the class path but just calling
> > setTokenStream is not going to work. Remotely, all i get from SchemaRequest
> > API is a AnalyzerDefinition. I haven't found any Solr code that allows me
> > to transform that directly into an analyzer. If i had that, it would make
> > things easy.
> >
> > As far as i see it, i need to reconstruct a real Analyzer using
> > AnalyzerDefinition's information. It won't be a problem, but it is
> > cumbersome.
> >
> > Thanks anyway,
> > Markus
> >
> > -----Original message-----
> > > From:David Smiley <[hidden email]>
> > > Sent: Thursday 5th April 2018 19:38
> > > To: [hidden email]
> > > Subject: Re: PreAnalyzed URP and SchemaRequest API
> > >
> > > Is this really a problem when you could easily enough create a TextField
> > > and call setTokenStream?
> > >
> > > Does your remote client have Solr-core and all its dependencies on the
> > > classpath?   That's one way to do it... and presumably the direction you
> > > are going because you're asking how to work with PreAnalyzedParser which
> > is
> > > in solr-core.  *Alternatively*, only bring in Lucene core and construct
> > > things yourself in the right format.  You could copy PreAnalyzedParser
> > into
> > > your codebase so that you don't have to reinvent any wheels, even though
> > > that's awkward.  Perhaps that ought to be in Solrj?  But no we don't want
> > > SolrJ depending on Lucene-core, though it'd make a fine "optional"
> > > dependency.
> > >
> > > On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma <[hidden email]
> > >
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > We intend to move to PreAnalyzed URP for analysis offloading. Browsing
> > the
> > > > Javadocs i came across the SchemaRequest API looking for a way to get a
> > > > Field object remotely, which i seem to need for
> > > > JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get
> > from
> > > > SchemaRequest API is FieldTypeRepresentation, which offers me
> > > > getIndexAnalyzer() but won't allow me to construct a Field object.
> > > >
> > > > So, to analyze remotely i do need an index-time analyzer. I can get it,
> > > > but not turn it into a Field object, which the PreAnalyzedParser for
> > some
> > > > reason wants.
> > > >
> > > > Any hints here? I must be looking the wrong way.
> > > >
> > > > Many thanks!
> > > > Markus
> > > >
> > > --
> > > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > > http://www.solrenterprisesearchserver.com
> > >
> >
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>
Reply | Threaded
Open this post in threaded view
|

Re: PreAnalyzed URP and SchemaRequest API

david.w.smiley@gmail.com
Yes I could imagine big gains from this strategy if OpenNLP is in the
analysis chain ;-)

On Fri, Apr 13, 2018 at 5:01 PM Markus Jelsma <[hidden email]>
wrote:

> Hello David,
>
> If JSON serialization is too bulky, we could also opt for
> SimplePreAnalyzed right? At least as a FieldType it is possible, if not
> with URP, it just needs some work.
>
> Regarding results; we haven't done it yet, and won't for some time, but we
> will when we reintroduce OpenNLP in the analysis chain. We tried to
> introduce POS-tagging on our own two years ago, but i wasn't suited for
> production because it was too heavy on the CPU. Indexing data suddenly took
> eight to ten times longer in a SolrCloud environment with three replica's.
>
> If we offload our current chains without OpenNLP, it will only benefit
> when large fields pass through a regex, and for decompounding the Germanic
> languages we ingest. Offloading just this cost is a micro optimization,
> offloading the various OpenNLP char and token filters are really beneficial.
>
> Regarding a dependency on Lucene core and analysis-common, it would be
> helpful, but we'll manage.
>
> Thanks again,
> Markus
>
> -----Original message-----
> > From:David Smiley <[hidden email]>
> > Sent: Thursday 12th April 2018 19:16
> > To: [hidden email]
> > Subject: Re: PreAnalyzed URP and SchemaRequest API
> >
> > Ah ok.
> > I've wondered how much value there is in pre-analysis.  The serialization
> > of the analyzed form in JSON is bulky.  If you can share any results, I'd
> > be interested to hear how it went.  It's an optimization so you should be
> > able to know how much better it is.  Of course it isn't for everybody --
> > only when the analysis chain is sufficiently complex.
> >
> > On Mon, Apr 9, 2018 at 9:45 AM Markus Jelsma <[hidden email]
> >
> > wrote:
> >
> > > Hello David,
> > >
> > > The remote client has everything on the class path but just calling
> > > setTokenStream is not going to work. Remotely, all i get from
> SchemaRequest
> > > API is a AnalyzerDefinition. I haven't found any Solr code that allows
> me
> > > to transform that directly into an analyzer. If i had that, it would
> make
> > > things easy.
> > >
> > > As far as i see it, i need to reconstruct a real Analyzer using
> > > AnalyzerDefinition's information. It won't be a problem, but it is
> > > cumbersome.
> > >
> > > Thanks anyway,
> > > Markus
> > >
> > > -----Original message-----
> > > > From:David Smiley <[hidden email]>
> > > > Sent: Thursday 5th April 2018 19:38
> > > > To: [hidden email]
> > > > Subject: Re: PreAnalyzed URP and SchemaRequest API
> > > >
> > > > Is this really a problem when you could easily enough create a
> TextField
> > > > and call setTokenStream?
> > > >
> > > > Does your remote client have Solr-core and all its dependencies on
> the
> > > > classpath?   That's one way to do it... and presumably the direction
> you
> > > > are going because you're asking how to work with PreAnalyzedParser
> which
> > > is
> > > > in solr-core.  *Alternatively*, only bring in Lucene core and
> construct
> > > > things yourself in the right format.  You could copy
> PreAnalyzedParser
> > > into
> > > > your codebase so that you don't have to reinvent any wheels, even
> though
> > > > that's awkward.  Perhaps that ought to be in Solrj?  But no we don't
> want
> > > > SolrJ depending on Lucene-core, though it'd make a fine "optional"
> > > > dependency.
> > > >
> > > > On Wed, Apr 4, 2018 at 4:53 AM Markus Jelsma <
> [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > We intend to move to PreAnalyzed URP for analysis offloading.
> Browsing
> > > the
> > > > > Javadocs i came across the SchemaRequest API looking for a way to
> get a
> > > > > Field object remotely, which i seem to need for
> > > > > JsonPreAnalyzedParser.toFormattedString(Field f). But all i can get
> > > from
> > > > > SchemaRequest API is FieldTypeRepresentation, which offers me
> > > > > getIndexAnalyzer() but won't allow me to construct a Field object.
> > > > >
> > > > > So, to analyze remotely i do need an index-time analyzer. I can
> get it,
> > > > > but not turn it into a Field object, which the PreAnalyzedParser
> for
> > > some
> > > > > reason wants.
> > > > >
> > > > > Any hints here? I must be looking the wrong way.
> > > > >
> > > > > Many thanks!
> > > > > Markus
> > > > >
> > > > --
> > > > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > > > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > > > http://www.solrenterprisesearchserver.com
> > > >
> > >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
> >
>
--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com