Time-Routed Alias Not Distributing Wrongly Placed Docs

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Time-Routed Alias Not Distributing Wrongly Placed Docs

John Nashorn
Hello Everyone,
I'm using "hive-solr" from Lucidworks to index my data into Solr (v:7.5, cloud mode). As written in the Solr Manual, TRA expects documents to be indexed using its alias name, and not directly into the collections under it. Unfortunately, hive-solr doesn't allow using TRA names as indexing targets. So what I do is: I index data using the first collection created by TRA and expect Solr to distribute my data into its respective collection under the hood. This works to some extent, but a big portion of data stays in where they were indexed, ie. the first collection of the TRA. For example (approximate numbers):

* coll_2018-07-01 => 800.000.000 docs
* coll_2018-08-01 => 0 docs
* coll_2018-09-01 => 0 docs
* coll_2018-10-01 => 150.000.000 docs
* coll_2018-11-01 => 0 docs

Here, coll_2018-07-01 contains data that should normally be in the other four collections.

Is there a way to make TRA scan (somehow intentionally) misplaced data and send them to their correct places?
Reply | Threaded
Open this post in threaded view
|

Re: Time-Routed Alias Not Distributing Wrongly Placed Docs

Jason Gerlowski
Hi John,

I'm not an expert on TRA, but I don't think so.  The TRA functionality
I'm familiar with involves creating and deleting underlying
collections and then routing documents based on that information.  As
far as I know that happens at the UpdateRequestProcessor level - once
your data is indexed there's nothing available to move it around.

Best,

Jason
On Tue, Nov 27, 2018 at 12:42 PM John Nashorn <[hidden email]> wrote:

>
> Hello Everyone,
> I'm using "hive-solr" from Lucidworks to index my data into Solr (v:7.5, cloud mode). As written in the Solr Manual, TRA expects documents to be indexed using its alias name, and not directly into the collections under it. Unfortunately, hive-solr doesn't allow using TRA names as indexing targets. So what I do is: I index data using the first collection created by TRA and expect Solr to distribute my data into its respective collection under the hood. This works to some extent, but a big portion of data stays in where they were indexed, ie. the first collection of the TRA. For example (approximate numbers):
>
> * coll_2018-07-01 => 800.000.000 docs
> * coll_2018-08-01 => 0 docs
> * coll_2018-09-01 => 0 docs
> * coll_2018-10-01 => 150.000.000 docs
> * coll_2018-11-01 => 0 docs
>
> Here, coll_2018-07-01 contains data that should normally be in the other four collections.
>
> Is there a way to make TRA scan (somehow intentionally) misplaced data and send them to their correct places?
Reply | Threaded
Open this post in threaded view
|

Re: Time-Routed Alias Not Distributing Wrongly Placed Docs

Gus Heck
In reply to this post by John Nashorn
Hi John,

TRA's really do require that you index via the alias. Internally the code
is wrapping the Distributed Update Processor with an additional processor
to handle the time routing when (and only when) the TRA alias is detected.
If the alias is not used, none of the TRA code runs (by design, for
performance). TRA's have no capability at all to re-assign docs once they
are implemented since the process is data driven during update only, with
no internal maintenance threads (again by design).  It is not even
supported at this time to update the date on which the document was routed
via atomic updates for example. One would have to delete and re-index the
document (in that order, waiting for one to complete!) Adding some sort of
"fixer thread" is not something that would make much sense, since we don't
want to ever have the TRA's storing documents in the wrong place to
begin with.

TRA's are targeted at systems where new data items arrive regularly, can be
placed in the right place correctly up front and the timestamp is immutable
(typical for IOT readings, log or event based types of data for example).

I think you will probably need to follow up with Lucidworks to get them to
add a feature to allow TRA's as targets if TRA's still sound like they fit
your use case. (or pursue another solution without limitations on the
indexing target)

Frankly, it's a mystery to me how you even got any docs in the October
collection you list in your question. For anything to have been
distributed, it would have had to go through the alias. Also, how you have
more than one collection is a mystery unless you manually inserted a doc at
some point to cause collection creation perhaps?

It's also worth noting that without the routing and maintenance features
tied to the alias TRA's give very little benefit, and there are other ways
of solving this problem with external solutions. Dave, my co-presenter at
Activate 2018 talks about a couple of other options in the middle section
of our talk
https://www.youtube.com/watch?v=RB1-7Y5NQeI&index=59&list=PLU6n9Voqu_1HW8-VavVMa9lP8-oF8Oh5t&t=0s


The part describing TRA's in detail starts at 14 min and 17 to 23 min
discusses predecessors and alternatives

-Gus

On Tue, Nov 27, 2018 at 12:42 PM John Nashorn <[hidden email]> wrote:

> Hello Everyone,
> I'm using "hive-solr" from Lucidworks to index my data into Solr (v:7.5,
> cloud mode). As written in the Solr Manual, TRA expects documents to be
> indexed using its alias name, and not directly into the collections under
> it. Unfortunately, hive-solr doesn't allow using TRA names as indexing
> targets. So what I do is: I index data using the first collection created
> by TRA and expect Solr to distribute my data into its respective collection
> under the hood. This works to some extent, but a big portion of data stays
> in where they were indexed, ie. the first collection of the TRA. For
> example (approximate numbers):
>
> * coll_2018-07-01 => 800.000.000 docs
> * coll_2018-08-01 => 0 docs
> * coll_2018-09-01 => 0 docs
> * coll_2018-10-01 => 150.000.000 docs
> * coll_2018-11-01 => 0 docs
>
> Here, coll_2018-07-01 contains data that should normally be in the other
> four collections.
>
> Is there a way to make TRA scan (somehow intentionally) misplaced data and
> send them to their correct places?
>


--
http://www.the111shift.com
Reply | Threaded
Open this post in threaded view
|

Re: Time-Routed Alias Not Distributing Wrongly Placed Docs

John Nashorn
Hi Gus, thanks  for writing a detailed answer. I've written some bits between quotings from your post.

On 2018/11/30 05:15:10, Gus Heck <[hidden email]> wrote:

> Hi John,
>
> TRA's really do require that you index via the alias. Internally the code
> is wrapping the Distributed Update Processor with an additional processor
> to handle the time routing when (and only when) the TRA alias is detected.
> If the alias is not used, none of the TRA code runs (by design, for
> performance). TRA's have no capability at all to re-assign docs once they
> are implemented since the process is data driven during update only, with
> no internal maintenance threads (again by design).  It is not even
> supported at this time to update the date on which the document was routed
> via atomic updates for example. One would have to delete and re-index the
> document (in that order, waiting for one to complete!) Adding some sort of
> "fixer thread" is not something that would make much sense, since we don't
> want to ever have the TRA's storing documents in the wrong place to
> begin with.
>
> TRA's are targeted at systems where new data items arrive regularly, can be
> placed in the right place correctly up front and the timestamp is immutable
> (typical for IOT readings, log or event based types of data for example).
>
> I think you will probably need to follow up with Lucidworks to get them to
> add a feature to allow TRA's as targets if TRA's still sound like they fit
> your use case. (or pursue another solution without limitations on the
> indexing target)
>

I know that I'm using TRA out of its designed way, though my scenario would perfectly fit for TRA if I were able to use alias name with "hive-solr". I have reported the issue to hive-solr devs: https://github.com/lucidworks/hive-solr/issues/63

>
> Frankly, it's a mystery to me how you even got any docs in the October
> collection you list in your question. For anything to have been
> distributed, it would have had to go through the alias. Also, how you have
> more than one collection is a mystery unless you manually inserted a doc at
> some point to cause collection creation perhaps?
>

Maybe it's the example got you confused, I might have oversummarized it while trying to trim. Let me clarify things a little bit: My data ranges from 2013-01-01 to NOW and continues to grow. I've created a TRA beginning from 2013-01-01 adding a new collection on a monthly basis. I begun indexing data  from last to first. Since hive-solr threw NPE when used against TRA name, I was sending data to an external table created for the collection of 2013-01-01. When the first document was indexed, I saw that all the collections between 2013-01-01 and 2018-10-01 were created, and the docs were indexed into 2018-10-01, then 2018-09-01, then 2018-08-01... But after some point, say 2017-02-01, it stopped this routing and all documents went into 2013-01-01 collection.
I didn't manually insert any documents to cause creation of collections.

>
> It's also worth noting that without the routing and maintenance features
> tied to the alias TRA's give very little benefit, and there are other ways
> of solving this problem with external solutions. Dave, my co-presenter at
> Activate 2018 talks about a couple of other options in the middle section
> of our talk
> https://www.youtube.com/watch?v=RB1-7Y5NQeI&index=59&list=PLU6n9Voqu_1HW8-VavVMa9lP8-oF8Oh5t&t=0s
>
>
> The part describing TRA's in detail starts at 14 min and 17 to 23 min
> discusses predecessors and alternatives
>
> -Gus
>
> On Tue, Nov 27, 2018 at 12:42 PM John Nashorn <[hidden email]> wrote:
>
> > Hello Everyone,
> > I'm using "hive-solr" from Lucidworks to index my data into Solr (v:7.5,
> > cloud mode). As written in the Solr Manual, TRA expects documents to be
> > indexed using its alias name, and not directly into the collections under
> > it. Unfortunately, hive-solr doesn't allow using TRA names as indexing
> > targets. So what I do is: I index data using the first collection created
> > by TRA and expect Solr to distribute my data into its respective collection
> > under the hood. This works to some extent, but a big portion of data stays
> > in where they were indexed, ie. the first collection of the TRA. For
> > example (approximate numbers):
> >
> > * coll_2018-07-01 => 800.000.000 docs
> > * coll_2018-08-01 => 0 docs
> > * coll_2018-09-01 => 0 docs
> > * coll_2018-10-01 => 150.000.000 docs
> > * coll_2018-11-01 => 0 docs
> >
> > Here, coll_2018-07-01 contains data that should normally be in the other
> > four collections.
> >
> > Is there a way to make TRA scan (somehow intentionally) misplaced data and
> > send them to their correct places?
> >
>
>
> --
> http://www.the111shift.com
>