Alias Id condundrum

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Alias Id condundrum

Gus Heck
It seems that the real time get handler doesn't play nice with aliases. The current (and past) behavior seems to be that it only works for the first collection listed in the alias. This seems to be pretty clearly a bug, as one certainly would expect the /get executed against an alias to either refuse to work with aliases or work across all collections in the alias rather than silently working only on the first collection. 

However this has opened another can of worms after some discussion with Erick on slack. What's the expected behavior for this handler in the event that the same ID shows up in both collections? 

My first impulse was it should return both, and then I looked at /select to see what it did, and found that /select on an alias to collections that contain duplicate ids is not in a happy state either since it seems to randomly return one or the other document, but not both (probably based on the order in which the docs are returned from sub-requests which is not deterministic).

So from a user perspective I can see arguments for either of two behaviors (in both cases) but no reason to like the current behaviors which are silently giving results that are hiding the situation and not returning all documents.

Reasonable Behavior 1: Throw an error if a second document with the same ID is encountered.
Reasonable Behavior 2: Return all documents including both (or more) documents that have colliding ID's. 

I can think of scenarios where either would be desirable, so I would think that we want to make the behavior choice something that can be selected by users. For this I see two possible points at which the user might express their preference: 
  1. At Configuration time with an Alias Property
  2. At query time with a query parameter. 
This also implies a down side to routed aliases in that it's probably possible to index the same ID multiple times if it repeats less often than the collection creation interval for time routing or doesn't repeat within the same category (for category routed), but the responses to queries may then hide the duplicates in a non-deterministic fashion which is clearly bad. 

I am possibly ok with just documenting that aliases require the user to provide their own guarantees about ID uniqueness too... though part of me really wants to have a mode that detects this problem for the user somehow... (&facet.mincount=2&facet.field=id seems to work, but requires active checking?) In any case, the behavior with /get not returning docs in any but the first collection probably needs to be fixed.

Thoughts?

-Gus
Reply | Threaded
Open this post in threaded view
|

Re: Alias Id condundrum

david.w.smiley@gmail.com
On Wed, Sep 4, 2019 at 11:26 PM Gus Heck <[hidden email]> wrote:
It seems that the real time get handler doesn't play nice with aliases. The current (and past) behavior seems to be that it only works for the first collection listed in the alias. This seems to be pretty clearly a bug, as one certainly would expect the /get executed against an alias to either refuse to work with aliases or work across all collections in the alias rather than silently working only on the first collection. 

I think it should just refuse to work (throw an exception) if there are multiple collections in the alias -- simple.  It's okay for components to have a limitation.  

Solr's internal use of RTG isn't affected by this scenario.  I believe few users even use RTG but yes of course some do and I know of at least one.  In the one case I saw RTG used, it was an nice optimization that replaced its former mode of operation that worked fine.

~ David
Reply | Threaded
Open this post in threaded view
|

Re: Alias Id condundrum

Gus Heck
That's certainly an option, but I was leaning the other way (making it work). I know of a user that is dividing up their data into frequently and less frequently (re)indexed stuff which is normally accessed by an alias and they presently have to query for the list of collections in the alias and then /get on each collection independently because of the current behavior. This works, and if we start producing an error, they can of course continue to do that, but it feels clumsy and inelegant for them to have to do that to me at least). Also, it might not be all bad if it worked with routed aliases.

On Fri, Sep 6, 2019 at 5:06 PM David Smiley <[hidden email]> wrote:
On Wed, Sep 4, 2019 at 11:26 PM Gus Heck <[hidden email]> wrote:
It seems that the real time get handler doesn't play nice with aliases. The current (and past) behavior seems to be that it only works for the first collection listed in the alias. This seems to be pretty clearly a bug, as one certainly would expect the /get executed against an alias to either refuse to work with aliases or work across all collections in the alias rather than silently working only on the first collection. 

I think it should just refuse to work (throw an exception) if there are multiple collections in the alias -- simple.  It's okay for components to have a limitation.  

Solr's internal use of RTG isn't affected by this scenario.  I believe few users even use RTG but yes of course some do and I know of at least one.  In the one case I saw RTG used, it was an nice optimization that replaced its former mode of operation that worked fine.

~ David


--