[jira] [Commented] (SOLR-13709) Race condition on core reload while core is still loading?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (SOLR-13709) Race condition on core reload while core is still loading?

Chris Mattmann (Jira)

    [ https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914948#comment-16914948 ]

Erick Erickson commented on SOLR-13709:

That doc hasn't been accurate since 2015 on a quick glance, so I don't trust it in the least. Also, in some testing I was doing last night there are many legitimate (apparently) times that getCoreDescriptor is called and returns null, so blocking forever would stop at least the tests cold. Particularly looking for things like ".system" collection. The comment is totally bogus, I'll change it if I can figure out a fix.

Your hypothesis is that CoreContainer.load() is on one thread and the watcher is on another, right? And loading, which could easily take a long time if there are a lot of cores especially if there are a limited number of threads loading them, isn't done, thus the race.

Off the top of my head, it'd be OK to block until CoreContainer.load is finished. The {code}status{code} is there specifically so a transient plugin can detect this state, there's no reason we can't use it other places. At that point, all core _descriptors_ will be available to getCoreDescriptor, whether or not the core is actually loaded or not (i.e. transient or lazy). In that case null should not be returned from getCoreDescriptor. I'll give that a whirl.

But there's one other thing that occurred to me. When a core is created there's a period during which the core descriptor is not available to getCoreDescriptor for an indeterminate amount of time. Do you think that'd also be a problem?

I'll try blocking until CoreContainer.load is finished and add some logging in both cases to see if we actually hit the state where CoreContainer.load() isn't finished and we can't find the descriptor and it isn't the .system collection, which seems to be called for a lot.

It'd actually be easier to debug if we can fail in this case. Is there an easy way for Solr code to know whether it's being run from a test? I'd like getCoreDescriptor to throw an error _only when testing_ for a while if it gets into this situation. I'd make this JIRA a blocker in that case so we'd be sure to clean that up before release.

> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>                 Key: SOLR-13709
>                 URL: https://issues.apache.org/jira/browse/SOLR-13709
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public)
>            Reporter: Hoss Man
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that there may be a race condition when attempting to re-load a SolrCore while the core is currently in the process of (re)loading that can leave the SolrCore in an unusable state.
> Details to follow...

This message was sent by Atlassian Jira

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]