Re: [jira] [Commented] (SOLR-14923) Indexing performance is unacceptable when child documents are involved

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (SOLR-14923) Indexing performance is unacceptable when child documents are involved

Thomas Wöckinger


On Tue, Oct 13, 2020 at 2:43 PM David Smiley (Jira) <[hidden email]> wrote:

    [ https://issues.apache.org/jira/browse/SOLR-14923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213093#comment-17213093 ]

David Smiley commented on SOLR-14923:
-------------------------------------

I am responsible for this bug, along with [~moshebla], the contributor of SOLR-12638.  Perhaps the single most bit of code I've regretted committing on behalf of another are the few lines of code you have found Thomas.  I expressed my reservations at the time:

https://issues.apache.org/jira/browse/SOLR-12638?focusedCommentId=16872898&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16872898

bq. What gnaws at me is that this "UpdateLog.openRealtimeSearcher" is being called optimistically on a new doc because maaaayyyybeee some future atomic update will need to see it. And not just any type of atomic update; one that is directly to a nested child doc (something I consider highly experimental). It's as if we're optimizing for making that future atomic update faster by doing work in advance that will, I think, very rarely actually be used. It's a tragedy, if I'm understanding this right.

There's a bit of conversation before in the issue about it as well.  It's difficult for me to say at the moment what the fix is because that's fairly complex low-level Solr code that I think few people understand well.  Nonetheless I'll look into it further this week.

I was writing an additional comment at the moment you commented on the issue, an additional performance test shows that moving out the openNewSearcher code from the synchronized block results in nearly the same performance boost.
But removing this critical code section, if not needed, would be better, as there are no side effects on the UpdateLog



> Indexing performance is unacceptable when child documents are involved
> ----------------------------------------------------------------------
>
>                 Key: SOLR-14923
>                 URL: https://issues.apache.org/jira/browse/SOLR-14923
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public)
>          Components: update, UpdateRequestProcessors
>    Affects Versions: master (9.0), 8.3, 8.4, 8.5, 8.6
>            Reporter: Thomas Wöckinger
>            Priority: Critical
>              Labels: performance
>
> Parallel indexing does not make sense at moment when child documents are used.
> The org.apache.solr.update.processor.DistributedUpdateProcessor checks at the end of the method doVersionAdd if Ulog caches should be refreshed.
> This check will return true if any child document is included in the AddUpdateCommand.
> If so ulog.openRealtimeSearcher(); is called, this call is very expensive, and executed in a synchronized block of the UpdateLog instance, therefore all other operations on the UpdateLog are blocked too.
> Because every important UpdateLog method (add, delete, ...) is done using a synchronized block almost each operation is blocked.
> This reduces multi threaded index update to a single thread behavior.
> The described behavior is not depending on any option of the UpdateRequest, so it does not make any difference if 'waitFlush', 'waitSearcher' or 'softCommit'  is true or false.
> The described behavior makes the usage of ChildDocuments useless, because the performance is unacceptable.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (SOLR-14923) Indexing performance is unacceptable when child documents are involved

Thomas Wöckinger
Did you have time to look at this?

On Tue, Oct 13, 2020 at 2:43 PM David Smiley (Jira) <[hidden email]> wrote:

    [ https://issues.apache.org/jira/browse/SOLR-14923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213093#comment-17213093 ]

David Smiley commented on SOLR-14923:
-------------------------------------

I am responsible for this bug, along with [~moshebla], the contributor of SOLR-12638.  Perhaps the single most bit of code I've regretted committing on behalf of another are the few lines of code you have found Thomas.  I expressed my reservations at the time:

https://issues.apache.org/jira/browse/SOLR-12638?focusedCommentId=16872898&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16872898

bq. What gnaws at me is that this "UpdateLog.openRealtimeSearcher" is being called optimistically on a new doc because maaaayyyybeee some future atomic update will need to see it. And not just any type of atomic update; one that is directly to a nested child doc (something I consider highly experimental). It's as if we're optimizing for making that future atomic update faster by doing work in advance that will, I think, very rarely actually be used. It's a tragedy, if I'm understanding this right.

There's a bit of conversation before in the issue about it as well.  It's difficult for me to say at the moment what the fix is because that's fairly complex low-level Solr code that I think few people understand well.  Nonetheless I'll look into it further this week.


> Indexing performance is unacceptable when child documents are involved
> ----------------------------------------------------------------------
>
>                 Key: SOLR-14923
>                 URL: https://issues.apache.org/jira/browse/SOLR-14923
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public)
>          Components: update, UpdateRequestProcessors
>    Affects Versions: master (9.0), 8.3, 8.4, 8.5, 8.6
>            Reporter: Thomas Wöckinger
>            Priority: Critical
>              Labels: performance
>
> Parallel indexing does not make sense at moment when child documents are used.
> The org.apache.solr.update.processor.DistributedUpdateProcessor checks at the end of the method doVersionAdd if Ulog caches should be refreshed.
> This check will return true if any child document is included in the AddUpdateCommand.
> If so ulog.openRealtimeSearcher(); is called, this call is very expensive, and executed in a synchronized block of the UpdateLog instance, therefore all other operations on the UpdateLog are blocked too.
> Because every important UpdateLog method (add, delete, ...) is done using a synchronized block almost each operation is blocked.
> This reduces multi threaded index update to a single thread behavior.
> The described behavior is not depending on any option of the UpdateRequest, so it does not make any difference if 'waitFlush', 'waitSearcher' or 'softCommit'  is true or false.
> The described behavior makes the usage of ChildDocuments useless, because the performance is unacceptable.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (SOLR-14923) Indexing performance is unacceptable when child documents are involved

Ishan Chattopadhyaya
This is a classic example of why *every change to default code paths* for core components must accompany performance benchmarks.

On Tue, 20 Oct, 2020, 1:35 pm Thomas Wöckinger, <[hidden email]> wrote:
Did you have time to look at this?

On Tue, Oct 13, 2020 at 2:43 PM David Smiley (Jira) <[hidden email]> wrote:

    [ https://issues.apache.org/jira/browse/SOLR-14923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213093#comment-17213093 ]

David Smiley commented on SOLR-14923:
-------------------------------------

I am responsible for this bug, along with [~moshebla], the contributor of SOLR-12638.  Perhaps the single most bit of code I've regretted committing on behalf of another are the few lines of code you have found Thomas.  I expressed my reservations at the time:

https://issues.apache.org/jira/browse/SOLR-12638?focusedCommentId=16872898&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16872898

bq. What gnaws at me is that this "UpdateLog.openRealtimeSearcher" is being called optimistically on a new doc because maaaayyyybeee some future atomic update will need to see it. And not just any type of atomic update; one that is directly to a nested child doc (something I consider highly experimental). It's as if we're optimizing for making that future atomic update faster by doing work in advance that will, I think, very rarely actually be used. It's a tragedy, if I'm understanding this right.

There's a bit of conversation before in the issue about it as well.  It's difficult for me to say at the moment what the fix is because that's fairly complex low-level Solr code that I think few people understand well.  Nonetheless I'll look into it further this week.


> Indexing performance is unacceptable when child documents are involved
> ----------------------------------------------------------------------
>
>                 Key: SOLR-14923
>                 URL: https://issues.apache.org/jira/browse/SOLR-14923
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public)
>          Components: update, UpdateRequestProcessors
>    Affects Versions: master (9.0), 8.3, 8.4, 8.5, 8.6
>            Reporter: Thomas Wöckinger
>            Priority: Critical
>              Labels: performance
>
> Parallel indexing does not make sense at moment when child documents are used.
> The org.apache.solr.update.processor.DistributedUpdateProcessor checks at the end of the method doVersionAdd if Ulog caches should be refreshed.
> This check will return true if any child document is included in the AddUpdateCommand.
> If so ulog.openRealtimeSearcher(); is called, this call is very expensive, and executed in a synchronized block of the UpdateLog instance, therefore all other operations on the UpdateLog are blocked too.
> Because every important UpdateLog method (add, delete, ...) is done using a synchronized block almost each operation is blocked.
> This reduces multi threaded index update to a single thread behavior.
> The described behavior is not depending on any option of the UpdateRequest, so it does not make any difference if 'waitFlush', 'waitSearcher' or 'softCommit'  is true or false.
> The described behavior makes the usage of ChildDocuments useless, because the performance is unacceptable.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]