Document Update performances Improvement

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Document Update performances Improvement

Nicolas Paris-2
Hi

I am looking for a way to faster the update of documents.

In my context, the update replaces one of the many existing indexed
fields, and keep the others as is.

Right now, I am building the whole document, and replacing the existing
one by id.

I am wondering if **atomic update feature** would faster the process.

From one hand, using this feature would save network because only a
small subset of the document would be send from the client to the
server.
On the other hand, the server will have to collect the values from the
disk and reindex them. In addition, this implies to store the values for
every fields (I am not storing every fields) and use more space.

Also I have read about the ConcurrentUpdateSolrServer class might be an
optimized way of updating documents.

I am using spark-solr library to deal with solr-cloud. If something
exist to faster the process, I would be glad to implement it in that
library.
Also, I have split the collection over multiple shard, and I admit this
faster the update process, but who knows ?

Thoughts ?

--
nicolas
Reply | Threaded
Open this post in threaded view
|

Re: Document Update performances Improvement

Nicolas Paris-2
Hi community,

Any advice to speed-up updates ?
Is there any advice on commit, memory, docvalues, stored or any tips to
faster things ?

Thanks


On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:

> Hi
>
> I am looking for a way to faster the update of documents.
>
> In my context, the update replaces one of the many existing indexed
> fields, and keep the others as is.
>
> Right now, I am building the whole document, and replacing the existing
> one by id.
>
> I am wondering if **atomic update feature** would faster the process.
>
> From one hand, using this feature would save network because only a
> small subset of the document would be send from the client to the
> server.
> On the other hand, the server will have to collect the values from the
> disk and reindex them. In addition, this implies to store the values for
> every fields (I am not storing every fields) and use more space.
>
> Also I have read about the ConcurrentUpdateSolrServer class might be an
> optimized way of updating documents.
>
> I am using spark-solr library to deal with solr-cloud. If something
> exist to faster the process, I would be glad to implement it in that
> library.
> Also, I have split the collection over multiple shard, and I admit this
> faster the update process, but who knows ?
>
> Thoughts ?
>
> --
> nicolas
>

--
nicolas
Reply | Threaded
Open this post in threaded view
|

Re: Document Update performances Improvement

Jörn Franke
Maybe you need to give more details. I recommend always to try and test yourself as you know your own solution best. Depending on your spark process atomic updates  could be faster.

With Spark-Solr additional complexity comes. You could have too many executors for your Solr instance(s), ie a too high parallelism.

Probably the most important question is:
What performance do your use car needs and what is your current performance?

Once this is clear further architecture aspects can be derived, such as number of spark executors, number of Solr instances, sharding, replication, commit timing etc.

> Am 19.10.2019 um 21:52 schrieb Nicolas Paris <[hidden email]>:
>
> Hi community,
>
> Any advice to speed-up updates ?
> Is there any advice on commit, memory, docvalues, stored or any tips to
> faster things ?
>
> Thanks
>
>
>> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
>> Hi
>>
>> I am looking for a way to faster the update of documents.
>>
>> In my context, the update replaces one of the many existing indexed
>> fields, and keep the others as is.
>>
>> Right now, I am building the whole document, and replacing the existing
>> one by id.
>>
>> I am wondering if **atomic update feature** would faster the process.
>>
>> From one hand, using this feature would save network because only a
>> small subset of the document would be send from the client to the
>> server.
>> On the other hand, the server will have to collect the values from the
>> disk and reindex them. In addition, this implies to store the values for
>> every fields (I am not storing every fields) and use more space.
>>
>> Also I have read about the ConcurrentUpdateSolrServer class might be an
>> optimized way of updating documents.
>>
>> I am using spark-solr library to deal with solr-cloud. If something
>> exist to faster the process, I would be glad to implement it in that
>> library.
>> Also, I have split the collection over multiple shard, and I admit this
>> faster the update process, but who knows ?
>>
>> Thoughts ?
>>
>> --
>> nicolas
>>
>
> --
> nicolas
Reply | Threaded
Open this post in threaded view
|

Re: Document Update performances Improvement

Nicolas Paris-2
> Maybe you need to give more details. I recommend always to try and
> test yourself as you know your own solution best. What performance do
> your use car needs and what is your current performance?

I have 10 collections on 4 shards (no replications). The collections are
quite large ranging from 2GB to 60 GB per shard. In every case, the
update process only add several values to an indexed array field on a
document subset of each collection. The proportion of the subset is from
0 to 100%, and 95% of time below 20%. The array field represents 1 over
20 fields which are mainly unstored fields with some large textual
fields.

The 4 solr instance collocate with the spark. Right now I tested with 40
spark executors. Commit timing and commit number document are both set
to 20000. Each shard has 20g of memory.
Loading/replacing the largest collection is about 2 hours - which is
quite fast I guess. Updating 5% percent of documents of each
collections, is about half an hour.

Because my need is "only" to append several values to an array I suspect
there is some trick to make things faster.



On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:

> Maybe you need to give more details. I recommend always to try and test yourself as you know your own solution best. Depending on your spark process atomic updates  could be faster.
>
> With Spark-Solr additional complexity comes. You could have too many executors for your Solr instance(s), ie a too high parallelism.
>
> Probably the most important question is:
> What performance do your use car needs and what is your current performance?
>
> Once this is clear further architecture aspects can be derived, such as number of spark executors, number of Solr instances, sharding, replication, commit timing etc.
>
> > Am 19.10.2019 um 21:52 schrieb Nicolas Paris <[hidden email]>:
> >
> > Hi community,
> >
> > Any advice to speed-up updates ?
> > Is there any advice on commit, memory, docvalues, stored or any tips to
> > faster things ?
> >
> > Thanks
> >
> >
> >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> >> Hi
> >>
> >> I am looking for a way to faster the update of documents.
> >>
> >> In my context, the update replaces one of the many existing indexed
> >> fields, and keep the others as is.
> >>
> >> Right now, I am building the whole document, and replacing the existing
> >> one by id.
> >>
> >> I am wondering if **atomic update feature** would faster the process.
> >>
> >> From one hand, using this feature would save network because only a
> >> small subset of the document would be send from the client to the
> >> server.
> >> On the other hand, the server will have to collect the values from the
> >> disk and reindex them. In addition, this implies to store the values for
> >> every fields (I am not storing every fields) and use more space.
> >>
> >> Also I have read about the ConcurrentUpdateSolrServer class might be an
> >> optimized way of updating documents.
> >>
> >> I am using spark-solr library to deal with solr-cloud. If something
> >> exist to faster the process, I would be glad to implement it in that
> >> library.
> >> Also, I have split the collection over multiple shard, and I admit this
> >> faster the update process, but who knows ?
> >>
> >> Thoughts ?
> >>
> >> --
> >> nicolas
> >>
> >
> > --
> > nicolas
>

--
nicolas
Reply | Threaded
Open this post in threaded view
|

Re: Document Update performances Improvement

Paras Lehana
Hi Nicolas,

Have you tried playing with values of *IndexConfig*
<https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html>
(merge factor, segment size, maxBufferedDocs, Merge Policies)? We, at
Auto-Suggest, also do atomic updates daily and specifically changing merge
factor gave us a boost of ~4x during indexing. At current configuration,
our core atomically updates ~423 documents per second.

On Sun, 20 Oct 2019 at 02:07, Nicolas Paris <[hidden email]>
wrote:

> > Maybe you need to give more details. I recommend always to try and
> > test yourself as you know your own solution best. What performance do
> > your use car needs and what is your current performance?
>
> I have 10 collections on 4 shards (no replications). The collections are
> quite large ranging from 2GB to 60 GB per shard. In every case, the
> update process only add several values to an indexed array field on a
> document subset of each collection. The proportion of the subset is from
> 0 to 100%, and 95% of time below 20%. The array field represents 1 over
> 20 fields which are mainly unstored fields with some large textual
> fields.
>
> The 4 solr instance collocate with the spark. Right now I tested with 40
> spark executors. Commit timing and commit number document are both set
> to 20000. Each shard has 20g of memory.
> Loading/replacing the largest collection is about 2 hours - which is
> quite fast I guess. Updating 5% percent of documents of each
> collections, is about half an hour.
>
> Because my need is "only" to append several values to an array I suspect
> there is some trick to make things faster.
>
>
>
> On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
> > Maybe you need to give more details. I recommend always to try and test
> yourself as you know your own solution best. Depending on your spark
> process atomic updates  could be faster.
> >
> > With Spark-Solr additional complexity comes. You could have too many
> executors for your Solr instance(s), ie a too high parallelism.
> >
> > Probably the most important question is:
> > What performance do your use car needs and what is your current
> performance?
> >
> > Once this is clear further architecture aspects can be derived, such as
> number of spark executors, number of Solr instances, sharding, replication,
> commit timing etc.
> >
> > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris <[hidden email]
> >:
> > >
> > > Hi community,
> > >
> > > Any advice to speed-up updates ?
> > > Is there any advice on commit, memory, docvalues, stored or any tips to
> > > faster things ?
> > >
> > > Thanks
> > >
> > >
> > >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> > >> Hi
> > >>
> > >> I am looking for a way to faster the update of documents.
> > >>
> > >> In my context, the update replaces one of the many existing indexed
> > >> fields, and keep the others as is.
> > >>
> > >> Right now, I am building the whole document, and replacing the
> existing
> > >> one by id.
> > >>
> > >> I am wondering if **atomic update feature** would faster the process.
> > >>
> > >> From one hand, using this feature would save network because only a
> > >> small subset of the document would be send from the client to the
> > >> server.
> > >> On the other hand, the server will have to collect the values from the
> > >> disk and reindex them. In addition, this implies to store the values
> for
> > >> every fields (I am not storing every fields) and use more space.
> > >>
> > >> Also I have read about the ConcurrentUpdateSolrServer class might be
> an
> > >> optimized way of updating documents.
> > >>
> > >> I am using spark-solr library to deal with solr-cloud. If something
> > >> exist to faster the process, I would be glad to implement it in that
> > >> library.
> > >> Also, I have split the collection over multiple shard, and I admit
> this
> > >> faster the update process, but who knows ?
> > >>
> > >> Thoughts ?
> > >>
> > >> --
> > >> nicolas
> > >>
> > >
> > > --
> > > nicolas
> >
>
> --
> nicolas
>


--
--
Regards,

*Paras Lehana* [65871]
Software Programmer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

--
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.
Reply | Threaded
Open this post in threaded view
|

Re: Document Update performances Improvement

Nicolas Paris-2
> We, at Auto-Suggest, also do atomic updates daily and specifically
> changing merge factor gave us a boost of ~4x

Interesting. What kind of change exactly on the merge factor side ?


> At current configuration, our core atomically updates ~423 documents
> per second.

Would you say atomical update is faster than regular replacement of
documents ? (considering my first thought on this below)

> > I am wondering if **atomic update feature** would faster the process.
> > From one hand, using this feature would save network because only a
> > small subset of the document would be send from the client to the
> > server.
> > On the other hand, the server will have to collect the values from the
> > disk and reindex them. In addition, this implies to store the values
> > every fields (I am not storing every fields) and use more space.


Thanks Paras



On Tue, Oct 22, 2019 at 01:00:10PM +0530, Paras Lehana wrote:

> Hi Nicolas,
>
> Have you tried playing with values of *IndexConfig*
> <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html>
> (merge factor, segment size, maxBufferedDocs, Merge Policies)? We, at
> Auto-Suggest, also do atomic updates daily and specifically changing merge
> factor gave us a boost of ~4x during indexing. At current configuration,
> our core atomically updates ~423 documents per second.
>
> On Sun, 20 Oct 2019 at 02:07, Nicolas Paris <[hidden email]>
> wrote:
>
> > > Maybe you need to give more details. I recommend always to try and
> > > test yourself as you know your own solution best. What performance do
> > > your use car needs and what is your current performance?
> >
> > I have 10 collections on 4 shards (no replications). The collections are
> > quite large ranging from 2GB to 60 GB per shard. In every case, the
> > update process only add several values to an indexed array field on a
> > document subset of each collection. The proportion of the subset is from
> > 0 to 100%, and 95% of time below 20%. The array field represents 1 over
> > 20 fields which are mainly unstored fields with some large textual
> > fields.
> >
> > The 4 solr instance collocate with the spark. Right now I tested with 40
> > spark executors. Commit timing and commit number document are both set
> > to 20000. Each shard has 20g of memory.
> > Loading/replacing the largest collection is about 2 hours - which is
> > quite fast I guess. Updating 5% percent of documents of each
> > collections, is about half an hour.
> >
> > Because my need is "only" to append several values to an array I suspect
> > there is some trick to make things faster.
> >
> >
> >
> > On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
> > > Maybe you need to give more details. I recommend always to try and test
> > yourself as you know your own solution best. Depending on your spark
> > process atomic updates  could be faster.
> > >
> > > With Spark-Solr additional complexity comes. You could have too many
> > executors for your Solr instance(s), ie a too high parallelism.
> > >
> > > Probably the most important question is:
> > > What performance do your use car needs and what is your current
> > performance?
> > >
> > > Once this is clear further architecture aspects can be derived, such as
> > number of spark executors, number of Solr instances, sharding, replication,
> > commit timing etc.
> > >
> > > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris <[hidden email]
> > >:
> > > >
> > > > Hi community,
> > > >
> > > > Any advice to speed-up updates ?
> > > > Is there any advice on commit, memory, docvalues, stored or any tips to
> > > > faster things ?
> > > >
> > > > Thanks
> > > >
> > > >
> > > >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> > > >> Hi
> > > >>
> > > >> I am looking for a way to faster the update of documents.
> > > >>
> > > >> In my context, the update replaces one of the many existing indexed
> > > >> fields, and keep the others as is.
> > > >>
> > > >> Right now, I am building the whole document, and replacing the
> > existing
> > > >> one by id.
> > > >>
> > > >> I am wondering if **atomic update feature** would faster the process.
> > > >>
> > > >> From one hand, using this feature would save network because only a
> > > >> small subset of the document would be send from the client to the
> > > >> server.
> > > >> On the other hand, the server will have to collect the values from the
> > > >> disk and reindex them. In addition, this implies to store the values
> > for
> > > >> every fields (I am not storing every fields) and use more space.
> > > >>
> > > >> Also I have read about the ConcurrentUpdateSolrServer class might be
> > an
> > > >> optimized way of updating documents.
> > > >>
> > > >> I am using spark-solr library to deal with solr-cloud. If something
> > > >> exist to faster the process, I would be glad to implement it in that
> > > >> library.
> > > >> Also, I have split the collection over multiple shard, and I admit
> > this
> > > >> faster the update process, but who knows ?
> > > >>
> > > >> Thoughts ?
> > > >>
> > > >> --
> > > >> nicolas
> > > >>
> > > >
> > > > --
> > > > nicolas
> > >
> >
> > --
> > nicolas
> >
>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Software Programmer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> --
> IMPORTANT: 
> NEVER share your IndiaMART OTP/ Password with anyone.

--
nicolas
Reply | Threaded
Open this post in threaded view
|

Re: Document Update performances Improvement

Paras Lehana
Hi Nicolas,

What kind of change exactly on the merge factor side ?


We increased maxMergeAtOnce and segmentsPerTier from 5 to 50. This will
make Solr to merge segments less frequently after many index updates. Yes,
you need to find the sweet spot here but do try increasing these values
from the default ones. I strongly recommend you to give a 2 min read to this
<https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#merge-factors>.
Do note that increasing these values will require you to have larger
physical storage until segments merge.

Besides this, do review your autoCommit config
<https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-autoCommit>
or the frequency of your hard commits. In our case, we don't want real time
updates - so we can always commit less frequently. This makes indexing
faster. How often do you commit? Are you committing after each XML is
indexed? If yes, what is your batch (XML) size? Review default settings of
autoCommit and considering increasing it. Do you want real time reflection
of updates? If no, you can compromise on commits and merge factors and do
faster indexing. Don't so soft commits then.

In our case, I have set autoCommit to commit after 50,000 documents are
indexed. After EdgeNGrams tokenization, while full indexing, we have seen
index to get over 60 GBs. Once we are done with full indexing, I optimize
the index and the index size comes below 13 GB! Since we can trade off
space temporarily for increased indexing speed, we are still committed to
find sweeter spots for faster indexing. For statistics purpose, we have
over 250 million documents for indexing that converges to 60 million unique
documents after atomic updates (full indexing).



> Would you say atomical update is faster than regular replacement of
> documents?


No, I don't say that. Either of the two configs (autoCommit, Merge Policy)
will impact regular indexing too. In our case, non-atomic indexing is out
of question.

On Wed, 23 Oct 2019 at 00:43, Nicolas Paris <[hidden email]>
wrote:

> > We, at Auto-Suggest, also do atomic updates daily and specifically
> > changing merge factor gave us a boost of ~4x
>
> Interesting. What kind of change exactly on the merge factor side ?
>
>
> > At current configuration, our core atomically updates ~423 documents
> > per second.
>
> Would you say atomical update is faster than regular replacement of
> documents ? (considering my first thought on this below)
>
> > > I am wondering if **atomic update feature** would faster the process.
> > > From one hand, using this feature would save network because only a
> > > small subset of the document would be send from the client to the
> > > server.
> > > On the other hand, the server will have to collect the values from the
> > > disk and reindex them. In addition, this implies to store the values
> > > every fields (I am not storing every fields) and use more space.
>
>
> Thanks Paras
>
>
>
> On Tue, Oct 22, 2019 at 01:00:10PM +0530, Paras Lehana wrote:
> > Hi Nicolas,
> >
> > Have you tried playing with values of *IndexConfig*
> > <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html
> >
> > (merge factor, segment size, maxBufferedDocs, Merge Policies)? We, at
> > Auto-Suggest, also do atomic updates daily and specifically changing
> merge
> > factor gave us a boost of ~4x during indexing. At current configuration,
> > our core atomically updates ~423 documents per second.
> >
> > On Sun, 20 Oct 2019 at 02:07, Nicolas Paris <[hidden email]>
> > wrote:
> >
> > > > Maybe you need to give more details. I recommend always to try and
> > > > test yourself as you know your own solution best. What performance do
> > > > your use car needs and what is your current performance?
> > >
> > > I have 10 collections on 4 shards (no replications). The collections
> are
> > > quite large ranging from 2GB to 60 GB per shard. In every case, the
> > > update process only add several values to an indexed array field on a
> > > document subset of each collection. The proportion of the subset is
> from
> > > 0 to 100%, and 95% of time below 20%. The array field represents 1 over
> > > 20 fields which are mainly unstored fields with some large textual
> > > fields.
> > >
> > > The 4 solr instance collocate with the spark. Right now I tested with
> 40
> > > spark executors. Commit timing and commit number document are both set
> > > to 20000. Each shard has 20g of memory.
> > > Loading/replacing the largest collection is about 2 hours - which is
> > > quite fast I guess. Updating 5% percent of documents of each
> > > collections, is about half an hour.
> > >
> > > Because my need is "only" to append several values to an array I
> suspect
> > > there is some trick to make things faster.
> > >
> > >
> > >
> > > On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
> > > > Maybe you need to give more details. I recommend always to try and
> test
> > > yourself as you know your own solution best. Depending on your spark
> > > process atomic updates  could be faster.
> > > >
> > > > With Spark-Solr additional complexity comes. You could have too many
> > > executors for your Solr instance(s), ie a too high parallelism.
> > > >
> > > > Probably the most important question is:
> > > > What performance do your use car needs and what is your current
> > > performance?
> > > >
> > > > Once this is clear further architecture aspects can be derived, such
> as
> > > number of spark executors, number of Solr instances, sharding,
> replication,
> > > commit timing etc.
> > > >
> > > > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris <
> [hidden email]
> > > >:
> > > > >
> > > > > Hi community,
> > > > >
> > > > > Any advice to speed-up updates ?
> > > > > Is there any advice on commit, memory, docvalues, stored or any
> tips to
> > > > > faster things ?
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> > > > >> Hi
> > > > >>
> > > > >> I am looking for a way to faster the update of documents.
> > > > >>
> > > > >> In my context, the update replaces one of the many existing
> indexed
> > > > >> fields, and keep the others as is.
> > > > >>
> > > > >> Right now, I am building the whole document, and replacing the
> > > existing
> > > > >> one by id.
> > > > >>
> > > > >> I am wondering if **atomic update feature** would faster the
> process.
> > > > >>
> > > > >> From one hand, using this feature would save network because only
> a
> > > > >> small subset of the document would be send from the client to the
> > > > >> server.
> > > > >> On the other hand, the server will have to collect the values
> from the
> > > > >> disk and reindex them. In addition, this implies to store the
> values
> > > for
> > > > >> every fields (I am not storing every fields) and use more space.
> > > > >>
> > > > >> Also I have read about the ConcurrentUpdateSolrServer class might
> be
> > > an
> > > > >> optimized way of updating documents.
> > > > >>
> > > > >> I am using spark-solr library to deal with solr-cloud. If
> something
> > > > >> exist to faster the process, I would be glad to implement it in
> that
> > > > >> library.
> > > > >> Also, I have split the collection over multiple shard, and I admit
> > > this
> > > > >> faster the update process, but who knows ?
> > > > >>
> > > > >> Thoughts ?
> > > > >>
> > > > >> --
> > > > >> nicolas
> > > > >>
> > > > >
> > > > > --
> > > > > nicolas
> > > >
> > >
> > > --
> > > nicolas
> > >
> >
> >
> > --
> > --
> > Regards,
> >
> > *Paras Lehana* [65871]
> > Software Programmer, Auto-Suggest,
> > IndiaMART Intermesh Ltd.
> >
> > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > Noida, UP, IN - 201303
> >
> > Mob.: +91-9560911996
> > Work: 01203916600 | Extn:  *8173*
> >
> > --
> > IMPORTANT:
> > NEVER share your IndiaMART OTP/ Password with anyone.
>
> --
> nicolas
>


--
--
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

--
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.
Reply | Threaded
Open this post in threaded view
|

Re: Document Update performances Improvement

Shawn Heisey-2
In reply to this post by Nicolas Paris-2
On 10/22/2019 1:12 PM, Nicolas Paris wrote:
>> We, at Auto-Suggest, also do atomic updates daily and specifically
>> changing merge factor gave us a boost of ~4x
>
> Interesting. What kind of change exactly on the merge factor side ?

The mergeFactor setting is deprecated.  Instead, use maxMergeAtOnce,
segmentsPerTier, and a setting that is not mentioned in the ref guide --
maxMergeAtOnceExplicit.

Set the first two to the same number, and the third to a minumum of
three times what you set the other two.

The default setting for maxMergeAtOnce and segmentsPerTier is 10, with
30 for maxMergeAtOnceExplicit.  When you're trying to increase indexing
speed and you think segment merging is interfering, you want to increase
these values to something larger.  Note that increasing these values
will increase the number of files that your Solr install keeps open.

https://lucene.apache.org/solr/guide/8_1/indexconfig-in-solrconfig.html#mergepolicyfactory

When I built a Solr setup, I increased maxMergeAtOnce and
segmentsPerTier to 35, and maxMergeAtOnceExplicit to 105.  This made
merging happen a lot less frequently.

> Would you say atomical update is faster than regular replacement of
> documents ? (considering my first thought on this below)

On the Solr side, atomic updates will be slightly slower than indexing
the whole document provided to Solr.  When an atomic update is done,
Solr will find the existing document, then combine what's in that
document with the changes you specify using the atomic update, and then
index the whole combined document as a new document that replaces with
original.

Whether or not atomic updates are faster or slower in practice than
indexing the whole document will depend on how your source systems work,
and that is not something we can know.  If Solr can access the previous
document faster than you can get the document from your source system,
then atomic updates might be faster.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Document Update performances Improvement

Nicolas Paris-2
In reply to this post by Paras Lehana
> <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#merge-factors>.
> <https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-autoCommit>

Thanks for those relevant pointers and the explanation.

> How often do you commit? Are you committing after each XML is
> indexed? If yes, what is your batch (XML) size? Review default settings of
> autoCommit and considering increasing it.

I guess I do not use any XML under the hood: spark-solr uses sorlj which
serialize the document in java binary objects. However the commit
strategy applies too, I have setup 20,000 documents or 20,000 ms.

> Do you want real time reflection
> of updates? If no, you can compromise on commits and merge factors and do
> faster indexing. Don't so soft commits then.

Indeed I d'like the document be accessible sooner. That being said, 5
minutes delay is acceptable.

> In our case, I have set autoCommit to commit after 50,000 documents are
> indexed. After EdgeNGrams tokenization, while full indexing, we have seen
> index to get over 60 GBs. Once we are done with full indexing, I optimize
> the index and the index size comes below 13 GB!

I guess I get the idea: "put the dollars as fast as possible in the bag,
we will clean-up when back home"

Thanks

On Wed, Oct 23, 2019 at 11:34:44AM +0530, Paras Lehana wrote:

> Hi Nicolas,
>
> What kind of change exactly on the merge factor side ?
>
>
> We increased maxMergeAtOnce and segmentsPerTier from 5 to 50. This will
> make Solr to merge segments less frequently after many index updates. Yes,
> you need to find the sweet spot here but do try increasing these values
> from the default ones. I strongly recommend you to give a 2 min read to this
> <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#merge-factors>.
> Do note that increasing these values will require you to have larger
> physical storage until segments merge.
>
> Besides this, do review your autoCommit config
> <https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-autoCommit>
> or the frequency of your hard commits. In our case, we don't want real time
> updates - so we can always commit less frequently. This makes indexing
> faster. How often do you commit? Are you committing after each XML is
> indexed? If yes, what is your batch (XML) size? Review default settings of
> autoCommit and considering increasing it. Do you want real time reflection
> of updates? If no, you can compromise on commits and merge factors and do
> faster indexing. Don't so soft commits then.
>
> In our case, I have set autoCommit to commit after 50,000 documents are
> indexed. After EdgeNGrams tokenization, while full indexing, we have seen
> index to get over 60 GBs. Once we are done with full indexing, I optimize
> the index and the index size comes below 13 GB! Since we can trade off
> space temporarily for increased indexing speed, we are still committed to
> find sweeter spots for faster indexing. For statistics purpose, we have
> over 250 million documents for indexing that converges to 60 million unique
> documents after atomic updates (full indexing).
>
>
>
> > Would you say atomical update is faster than regular replacement of
> > documents?
>
>
> No, I don't say that. Either of the two configs (autoCommit, Merge Policy)
> will impact regular indexing too. In our case, non-atomic indexing is out
> of question.
>
> On Wed, 23 Oct 2019 at 00:43, Nicolas Paris <[hidden email]>
> wrote:
>
> > > We, at Auto-Suggest, also do atomic updates daily and specifically
> > > changing merge factor gave us a boost of ~4x
> >
> > Interesting. What kind of change exactly on the merge factor side ?
> >
> >
> > > At current configuration, our core atomically updates ~423 documents
> > > per second.
> >
> > Would you say atomical update is faster than regular replacement of
> > documents ? (considering my first thought on this below)
> >
> > > > I am wondering if **atomic update feature** would faster the process.
> > > > From one hand, using this feature would save network because only a
> > > > small subset of the document would be send from the client to the
> > > > server.
> > > > On the other hand, the server will have to collect the values from the
> > > > disk and reindex them. In addition, this implies to store the values
> > > > every fields (I am not storing every fields) and use more space.
> >
> >
> > Thanks Paras
> >
> >
> >
> > On Tue, Oct 22, 2019 at 01:00:10PM +0530, Paras Lehana wrote:
> > > Hi Nicolas,
> > >
> > > Have you tried playing with values of *IndexConfig*
> > > <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html
> > >
> > > (merge factor, segment size, maxBufferedDocs, Merge Policies)? We, at
> > > Auto-Suggest, also do atomic updates daily and specifically changing
> > merge
> > > factor gave us a boost of ~4x during indexing. At current configuration,
> > > our core atomically updates ~423 documents per second.
> > >
> > > On Sun, 20 Oct 2019 at 02:07, Nicolas Paris <[hidden email]>
> > > wrote:
> > >
> > > > > Maybe you need to give more details. I recommend always to try and
> > > > > test yourself as you know your own solution best. What performance do
> > > > > your use car needs and what is your current performance?
> > > >
> > > > I have 10 collections on 4 shards (no replications). The collections
> > are
> > > > quite large ranging from 2GB to 60 GB per shard. In every case, the
> > > > update process only add several values to an indexed array field on a
> > > > document subset of each collection. The proportion of the subset is
> > from
> > > > 0 to 100%, and 95% of time below 20%. The array field represents 1 over
> > > > 20 fields which are mainly unstored fields with some large textual
> > > > fields.
> > > >
> > > > The 4 solr instance collocate with the spark. Right now I tested with
> > 40
> > > > spark executors. Commit timing and commit number document are both set
> > > > to 20000. Each shard has 20g of memory.
> > > > Loading/replacing the largest collection is about 2 hours - which is
> > > > quite fast I guess. Updating 5% percent of documents of each
> > > > collections, is about half an hour.
> > > >
> > > > Because my need is "only" to append several values to an array I
> > suspect
> > > > there is some trick to make things faster.
> > > >
> > > >
> > > >
> > > > On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
> > > > > Maybe you need to give more details. I recommend always to try and
> > test
> > > > yourself as you know your own solution best. Depending on your spark
> > > > process atomic updates  could be faster.
> > > > >
> > > > > With Spark-Solr additional complexity comes. You could have too many
> > > > executors for your Solr instance(s), ie a too high parallelism.
> > > > >
> > > > > Probably the most important question is:
> > > > > What performance do your use car needs and what is your current
> > > > performance?
> > > > >
> > > > > Once this is clear further architecture aspects can be derived, such
> > as
> > > > number of spark executors, number of Solr instances, sharding,
> > replication,
> > > > commit timing etc.
> > > > >
> > > > > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris <
> > [hidden email]
> > > > >:
> > > > > >
> > > > > > Hi community,
> > > > > >
> > > > > > Any advice to speed-up updates ?
> > > > > > Is there any advice on commit, memory, docvalues, stored or any
> > tips to
> > > > > > faster things ?
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > >
> > > > > >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> > > > > >> Hi
> > > > > >>
> > > > > >> I am looking for a way to faster the update of documents.
> > > > > >>
> > > > > >> In my context, the update replaces one of the many existing
> > indexed
> > > > > >> fields, and keep the others as is.
> > > > > >>
> > > > > >> Right now, I am building the whole document, and replacing the
> > > > existing
> > > > > >> one by id.
> > > > > >>
> > > > > >> I am wondering if **atomic update feature** would faster the
> > process.
> > > > > >>
> > > > > >> From one hand, using this feature would save network because only
> > a
> > > > > >> small subset of the document would be send from the client to the
> > > > > >> server.
> > > > > >> On the other hand, the server will have to collect the values
> > from the
> > > > > >> disk and reindex them. In addition, this implies to store the
> > values
> > > > for
> > > > > >> every fields (I am not storing every fields) and use more space.
> > > > > >>
> > > > > >> Also I have read about the ConcurrentUpdateSolrServer class might
> > be
> > > > an
> > > > > >> optimized way of updating documents.
> > > > > >>
> > > > > >> I am using spark-solr library to deal with solr-cloud. If
> > something
> > > > > >> exist to faster the process, I would be glad to implement it in
> > that
> > > > > >> library.
> > > > > >> Also, I have split the collection over multiple shard, and I admit
> > > > this
> > > > > >> faster the update process, but who knows ?
> > > > > >>
> > > > > >> Thoughts ?
> > > > > >>
> > > > > >> --
> > > > > >> nicolas
> > > > > >>
> > > > > >
> > > > > > --
> > > > > > nicolas
> > > > >
> > > >
> > > > --
> > > > nicolas
> > > >
> > >
> > >
> > > --
> > > --
> > > Regards,
> > >
> > > *Paras Lehana* [65871]
> > > Software Programmer, Auto-Suggest,
> > > IndiaMART Intermesh Ltd.
> > >
> > > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > > Noida, UP, IN - 201303
> > >
> > > Mob.: +91-9560911996
> > > Work: 01203916600 | Extn:  *8173*
> > >
> > > --
> > > IMPORTANT:
> > > NEVER share your IndiaMART OTP/ Password with anyone.
> >
> > --
> > nicolas
> >
>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> --
> IMPORTANT: 
> NEVER share your IndiaMART OTP/ Password with anyone.

--
nicolas
Reply | Threaded
Open this post in threaded view
|

Re: Document Update performances Improvement

Nicolas Paris-2
In reply to this post by Shawn Heisey-2
> Set the first two to the same number, and the third to a minumum of three
> times what you set the other two.
> When I built a Solr setup, I increased maxMergeAtOnce and segmentsPerTier to
> 35, and maxMergeAtOnceExplicit to 105.  This made merging happen a lot less
> frequently.

Good to know the chief recipes.

> On the Solr side, atomic updates will be slightly slower than indexing the
> whole document provided to Solr.

This makes sense.

> If Solr can access the previous document faster than you can get the
> document from your source system, then atomic updates might be faster.

The documents are stored within parquet files without any processing
needed. In this case, the atomic update is not likely to faster things.


Thanks

On Wed, Oct 23, 2019 at 07:49:44AM -0600, Shawn Heisey wrote:

> On 10/22/2019 1:12 PM, Nicolas Paris wrote:
> > > We, at Auto-Suggest, also do atomic updates daily and specifically
> > > changing merge factor gave us a boost of ~4x
> >
> > Interesting. What kind of change exactly on the merge factor side ?
>
> The mergeFactor setting is deprecated.  Instead, use maxMergeAtOnce,
> segmentsPerTier, and a setting that is not mentioned in the ref guide --
> maxMergeAtOnceExplicit.
>
> Set the first two to the same number, and the third to a minumum of three
> times what you set the other two.
>
> The default setting for maxMergeAtOnce and segmentsPerTier is 10, with 30
> for maxMergeAtOnceExplicit.  When you're trying to increase indexing speed
> and you think segment merging is interfering, you want to increase these
> values to something larger.  Note that increasing these values will increase
> the number of files that your Solr install keeps open.
>
> https://lucene.apache.org/solr/guide/8_1/indexconfig-in-solrconfig.html#mergepolicyfactory
>
> When I built a Solr setup, I increased maxMergeAtOnce and segmentsPerTier to
> 35, and maxMergeAtOnceExplicit to 105.  This made merging happen a lot less
> frequently.
>
> > Would you say atomical update is faster than regular replacement of
> > documents ? (considering my first thought on this below)
>
> On the Solr side, atomic updates will be slightly slower than indexing the
> whole document provided to Solr.  When an atomic update is done, Solr will
> find the existing document, then combine what's in that document with the
> changes you specify using the atomic update, and then index the whole
> combined document as a new document that replaces with original.
>
> Whether or not atomic updates are faster or slower in practice than indexing
> the whole document will depend on how your source systems work, and that is
> not something we can know.  If Solr can access the previous document faster
> than you can get the document from your source system, then atomic updates
> might be faster.
>
> Thanks,
> Shawn
>

--
nicolas
Reply | Threaded
Open this post in threaded view
|

Re: Document Update performances Improvement

Nicolas Paris-2
In reply to this post by Jörn Franke
> With Spark-Solr additional complexity comes. You could have too many
> executors for your Solr instance(s), ie a too high parallelism.

I have been reducing the parallelism of spark-solr part by 5. I had 40
executors loading 4 shards. Right now only 8 executors loading 4 shards.
As a result, I can see a 10 times update improvement, and I suspect the
update process had been overhelmed by spark.

I have been able to keep 40 executor for document preprocessing and
reducing to 8 executors within the same spark job by using the
"dataframe.coalesce" feature which does not shuffle the data at all and
keeps both spark cluster and solr quiet in term of network.

Thanks

On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:

> Maybe you need to give more details. I recommend always to try and test yourself as you know your own solution best. Depending on your spark process atomic updates  could be faster.
>
> With Spark-Solr additional complexity comes. You could have too many executors for your Solr instance(s), ie a too high parallelism.
>
> Probably the most important question is:
> What performance do your use car needs and what is your current performance?
>
> Once this is clear further architecture aspects can be derived, such as number of spark executors, number of Solr instances, sharding, replication, commit timing etc.
>
> > Am 19.10.2019 um 21:52 schrieb Nicolas Paris <[hidden email]>:
> >
> > Hi community,
> >
> > Any advice to speed-up updates ?
> > Is there any advice on commit, memory, docvalues, stored or any tips to
> > faster things ?
> >
> > Thanks
> >
> >
> >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> >> Hi
> >>
> >> I am looking for a way to faster the update of documents.
> >>
> >> In my context, the update replaces one of the many existing indexed
> >> fields, and keep the others as is.
> >>
> >> Right now, I am building the whole document, and replacing the existing
> >> one by id.
> >>
> >> I am wondering if **atomic update feature** would faster the process.
> >>
> >> From one hand, using this feature would save network because only a
> >> small subset of the document would be send from the client to the
> >> server.
> >> On the other hand, the server will have to collect the values from the
> >> disk and reindex them. In addition, this implies to store the values for
> >> every fields (I am not storing every fields) and use more space.
> >>
> >> Also I have read about the ConcurrentUpdateSolrServer class might be an
> >> optimized way of updating documents.
> >>
> >> I am using spark-solr library to deal with solr-cloud. If something
> >> exist to faster the process, I would be glad to implement it in that
> >> library.
> >> Also, I have split the collection over multiple shard, and I admit this
> >> faster the update process, but who knows ?
> >>
> >> Thoughts ?
> >>
> >> --
> >> nicolas
> >>
> >
> > --
> > nicolas
>

--
nicolas
Reply | Threaded
Open this post in threaded view
|

Re: Document Update performances Improvement

Erick Erickson
In reply to this post by Shawn Heisey-2
My first question is always “what’s the bottleneck”? Unless you’re driving your CPUs and/or I/O hard on Solr, the bottleneck is in the acquisition of the docs not on the Solr side.

Also, be sure and batch in groups of at least 10x the number of shards, see: https://lucidworks.com/post/really-batch-updates-solr-2/

Although it sounds like you’ve figured this out already…. And yeah, I’ve seen Solr indexing degrade when it’s being overwhelmed, so that might be the total issue.

Best,
Erick

> On Oct 23, 2019, at 9:49 AM, Shawn Heisey <[hidden email]> wrote:
>
> On 10/22/2019 1:12 PM, Nicolas Paris wrote:
>>> We, at Auto-Suggest, also do atomic updates daily and specifically
>>> changing merge factor gave us a boost of ~4x
>> Interesting. What kind of change exactly on the merge factor side ?
>
> The mergeFactor setting is deprecated.  Instead, use maxMergeAtOnce, segmentsPerTier, and a setting that is not mentioned in the ref guide -- maxMergeAtOnceExplicit.
>
> Set the first two to the same number, and the third to a minumum of three times what you set the other two.
>
> The default setting for maxMergeAtOnce and segmentsPerTier is 10, with 30 for maxMergeAtOnceExplicit.  When you're trying to increase indexing speed and you think segment merging is interfering, you want to increase these values to something larger.  Note that increasing these values will increase the number of files that your Solr install keeps open.
>
> https://lucene.apache.org/solr/guide/8_1/indexconfig-in-solrconfig.html#mergepolicyfactory
>
> When I built a Solr setup, I increased maxMergeAtOnce and segmentsPerTier to 35, and maxMergeAtOnceExplicit to 105.  This made merging happen a lot less frequently.
>
>> Would you say atomical update is faster than regular replacement of
>> documents ? (considering my first thought on this below)
>
> On the Solr side, atomic updates will be slightly slower than indexing the whole document provided to Solr.  When an atomic update is done, Solr will find the existing document, then combine what's in that document with the changes you specify using the atomic update, and then index the whole combined document as a new document that replaces with original.
>
> Whether or not atomic updates are faster or slower in practice than indexing the whole document will depend on how your source systems work, and that is not something we can know.  If Solr can access the previous document faster than you can get the document from your source system, then atomic updates might be faster.
>
> Thanks,
> Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Document Update performances Improvement

Jörn Franke
In reply to this post by Nicolas Paris-2
Well coalesce does require shuffle and network, however in most cases it is less than repartition as it moves the data (through the network) to already existing executors.
However as you see and others confirm: for high peformance you don’t need high parallelism on the ingestion side, but you can load the data in batches with a low parallelism. Tuning with some parameters (commit interval, merge segment size) can, but only if needed, deliver even more performance. If you then still need more performance you can increase the number of Solr nodes and shards.

> Am 23.10.2019 um 22:01 schrieb Nicolas Paris <[hidden email]>:
>
> 
>>
>> With Spark-Solr additional complexity comes. You could have too many
>> executors for your Solr instance(s), ie a too high parallelism.
>
> I have been reducing the parallelism of spark-solr part by 5. I had 40
> executors loading 4 shards. Right now only 8 executors loading 4 shards.
> As a result, I can see a 10 times update improvement, and I suspect the
> update process had been overhelmed by spark.
>
> I have been able to keep 40 executor for document preprocessing and
> reducing to 8 executors within the same spark job by using the
> "dataframe.coalesce" feature which does not shuffle the data at all and
> keeps both spark cluster and solr quiet in term of network.
>
> Thanks
>
>> On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
>> Maybe you need to give more details. I recommend always to try and test yourself as you know your own solution best. Depending on your spark process atomic updates  could be faster.
>>
>> With Spark-Solr additional complexity comes. You could have too many executors for your Solr instance(s), ie a too high parallelism.
>>
>> Probably the most important question is:
>> What performance do your use car needs and what is your current performance?
>>
>> Once this is clear further architecture aspects can be derived, such as number of spark executors, number of Solr instances, sharding, replication, commit timing etc.
>>
>>>> Am 19.10.2019 um 21:52 schrieb Nicolas Paris <[hidden email]>:
>>>
>>> Hi community,
>>>
>>> Any advice to speed-up updates ?
>>> Is there any advice on commit, memory, docvalues, stored or any tips to
>>> faster things ?
>>>
>>> Thanks
>>>
>>>
>>>> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
>>>> Hi
>>>>
>>>> I am looking for a way to faster the update of documents.
>>>>
>>>> In my context, the update replaces one of the many existing indexed
>>>> fields, and keep the others as is.
>>>>
>>>> Right now, I am building the whole document, and replacing the existing
>>>> one by id.
>>>>
>>>> I am wondering if **atomic update feature** would faster the process.
>>>>
>>>> From one hand, using this feature would save network because only a
>>>> small subset of the document would be send from the client to the
>>>> server.
>>>> On the other hand, the server will have to collect the values from the
>>>> disk and reindex them. In addition, this implies to store the values for
>>>> every fields (I am not storing every fields) and use more space.
>>>>
>>>> Also I have read about the ConcurrentUpdateSolrServer class might be an
>>>> optimized way of updating documents.
>>>>
>>>> I am using spark-solr library to deal with solr-cloud. If something
>>>> exist to faster the process, I would be glad to implement it in that
>>>> library.
>>>> Also, I have split the collection over multiple shard, and I admit this
>>>> faster the update process, but who knows ?
>>>>
>>>> Thoughts ?
>>>>
>>>> --
>>>> nicolas
>>>>
>>>
>>> --
>>> nicolas
>>
>
> --
> nicolas