Slower indexing speed in Solr 8.0.0

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Slower indexing speed in Solr 8.0.0

Zheng Lin Edwin Yeo
Hi,

I am setting up the latest Solr 8.0.0, and I am re-indexing the data from
scratch in Solr 8.0.0

However, I found that the indexing speed is slower in Solr 8.0.0, as
compared to the earlier version like Solr 7.7.1. I have not changed the
schema.xml and solrconfig.xml yet, just did a change of the
luceneMatchVersion in solrconfig.xml to 8.0.0
uceneMatchVersion>8.0.0</luceneMatchVersion>

On average, the speed is about 40% to 50% slower. For example, the indexing
speed was about 17 mins in Solr 7.7.1, but now it takes about 25 mins to
index the same set of data.

What could be the reason that causes the indexing to be slower in Solr
8.0.0?

Regards,
Edwin
Reply | Threaded
Open this post in threaded view
|

Re: Slower indexing speed in Solr 8.0.0

Zheng Lin Edwin Yeo
For additional info, I am still using the same version of the major
components like ZooKeeper, Tika, Carrot2 and Jetty.

Regards,
Edwin

On Wed, 3 Apr 2019 at 10:17, Zheng Lin Edwin Yeo <[hidden email]>
wrote:

> Hi,
>
> I am setting up the latest Solr 8.0.0, and I am re-indexing the data from
> scratch in Solr 8.0.0
>
> However, I found that the indexing speed is slower in Solr 8.0.0, as
> compared to the earlier version like Solr 7.7.1. I have not changed the
> schema.xml and solrconfig.xml yet, just did a change of the
> luceneMatchVersion in solrconfig.xml to 8.0.0
> uceneMatchVersion>8.0.0</luceneMatchVersion>
>
> On average, the speed is about 40% to 50% slower. For example, the
> indexing speed was about 17 mins in Solr 7.7.1, but now it takes about 25
> mins to index the same set of data.
>
> What could be the reason that causes the indexing to be slower in Solr
> 8.0.0?
>
> Regards,
> Edwin
>
Reply | Threaded
Open this post in threaded view
|

Re: Slower indexing speed in Solr 8.0.0

Aroop Ganguly
Indexing speeds are function of a lot of variables in my experience.

What is your setup like?
What kind of cluster you have, the number of shards you created, the number of machines etc?
Where is your input data coming from? What technology do you use to indexing (simple java threads or something more robust like flink/spark)?
How many documents do you index at a time?

How many times have u run the indexer job on the new 8.0 setup before concluding its slower?
Make a matrix of all these variables and test over at least 5 runs before making an opinion.

I’d love hear more

> On Apr 2, 2019, at 7:41 PM, Zheng Lin Edwin Yeo <[hidden email]> wrote:
>
> For additional info, I am still using the same version of the major
> components like ZooKeeper, Tika, Carrot2 and Jetty.
>
> Regards,
> Edwin
>
> On Wed, 3 Apr 2019 at 10:17, Zheng Lin Edwin Yeo <[hidden email]>
> wrote:
>
>> Hi,
>>
>> I am setting up the latest Solr 8.0.0, and I am re-indexing the data from
>> scratch in Solr 8.0.0
>>
>> However, I found that the indexing speed is slower in Solr 8.0.0, as
>> compared to the earlier version like Solr 7.7.1. I have not changed the
>> schema.xml and solrconfig.xml yet, just did a change of the
>> luceneMatchVersion in solrconfig.xml to 8.0.0
>> uceneMatchVersion>8.0.0</luceneMatchVersion>
>>
>> On average, the speed is about 40% to 50% slower. For example, the
>> indexing speed was about 17 mins in Solr 7.7.1, but now it takes about 25
>> mins to index the same set of data.
>>
>> What could be the reason that causes the indexing to be slower in Solr
>> 8.0.0?
>>
>> Regards,
>> Edwin
>>

Reply | Threaded
Open this post in threaded view
|

Re: Slower indexing speed in Solr 8.0.0

Zheng Lin Edwin Yeo
I'm using external zookeeper, running on Solr Cloud with one shards and two
replicas. This is a testing setup, so there is only one machine.
The input data is coming from CSV file. I am indexing one CSV file at a
time, and each CSV file contains 3 million records.
I'm indexing using the code from the SimplePostTools.

I have already tried it more than 10 times, and for all the time that I
tried, the indexing speed in 8.0 are all at least 40% slower than 7.7.1

Regards,
Edwin




On Wed, 3 Apr 2019 at 11:19, Aroop Ganguly <[hidden email]> wrote:

> Indexing speeds are function of a lot of variables in my experience.
>
> What is your setup like?
> What kind of cluster you have, the number of shards you created, the
> number of machines etc?
> Where is your input data coming from? What technology do you use to
> indexing (simple java threads or something more robust like flink/spark)?
> How many documents do you index at a time?
>
> How many times have u run the indexer job on the new 8.0 setup before
> concluding its slower?
> Make a matrix of all these variables and test over at least 5 runs before
> making an opinion.
>
> I’d love hear more
>
> > On Apr 2, 2019, at 7:41 PM, Zheng Lin Edwin Yeo <[hidden email]>
> wrote:
> >
> > For additional info, I am still using the same version of the major
> > components like ZooKeeper, Tika, Carrot2 and Jetty.
> >
> > Regards,
> > Edwin
> >
> > On Wed, 3 Apr 2019 at 10:17, Zheng Lin Edwin Yeo <[hidden email]>
> > wrote:
> >
> >> Hi,
> >>
> >> I am setting up the latest Solr 8.0.0, and I am re-indexing the data
> from
> >> scratch in Solr 8.0.0
> >>
> >> However, I found that the indexing speed is slower in Solr 8.0.0, as
> >> compared to the earlier version like Solr 7.7.1. I have not changed the
> >> schema.xml and solrconfig.xml yet, just did a change of the
> >> luceneMatchVersion in solrconfig.xml to 8.0.0
> >> uceneMatchVersion>8.0.0</luceneMatchVersion>
> >>
> >> On average, the speed is about 40% to 50% slower. For example, the
> >> indexing speed was about 17 mins in Solr 7.7.1, but now it takes about
> 25
> >> mins to index the same set of data.
> >>
> >> What could be the reason that causes the indexing to be slower in Solr
> >> 8.0.0?
> >>
> >> Regards,
> >> Edwin
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Slower indexing speed in Solr 8.0.0

Toke Eskildsen-2
In reply to this post by Zheng Lin Edwin Yeo
On Wed, 2019-04-03 at 10:17 +0800, Zheng Lin Edwin Yeo wrote:
> What could be the reason that causes the indexing to be slower in
> Solr 8.0.0?

As Aroop states there can be multiple explanations. One of them is the
change to how DocValues are handled in 8.0.0. The indexing impact
should be tiny, but mistakes happen. With that in mind, do you have
DocValues enabled for a lot of your fields?

Performance issues like this one are notoriously hard to debug remote.
Is it possible for you to share your setup and your test data?

- Toke Eskildsen, Royal Danish Library


Reply | Threaded
Open this post in threaded view
|

Re: Slower indexing speed in Solr 8.0.0

Zheng Lin Edwin Yeo
Yes, I am using DocValues for most of my fields.

    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
docValues="true" />
    <fieldType name="int" class="solr.TrieIntField" docValues="true"
precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" docValues="true"
precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" docValues="true"
precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" docValues="true"
precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="tint" class="solr.TrieIntField" docValues="true"
precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="tfloat" class="solr.TrieFloatField" docValues="true"
precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="tlong" class="solr.TrieLongField" docValues="true"
precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="tdouble" class="solr.TrieDoubleField" docValues="true"
precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="date" class="solr.TrieDateField" docValues="true"
precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="tdate" class="solr.TrieDateField" docValues="true"
precisionStep="6" positionIncrementGap="0"/>

I am using dynamicField, in which I have appended the field name with
things like _s, _i, etc in the CSV file.

   <dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
   <dynamicField name="*_is" type="int"    indexed="true"  stored="true"
multiValued="true"/>
   <dynamicField name="*_s"  type="string_lower"  indexed="true"
stored="true" />
   <dynamicField name="*_ss" type="string_lower"  indexed="true"
stored="true" multiValued="true"/>


Currently we can't share the test data yet as some of the records are
sensitive. Do you have any data from CSV file that you can test?
If not we have to remove all the sensitive data before I can share.

Regards,
Edwin



On Wed, 3 Apr 2019 at 14:38, Toke Eskildsen <[hidden email]> wrote:

> On Wed, 2019-04-03 at 10:17 +0800, Zheng Lin Edwin Yeo wrote:
> > What could be the reason that causes the indexing to be slower in
> > Solr 8.0.0?
>
> As Aroop states there can be multiple explanations. One of them is the
> change to how DocValues are handled in 8.0.0. The indexing impact
> should be tiny, but mistakes happen. With that in mind, do you have
> DocValues enabled for a lot of your fields?
>
> Performance issues like this one are notoriously hard to debug remote.
> Is it possible for you to share your setup and your test data?
>
> - Toke Eskildsen, Royal Danish Library
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Slower indexing speed in Solr 8.0.0

Toke Eskildsen-2
On Wed, 2019-04-03 at 15:24 +0800, Zheng Lin Edwin Yeo wrote:
> Yes, I am using DocValues for most of my fields.

So that's a culprit. Thank you.

> Currently we can't share the test data yet as some of the records are
> sensitive. Do you have any data from CSV file that you can test?

Not really. I asked because it was a relatively easy way to do testing
(replicate your indexing flow with both Solr 7 & 8 as end-points,
attach JVisualVM to the Solrs and compare the profiles).


I'll put on my to-do to create a test or two with the scenario
"indexing from CSV with many DocValues fields". I'll try and generate
some test data and see if I can reproduce with them. If this is to be a
JIRA, that's needed anyway. Can't promise when I'll get to it, sorry.

If this does turn out to be the cause of your performance regression,
the fix (if possible) will be for a later Solr version. Currently it is
not possible to tweak the docValues indexing parameters outside of code
changes.


Do note that we're still operating on guesses here. The cause for your
regression might easily be elsewhere.

- Toke Eskildsen, Royal Danish Library


Reply | Threaded
Open this post in threaded view
|

Re: Slower indexing speed in Solr 8.0.0

caomanhdat
Hi guys,

I'm seeing the same problems with Shalin nightly indexing benchmark. This happen around this period 
git log --before=2018-12-07 --after=2018-11-21

On Wed, Apr 3, 2019 at 8:45 AM Toke Eskildsen <[hidden email]> wrote:
On Wed, 2019-04-03 at 15:24 +0800, Zheng Lin Edwin Yeo wrote:
> Yes, I am using DocValues for most of my fields.

So that's a culprit. Thank you.

> Currently we can't share the test data yet as some of the records are
> sensitive. Do you have any data from CSV file that you can test?

Not really. I asked because it was a relatively easy way to do testing
(replicate your indexing flow with both Solr 7 & 8 as end-points,
attach JVisualVM to the Solrs and compare the profiles).


I'll put on my to-do to create a test or two with the scenario
"indexing from CSV with many DocValues fields". I'll try and generate
some test data and see if I can reproduce with them. If this is to be a
JIRA, that's needed anyway. Can't promise when I'll get to it, sorry.

If this does turn out to be the cause of your performance regression,
the fix (if possible) will be for a later Solr version. Currently it is
not possible to tweak the docValues indexing parameters outside of code
changes.


Do note that we're still operating on guesses here. The cause for your
regression might easily be elsewhere.

- Toke Eskildsen, Royal Danish Library




--
Best regards,
Cao Mạnh Đạt
D.O.B : 31-07-1991
Cell: (+84) 946.328.329
E-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Slower indexing speed in Solr 8.0.0

Zheng Lin Edwin Yeo
In reply to this post by Toke Eskildsen-2
Hi Toke,

I have tried to set all the docValues in my schema.xml to false and do the
indexing again.
There isn't any difference with the indexing speed as compared to when we
have enabled the docValues.

Seems like the cause of the regression might be somewhere else?

Regards,
Edwin

On Wed, 3 Apr 2019 at 15:45, Toke Eskildsen <[hidden email]> wrote:

> On Wed, 2019-04-03 at 15:24 +0800, Zheng Lin Edwin Yeo wrote:
> > Yes, I am using DocValues for most of my fields.
>
> So that's a culprit. Thank you.
>
> > Currently we can't share the test data yet as some of the records are
> > sensitive. Do you have any data from CSV file that you can test?
>
> Not really. I asked because it was a relatively easy way to do testing
> (replicate your indexing flow with both Solr 7 & 8 as end-points,
> attach JVisualVM to the Solrs and compare the profiles).
>
>
> I'll put on my to-do to create a test or two with the scenario
> "indexing from CSV with many DocValues fields". I'll try and generate
> some test data and see if I can reproduce with them. If this is to be a
> JIRA, that's needed anyway. Can't promise when I'll get to it, sorry.
>
> If this does turn out to be the cause of your performance regression,
> the fix (if possible) will be for a later Solr version. Currently it is
> not possible to tweak the docValues indexing parameters outside of code
> changes.
>
>
> Do note that we're still operating on guesses here. The cause for your
> regression might easily be elsewhere.
>
> - Toke Eskildsen, Royal Danish Library
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Slower indexing speed in Solr 8.0.0

Toke Eskildsen-2
On Wed, 2019-04-03 at 18:04 +0800, Zheng Lin Edwin Yeo wrote:
> I have tried to set all the docValues in my schema.xml to false and
> do the indexing again.
> There isn't any difference with the indexing speed as compared to
> when we have enabled the docValues.

Thank you for sparing me the work.

- Toke Eskildsen, Royal Danish Library


Reply | Threaded
Open this post in threaded view
|

Re: Slower indexing speed in Solr 8.0.0

david.w.smiley@gmail.com
In reply to this post by caomanhdat
What/where is this benchmark?  I recall once Ishan was working with a
volunteer to set up something like Lucene has but sadly it was not
successful

On Wed, Apr 3, 2019 at 6:04 AM Đạt Cao Mạnh <[hidden email]> wrote:

> Hi guys,
>
> I'm seeing the same problems with Shalin nightly indexing benchmark. This
> happen around this period
> git log --before=2018-12-07 --after=2018-11-21
>
> On Wed, Apr 3, 2019 at 8:45 AM Toke Eskildsen <[hidden email]> wrote:
>
>> On Wed, 2019-04-03 at 15:24 +0800, Zheng Lin Edwin Yeo wrote:
>> > Yes, I am using DocValues for most of my fields.
>>
>> So that's a culprit. Thank you.
>>
>> > Currently we can't share the test data yet as some of the records are
>> > sensitive. Do you have any data from CSV file that you can test?
>>
>> Not really. I asked because it was a relatively easy way to do testing
>> (replicate your indexing flow with both Solr 7 & 8 as end-points,
>> attach JVisualVM to the Solrs and compare the profiles).
>>
>>
>> I'll put on my to-do to create a test or two with the scenario
>> "indexing from CSV with many DocValues fields". I'll try and generate
>> some test data and see if I can reproduce with them. If this is to be a
>> JIRA, that's needed anyway. Can't promise when I'll get to it, sorry.
>>
>> If this does turn out to be the cause of your performance regression,
>> the fix (if possible) will be for a later Solr version. Currently it is
>> not possible to tweak the docValues indexing parameters outside of code
>> changes.
>>
>>
>> Do note that we're still operating on guesses here. The cause for your
>> regression might easily be elsewhere.
>>
>> - Toke Eskildsen, Royal Danish Library
>>
>>
>>
>
> --
> *Best regards,*
> *Cao Mạnh Đạt*
>
>
> *D.O.B : 31-07-1991Cell: (+84) 946.328.329E-mail: [hidden email]
> <[hidden email]>*
>
--
Sent from Gmail Mobile
Reply | Threaded
Open this post in threaded view
|

Re: Slower indexing speed in Solr 8.0.0

david.w.smiley@gmail.com
In reply to this post by Zheng Lin Edwin Yeo
Hi Edwin,

I'd like to rule something out.  Does your schema define a field "_root_"?
If you don't have nested documents then remove it.  It's presence adds
indexing weight in 8.0 that was not there previously.  I'm not sure how
much though; I've hoped small but who knows.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Apr 2, 2019 at 10:17 PM Zheng Lin Edwin Yeo <[hidden email]>
wrote:

> Hi,
>
> I am setting up the latest Solr 8.0.0, and I am re-indexing the data from
> scratch in Solr 8.0.0
>
> However, I found that the indexing speed is slower in Solr 8.0.0, as
> compared to the earlier version like Solr 7.7.1. I have not changed the
> schema.xml and solrconfig.xml yet, just did a change of the
> luceneMatchVersion in solrconfig.xml to 8.0.0
> uceneMatchVersion>8.0.0</luceneMatchVersion>
>
> On average, the speed is about 40% to 50% slower. For example, the indexing
> speed was about 17 mins in Solr 7.7.1, but now it takes about 25 mins to
> index the same set of data.
>
> What could be the reason that causes the indexing to be slower in Solr
> 8.0.0?
>
> Regards,
> Edwin
>
Reply | Threaded
Open this post in threaded view
|

Re: Slower indexing speed in Solr 8.0.0

Zheng Lin Edwin Yeo
Hi David,

Yes, I do have this field "_root_" in the schema.

   <field name="_root_" type="string" indexed="true" stored="false"
docValues="false" />

However, I don't think I have use the field, and there is no difference in
the indexing speed after I remove the field.

Regards,
Edwin

On Wed, 3 Apr 2019 at 22:57, David Smiley <[hidden email]> wrote:

> Hi Edwin,
>
> I'd like to rule something out.  Does your schema define a field "_root_"?
> If you don't have nested documents then remove it.  It's presence adds
> indexing weight in 8.0 that was not there previously.  I'm not sure how
> much though; I've hoped small but who knows.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, Apr 2, 2019 at 10:17 PM Zheng Lin Edwin Yeo <[hidden email]>
> wrote:
>
> > Hi,
> >
> > I am setting up the latest Solr 8.0.0, and I am re-indexing the data from
> > scratch in Solr 8.0.0
> >
> > However, I found that the indexing speed is slower in Solr 8.0.0, as
> > compared to the earlier version like Solr 7.7.1. I have not changed the
> > schema.xml and solrconfig.xml yet, just did a change of the
> > luceneMatchVersion in solrconfig.xml to 8.0.0
> > uceneMatchVersion>8.0.0</luceneMatchVersion>
> >
> > On average, the speed is about 40% to 50% slower. For example, the
> indexing
> > speed was about 17 mins in Solr 7.7.1, but now it takes about 25 mins to
> > index the same set of data.
> >
> > What could be the reason that causes the indexing to be slower in Solr
> > 8.0.0?
> >
> > Regards,
> > Edwin
> >
>