indexing performance 6.6 vs 7.1

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

indexing performance 6.6 vs 7.1

Rob Audenaerde
Hi all,

We recently upgraded from Lucene 6.6 to 7.1.  We see a significant drop in
indexing performace.

We have a-typical use of Lucene, as we (also) index some database tables
and add all the values as AssociatedFacetFields as well. This allows us to
create pivot tables on search results really fast.

These tables have some overlapping columns, but also disjoint ones.

We anticipated a decrease in index size because of the sparse docvalues. We
see this happening, with decreases to ~50%-80% of the original index size.
But we did not expect an drop in indexing performance (client systems
indexing time increased with +50% to +250%).

(Our indexing-speed used to be mainly bound by the speed the Taxonomy could
deliver new ordinals for new values, currently we are investigating if this
is still the case, will report later when a profiler run has been done)

Does anyone know if this increase in indexing time is to be expected as
result of the sparse docvalues change?

Kind regards,

Rob Audenaerde
Reply | Threaded
Open this post in threaded view
|

Re: indexing performance 6.6 vs 7.1

Erick Erickson
My first question is always "are you running the Solr CPUs flat out?".
My guess in this case is that the indexing client is the same and the
problem is in Solr, but it's worth checking whether the clients are
just somehow not delivering docs as fast as they were before.

My suspicion is that the indexing client hasn't changed, but it's
worth checking.

Best,
Erick

On Thu, Jan 18, 2018 at 2:23 AM, Rob Audenaerde
<[hidden email]> wrote:

> Hi all,
>
> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant drop in
> indexing performace.
>
> We have a-typical use of Lucene, as we (also) index some database tables
> and add all the values as AssociatedFacetFields as well. This allows us to
> create pivot tables on search results really fast.
>
> These tables have some overlapping columns, but also disjoint ones.
>
> We anticipated a decrease in index size because of the sparse docvalues. We
> see this happening, with decreases to ~50%-80% of the original index size.
> But we did not expect an drop in indexing performance (client systems
> indexing time increased with +50% to +250%).
>
> (Our indexing-speed used to be mainly bound by the speed the Taxonomy could
> deliver new ordinals for new values, currently we are investigating if this
> is still the case, will report later when a profiler run has been done)
>
> Does anyone know if this increase in indexing time is to be expected as
> result of the sparse docvalues change?
>
> Kind regards,
>
> Rob Audenaerde

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing performance 6.6 vs 7.1

Robert Muir
Erick I don't think solr was mentioned here.

On Thu, Jan 18, 2018 at 8:03 AM, Erick Erickson <[hidden email]> wrote:

> My first question is always "are you running the Solr CPUs flat out?".
> My guess in this case is that the indexing client is the same and the
> problem is in Solr, but it's worth checking whether the clients are
> just somehow not delivering docs as fast as they were before.
>
> My suspicion is that the indexing client hasn't changed, but it's
> worth checking.
>
> Best,
> Erick
>
> On Thu, Jan 18, 2018 at 2:23 AM, Rob Audenaerde
> <[hidden email]> wrote:
>> Hi all,
>>
>> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant drop in
>> indexing performace.
>>
>> We have a-typical use of Lucene, as we (also) index some database tables
>> and add all the values as AssociatedFacetFields as well. This allows us to
>> create pivot tables on search results really fast.
>>
>> These tables have some overlapping columns, but also disjoint ones.
>>
>> We anticipated a decrease in index size because of the sparse docvalues. We
>> see this happening, with decreases to ~50%-80% of the original index size.
>> But we did not expect an drop in indexing performance (client systems
>> indexing time increased with +50% to +250%).
>>
>> (Our indexing-speed used to be mainly bound by the speed the Taxonomy could
>> deliver new ordinals for new values, currently we are investigating if this
>> is still the case, will report later when a profiler run has been done)
>>
>> Does anyone know if this increase in indexing time is to be expected as
>> result of the sparse docvalues change?
>>
>> Kind regards,
>>
>> Rob Audenaerde
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing performance 6.6 vs 7.1

Adrien Grand
In reply to this post by Rob Audenaerde
If you have sparse data, I would have expected index time to *decrease*,
not increase.

Can you enable the IW info stream and share flush + merge times to see
where indexing time goes?

If you can run with a profiler, this might also give useful information.

Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde <[hidden email]> a
écrit :

> Hi all,
>
> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant drop in
> indexing performace.
>
> We have a-typical use of Lucene, as we (also) index some database tables
> and add all the values as AssociatedFacetFields as well. This allows us to
> create pivot tables on search results really fast.
>
> These tables have some overlapping columns, but also disjoint ones.
>
> We anticipated a decrease in index size because of the sparse docvalues. We
> see this happening, with decreases to ~50%-80% of the original index size.
> But we did not expect an drop in indexing performance (client systems
> indexing time increased with +50% to +250%).
>
> (Our indexing-speed used to be mainly bound by the speed the Taxonomy could
> deliver new ordinals for new values, currently we are investigating if this
> is still the case, will report later when a profiler run has been done)
>
> Does anyone know if this increase in indexing time is to be expected as
> result of the sparse docvalues change?
>
> Kind regards,
>
> Rob Audenaerde
>
Reply | Threaded
Open this post in threaded view
|

Re: indexing performance 6.6 vs 7.1

Erick Erickson
Robert:

Ah, right. I keep confusing my gmail lists
"lucene dev"
and
"lucene list"....

Siiigggghhhhh.



On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand <[hidden email]> wrote:

> If you have sparse data, I would have expected index time to *decrease*,
> not increase.
>
> Can you enable the IW info stream and share flush + merge times to see
> where indexing time goes?
>
> If you can run with a profiler, this might also give useful information.
>
> Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde <[hidden email]> a
> écrit :
>
>> Hi all,
>>
>> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant drop in
>> indexing performace.
>>
>> We have a-typical use of Lucene, as we (also) index some database tables
>> and add all the values as AssociatedFacetFields as well. This allows us to
>> create pivot tables on search results really fast.
>>
>> These tables have some overlapping columns, but also disjoint ones.
>>
>> We anticipated a decrease in index size because of the sparse docvalues. We
>> see this happening, with decreases to ~50%-80% of the original index size.
>> But we did not expect an drop in indexing performance (client systems
>> indexing time increased with +50% to +250%).
>>
>> (Our indexing-speed used to be mainly bound by the speed the Taxonomy could
>> deliver new ordinals for new values, currently we are investigating if this
>> is still the case, will report later when a profiler run has been done)
>>
>> Does anyone know if this increase in indexing time is to be expected as
>> result of the sparse docvalues change?
>>
>> Kind regards,
>>
>> Rob Audenaerde
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing performance 6.6 vs 7.1

Rob Audenaerde
Hi all,

Some follow up (sorry for the delay).

We built a benchmark in our application, and profiled it (on a smallish
data set). What we currently see in the profiler is that in Lucene 7.1 the
calls to `commit()` take much longer.

The self-time committing in 6.6: 3,215 ms
The self-time committing in 7.1: 10,187 ms.

We will try to run a larger data set and also later with the IW info
stream.

-Rob

On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson <[hidden email]>
wrote:

> Robert:
>
> Ah, right. I keep confusing my gmail lists
> "lucene dev"
> and
> "lucene list"....
>
> Siiigggghhhhh.
>
>
>
> On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand <[hidden email]> wrote:
> > If you have sparse data, I would have expected index time to *decrease*,
> > not increase.
> >
> > Can you enable the IW info stream and share flush + merge times to see
> > where indexing time goes?
> >
> > If you can run with a profiler, this might also give useful information.
> >
> > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde <[hidden email]>
> a
> > écrit :
> >
> >> Hi all,
> >>
> >> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant drop
> in
> >> indexing performace.
> >>
> >> We have a-typical use of Lucene, as we (also) index some database tables
> >> and add all the values as AssociatedFacetFields as well. This allows us
> to
> >> create pivot tables on search results really fast.
> >>
> >> These tables have some overlapping columns, but also disjoint ones.
> >>
> >> We anticipated a decrease in index size because of the sparse
> docvalues. We
> >> see this happening, with decreases to ~50%-80% of the original index
> size.
> >> But we did not expect an drop in indexing performance (client systems
> >> indexing time increased with +50% to +250%).
> >>
> >> (Our indexing-speed used to be mainly bound by the speed the Taxonomy
> could
> >> deliver new ordinals for new values, currently we are investigating if
> this
> >> is still the case, will report later when a profiler run has been done)
> >>
> >> Does anyone know if this increase in indexing time is to be expected as
> >> result of the sparse docvalues change?
> >>
> >> Kind regards,
> >>
> >> Rob Audenaerde
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

RE: indexing performance 6.6 vs 7.1

Uwe Schindler
Hi,

How often do you commit? If you index the data initially (that's the case where indexing needs to be fast), one would call commit at the end of the whole job, so the actual time it takes is not so important.

If you have a system where the index is updated all the time, then of course committing is also something you have to take into account. Systems like Solr or Elasticsearch use a transaction log in parallel to indexing, so they commit very seldom. If the system crashes, the changes are replayed from tranlog since last commit.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Rob Audenaerde [mailto:[hidden email]]
> Sent: Monday, January 29, 2018 11:29 AM
> To: [hidden email]
> Subject: Re: indexing performance 6.6 vs 7.1
>
> Hi all,
>
> Some follow up (sorry for the delay).
>
> We built a benchmark in our application, and profiled it (on a smallish
> data set). What we currently see in the profiler is that in Lucene 7.1 the
> calls to `commit()` take much longer.
>
> The self-time committing in 6.6: 3,215 ms
> The self-time committing in 7.1: 10,187 ms.
>
> We will try to run a larger data set and also later with the IW info
> stream.
>
> -Rob
>
> On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson <[hidden email]>
> wrote:
>
> > Robert:
> >
> > Ah, right. I keep confusing my gmail lists
> > "lucene dev"
> > and
> > "lucene list"....
> >
> > Siiigggghhhhh.
> >
> >
> >
> > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand <[hidden email]>
> wrote:
> > > If you have sparse data, I would have expected index time to *decrease*,
> > > not increase.
> > >
> > > Can you enable the IW info stream and share flush + merge times to see
> > > where indexing time goes?
> > >
> > > If you can run with a profiler, this might also give useful information.
> > >
> > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde
> <[hidden email]>
> > a
> > > écrit :
> > >
> > >> Hi all,
> > >>
> > >> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant drop
> > in
> > >> indexing performace.
> > >>
> > >> We have a-typical use of Lucene, as we (also) index some database
> tables
> > >> and add all the values as AssociatedFacetFields as well. This allows us
> > to
> > >> create pivot tables on search results really fast.
> > >>
> > >> These tables have some overlapping columns, but also disjoint ones.
> > >>
> > >> We anticipated a decrease in index size because of the sparse
> > docvalues. We
> > >> see this happening, with decreases to ~50%-80% of the original index
> > size.
> > >> But we did not expect an drop in indexing performance (client systems
> > >> indexing time increased with +50% to +250%).
> > >>
> > >> (Our indexing-speed used to be mainly bound by the speed the
> Taxonomy
> > could
> > >> deliver new ordinals for new values, currently we are investigating if
> > this
> > >> is still the case, will report later when a profiler run has been done)
> > >>
> > >> Does anyone know if this increase in indexing time is to be expected as
> > >> result of the sparse docvalues change?
> > >>
> > >> Kind regards,
> > >>
> > >> Rob Audenaerde
> > >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: indexing performance 6.6 vs 7.1

Rob Audenaerde
Hi Uwe,

Thanks for the reply. We commit often. Actually, in the benchmark, we
commit every 60 documents (but we will run a larger set with less commits).
The number of commits we call does not change between 6.6. and 7.1. In our
production systems  we commit every 5000 documents.

We dug deeper into the commit methods, and currently see the main
difference seems to be the calls to the java.util.zit.Checksum.update().
The number of calls to that method in 6.6 is around 11M  , and 7.1  21M, so
almost twice the calls.

-Rob

On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler <[hidden email]> wrote:

> Hi,
>
> How often do you commit? If you index the data initially (that's the case
> where indexing needs to be fast), one would call commit at the end of the
> whole job, so the actual time it takes is not so important.
>
> If you have a system where the index is updated all the time, then of
> course committing is also something you have to take into account. Systems
> like Solr or Elasticsearch use a transaction log in parallel to indexing,
> so they commit very seldom. If the system crashes, the changes are replayed
> from tranlog since last commit.
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: [hidden email]
>
> > -----Original Message-----
> > From: Rob Audenaerde [mailto:[hidden email]]
> > Sent: Monday, January 29, 2018 11:29 AM
> > To: [hidden email]
> > Subject: Re: indexing performance 6.6 vs 7.1
> >
> > Hi all,
> >
> > Some follow up (sorry for the delay).
> >
> > We built a benchmark in our application, and profiled it (on a smallish
> > data set). What we currently see in the profiler is that in Lucene 7.1
> the
> > calls to `commit()` take much longer.
> >
> > The self-time committing in 6.6: 3,215 ms
> > The self-time committing in 7.1: 10,187 ms.
> >
> > We will try to run a larger data set and also later with the IW info
> > stream.
> >
> > -Rob
> >
> > On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson <[hidden email]
> >
> > wrote:
> >
> > > Robert:
> > >
> > > Ah, right. I keep confusing my gmail lists
> > > "lucene dev"
> > > and
> > > "lucene list"....
> > >
> > > Siiigggghhhhh.
> > >
> > >
> > >
> > > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand <[hidden email]>
> > wrote:
> > > > If you have sparse data, I would have expected index time to
> *decrease*,
> > > > not increase.
> > > >
> > > > Can you enable the IW info stream and share flush + merge times to
> see
> > > > where indexing time goes?
> > > >
> > > > If you can run with a profiler, this might also give useful
> information.
> > > >
> > > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde
> > <[hidden email]>
> > > a
> > > > écrit :
> > > >
> > > >> Hi all,
> > > >>
> > > >> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant
> drop
> > > in
> > > >> indexing performace.
> > > >>
> > > >> We have a-typical use of Lucene, as we (also) index some database
> > tables
> > > >> and add all the values as AssociatedFacetFields as well. This
> allows us
> > > to
> > > >> create pivot tables on search results really fast.
> > > >>
> > > >> These tables have some overlapping columns, but also disjoint ones.
> > > >>
> > > >> We anticipated a decrease in index size because of the sparse
> > > docvalues. We
> > > >> see this happening, with decreases to ~50%-80% of the original index
> > > size.
> > > >> But we did not expect an drop in indexing performance (client
> systems
> > > >> indexing time increased with +50% to +250%).
> > > >>
> > > >> (Our indexing-speed used to be mainly bound by the speed the
> > Taxonomy
> > > could
> > > >> deliver new ordinals for new values, currently we are investigating
> if
> > > this
> > > >> is still the case, will report later when a profiler run has been
> done)
> > > >>
> > > >> Does anyone know if this increase in indexing time is to be
> expected as
> > > >> result of the sparse docvalues change?
> > > >>
> > > >> Kind regards,
> > > >>
> > > >> Rob Audenaerde
> > > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: indexing performance 6.6 vs 7.1

Rob Audenaerde
Hi all,

We ran the benchmarks (6.6 vs 7.1) with IW info stream and (as attachment
cannot be too large) I uploaded them to google drive. They can be found
here:

https://drive.google.com/open?id=1-nAHgpPO3qZ78lnvvlQ0_lF4uHJ-cWLh

Thanks in advance,
-Rob

On Mon, Jan 29, 2018 at 1:08 PM, Rob Audenaerde <[hidden email]>
wrote:

> Hi Uwe,
>
> Thanks for the reply. We commit often. Actually, in the benchmark, we
> commit every 60 documents (but we will run a larger set with less commits).
> The number of commits we call does not change between 6.6. and 7.1. In our
> production systems  we commit every 5000 documents.
>
> We dug deeper into the commit methods, and currently see the main
> difference seems to be the calls to the java.util.zit.Checksum.update().
> The number of calls to that method in 6.6 is around 11M  , and 7.1  21M, so
> almost twice the calls.
>
> -Rob
>
> On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler <[hidden email]> wrote:
>
>> Hi,
>>
>> How often do you commit? If you index the data initially (that's the case
>> where indexing needs to be fast), one would call commit at the end of the
>> whole job, so the actual time it takes is not so important.
>>
>> If you have a system where the index is updated all the time, then of
>> course committing is also something you have to take into account. Systems
>> like Solr or Elasticsearch use a transaction log in parallel to indexing,
>> so they commit very seldom. If the system crashes, the changes are replayed
>> from tranlog since last commit.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> http://www.thetaphi.de
>> eMail: [hidden email]
>>
>> > -----Original Message-----
>> > From: Rob Audenaerde [mailto:[hidden email]]
>> > Sent: Monday, January 29, 2018 11:29 AM
>> > To: [hidden email]
>> > Subject: Re: indexing performance 6.6 vs 7.1
>> >
>> > Hi all,
>> >
>> > Some follow up (sorry for the delay).
>> >
>> > We built a benchmark in our application, and profiled it (on a smallish
>> > data set). What we currently see in the profiler is that in Lucene 7.1
>> the
>> > calls to `commit()` take much longer.
>> >
>> > The self-time committing in 6.6: 3,215 ms
>> > The self-time committing in 7.1: 10,187 ms.
>> >
>> > We will try to run a larger data set and also later with the IW info
>> > stream.
>> >
>> > -Rob
>> >
>> > On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson <
>> [hidden email]>
>> > wrote:
>> >
>> > > Robert:
>> > >
>> > > Ah, right. I keep confusing my gmail lists
>> > > "lucene dev"
>> > > and
>> > > "lucene list"....
>> > >
>> > > Siiigggghhhhh.
>> > >
>> > >
>> > >
>> > > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand <[hidden email]>
>> > wrote:
>> > > > If you have sparse data, I would have expected index time to
>> *decrease*,
>> > > > not increase.
>> > > >
>> > > > Can you enable the IW info stream and share flush + merge times to
>> see
>> > > > where indexing time goes?
>> > > >
>> > > > If you can run with a profiler, this might also give useful
>> information.
>> > > >
>> > > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde
>> > <[hidden email]>
>> > > a
>> > > > écrit :
>> > > >
>> > > >> Hi all,
>> > > >>
>> > > >> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant
>> drop
>> > > in
>> > > >> indexing performace.
>> > > >>
>> > > >> We have a-typical use of Lucene, as we (also) index some database
>> > tables
>> > > >> and add all the values as AssociatedFacetFields as well. This
>> allows us
>> > > to
>> > > >> create pivot tables on search results really fast.
>> > > >>
>> > > >> These tables have some overlapping columns, but also disjoint ones.
>> > > >>
>> > > >> We anticipated a decrease in index size because of the sparse
>> > > docvalues. We
>> > > >> see this happening, with decreases to ~50%-80% of the original
>> index
>> > > size.
>> > > >> But we did not expect an drop in indexing performance (client
>> systems
>> > > >> indexing time increased with +50% to +250%).
>> > > >>
>> > > >> (Our indexing-speed used to be mainly bound by the speed the
>> > Taxonomy
>> > > could
>> > > >> deliver new ordinals for new values, currently we are
>> investigating if
>> > > this
>> > > >> is still the case, will report later when a profiler run has been
>> done)
>> > > >>
>> > > >> Does anyone know if this increase in indexing time is to be
>> expected as
>> > > >> result of the sparse docvalues change?
>> > > >>
>> > > >> Kind regards,
>> > > >>
>> > > >> Rob Audenaerde
>> > > >>
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: [hidden email]
>> > > For additional commands, e-mail: [hidden email]
>> > >
>> > >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: indexing performance 6.6 vs 7.1

Adrien Grand
Hi Rob,

I don't think your benchmark is good. If I read it correctly, it only
indexes between 21k and 22k documents, which is tiny. Plus it should try to
better replicate production workload, otherwise we will draw wrong
conclusions.

I also suspect something is not quite right in your indexing code. When I
look at the IW logs, 562 out of the 642 flushes only write 1 document. I'm
not surprised that it exacerbates the cost of checksums, which are cheaper
to compute on one large file than on many tiny files. For the record, even
committing every 5k documents still sounds too frequent to me for an
application that is heavily indexing. Maybe you should consider moving to a
time-based policy? eg. commit every 10 minutes?

Le mer. 31 janv. 2018 à 10:25, Rob Audenaerde <[hidden email]> a
écrit :

> Hi all,
>
> We ran the benchmarks (6.6 vs 7.1) with IW info stream and (as attachment
> cannot be too large) I uploaded them to google drive. They can be found
> here:
>
> https://drive.google.com/open?id=1-nAHgpPO3qZ78lnvvlQ0_lF4uHJ-cWLh
>
> Thanks in advance,
> -Rob
>
> On Mon, Jan 29, 2018 at 1:08 PM, Rob Audenaerde <[hidden email]>
> wrote:
>
> > Hi Uwe,
> >
> > Thanks for the reply. We commit often. Actually, in the benchmark, we
> > commit every 60 documents (but we will run a larger set with less
> commits).
> > The number of commits we call does not change between 6.6. and 7.1. In
> our
> > production systems  we commit every 5000 documents.
> >
> > We dug deeper into the commit methods, and currently see the main
> > difference seems to be the calls to the java.util.zit.Checksum.update().
> > The number of calls to that method in 6.6 is around 11M  , and 7.1  21M,
> so
> > almost twice the calls.
> >
> > -Rob
> >
> > On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler <[hidden email]> wrote:
> >
> >> Hi,
> >>
> >> How often do you commit? If you index the data initially (that's the
> case
> >> where indexing needs to be fast), one would call commit at the end of
> the
> >> whole job, so the actual time it takes is not so important.
> >>
> >> If you have a system where the index is updated all the time, then of
> >> course committing is also something you have to take into account.
> Systems
> >> like Solr or Elasticsearch use a transaction log in parallel to
> indexing,
> >> so they commit very seldom. If the system crashes, the changes are
> replayed
> >> from tranlog since last commit.
> >>
> >> Uwe
> >>
> >> -----
> >> Uwe Schindler
> >> Achterdiek 19, D-28357 Bremen
> >> http://www.thetaphi.de
> >> eMail: [hidden email]
> >>
> >> > -----Original Message-----
> >> > From: Rob Audenaerde [mailto:[hidden email]]
> >> > Sent: Monday, January 29, 2018 11:29 AM
> >> > To: [hidden email]
> >> > Subject: Re: indexing performance 6.6 vs 7.1
> >> >
> >> > Hi all,
> >> >
> >> > Some follow up (sorry for the delay).
> >> >
> >> > We built a benchmark in our application, and profiled it (on a
> smallish
> >> > data set). What we currently see in the profiler is that in Lucene 7.1
> >> the
> >> > calls to `commit()` take much longer.
> >> >
> >> > The self-time committing in 6.6: 3,215 ms
> >> > The self-time committing in 7.1: 10,187 ms.
> >> >
> >> > We will try to run a larger data set and also later with the IW info
> >> > stream.
> >> >
> >> > -Rob
> >> >
> >> > On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson <
> >> [hidden email]>
> >> > wrote:
> >> >
> >> > > Robert:
> >> > >
> >> > > Ah, right. I keep confusing my gmail lists
> >> > > "lucene dev"
> >> > > and
> >> > > "lucene list"....
> >> > >
> >> > > Siiigggghhhhh.
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand <[hidden email]>
> >> > wrote:
> >> > > > If you have sparse data, I would have expected index time to
> >> *decrease*,
> >> > > > not increase.
> >> > > >
> >> > > > Can you enable the IW info stream and share flush + merge times to
> >> see
> >> > > > where indexing time goes?
> >> > > >
> >> > > > If you can run with a profiler, this might also give useful
> >> information.
> >> > > >
> >> > > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde
> >> > <[hidden email]>
> >> > > a
> >> > > > écrit :
> >> > > >
> >> > > >> Hi all,
> >> > > >>
> >> > > >> We recently upgraded from Lucene 6.6 to 7.1.  We see a
> significant
> >> drop
> >> > > in
> >> > > >> indexing performace.
> >> > > >>
> >> > > >> We have a-typical use of Lucene, as we (also) index some database
> >> > tables
> >> > > >> and add all the values as AssociatedFacetFields as well. This
> >> allows us
> >> > > to
> >> > > >> create pivot tables on search results really fast.
> >> > > >>
> >> > > >> These tables have some overlapping columns, but also disjoint
> ones.
> >> > > >>
> >> > > >> We anticipated a decrease in index size because of the sparse
> >> > > docvalues. We
> >> > > >> see this happening, with decreases to ~50%-80% of the original
> >> index
> >> > > size.
> >> > > >> But we did not expect an drop in indexing performance (client
> >> systems
> >> > > >> indexing time increased with +50% to +250%).
> >> > > >>
> >> > > >> (Our indexing-speed used to be mainly bound by the speed the
> >> > Taxonomy
> >> > > could
> >> > > >> deliver new ordinals for new values, currently we are
> >> investigating if
> >> > > this
> >> > > >> is still the case, will report later when a profiler run has been
> >> done)
> >> > > >>
> >> > > >> Does anyone know if this increase in indexing time is to be
> >> expected as
> >> > > >> result of the sparse docvalues change?
> >> > > >>
> >> > > >> Kind regards,
> >> > > >>
> >> > > >> Rob Audenaerde
> >> > > >>
> >> > >
> >> > >
> ---------------------------------------------------------------------
> >> > > To unsubscribe, e-mail: [hidden email]
> >> > > For additional commands, e-mail: [hidden email]
> >> > >
> >> > >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: indexing performance 6.6 vs 7.1

Rob Audenaerde
Hi Adrian,

Thanks for the response. Good points too!

We actually went with a smallish benchmark to be able to profile the
application within reasonable time.

We will do a larger benchmark (say, 1M documents, without profiling) and I
will revisit the commit-code as well. (IIRC we actually increased the
commit frequency a while back because of issues (maybe out-of-memory
issues, it was in the Lucene 4.x time. But this might no longer be relevant)

What I don't understand yet is how this difference (between 6 and 7) came
to be, I was reading the change log but could not really pinpoint it. Sure,
the commit's are far from optimal, but we use the same commit strategy
between 6.6 and 7.1.

-Rob




On Wed, Jan 31, 2018 at 1:56 PM, Adrien Grand <[hidden email]> wrote:

> Hi Rob,
>
> I don't think your benchmark is good. If I read it correctly, it only
> indexes between 21k and 22k documents, which is tiny. Plus it should try to
> better replicate production workload, otherwise we will draw wrong
> conclusions.
>
> I also suspect something is not quite right in your indexing code. When I
> look at the IW logs, 562 out of the 642 flushes only write 1 document. I'm
> not surprised that it exacerbates the cost of checksums, which are cheaper
> to compute on one large file than on many tiny files. For the record, even
> committing every 5k documents still sounds too frequent to me for an
> application that is heavily indexing. Maybe you should consider moving to a
> time-based policy? eg. commit every 10 minutes?
>
> Le mer. 31 janv. 2018 à 10:25, Rob Audenaerde <[hidden email]> a
> écrit :
>
> > Hi all,
> >
> > We ran the benchmarks (6.6 vs 7.1) with IW info stream and (as attachment
> > cannot be too large) I uploaded them to google drive. They can be found
> > here:
> >
> > https://drive.google.com/open?id=1-nAHgpPO3qZ78lnvvlQ0_lF4uHJ-cWLh
> >
> > Thanks in advance,
> > -Rob
> >
> > On Mon, Jan 29, 2018 at 1:08 PM, Rob Audenaerde <
> [hidden email]>
> > wrote:
> >
> > > Hi Uwe,
> > >
> > > Thanks for the reply. We commit often. Actually, in the benchmark, we
> > > commit every 60 documents (but we will run a larger set with less
> > commits).
> > > The number of commits we call does not change between 6.6. and 7.1. In
> > our
> > > production systems  we commit every 5000 documents.
> > >
> > > We dug deeper into the commit methods, and currently see the main
> > > difference seems to be the calls to the java.util.zit.Checksum.update(
> ).
> > > The number of calls to that method in 6.6 is around 11M  , and 7.1
> 21M,
> > so
> > > almost twice the calls.
> > >
> > > -Rob
> > >
> > > On Mon, Jan 29, 2018 at 12:18 PM, Uwe Schindler <[hidden email]>
> wrote:
> > >
> > >> Hi,
> > >>
> > >> How often do you commit? If you index the data initially (that's the
> > case
> > >> where indexing needs to be fast), one would call commit at the end of
> > the
> > >> whole job, so the actual time it takes is not so important.
> > >>
> > >> If you have a system where the index is updated all the time, then of
> > >> course committing is also something you have to take into account.
> > Systems
> > >> like Solr or Elasticsearch use a transaction log in parallel to
> > indexing,
> > >> so they commit very seldom. If the system crashes, the changes are
> > replayed
> > >> from tranlog since last commit.
> > >>
> > >> Uwe
> > >>
> > >> -----
> > >> Uwe Schindler
> > >> Achterdiek 19, D-28357 Bremen
> > >> http://www.thetaphi.de
> > >> eMail: [hidden email]
> > >>
> > >> > -----Original Message-----
> > >> > From: Rob Audenaerde [mailto:[hidden email]]
> > >> > Sent: Monday, January 29, 2018 11:29 AM
> > >> > To: [hidden email]
> > >> > Subject: Re: indexing performance 6.6 vs 7.1
> > >> >
> > >> > Hi all,
> > >> >
> > >> > Some follow up (sorry for the delay).
> > >> >
> > >> > We built a benchmark in our application, and profiled it (on a
> > smallish
> > >> > data set). What we currently see in the profiler is that in Lucene
> 7.1
> > >> the
> > >> > calls to `commit()` take much longer.
> > >> >
> > >> > The self-time committing in 6.6: 3,215 ms
> > >> > The self-time committing in 7.1: 10,187 ms.
> > >> >
> > >> > We will try to run a larger data set and also later with the IW info
> > >> > stream.
> > >> >
> > >> > -Rob
> > >> >
> > >> > On Thu, Jan 18, 2018 at 7:03 PM, Erick Erickson <
> > >> [hidden email]>
> > >> > wrote:
> > >> >
> > >> > > Robert:
> > >> > >
> > >> > > Ah, right. I keep confusing my gmail lists
> > >> > > "lucene dev"
> > >> > > and
> > >> > > "lucene list"....
> > >> > >
> > >> > > Siiigggghhhhh.
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand <[hidden email]>
> > >> > wrote:
> > >> > > > If you have sparse data, I would have expected index time to
> > >> *decrease*,
> > >> > > > not increase.
> > >> > > >
> > >> > > > Can you enable the IW info stream and share flush + merge times
> to
> > >> see
> > >> > > > where indexing time goes?
> > >> > > >
> > >> > > > If you can run with a profiler, this might also give useful
> > >> information.
> > >> > > >
> > >> > > > Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde
> > >> > <[hidden email]>
> > >> > > a
> > >> > > > écrit :
> > >> > > >
> > >> > > >> Hi all,
> > >> > > >>
> > >> > > >> We recently upgraded from Lucene 6.6 to 7.1.  We see a
> > significant
> > >> drop
> > >> > > in
> > >> > > >> indexing performace.
> > >> > > >>
> > >> > > >> We have a-typical use of Lucene, as we (also) index some
> database
> > >> > tables
> > >> > > >> and add all the values as AssociatedFacetFields as well. This
> > >> allows us
> > >> > > to
> > >> > > >> create pivot tables on search results really fast.
> > >> > > >>
> > >> > > >> These tables have some overlapping columns, but also disjoint
> > ones.
> > >> > > >>
> > >> > > >> We anticipated a decrease in index size because of the sparse
> > >> > > docvalues. We
> > >> > > >> see this happening, with decreases to ~50%-80% of the original
> > >> index
> > >> > > size.
> > >> > > >> But we did not expect an drop in indexing performance (client
> > >> systems
> > >> > > >> indexing time increased with +50% to +250%).
> > >> > > >>
> > >> > > >> (Our indexing-speed used to be mainly bound by the speed the
> > >> > Taxonomy
> > >> > > could
> > >> > > >> deliver new ordinals for new values, currently we are
> > >> investigating if
> > >> > > this
> > >> > > >> is still the case, will report later when a profiler run has
> been
> > >> done)
> > >> > > >>
> > >> > > >> Does anyone know if this increase in indexing time is to be
> > >> expected as
> > >> > > >> result of the sparse docvalues change?
> > >> > > >>
> > >> > > >> Kind regards,
> > >> > > >>
> > >> > > >> Rob Audenaerde
> > >> > > >>
> > >> > >
> > >> > >
> > ---------------------------------------------------------------------
> > >> > > To unsubscribe, e-mail: [hidden email]
> > >> > > For additional commands, e-mail: [hidden email]
> > >> > >
> > >> > >
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [hidden email]
> > >> For additional commands, e-mail: [hidden email]
> > >>
> > >>
> > >
> >
>