Provide suggestion on indexing performance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Provide suggestion on indexing performance

Aman Tandon
Hi,

We want to know about the indexing performance in the below mentioned
scenarios, consider the total number of 10 string fields and total number
of documents are 10 million.

1) indexed=true, stored=true
2) indexed=true, docValues=true

Which one should we prefer in terms of indexing performance, please share
your experience.

With regards,
Aman Tandon
Reply | Threaded
Open this post in threaded view
|

Re: Provide suggestion on indexing performance

Tom Evans
On Tue, Sep 12, 2017 at 4:06 AM, Aman Tandon <[hidden email]> wrote:

> Hi,
>
> We want to know about the indexing performance in the below mentioned
> scenarios, consider the total number of 10 string fields and total number
> of documents are 10 million.
>
> 1) indexed=true, stored=true
> 2) indexed=true, docValues=true
>
> Which one should we prefer in terms of indexing performance, please share
> your experience.
>
> With regards,
> Aman Tandon

Your question doesn't make much sense. You turn on stored when you
need to retrieve the original contents of the fields after searching,
and you use docvalues to speed up faceting, sorting and grouping.
Using docvalues to retrieve values during search is more expensive
than simply using stored values, so if your primary aim is retrieving
stored values, use stored=true.

Secondly, the only way to answer performance questions for your schema
and data is to try it out. Generate 10 million docs, store them in a
doc (eg as CSV), and then use the post tool to try different schema
and query options.

Cheers

Tom
Reply | Threaded
Open this post in threaded view
|

Re: Provide suggestion on indexing performance

Sreenivas.T
I agree with Tom. Doc values and stored fields are present for different
reasons. Doc values is another index that gets build for faster
sorting/faceting.

On Wed, Sep 13, 2017 at 11:30 PM Tom Evans <[hidden email]> wrote:

> On Tue, Sep 12, 2017 at 4:06 AM, Aman Tandon <[hidden email]>
> wrote:
> > Hi,
> >
> > We want to know about the indexing performance in the below mentioned
> > scenarios, consider the total number of 10 string fields and total number
> > of documents are 10 million.
> >
> > 1) indexed=true, stored=true
> > 2) indexed=true, docValues=true
> >
> > Which one should we prefer in terms of indexing performance, please share
> > your experience.
> >
> > With regards,
> > Aman Tandon
>
> Your question doesn't make much sense. You turn on stored when you
> need to retrieve the original contents of the fields after searching,
> and you use docvalues to speed up faceting, sorting and grouping.
> Using docvalues to retrieve values during search is more expensive
> than simply using stored values, so if your primary aim is retrieving
> stored values, use stored=true.
>
> Secondly, the only way to answer performance questions for your schema
> and data is to try it out. Generate 10 million docs, store them in a
> doc (eg as CSV), and then use the post tool to try different schema
> and query options.
>
> Cheers
>
> Tom
>
Reply | Threaded
Open this post in threaded view
|

Re: Provide suggestion on indexing performance

Shawn Heisey-2
In reply to this post by Aman Tandon
On 9/11/2017 9:06 PM, Aman Tandon wrote:
> We want to know about the indexing performance in the below mentioned
> scenarios, consider the total number of 10 string fields and total number
> of documents are 10 million.
>
> 1) indexed=true, stored=true
> 2) indexed=true, docValues=true
>
> Which one should we prefer in terms of indexing performance, please share
> your experience.

There are several settings in the schema for each field, things like
indexed, stored, docValues, multiValued, and others.  You should base
your choices on what you need Solr to do.  Choosing these settings based
purely on desired indexing speed may result in Solr not doing what you
want it to do.

When the indexing system sends data to Solr with several threads or
processes, Solr is *usually* capable of indexing data faster than most
systems can supply it.  The more settings you disable on a field, the
faster Solr will be able to index.

It is not possible to provide precise numbers, because performance
depends on many factors, some of which you may not even know until you
build a production system.

https://lucidworks.com/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

All that said ... docValues MIGHT be a little bit faster than stored,
because stored data is compressed, and the compression takes CPU time. 
On a fully populated production system, that statement might turn out to
be wrong.  There may be factors that result in stored fields working
better.  The best way to decide is to try it both ways with all your data.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Provide suggestion on indexing performance

Aman Tandon
In reply to this post by Sreenivas.T
Hi Tom,

Thanks for your suggestion and the information.

I will try this out to test and will share the results.

On Sep 14, 2017 2:32 PM, "Sreenivas.T" <[hidden email]> wrote:

> I agree with Tom. Doc values and stored fields are present for different
> reasons. Doc values is another index that gets build for faster
> sorting/faceting.
>
> On Wed, Sep 13, 2017 at 11:30 PM Tom Evans <[hidden email]>
> wrote:
>
> > On Tue, Sep 12, 2017 at 4:06 AM, Aman Tandon <[hidden email]>
> > wrote:
> > > Hi,
> > >
> > > We want to know about the indexing performance in the below mentioned
> > > scenarios, consider the total number of 10 string fields and total
> number
> > > of documents are 10 million.
> > >
> > > 1) indexed=true, stored=true
> > > 2) indexed=true, docValues=true
> > >
> > > Which one should we prefer in terms of indexing performance, please
> share
> > > your experience.
> > >
> > > With regards,
> > > Aman Tandon
> >
> > Your question doesn't make much sense. You turn on stored when you
> > need to retrieve the original contents of the fields after searching,
> > and you use docvalues to speed up faceting, sorting and grouping.
> > Using docvalues to retrieve values during search is more expensive
> > than simply using stored values, so if your primary aim is retrieving
> > stored values, use stored=true.
> >
> > Secondly, the only way to answer performance questions for your schema
> > and data is to try it out. Generate 10 million docs, store them in a
> > doc (eg as CSV), and then use the post tool to try different schema
> > and query options.
> >
> > Cheers
> >
> > Tom
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Provide suggestion on indexing performance

Aman Tandon
In reply to this post by Shawn Heisey-2
Hi Shawn,

Thanks for your reply, this is really helpful. I will try this out to see
the performance with the docValues.

With regards,
Aman Tandon

On Sep 15, 2017 9:10 PM, "Shawn Heisey" <[hidden email]> wrote:

> On 9/11/2017 9:06 PM, Aman Tandon wrote:
> > We want to know about the indexing performance in the below mentioned
> > scenarios, consider the total number of 10 string fields and total number
> > of documents are 10 million.
> >
> > 1) indexed=true, stored=true
> > 2) indexed=true, docValues=true
> >
> > Which one should we prefer in terms of indexing performance, please share
> > your experience.
>
> There are several settings in the schema for each field, things like
> indexed, stored, docValues, multiValued, and others.  You should base
> your choices on what you need Solr to do.  Choosing these settings based
> purely on desired indexing speed may result in Solr not doing what you
> want it to do.
>
> When the indexing system sends data to Solr with several threads or
> processes, Solr is *usually* capable of indexing data faster than most
> systems can supply it.  The more settings you disable on a field, the
> faster Solr will be able to index.
>
> It is not possible to provide precise numbers, because performance
> depends on many factors, some of which you may not even know until you
> build a production system.
>
> https://lucidworks.com/sizing-hardware-in-the-abstract-why-
> we-dont-have-a-definitive-answer/
>
> All that said ... docValues MIGHT be a little bit faster than stored,
> because stored data is compressed, and the compression takes CPU time.
> On a fully populated production system, that statement might turn out to
> be wrong.  There may be factors that result in stored fields working
> better.  The best way to decide is to try it both ways with all your data.
>
> Thanks,
> Shawn
>
>