Questions about stored fields and updates.

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions about stored fields and updates.

Ash Ramesh
Hi everyone,

My company currently uses SOLR to completely hydrate client objects by
storing all fields (stored=true). Therefore we have 2 types of fields:

   1. indexed=true | stored=true : For fields that will be used for
   searching, sorting, etc.
   2. indexed=false | stored=true: For fields that only need hydrating for
   clients

We are re-architecting this so that we will eventually only get the id from
SOLR (fl=id) and hydrate from another data source. This means we can
obviously delete all the indexed=false | stored=true fields to reduce our
index size.

However, when it comes to the indexed=true | stored=true fields, we are not
sure whether to also set them to be stored=false and perform in-place
updates or leave it as is and perform atomic updates. We've done a fair bit
of research on the archives of this mailing list, but are still a bit
confused:

1. Will having the fields be converted from indexed=true | stored=true ->
indexed=true | stored=false cause our index size to reduce? Will it also
mean that indexing will be less compute expensive due to the compression of
stored field logic?
2. Are atomic updates preferred to in-place updates? Obviously if we move
to index only fields, then we have to do in-place updates all the time.
This isn't an issue for us, but we are a bit concerned about how SOLR's
indexing speed will suffer & deleted docs increase. Currently we perform
both.

Some points about our SOLR usecase:
- 40-60M docs with 8 shards (PULL/TLOG structure) Solr 7.4
- No need for extremely fast indexing
- Need for high query throughput (thus why we only want to retrieve the id
field and hydrate with a faster db store)

Thanks everyone, always appreciate the good information being shared here
daily :)

Regards,

Ash

--
*P.S. We've launched a new blog to share the latest ideas and case studies
from our team. Check it out here: product.canva.com
<http://product.canva.com/>. ***
** <https://canva.com>Empowering the world
to design
Also, we're hiring. Apply here!
<https://about.canva.com/careers/>
 <https://twitter.com/canva>
<https://facebook.com/canva> <https://au.linkedin.com/company/canva>
<https://instagram.com/canva>





Reply | Threaded
Open this post in threaded view
|

Re: Questions about stored fields and updates.

Shawn Heisey-2
On 11/3/2018 9:45 PM, Ash Ramesh wrote:

> My company currently uses SOLR to completely hydrate client objects by
> storing all fields (stored=true). Therefore we have 2 types of fields:
>
>     1. indexed=true | stored=true : For fields that will be used for
>     searching, sorting, etc.
>     2. indexed=false | stored=true: For fields that only need hydrating for
>     clients
>
> We are re-architecting this so that we will eventually only get the id from
> SOLR (fl=id) and hydrate from another data source. This means we can
> obviously delete all the indexed=false | stored=true fields to reduce our
> index size.
>
> However, when it comes to the indexed=true | stored=true fields, we are not
> sure whether to also set them to be stored=false and perform in-place
> updates or leave it as is and perform atomic updates. We've done a fair bit
> of research on the archives of this mailing list, but are still a bit
> confused:
>
> 1. Will having the fields be converted from indexed=true | stored=true ->
> indexed=true | stored=false cause our index size to reduce? Will it also
> mean that indexing will be less compute expensive due to the compression of
> stored field logic?

Pretty much anything you change from true to false in the schema will
reduce index size.

Removal of stored data will not *directly* improve query speed -- stored
data is not used during the query phase.  It might *indirectly* increase
query speed by removing data from the OS disk cache, leaving more room
for inverted index data.

The direct improvement from removing stored data will be during data
retrieval (after the query itself).  It will also mean there is less
data to compress, which means that indexing speed might increase.

> 2. Are atomic updates preferred to in-place updates? Obviously if we move
> to index only fields, then we have to do in-place updates all the time.
> This isn't an issue for us, but we are a bit concerned about how SOLR's
> indexing speed will suffer & deleted docs increase. Currently we perform
> both.

If you change stored to false, you will most likely not be able to do
atomic updates.  Atomic update functionality has very specific requirements:

https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#field-storage

In-place updates have requirements that are even more strict than atomic
updates -- the field cannot be indexed:

https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#in-place-updates

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Questions about stored fields and updates.

Ash Ramesh
Sorry Shawn,

I seem to have gotten my wording wrong. I meant that we wanted to move away
from atomic-updates to replacing/reindexing the document entirely again
when changes are made.
https://lucene.apache.org/solr/guide/7_5/uploading-data-with-index-handlers.html#adding-documents

Regards,

Ash

On Mon, Nov 5, 2018 at 11:29 AM Shawn Heisey <[hidden email]> wrote:

> On 11/3/2018 9:45 PM, Ash Ramesh wrote:
> > My company currently uses SOLR to completely hydrate client objects by
> > storing all fields (stored=true). Therefore we have 2 types of fields:
> >
> >     1. indexed=true | stored=true : For fields that will be used for
> >     searching, sorting, etc.
> >     2. indexed=false | stored=true: For fields that only need hydrating
> for
> >     clients
> >
> > We are re-architecting this so that we will eventually only get the id
> from
> > SOLR (fl=id) and hydrate from another data source. This means we can
> > obviously delete all the indexed=false | stored=true fields to reduce our
> > index size.
> >
> > However, when it comes to the indexed=true | stored=true fields, we are
> not
> > sure whether to also set them to be stored=false and perform in-place
> > updates or leave it as is and perform atomic updates. We've done a fair
> bit
> > of research on the archives of this mailing list, but are still a bit
> > confused:
> >
> > 1. Will having the fields be converted from indexed=true | stored=true ->
> > indexed=true | stored=false cause our index size to reduce? Will it also
> > mean that indexing will be less compute expensive due to the compression
> of
> > stored field logic?
>
> Pretty much anything you change from true to false in the schema will
> reduce index size.
>
> Removal of stored data will not *directly* improve query speed -- stored
> data is not used during the query phase.  It might *indirectly* increase
> query speed by removing data from the OS disk cache, leaving more room
> for inverted index data.
>
> The direct improvement from removing stored data will be during data
> retrieval (after the query itself).  It will also mean there is less
> data to compress, which means that indexing speed might increase.
>
> > 2. Are atomic updates preferred to in-place updates? Obviously if we move
> > to index only fields, then we have to do in-place updates all the time.
> > This isn't an issue for us, but we are a bit concerned about how SOLR's
> > indexing speed will suffer & deleted docs increase. Currently we perform
> > both.
>
> If you change stored to false, you will most likely not be able to do
> atomic updates.  Atomic update functionality has very specific
> requirements:
>
>
> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#field-storage
>
> In-place updates have requirements that are even more strict than atomic
> updates -- the field cannot be indexed:
>
>
> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#in-place-updates
>
> Thanks,
> Shawn
>
>

--
*P.S. We've launched a new blog to share the latest ideas and case studies
from our team. Check it out here: product.canva.com
<http://product.canva.com/>. ***
** <https://canva.com>Empowering the world
to design
Also, we're hiring. Apply here!
<https://about.canva.com/careers/>
 <https://twitter.com/canva>
<https://facebook.com/canva> <https://au.linkedin.com/company/canva>
<https://instagram.com/canva>





Reply | Threaded
Open this post in threaded view
|

Re: Questions about stored fields and updates.

Ash Ramesh
Also thanks for the information Shawn! :)

On Mon, Nov 5, 2018 at 12:09 PM Ash Ramesh <[hidden email]> wrote:

> Sorry Shawn,
>
> I seem to have gotten my wording wrong. I meant that we wanted to move
> away from atomic-updates to replacing/reindexing the document entirely
> again when changes are made.
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-index-handlers.html#adding-documents
>
> Regards,
>
> Ash
>
> On Mon, Nov 5, 2018 at 11:29 AM Shawn Heisey <[hidden email]> wrote:
>
>> On 11/3/2018 9:45 PM, Ash Ramesh wrote:
>> > My company currently uses SOLR to completely hydrate client objects by
>> > storing all fields (stored=true). Therefore we have 2 types of fields:
>> >
>> >     1. indexed=true | stored=true : For fields that will be used for
>> >     searching, sorting, etc.
>> >     2. indexed=false | stored=true: For fields that only need hydrating
>> for
>> >     clients
>> >
>> > We are re-architecting this so that we will eventually only get the id
>> from
>> > SOLR (fl=id) and hydrate from another data source. This means we can
>> > obviously delete all the indexed=false | stored=true fields to reduce
>> our
>> > index size.
>> >
>> > However, when it comes to the indexed=true | stored=true fields, we are
>> not
>> > sure whether to also set them to be stored=false and perform in-place
>> > updates or leave it as is and perform atomic updates. We've done a fair
>> bit
>> > of research on the archives of this mailing list, but are still a bit
>> > confused:
>> >
>> > 1. Will having the fields be converted from indexed=true | stored=true
>> ->
>> > indexed=true | stored=false cause our index size to reduce? Will it also
>> > mean that indexing will be less compute expensive due to the
>> compression of
>> > stored field logic?
>>
>> Pretty much anything you change from true to false in the schema will
>> reduce index size.
>>
>> Removal of stored data will not *directly* improve query speed -- stored
>> data is not used during the query phase.  It might *indirectly* increase
>> query speed by removing data from the OS disk cache, leaving more room
>> for inverted index data.
>>
>> The direct improvement from removing stored data will be during data
>> retrieval (after the query itself).  It will also mean there is less
>> data to compress, which means that indexing speed might increase.
>>
>> > 2. Are atomic updates preferred to in-place updates? Obviously if we
>> move
>> > to index only fields, then we have to do in-place updates all the time.
>> > This isn't an issue for us, but we are a bit concerned about how SOLR's
>> > indexing speed will suffer & deleted docs increase. Currently we perform
>> > both.
>>
>> If you change stored to false, you will most likely not be able to do
>> atomic updates.  Atomic update functionality has very specific
>> requirements:
>>
>>
>> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#field-storage
>>
>> In-place updates have requirements that are even more strict than atomic
>> updates -- the field cannot be indexed:
>>
>>
>> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#in-place-updates
>>
>> Thanks,
>> Shawn
>>
>>

--
*P.S. We've launched a new blog to share the latest ideas and case studies
from our team. Check it out here: product.canva.com
<http://product.canva.com/>. ***
** <https://canva.com>Empowering the world
to design
Also, we're hiring. Apply here!
<https://about.canva.com/careers/>
 <https://twitter.com/canva>
<https://facebook.com/canva> <https://au.linkedin.com/company/canva>
<https://instagram.com/canva>





Reply | Threaded
Open this post in threaded view
|

Re: Questions about stored fields and updates.

Erick Erickson
Ash:

Atomic updates are really a reindex of all the original fields. What happens is:
1> Solr gets all the stored fields from the disk
2> Solr overlays the new data
3> Solr re-indexes  the entire document just as though it came from outside.

For step <3>, there's no difference at all between an atomic update
and the client having resent the entire document. There's still a doc
marked as deleted in the old segment and an entirely new document
being indexed into the current segment.

As for efficiency, in the atomic update case you have to
1> seek/read the stored data off disk
2> decompress a 16K block (minimum)

.vs. in the re-index the whole doc from outside case where you

1> read the entire document off the wire and deserialize it

From there, everything's the same.

I haven't actually measured, but I'd guess that atomic updates are
actually more work than simply re-sending the doc from the client.
Now, all that  said, and even assuming I'm right, unless you have a
pretty high indexing rate I doubt you'd notice.

But in general I strongly prefer re-indexing from my system of record
if at all possible, if for no other reason than you'll have to
sometime anyway when you need to make changes to your schema to
support different use-cases.

Best,
Erick

On Sun, Nov 4, 2018 at 5:10 PM Ash Ramesh <[hidden email]> wrote:

>
> Also thanks for the information Shawn! :)
>
> On Mon, Nov 5, 2018 at 12:09 PM Ash Ramesh <[hidden email]> wrote:
>
> > Sorry Shawn,
> >
> > I seem to have gotten my wording wrong. I meant that we wanted to move
> > away from atomic-updates to replacing/reindexing the document entirely
> > again when changes are made.
> > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-index-handlers.html#adding-documents
> >
> > Regards,
> >
> > Ash
> >
> > On Mon, Nov 5, 2018 at 11:29 AM Shawn Heisey <[hidden email]> wrote:
> >
> >> On 11/3/2018 9:45 PM, Ash Ramesh wrote:
> >> > My company currently uses SOLR to completely hydrate client objects by
> >> > storing all fields (stored=true). Therefore we have 2 types of fields:
> >> >
> >> >     1. indexed=true | stored=true : For fields that will be used for
> >> >     searching, sorting, etc.
> >> >     2. indexed=false | stored=true: For fields that only need hydrating
> >> for
> >> >     clients
> >> >
> >> > We are re-architecting this so that we will eventually only get the id
> >> from
> >> > SOLR (fl=id) and hydrate from another data source. This means we can
> >> > obviously delete all the indexed=false | stored=true fields to reduce
> >> our
> >> > index size.
> >> >
> >> > However, when it comes to the indexed=true | stored=true fields, we are
> >> not
> >> > sure whether to also set them to be stored=false and perform in-place
> >> > updates or leave it as is and perform atomic updates. We've done a fair
> >> bit
> >> > of research on the archives of this mailing list, but are still a bit
> >> > confused:
> >> >
> >> > 1. Will having the fields be converted from indexed=true | stored=true
> >> ->
> >> > indexed=true | stored=false cause our index size to reduce? Will it also
> >> > mean that indexing will be less compute expensive due to the
> >> compression of
> >> > stored field logic?
> >>
> >> Pretty much anything you change from true to false in the schema will
> >> reduce index size.
> >>
> >> Removal of stored data will not *directly* improve query speed -- stored
> >> data is not used during the query phase.  It might *indirectly* increase
> >> query speed by removing data from the OS disk cache, leaving more room
> >> for inverted index data.
> >>
> >> The direct improvement from removing stored data will be during data
> >> retrieval (after the query itself).  It will also mean there is less
> >> data to compress, which means that indexing speed might increase.
> >>
> >> > 2. Are atomic updates preferred to in-place updates? Obviously if we
> >> move
> >> > to index only fields, then we have to do in-place updates all the time.
> >> > This isn't an issue for us, but we are a bit concerned about how SOLR's
> >> > indexing speed will suffer & deleted docs increase. Currently we perform
> >> > both.
> >>
> >> If you change stored to false, you will most likely not be able to do
> >> atomic updates.  Atomic update functionality has very specific
> >> requirements:
> >>
> >>
> >> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#field-storage
> >>
> >> In-place updates have requirements that are even more strict than atomic
> >> updates -- the field cannot be indexed:
> >>
> >>
> >> https://lucene.apache.org/solr/guide/7_5/updating-parts-of-documents.html#in-place-updates
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
> --
> *P.S. We've launched a new blog to share the latest ideas and case studies
> from our team. Check it out here: product.canva.com
> <http://product.canva.com/>. ***
> ** <https://canva.com>Empowering the world
> to design
> Also, we're hiring. Apply here!
> <https://about.canva.com/careers/>
>  <https://twitter.com/canva>
> <https://facebook.com/canva> <https://au.linkedin.com/company/canva>
> <https://instagram.com/canva>
>
>
>
>
>