Index size increases disproportionately to size of added field when indexed=false

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
33 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Index size increases disproportionately to size of added field when indexed=false

Howe, David

Hi,

We are using Solr 7.1.0 to index a database of addresses.  We have found that our index size increases massively when we add one extra field to the index, even though that field is stored and not indexed, and doesn’t contain a lot of data.  When this occurs, we also observe a significant increase in response times and CPU usage on the Solr server.

When we run an index load without the problematic field present, the Solr index size is 5.5GB.  When we add the field into the index, the size grows to 13.3GB.  The field itself is a maximum of 46 characters in length and on average is 19 characters. We have ~14,000,000 rows in total to index of which only ~200,000 have this field present at all (i.e. not null in database).  Given that we don’t want to index the field, only store it I would have thought (perhaps naively) that the storage increase would be approximately 200,000 * 19 = 3.8M bytes = 3.6MB rather than the 7.5GB we are seeing.

Some further background on what we are doing:

- We are using the Solr 7.1.0 docker image for our Solr server
- We are importing the data from an Oracle table using JDBC and the standard dataimport request handler
- As we want to push the docker image to AWS ECR which only accepts docker layers of a maximum of 10GB, we load the index in four separate imports, stopping Solr gracefully in between each load
- Our index contains 48 fields in total
- The problematic field is created through the API as follows:

  curl -X POST -H 'Content-type:application/json' --data-binary '{
    "add-field":{
      "name":"buildingName",
      "type":"string",
      "stored":true,
      "indexed":false
    }
  }' http://localhost:8983/solr/address/schema

I have also tried using SolrText instead of string, but that doesn't make a noticeable difference.

It also makes a difference how many records are loaded.  If I only load 1,000,000 records (that have a proportionate number of building names) then the size of the index with and without buildingName is about the same (~1GB).

Is there some sort of limit that I'm not aware of that we are hitting, either number of fields or size of data?  Is there some kind of corrupt data that I need to look for in the buildingName field that could cause this (it's just a varchar2(46) field in Oracle)?

Thanks for your assistance,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  [hidden email]

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Reply | Threaded
Open this post in threaded view
|

Re: Index size increases disproportionately to size of added field when indexed=false

Alessandro Benedetti
I assume you re-index in full right ?
My shot in the dark is that this increment is temporary.
You re-index, so effectively delete and add all documents ( this means that
even if the new field is just stored, you re-build the entire index for all
the fields).
Create new segments and the old docs are marked as deleted.
Until the background merge happens, the index could reach those sizes.

the weird thing is why the merge didn't kick in...
Have you configured any special approach in segments merging ?

What happens if you explicitly optimize ?

Let us know ...




-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

RE: Index size increases disproportionately to size of added field when indexed=false

Howe, David

Hi Alessanro,

Thanks for responding.  We rebuild the index every time starting from a fresh installation of Solr.  Because we are running at AWS, we have automated our deployment so we start with the base docker image, configure Solr and then import our data every time the data changes (it only changes once a fortnight).  Once the import finishes we save the docker image in the AWS docker repository.  We then build our cluster using that image as the base.  So we never re-index an existing index, we just build another one from scratch.

We haven't configured anything special for segments and merges.

When I look in the console, the index is shown as being optimized.  There doesn't seem to be an option in the console anymore to optimize an index.  If I have only ever inserted new documents, should I need to optimize?  I will try an optimize when I am back in the office tomorrow.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  [hidden email]

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Reply | Threaded
Open this post in threaded view
|

RE: Index size increases disproportionately to size of added field when indexed=false

Alessandro Benedetti
Hi David,
given the fact that you are actually building a new index from scratch, my
shot in the dark didn't hit any target.
When you say  : "Once the import finishes we save the docker image in the
AWS docker repository.  We then build our cluster using that image as the
base"

Do you mean just configuraiton wise ?
Will the new cluster have any starting index on disk?
If i understood correctly your latest statements I expect a NO in here.

So you are building a completely new index and comparing to the old index (
which is completely separate) you denote such a big difference in size.
This is extremely suspicious .
Optimizing in the end is just a huge merge to force 1 ( or N) final
segments.
Given the additional information you gave me, it's not going to make much
difference.

I would recommend to check how the index space is divided in different file
formats [1]
( i.e. list how much space is dedicated to a specific extension)

Stored content is in the .fdt files.


[1]
https://lucene.apache.org/core/6_4_0/core/org/apache/lucene/codecs/lucene62/package-summary.html#file-names



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: Index size increases disproportionately to size of added field when indexed=false

Erick Erickson
David:

Right, Optimize Is Evil. Well, actually in your case it's not. In your
specific case you can optimize every time you build your index and be
OK, gory details here:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

But that's just for background. The key is how many deleted docs you
have, which you can see from the admin UI screen. If you have 0
deleted docs, you have 0 space that would be reclaimed by an optimize.
My bet is that you have no deleted docs, if so just forget the whole
optimize question as it's a red herring.

"...storage increase would be approximately 200,000 * 19 = 3.8M bytes
= 3.6MB rather than the 7.5GB..."

Actually I'd expect it to only be half that  (1.9M). Stored fields are
compressed on disk and we usually see about a 2:1 compression ratio.
There'll be a little bit of fudge for metadata, but not enough to
measure probably.

So yes, this is totally weird. I think you'll also find that docValues
is set to true by default. This _still_ shouldn't be adding that much
to this index, but if you turn docValues off for that field what
happens?

Stored data is held in your *.fdt and *.fdx files. what's the total
index space used in your index by these two extensions with and
without your field?

*.dvd files contain the docValues data, again what's the before/after
size of all these files with and without your field?

These are two specific places to look, but in general I'm asking what
the total size is by extension in your index directory with and
without your field on the guess that one extension will be massively
bigger, this is totally surprising, but it'd give us a clue where to
look.

Here are the file extensions and what they contain BTW:
https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/codecs/lucene70/package-summary.html

Best,
Erick

On Tue, Feb 13, 2018 at 3:41 AM, Alessandro Benedetti
<[hidden email]> wrote:

> Hi David,
> given the fact that you are actually building a new index from scratch, my
> shot in the dark didn't hit any target.
> When you say  : "Once the import finishes we save the docker image in the
> AWS docker repository.  We then build our cluster using that image as the
> base"
>
> Do you mean just configuraiton wise ?
> Will the new cluster have any starting index on disk?
> If i understood correctly your latest statements I expect a NO in here.
>
> So you are building a completely new index and comparing to the old index (
> which is completely separate) you denote such a big difference in size.
> This is extremely suspicious .
> Optimizing in the end is just a huge merge to force 1 ( or N) final
> segments.
> Given the additional information you gave me, it's not going to make much
> difference.
>
> I would recommend to check how the index space is divided in different file
> formats [1]
> ( i.e. list how much space is dedicated to a specific extension)
>
> Stored content is in the .fdt files.
>
>
> [1]
> https://lucene.apache.org/core/6_4_0/core/org/apache/lucene/codecs/lucene62/package-summary.html#file-names
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Index size increases disproportionately to size of added field when indexed=false

Chris Hostetter-3
In reply to this post by Howe, David

: We are using Solr 7.1.0 to index a database of addresses.  We have found
: that our index size increases massively when we add one extra field to
: the index, even though that field is stored and not indexed, and doesn’t

what about docValues?

: When we run an index load without the problematic field present, the
: Solr index size is 5.5GB.  When we add the field into the index, the
: size grows to 13.3GB.  The field itself is a maximum of 46 characters in
: length and on average is 19 characters. We have ~14,000,000 rows in
: total to index of which only ~200,000 have this field present at all
: (i.e. not null in database).  Given that we don’t want to index the
: field, only store it I would have thought (perhaps naively) that the
: storage increase would be approximately 200,000 * 19 = 3.8M bytes =
: 3.6MB rather than the 7.5GB we are seeing.

if the field has docValues enabled, then there will be some overhead for
every doc in the index -- even the ones that don't have a value in this
field.  (allthough i'd still be very suprised if it accounted for 7G)

: - The problematic field is created through the API as follows:
:
:   curl -X POST -H 'Content-type:application/json' --data-binary '{
:     "add-field":{
:       "name":"buildingName",
:       "type":"string",
:       "stored":true,
:       "indexed":false
:     }
:   }' http://localhost:8983/solr/address/schema

...that's going to cause the field to inherit any (non-overridden)
settings from the fieldType "string" -- in the 7.1 _default configset,
"string" is defined with docValues="true"

You can see *all* properties set on a field -- regardless of wether they
are set on the fieldType, or are implicit hardcoded defaults in the
implementation of the fieldType via the 'showDefaults=true' Schema API
option.

Consider these API examples from the techproducts demo...

$ curl 'http://localhost:8983/solr/techproducts/schema/fields/cat'
{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "field":{
    "name":"cat",
    "type":"string",
    "multiValued":true,
    "indexed":true,
    "stored":true}}

$ curl 'http://localhost:8983/solr/techproducts/schema/fields/cat?showDefaults=true'
{
  "responseHeader":{
    "status":0,
    "QTime":0},
  "field":{
    "name":"cat",
    "type":"string",
    "indexed":true,
    "stored":true,
    "docValues":false,
    "termVectors":false,
    "termPositions":false,
    "termOffsets":false,
    "termPayloads":false,
    "omitNorms":true,
    "omitTermFreqAndPositions":true,
    "omitPositions":false,
    "storeOffsetsWithPositions":false,
    "multiValued":true,
    "large":false,
    "sortMissingLast":true,
    "required":false,
    "tokenized":false,
    "useDocValuesAsStored":true}}







-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: Index size increases disproportionately to size of added field when indexed=false

David Hastings
To piggy back on this, what would be the right scenarios to use
docvalues='true'?

On Tue, Feb 13, 2018 at 1:10 PM, Chris Hostetter <[hidden email]>
wrote:

>
> : We are using Solr 7.1.0 to index a database of addresses.  We have found
> : that our index size increases massively when we add one extra field to
> : the index, even though that field is stored and not indexed, and doesn’t
>
> what about docValues?
>
> : When we run an index load without the problematic field present, the
> : Solr index size is 5.5GB.  When we add the field into the index, the
> : size grows to 13.3GB.  The field itself is a maximum of 46 characters in
> : length and on average is 19 characters. We have ~14,000,000 rows in
> : total to index of which only ~200,000 have this field present at all
> : (i.e. not null in database).  Given that we don’t want to index the
> : field, only store it I would have thought (perhaps naively) that the
> : storage increase would be approximately 200,000 * 19 = 3.8M bytes =
> : 3.6MB rather than the 7.5GB we are seeing.
>
> if the field has docValues enabled, then there will be some overhead for
> every doc in the index -- even the ones that don't have a value in this
> field.  (allthough i'd still be very suprised if it accounted for 7G)
>
> : - The problematic field is created through the API as follows:
> :
> :   curl -X POST -H 'Content-type:application/json' --data-binary '{
> :     "add-field":{
> :       "name":"buildingName",
> :       "type":"string",
> :       "stored":true,
> :       "indexed":false
> :     }
> :   }' http://localhost:8983/solr/address/schema
>
> ...that's going to cause the field to inherit any (non-overridden)
> settings from the fieldType "string" -- in the 7.1 _default configset,
> "string" is defined with docValues="true"
>
> You can see *all* properties set on a field -- regardless of wether they
> are set on the fieldType, or are implicit hardcoded defaults in the
> implementation of the fieldType via the 'showDefaults=true' Schema API
> option.
>
> Consider these API examples from the techproducts demo...
>
> $ curl 'http://localhost:8983/solr/techproducts/schema/fields/cat'
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":0},
>   "field":{
>     "name":"cat",
>     "type":"string",
>     "multiValued":true,
>     "indexed":true,
>     "stored":true}}
>
> $ curl 'http://localhost:8983/solr/techproducts/schema/fields/
> cat?showDefaults=true'
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":0},
>   "field":{
>     "name":"cat",
>     "type":"string",
>     "indexed":true,
>     "stored":true,
>     "docValues":false,
>     "termVectors":false,
>     "termPositions":false,
>     "termOffsets":false,
>     "termPayloads":false,
>     "omitNorms":true,
>     "omitTermFreqAndPositions":true,
>     "omitPositions":false,
>     "storeOffsetsWithPositions":false,
>     "multiValued":true,
>     "large":false,
>     "sortMissingLast":true,
>     "required":false,
>     "tokenized":false,
>     "useDocValuesAsStored":true}}
>
>
>
>
>
>
>
> -Hoss
> http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

RE: Index size increases disproportionately to size of added field when indexed=false

Howe, David
In reply to this post by Alessandro Benedetti

Hi Alessandro,

The docker image is like a disk image of the entire server, so it includes the operating system, the Solr installation and the data.  Because we run in the cloud and our index isn't that big, this is an easy and fast way for us to scale our Solr cluster without having to configure Solr clusters, replication etc.  When we create a new server and "run" the docker image, the server comes up all ready to go, with Solr installed and the data already in the index.

I will checkout the different file extensions and how much space they are using.

Thanks,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  [hidden email]

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Reply | Threaded
Open this post in threaded view
|

RE: Index size increases disproportionately to size of added field when indexed=false

Howe, David
In reply to this post by Erick Erickson

Hi Erick,

Thanks for responding.  You are correct that we don't have any deleted docs.  When we want to re-index (once a fortnight), we build a brand new installation of Solr from scratch and re-import the new data into an empty index.

I will try setting docValues to false and see if that makes a difference.  It sounds like we shouldn't have it on anyway, as we only ever want to be able to retrieve this field.  In what situation would it make sense to have indexed=false and docValues=true?

I will re-index and get a sizing for all of the different file extensions both with and without the problematic field.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  [hidden email]

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Reply | Threaded
Open this post in threaded view
|

RE: Index size increases disproportionately to size of added field when indexed=false

Howe, David
In reply to this post by Chris Hostetter-3

Thanks Hoss.  I will try setting docValues to false, as we only ever want to be able to retrieve the value of this field.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  [hidden email]

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Reply | Threaded
Open this post in threaded view
|

RE: Index size increases disproportionately to size of added field when indexed=false

Howe, David

I have set docValues=false on all of the string fields in our index that have indexed=false and stored=true.  This gave a small improvement in the index size from 13.3GB to 12.82GB.

I have also tried running an optimize, which then reduced the index to 12.6GB.

Next step is to dump the sizes of the Solr index files for the index version that is the correct size and the version that has the large size.

Regards,

David


David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  [hidden email]

W  auspost.com.au
W  startrack.com.au

-----Original Message-----
From: Howe, David [mailto:[hidden email]]
Sent: Wednesday, 14 February 2018 7:26 AM
To: [hidden email]
Subject: RE: Index size increases disproportionately to size of added field when indexed=false


Thanks Hoss.  I will try setting docValues to false, as we only ever want to be able to retrieve the value of this field.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  [hidden email]

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Reply | Threaded
Open this post in threaded view
|

Re: Index size increases disproportionately to size of added field when indexed=false

Pratik Patel
I had a similar issue with index size after upgrading to version 6.4.1 from
5.x. The issue for me was that the field which caused index size to be
increased disproportionately had a field type("text_general") for which
default value of omitNorms was not true. Turning it on explicitly on field
fixed the problem. Following is the link to my related question.  You can
verify value of omitNorms for your fields to check whether this is
applicable in your case or not.
http://search-lucene.com/m/Solr/eHNlagIB7209f1w1?subj=Fwd+Solr+dynamic+field+blowing+up+the+index+size

On Tue, Feb 13, 2018 at 8:48 PM, Howe, David <[hidden email]>
wrote:

>
> I have set docValues=false on all of the string fields in our index that
> have indexed=false and stored=true.  This gave a small improvement in the
> index size from 13.3GB to 12.82GB.
>
> I have also tried running an optimize, which then reduced the index to
> 12.6GB.
>
> Next step is to dump the sizes of the Solr index files for the index
> version that is the correct size and the version that has the large size.
>
> Regards,
>
> David
>
>
> David Howe
> Java Domain Architect
> Postal Systems
> Level 16, 111 Bourke Street Melbourne VIC 3000
>
> T  0391067904
>
> M  0424036591
>
> E  [hidden email]
>
> W  auspost.com.au
> W  startrack.com.au
>
> -----Original Message-----
> From: Howe, David [mailto:[hidden email]]
> Sent: Wednesday, 14 February 2018 7:26 AM
> To: [hidden email]
> Subject: RE: Index size increases disproportionately to size of added
> field when indexed=false
>
>
> Thanks Hoss.  I will try setting docValues to false, as we only ever want
> to be able to retrieve the value of this field.
>
> Regards,
>
> David
>
> David Howe
> Java Domain Architect
> Postal Systems
> Level 16, 111 Bourke Street Melbourne VIC 3000
>
> T  0391067904
>
> M  0424036591
>
> E  [hidden email]
>
> W  auspost.com.au
> W  startrack.com.au
>
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or visit
> our website.
>
> The information contained in this email communication may be proprietary,
> confidential or legally professionally privileged. It is intended
> exclusively for the individual or entity to which it is addressed. You
> should only read, disclose, re-transmit, copy, distribute, act in reliance
> on or commercialise the information if you are authorised to do so.
> Australia Post does not represent, warrant or guarantee that the integrity
> of this email communication has been maintained nor that the communication
> is free of errors, virus or interference.
>
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper copy
> of this message. Any views expressed in this email communication are taken
> to be those of the individual sender, except where the sender specifically
> attributes those views to Australia Post and is authorised to do so.
>
> Please consider the environment before printing this email.
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or visit
> our website.
>
> The information contained in this email communication may be proprietary,
> confidential or legally professionally privileged. It is intended
> exclusively for the individual or entity to which it is addressed. You
> should only read, disclose, re-transmit, copy, distribute, act in reliance
> on or commercialise the information if you are authorised to do so.
> Australia Post does not represent, warrant or guarantee that the integrity
> of this email communication has been maintained nor that the communication
> is free of errors, virus or interference.
>
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper copy
> of this message. Any views expressed in this email communication are taken
> to be those of the individual sender, except where the sender specifically
> attributes those views to Australia Post and is authorised to do so.
>
> Please consider the environment before printing this email.
>
Reply | Threaded
Open this post in threaded view
|

Re: Index size increases disproportionately to size of added field when indexed=false

Erick Erickson
Pratik may have jumped right to the difference. We'd have gotten there
eventually by looking at file extensions, but just checking his
recommendation would be the first thing to do!

bq:  what would be the right scenarios to use docvalues='true'?

Whenever you want to facet, group or sort on the field. This _will_
increase the index size on disk, but it's almost always a good
tradeoff, here's why:

To facet, group or sort you need to "uninvert" the field. If you have
docValues=false, this universion is done at run-time into Java's heap.
If you have docValues=true, the uninversion is done at _index_ time
and the result stored on disk. Now when it's required, it can be
loaded in from disk efficiently (essentially de-serialized) and is
stored on the OS memory due to the magic of MMapDirectory, see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

bq:  In what situation would it make sense to have indexed=false and
docValues=true?

When you want to return _only_ fields that have docValues=true. If you
return fields with stored=true and docValues=false, Solr/Lucene has to
1> read the stored values from disk (minimum 16K block)
2> decrypt it
3> extract the field

With docValues, since they're only simple field types, all that you
have to do is read the value from the docValues structure., much more
efficient. HOWEVER, there are two caveats
1> The entire docValues field will be MMapped, so there's a time/space tradeoff.
2> docValues are stored in a sorted_set. This is relevant for
multiValued field because:
2a> values are returned in sorted order, not the order they were in the document
2b> identical values are collapsed.

So if the input values for a particular doc were 4, 3, 6, 4, 5, 2, 6,
5, 6, 5, 4, 3, 2 you'd get back 2, 3, 4, 5, 6

If you an live with those caveats, then returning field values would
involve much less work (both I/O and CPU), especially in
high-throughput situations. NOTE: there are a couple of JIRAs IIRC
that have to do with not storing the <uniqueKey> though.

Best,
Erick

On Wed, Feb 14, 2018 at 7:01 AM, Pratik Patel <[hidden email]> wrote:

> I had a similar issue with index size after upgrading to version 6.4.1 from
> 5.x. The issue for me was that the field which caused index size to be
> increased disproportionately had a field type("text_general") for which
> default value of omitNorms was not true. Turning it on explicitly on field
> fixed the problem. Following is the link to my related question.  You can
> verify value of omitNorms for your fields to check whether this is
> applicable in your case or not.
> http://search-lucene.com/m/Solr/eHNlagIB7209f1w1?subj=Fwd+Solr+dynamic+field+blowing+up+the+index+size
>
> On Tue, Feb 13, 2018 at 8:48 PM, Howe, David <[hidden email]>
> wrote:
>
>>
>> I have set docValues=false on all of the string fields in our index that
>> have indexed=false and stored=true.  This gave a small improvement in the
>> index size from 13.3GB to 12.82GB.
>>
>> I have also tried running an optimize, which then reduced the index to
>> 12.6GB.
>>
>> Next step is to dump the sizes of the Solr index files for the index
>> version that is the correct size and the version that has the large size.
>>
>> Regards,
>>
>> David
>>
>>
>> David Howe
>> Java Domain Architect
>> Postal Systems
>> Level 16, 111 Bourke Street Melbourne VIC 3000
>>
>> T  0391067904
>>
>> M  0424036591
>>
>> E  [hidden email]
>>
>> W  auspost.com.au
>> W  startrack.com.au
>>
>> -----Original Message-----
>> From: Howe, David [mailto:[hidden email]]
>> Sent: Wednesday, 14 February 2018 7:26 AM
>> To: [hidden email]
>> Subject: RE: Index size increases disproportionately to size of added
>> field when indexed=false
>>
>>
>> Thanks Hoss.  I will try setting docValues to false, as we only ever want
>> to be able to retrieve the value of this field.
>>
>> Regards,
>>
>> David
>>
>> David Howe
>> Java Domain Architect
>> Postal Systems
>> Level 16, 111 Bourke Street Melbourne VIC 3000
>>
>> T  0391067904
>>
>> M  0424036591
>>
>> E  [hidden email]
>>
>> W  auspost.com.au
>> W  startrack.com.au
>>
>> Australia Post is committed to providing our customers with excellent
>> service. If we can assist you in any way please telephone 13 13 18 or visit
>> our website.
>>
>> The information contained in this email communication may be proprietary,
>> confidential or legally professionally privileged. It is intended
>> exclusively for the individual or entity to which it is addressed. You
>> should only read, disclose, re-transmit, copy, distribute, act in reliance
>> on or commercialise the information if you are authorised to do so.
>> Australia Post does not represent, warrant or guarantee that the integrity
>> of this email communication has been maintained nor that the communication
>> is free of errors, virus or interference.
>>
>> If you are not the addressee or intended recipient please notify us by
>> replying direct to the sender and then destroy any electronic or paper copy
>> of this message. Any views expressed in this email communication are taken
>> to be those of the individual sender, except where the sender specifically
>> attributes those views to Australia Post and is authorised to do so.
>>
>> Please consider the environment before printing this email.
>> Australia Post is committed to providing our customers with excellent
>> service. If we can assist you in any way please telephone 13 13 18 or visit
>> our website.
>>
>> The information contained in this email communication may be proprietary,
>> confidential or legally professionally privileged. It is intended
>> exclusively for the individual or entity to which it is addressed. You
>> should only read, disclose, re-transmit, copy, distribute, act in reliance
>> on or commercialise the information if you are authorised to do so.
>> Australia Post does not represent, warrant or guarantee that the integrity
>> of this email communication has been maintained nor that the communication
>> is free of errors, virus or interference.
>>
>> If you are not the addressee or intended recipient please notify us by
>> replying direct to the sender and then destroy any electronic or paper copy
>> of this message. Any views expressed in this email communication are taken
>> to be those of the individual sender, except where the sender specifically
>> attributes those views to Australia Post and is authorised to do so.
>>
>> Please consider the environment before printing this email.
>>
Reply | Threaded
Open this post in threaded view
|

Re: Index size increases disproportionately to size of added field when indexed=false

Alessandro Benedetti
In reply to this post by Pratik Patel
Hi pratik,
how is it possible that just the norms for a single field were causing such
a massive index size increment in your case ?

In your case I think it was for a field type used by multiple fields, but
it's still suspicious in my opinions,
norms should be that big.
If I remember correctly in old versions of Solr before the drop of index
time boost, norms were containing both an approximation of the length of the
field + index time boost.
From your mailing list problem you moved from 10 Gb to 300 Gb.
It can't be just the norms, are you sure you didn't face some bug ?

Regards



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

Re: Index size increases disproportionately to size of added field when indexed=false

Pratik Patel
You are right, in my case this field type was applied to many text fields.
These includes many copy fields and dynamic fields as well. In my case,
only specifying omitNorms=true for field type "text_general" fixed the
issue. I didn't do anything else or had any other bug.

On Wed, Feb 14, 2018 at 1:01 PM, Alessandro Benedetti <[hidden email]>
wrote:

> Hi pratik,
> how is it possible that just the norms for a single field were causing such
> a massive index size increment in your case ?
>
> In your case I think it was for a field type used by multiple fields, but
> it's still suspicious in my opinions,
> norms should be that big.
> If I remember correctly in old versions of Solr before the drop of index
> time boost, norms were containing both an approximation of the length of
> the
> field + index time boost.
> From your mailing list problem you moved from 10 Gb to 300 Gb.
> It can't be just the norms, are you sure you didn't face some bug ?
>
> Regards
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
Reply | Threaded
Open this post in threaded view
|

RE: Index size increases disproportionately to size of added field when indexed=false

Howe, David
In reply to this post by Howe, David

I have re-run both scenarios and captured the total size of each type of index file.  The MB (1) column is for the baseline scenario which has the smaller index and acceptable performance.  The MB(2) column is after I have added the extra field to the index.

Ext     MB (1)          MB (2)
.cfe    0.00            0.01
.cfs    335.01          3612.09
.dii    0.00            0.00
.dim    324.38          319.07
.doc    1094.68         2767.53
.dvd    1211.84         625.44
.dvm    0.14            0.08
.fdt    1633.21         5387.92
.fdx    2.12            1.44
.fnm    0.11            0.12
.loc    0.00            0.00
.nvd    127.84          110.67
.nvm    0.01            0.01
.pos    809.23          1272.70
.si     0.02            0.03
.tim    137.94          156.82
.tip    2.52            3.04
Total   5679.06         14256.98


David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  [hidden email]

W  auspost.com.au
W  startrack.com.au

-----Original Message-----
From: Howe, David [mailto:[hidden email]]
Sent: Wednesday, 14 February 2018 12:49 PM
To: [hidden email]
Subject: RE: Index size increases disproportionately to size of added field when indexed=false


I have set docValues=false on all of the string fields in our index that have indexed=false and stored=true.  This gave a small improvement in the index size from 13.3GB to 12.82GB.

I have also tried running an optimize, which then reduced the index to 12.6GB.

Next step is to dump the sizes of the Solr index files for the index version that is the correct size and the version that has the large size.

Regards,

David


David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  [hidden email]

W  auspost.com.au
W  startrack.com.au

-----Original Message-----
From: Howe, David [mailto:[hidden email]]
Sent: Wednesday, 14 February 2018 7:26 AM
To: [hidden email]
Subject: RE: Index size increases disproportionately to size of added field when indexed=false


Thanks Hoss.  I will try setting docValues to false, as we only ever want to be able to retrieve the value of this field.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  [hidden email]

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Reply | Threaded
Open this post in threaded view
|

RE: Index size increases disproportionately to size of added field when indexed=false

Alessandro Benedetti
@Pratik: you should have investigated. I understand that solved your issue,
but in case you needed norms it doesn't make sense that cause your index to
grow up by a factor of 30. You must have faced a nasty bug if it was just
the norms.

@Howe :

*Compound File* .cfs, .cfe An optional "virtual" file consisting of all the
other index files for systems that frequently run out of file handles.

*Frequencies* .doc Contains the list of docs which contain each term along
with frequency

*Field Data* .fdt The stored fields for documents

*Positions* .pos Stores position information about where a term occurs in
the index

*Term Index* .tip The index into the Term Dictionary

So, David, you confirm that those two index have :

1) same number of documents
2) identical documents ( + 1 new field each not indexed)
3) same number of deleted documents
4) they both were born from scratch ( an empty index)

The matter is still suspicious :
- Cfs seems to highlight some sort of malfunctioning during
indexing/committing in relation with the OS. What was the way of commiting
you were using ?

- .doc, .pos, .tip -> they shouldn't change, assuming both the indexes are
optimised, you are adding a not indexed field, those data structures
shouldn't be affected

- the stored content as well, too much of an increment

Can you send us the full configuration for the new field ?
You don't want, norms, positions and frequencies for it.
But in case they are the issue, you may have found some very edge case,
because also enabling all of them you shouldn't incur in such a penalty for
just an additional tiny field



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
Reply | Threaded
Open this post in threaded view
|

RE: Index size increases disproportionately to size of added field when indexed=false

Howe, David

Hi Alessandro,

Some interesting testing today that seems to have gotten me closer to what the issue is.  When I run the version of the index that is working correctly against my database table that has the extra field in it, the index suddenly increases in size.  This is even though the data importer is running the same SELECT as before (which doesn't include the extra column) and loads the same number of rows.

After scratching my head for a bit and browsing through both versions of the table I am loading from (with and without the extra field), I noticed that the natural ordering of the tables is different.  These tables are "staging" tables that I populate with another set of queries and inserts to get the data into a format that is easy to ingest into Solr.  When I add the extra field to these queries, it changes the Oracle query plan as the field is contained in a different table that I need to join to.  As I don't specify an "ORDER BY" on the query (as I didn't think it would make a difference and would slow the query down), Oracle is free to chose how it orders the result set.  Adding the extra field changes that natural ordering, which affects the order things go into my staging table.  As I don't specify an "ORDER BY" when I select things out of the staging table, my data in the scenario that is working is being loaded in a different order to the scenario which doesn't work.

I am currently running full loads to verify this under each scenario, as I have now forced the data in the scenario that doesn't work to be in the same order as the scenario that does.  Will see how this load goes overnight.

This leads to the question of what difference does it make to Solr what order I load the data in?

I also noticed that the .cfs file is quite large in the second scenario, even though this is supposed to be disabled by default in Solr.  I checked my Solr config and there is no override of the default.

In answer to your questions:

1) same number of documents - YES ~14,000,000 documents
2) identical documents ( + 1 new field each not indexed) - YES, the second scenario has one extra field that is stored but not indexed
3) same number of deleted documents - YES, there are zero deleted documents in both scenarios
4) they both were born from scratch ( an empty index) - YES, both start from a brand new virtual server with a brand new installation of Solr

I am using the default auto commit, which I think is 15000.

Thanks again for your assistance.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  [hidden email]

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference.

If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Reply | Threaded
Open this post in threaded view
|

Re: Index size increases disproportionately to size of added field when indexed=false

Pratik Patel
@Alessandro I will see if I can reproduce the same issue just by turning
off omitNorms on field type. I'll open another mail thread if required.
Thanks.

On Thu, Feb 15, 2018 at 6:12 AM, Howe, David <[hidden email]>
wrote:

>
> Hi Alessandro,
>
> Some interesting testing today that seems to have gotten me closer to what
> the issue is.  When I run the version of the index that is working
> correctly against my database table that has the extra field in it, the
> index suddenly increases in size.  This is even though the data importer is
> running the same SELECT as before (which doesn't include the extra column)
> and loads the same number of rows.
>
> After scratching my head for a bit and browsing through both versions of
> the table I am loading from (with and without the extra field), I noticed
> that the natural ordering of the tables is different.  These tables are
> "staging" tables that I populate with another set of queries and inserts to
> get the data into a format that is easy to ingest into Solr.  When I add
> the extra field to these queries, it changes the Oracle query plan as the
> field is contained in a different table that I need to join to.  As I don't
> specify an "ORDER BY" on the query (as I didn't think it would make a
> difference and would slow the query down), Oracle is free to chose how it
> orders the result set.  Adding the extra field changes that natural
> ordering, which affects the order things go into my staging table.  As I
> don't specify an "ORDER BY" when I select things out of the staging table,
> my data in the scenario that is working is being loaded in a different
> order to the scenario which doesn't work.
>
> I am currently running full loads to verify this under each scenario, as I
> have now forced the data in the scenario that doesn't work to be in the
> same order as the scenario that does.  Will see how this load goes
> overnight.
>
> This leads to the question of what difference does it make to Solr what
> order I load the data in?
>
> I also noticed that the .cfs file is quite large in the second scenario,
> even though this is supposed to be disabled by default in Solr.  I checked
> my Solr config and there is no override of the default.
>
> In answer to your questions:
>
> 1) same number of documents - YES ~14,000,000 documents
> 2) identical documents ( + 1 new field each not indexed) - YES, the second
> scenario has one extra field that is stored but not indexed
> 3) same number of deleted documents - YES, there are zero deleted
> documents in both scenarios
> 4) they both were born from scratch ( an empty index) - YES, both start
> from a brand new virtual server with a brand new installation of Solr
>
> I am using the default auto commit, which I think is 15000.
>
> Thanks again for your assistance.
>
> Regards,
>
> David
>
> David Howe
> Java Domain Architect
> Postal Systems
> Level 16, 111 Bourke Street Melbourne VIC 3000
>
> T  0391067904
>
> M  0424036591
>
> E  [hidden email]
>
> W  auspost.com.au
> W  startrack.com.au
>
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or visit
> our website.
>
> The information contained in this email communication may be proprietary,
> confidential or legally professionally privileged. It is intended
> exclusively for the individual or entity to which it is addressed. You
> should only read, disclose, re-transmit, copy, distribute, act in reliance
> on or commercialise the information if you are authorised to do so.
> Australia Post does not represent, warrant or guarantee that the integrity
> of this email communication has been maintained nor that the communication
> is free of errors, virus or interference.
>
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper copy
> of this message. Any views expressed in this email communication are taken
> to be those of the individual sender, except where the sender specifically
> attributes those views to Australia Post and is authorised to do so.
>
> Please consider the environment before printing this email.
>
Reply | Threaded
Open this post in threaded view
|

Re: Index size increases disproportionately to size of added field when indexed=false

Erick Erickson
David:

Rats, the cfs files make everything I'd hoped to understand with the
sizes ambiguous, since they conceal the underlying sizes of each other
extension. We can approach it a bit differently though. Take one
segment that's _not_ in cfs format where the total size of all files
making up that segment is near 5GB (the default max segment size) and
compare the individual segments for that segment only. What I'm hoping
to find out, of course, is which extensions vary dramatically. But
let's assume for the nonce that the numbers you already have are
comparable if we ignore the .cfs files.

.doc    1094.68        2767.53 - term frequencies.
.fdt     1633.21         5387.92 - stored data
.pos    809.23          1272.70 - position information

So the file difference (if borne out) indicates the following

- doc you have more documents or more terms or different options on
your terms [1]
- fdt you're storing more fields than you used to. [1]
- pos you have more docs or more terms or have position information
turned on where you didn't before. [1]

[1] or lots of deleted docs that haven't been merged away. This
information should be on the admin page for any particular core. I
think this unlikely, but who knows? NOTE, just because you get 14M fro
querying *:* does _not_ say anything about the deleted docs, which
take up space. This is highly unlikely to be your problem, but let's
eliminate the easy stuff ;)

Where I'd go from here after checking that these ratios are true for a
single like-sized segment in both cases....

1> the LukeReqeustHandler can tell you information about exactly how
the index is defined, and using Luke itself can provide you a much
more detailed look at what's actually _in_ your index. You could also
have Luke reconstruct the same doc from your index in each case and
compare. Perhaps your SQL is doing something really unexpected. This
_should_ show you the realized meta-data for each field and let you
pinpoint any different options that have been enabled.

2> compare your Oracle intermediate tables, are they _really_
identical? The ordering shouldn't make any difference at all to Solr
assuming the same docs are being indexed (plus any expected delta).
There's an edge case I can imagine if you hit a "perfect storm" and
one version has a lot more deleted docs than the other that's possibly
the result of reordering, but that's unlikely. The edge case I'm
imagining would be easily verifiable by the two versions having a
radically different number of deleted docs....

Best,
Erick




On Thu, Feb 15, 2018 at 7:13 AM, Pratik Patel <[hidden email]> wrote:

> @Alessandro I will see if I can reproduce the same issue just by turning
> off omitNorms on field type. I'll open another mail thread if required.
> Thanks.
>
> On Thu, Feb 15, 2018 at 6:12 AM, Howe, David <[hidden email]>
> wrote:
>
>>
>> Hi Alessandro,
>>
>> Some interesting testing today that seems to have gotten me closer to what
>> the issue is.  When I run the version of the index that is working
>> correctly against my database table that has the extra field in it, the
>> index suddenly increases in size.  This is even though the data importer is
>> running the same SELECT as before (which doesn't include the extra column)
>> and loads the same number of rows.
>>
>> After scratching my head for a bit and browsing through both versions of
>> the table I am loading from (with and without the extra field), I noticed
>> that the natural ordering of the tables is different.  These tables are
>> "staging" tables that I populate with another set of queries and inserts to
>> get the data into a format that is easy to ingest into Solr.  When I add
>> the extra field to these queries, it changes the Oracle query plan as the
>> field is contained in a different table that I need to join to.  As I don't
>> specify an "ORDER BY" on the query (as I didn't think it would make a
>> difference and would slow the query down), Oracle is free to chose how it
>> orders the result set.  Adding the extra field changes that natural
>> ordering, which affects the order things go into my staging table.  As I
>> don't specify an "ORDER BY" when I select things out of the staging table,
>> my data in the scenario that is working is being loaded in a different
>> order to the scenario which doesn't work.
>>
>> I am currently running full loads to verify this under each scenario, as I
>> have now forced the data in the scenario that doesn't work to be in the
>> same order as the scenario that does.  Will see how this load goes
>> overnight.
>>
>> This leads to the question of what difference does it make to Solr what
>> order I load the data in?
>>
>> I also noticed that the .cfs file is quite large in the second scenario,
>> even though this is supposed to be disabled by default in Solr.  I checked
>> my Solr config and there is no override of the default.
>>
>> In answer to your questions:
>>
>> 1) same number of documents - YES ~14,000,000 documents
>> 2) identical documents ( + 1 new field each not indexed) - YES, the second
>> scenario has one extra field that is stored but not indexed
>> 3) same number of deleted documents - YES, there are zero deleted
>> documents in both scenarios
>> 4) they both were born from scratch ( an empty index) - YES, both start
>> from a brand new virtual server with a brand new installation of Solr
>>
>> I am using the default auto commit, which I think is 15000.
>>
>> Thanks again for your assistance.
>>
>> Regards,
>>
>> David
>>
>> David Howe
>> Java Domain Architect
>> Postal Systems
>> Level 16, 111 Bourke Street Melbourne VIC 3000
>>
>> T  0391067904
>>
>> M  0424036591
>>
>> E  [hidden email]
>>
>> W  auspost.com.au
>> W  startrack.com.au
>>
>> Australia Post is committed to providing our customers with excellent
>> service. If we can assist you in any way please telephone 13 13 18 or visit
>> our website.
>>
>> The information contained in this email communication may be proprietary,
>> confidential or legally professionally privileged. It is intended
>> exclusively for the individual or entity to which it is addressed. You
>> should only read, disclose, re-transmit, copy, distribute, act in reliance
>> on or commercialise the information if you are authorised to do so.
>> Australia Post does not represent, warrant or guarantee that the integrity
>> of this email communication has been maintained nor that the communication
>> is free of errors, virus or interference.
>>
>> If you are not the addressee or intended recipient please notify us by
>> replying direct to the sender and then destroy any electronic or paper copy
>> of this message. Any views expressed in this email communication are taken
>> to be those of the individual sender, except where the sender specifically
>> attributes those views to Australia Post and is authorised to do so.
>>
>> Please consider the environment before printing this email.
>>
12