Sort index by size

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Sort index by size

Srinivas Kashyap-2
Hello,

I have a solr core with some 20 fields in it.(all are stored and indexed). For an environment, the number of documents are around 0.29 million. When I run the full import through DIH, indexing is completing successfully. But, it is occupying the disk space of around 5 GB. Is there a possibility where I can go and check, which document is consuming more memory? Put in another way, can I sort the index based on size?

Thanks and Regards,
Srinivas Kashyap

  ________________________________
DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way.
No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
Reply | Threaded
Open this post in threaded view
|

Re: Sort index by size

Shawn Heisey-2
On 11/19/2018 2:31 AM, Srinivas Kashyap wrote:
> I have a solr core with some 20 fields in it.(all are stored and indexed). For an environment, the number of documents are around 0.29 million. When I run the full import through DIH, indexing is completing successfully. But, it is occupying the disk space of around 5 GB. Is there a possibility where I can go and check, which document is consuming more memory? Put in another way, can I sort the index based on size?

I am not aware of any way to do that.  Might be one that I don't know
about, but if there were a way, seems like I would have come across it
before.

It is not very that the large index size is due to a single document or
a handful of documents.  It is more likely that most documents are
relatively large.  I could be wrong about that, though.

If you have 290000 documents (which is how I interpreted 0.29 million)
and the total index size is about 5 GB, then the average size per
document in the index is about 18 kilobytes.This is in my view pretty
large.  Typically I think that most documents are 1-2 kilobytes.

Can we get your Solr version, a copy of your schema, and exactly what
Solr returns in search results for a typically sized document?  You'll
need to use a paste website or a file-sharing website ... if you try to
attach these things to a message, the mailing list will most likely eat
them, and we'll never see them. If you need to redact the information in
search results ... please do it in a way that we can still see the exact
size of the text -- don't just remove information, replace it with
information that's the same length.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Sort index by size

David Hastings
Also a full import, assuming the documents were already indexed, will just
double your index size until a merge/optimize is ran since you are just
marking a document as deleted, not taking back any space, and then adding
another completely new document on top of it.

On Mon, Nov 19, 2018 at 10:36 AM Shawn Heisey <[hidden email]> wrote:

> On 11/19/2018 2:31 AM, Srinivas Kashyap wrote:
> > I have a solr core with some 20 fields in it.(all are stored and
> indexed). For an environment, the number of documents are around 0.29
> million. When I run the full import through DIH, indexing is completing
> successfully. But, it is occupying the disk space of around 5 GB. Is there
> a possibility where I can go and check, which document is consuming more
> memory? Put in another way, can I sort the index based on size?
>
> I am not aware of any way to do that.  Might be one that I don't know
> about, but if there were a way, seems like I would have come across it
> before.
>
> It is not very that the large index size is due to a single document or
> a handful of documents.  It is more likely that most documents are
> relatively large.  I could be wrong about that, though.
>
> If you have 290000 documents (which is how I interpreted 0.29 million)
> and the total index size is about 5 GB, then the average size per
> document in the index is about 18 kilobytes.This is in my view pretty
> large.  Typically I think that most documents are 1-2 kilobytes.
>
> Can we get your Solr version, a copy of your schema, and exactly what
> Solr returns in search results for a typically sized document?  You'll
> need to use a paste website or a file-sharing website ... if you try to
> attach these things to a message, the mailing list will most likely eat
> them, and we'll never see them. If you need to redact the information in
> search results ... please do it in a way that we can still see the exact
> size of the text -- don't just remove information, replace it with
> information that's the same length.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Sort index by size

Walter Underwood
Worst case is 3X. That happens when there are no merges until the commit.

With tlogs, worst case is more than that. I’ve seen humongous tlogs with a batch load and no hard commit until the end. If you do that several times, then you have a few old humongous tlogs. Bleah.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Nov 19, 2018, at 7:40 AM, David Hastings <[hidden email]> wrote:
>
> Also a full import, assuming the documents were already indexed, will just
> double your index size until a merge/optimize is ran since you are just
> marking a document as deleted, not taking back any space, and then adding
> another completely new document on top of it.
>
> On Mon, Nov 19, 2018 at 10:36 AM Shawn Heisey <[hidden email]> wrote:
>
>> On 11/19/2018 2:31 AM, Srinivas Kashyap wrote:
>>> I have a solr core with some 20 fields in it.(all are stored and
>> indexed). For an environment, the number of documents are around 0.29
>> million. When I run the full import through DIH, indexing is completing
>> successfully. But, it is occupying the disk space of around 5 GB. Is there
>> a possibility where I can go and check, which document is consuming more
>> memory? Put in another way, can I sort the index based on size?
>>
>> I am not aware of any way to do that.  Might be one that I don't know
>> about, but if there were a way, seems like I would have come across it
>> before.
>>
>> It is not very that the large index size is due to a single document or
>> a handful of documents.  It is more likely that most documents are
>> relatively large.  I could be wrong about that, though.
>>
>> If you have 290000 documents (which is how I interpreted 0.29 million)
>> and the total index size is about 5 GB, then the average size per
>> document in the index is about 18 kilobytes.This is in my view pretty
>> large.  Typically I think that most documents are 1-2 kilobytes.
>>
>> Can we get your Solr version, a copy of your schema, and exactly what
>> Solr returns in search results for a typically sized document?  You'll
>> need to use a paste website or a file-sharing website ... if you try to
>> attach these things to a message, the mailing list will most likely eat
>> them, and we'll never see them. If you need to redact the information in
>> search results ... please do it in a way that we can still see the exact
>> size of the text -- don't just remove information, replace it with
>> information that's the same length.
>>
>> Thanks,
>> Shawn
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Sort index by size

Edward Ribeiro
One more tidbit: are you really sure you need all 20 fields to be indexed
and stored? Do you really need all those 20 fields?

See this blog post, for example:
https://www.garysieling.com/blog/tuning-solr-lucene-disk-usage

On Mon, Nov 19, 2018 at 1:45 PM Walter Underwood <[hidden email]>
wrote:
>
> Worst case is 3X. That happens when there are no merges until the commit.
>
> With tlogs, worst case is more than that. I’ve seen humongous tlogs with
a batch load and no hard commit until the end. If you do that several
times, then you have a few old humongous tlogs. Bleah.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 19, 2018, at 7:40 AM, David Hastings <
[hidden email]> wrote:
> >
> > Also a full import, assuming the documents were already indexed, will
just
> > double your index size until a merge/optimize is ran since you are just
> > marking a document as deleted, not taking back any space, and then
adding
> > another completely new document on top of it.
> >
> > On Mon, Nov 19, 2018 at 10:36 AM Shawn Heisey <[hidden email]>
wrote:
> >
> >> On 11/19/2018 2:31 AM, Srinivas Kashyap wrote:
> >>> I have a solr core with some 20 fields in it.(all are stored and
> >> indexed). For an environment, the number of documents are around 0.29
> >> million. When I run the full import through DIH, indexing is completing
> >> successfully. But, it is occupying the disk space of around 5 GB. Is
there
> >> a possibility where I can go and check, which document is consuming
more

> >> memory? Put in another way, can I sort the index based on size?
> >>
> >> I am not aware of any way to do that.  Might be one that I don't know
> >> about, but if there were a way, seems like I would have come across it
> >> before.
> >>
> >> It is not very that the large index size is due to a single document or
> >> a handful of documents.  It is more likely that most documents are
> >> relatively large.  I could be wrong about that, though.
> >>
> >> If you have 290000 documents (which is how I interpreted 0.29 million)
> >> and the total index size is about 5 GB, then the average size per
> >> document in the index is about 18 kilobytes.This is in my view pretty
> >> large.  Typically I think that most documents are 1-2 kilobytes.
> >>
> >> Can we get your Solr version, a copy of your schema, and exactly what
> >> Solr returns in search results for a typically sized document?  You'll
> >> need to use a paste website or a file-sharing website ... if you try to
> >> attach these things to a message, the mailing list will most likely eat
> >> them, and we'll never see them. If you need to redact the information
in
> >> search results ... please do it in a way that we can still see the
exact
> >> size of the text -- don't just remove information, replace it with
> >> information that's the same length.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
Reply | Threaded
Open this post in threaded view
|

Re: Sort index by size

Gus Heck
Just as a sanity check, is this getting replicated many times, or further
scaled up... it sounds like about $3.50/mo of disk space on AWS and it
should all fit in ram on any decent sized server.. (i.e. any server that
looks like half or quarter of a decent laptop)

As a question, it's interesting but it doesn't yet sound like a problem
worth sweating.

On Mon, Nov 19, 2018, 3:29 PM Edward Ribeiro <[hidden email]
wrote:

> One more tidbit: are you really sure you need all 20 fields to be indexed
> and stored? Do you really need all those 20 fields?
>
> See this blog post, for example:
> https://www.garysieling.com/blog/tuning-solr-lucene-disk-usage
>
> On Mon, Nov 19, 2018 at 1:45 PM Walter Underwood <[hidden email]>
> wrote:
> >
> > Worst case is 3X. That happens when there are no merges until the commit.
> >
> > With tlogs, worst case is more than that. I’ve seen humongous tlogs with
> a batch load and no hard commit until the end. If you do that several
> times, then you have a few old humongous tlogs. Bleah.
> >
> > wunder
> > Walter Underwood
> > [hidden email]
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Nov 19, 2018, at 7:40 AM, David Hastings <
> [hidden email]> wrote:
> > >
> > > Also a full import, assuming the documents were already indexed, will
> just
> > > double your index size until a merge/optimize is ran since you are just
> > > marking a document as deleted, not taking back any space, and then
> adding
> > > another completely new document on top of it.
> > >
> > > On Mon, Nov 19, 2018 at 10:36 AM Shawn Heisey <[hidden email]>
> wrote:
> > >
> > >> On 11/19/2018 2:31 AM, Srinivas Kashyap wrote:
> > >>> I have a solr core with some 20 fields in it.(all are stored and
> > >> indexed). For an environment, the number of documents are around 0.29
> > >> million. When I run the full import through DIH, indexing is
> completing
> > >> successfully. But, it is occupying the disk space of around 5 GB. Is
> there
> > >> a possibility where I can go and check, which document is consuming
> more
> > >> memory? Put in another way, can I sort the index based on size?
> > >>
> > >> I am not aware of any way to do that.  Might be one that I don't know
> > >> about, but if there were a way, seems like I would have come across it
> > >> before.
> > >>
> > >> It is not very that the large index size is due to a single document
> or
> > >> a handful of documents.  It is more likely that most documents are
> > >> relatively large.  I could be wrong about that, though.
> > >>
> > >> If you have 290000 documents (which is how I interpreted 0.29 million)
> > >> and the total index size is about 5 GB, then the average size per
> > >> document in the index is about 18 kilobytes.This is in my view pretty
> > >> large.  Typically I think that most documents are 1-2 kilobytes.
> > >>
> > >> Can we get your Solr version, a copy of your schema, and exactly what
> > >> Solr returns in search results for a typically sized document?  You'll
> > >> need to use a paste website or a file-sharing website ... if you try
> to
> > >> attach these things to a message, the mailing list will most likely
> eat
> > >> them, and we'll never see them. If you need to redact the information
> in
> > >> search results ... please do it in a way that we can still see the
> exact
> > >> size of the text -- don't just remove information, replace it with
> > >> information that's the same length.
> > >>
> > >> Thanks,
> > >> Shawn
> > >>
> > >>
>
Reply | Threaded
Open this post in threaded view
|

FW: Sort index by size

Srinivas Kashyap-2
In reply to this post by Shawn Heisey-2
Hi Shawn and everyone who replied to the thread,

The solr version is 5.2.1 and each document is returning multi-valued fields for majority of fields defined in schema.xml. I'm in the process of pasting the content of my files to a paste website and soon will update.

Thanks,
Srinivas


On 11/19/2018 2:31 AM, Srinivas Kashyap wrote:
> I have a solr core with some 20 fields in it.(all are stored and indexed). For an environment, the number of documents are around 0.29 million. When I run the full import through DIH, indexing is completing successfully. But, it is occupying the disk space of around 5 GB. Is there a possibility where I can go and check, which document is consuming more memory? Put in another way, can I sort the index based on size?

I am not aware of any way to do that.  Might be one that I don't know about, but if there were a way, seems like I would have come across it before.

It is not very that the large index size is due to a single document or a handful of documents.  It is more likely that most documents are relatively large.  I could be wrong about that, though.

If you have 290000 documents (which is how I interpreted 0.29 million) and the total index size is about 5 GB, then the average size per document in the index is about 18 kilobytes.This is in my view pretty large.  Typically I think that most documents are 1-2 kilobytes.

Can we get your Solr version, a copy of your schema, and exactly what Solr returns in search results for a typically sized document?  You'll need to use a paste website or a file-sharing website ... if you try to attach these things to a message, the mailing list will most likely eat them, and we'll never see them. If you need to redact the information in search results ... please do it in a way that we can still see the exact size of the text -- don't just remove information, replace it with information that's the same length.

Thanks,
Shawn

________________________________
DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way.
No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.