Indexed Data Size

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexed Data Size

Moyer, Brett
In our data/solr/<shard_replica>/data/index on the filesystem, we have files that go back 1 year. I don’t understand why and I doubt they are in use. Files with extensions like fdx,cfe,doc,pos,tip,dvm etc. Some of these are very large and running us out of server space. Our search indexes themselves are not large, in total we might have 50k documents.  How can I reduce this /data/solr space? Is this what the Solr Optimize command is for? Thanks!

Brett

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA
*************************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Indexed Data Size

Erick Erickson
On the surface, this makes no sense at all, so there’s something I don’t understand here ;).

How often do you update your index? Having files from a long time ago is perfectly reasonable if you’re not updating regularly.

But your statement that some of these are huge for just a 50K document index is odd unless they’re _huge_ documents.

I wouldn’t optimize, unless you’re on Solr 7.5+ as that’ll create a single segment, see:
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
and
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

The extensions you mentioned are perfectly reasonable. Each segment is made up of multiple files. .fdt for instance contains stored data. See: https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/codecs/lucene62/package-summary.html

Can you give us a long listing of one of your index directories?

Best,
Erick

> On Aug 8, 2019, at 5:17 PM, Moyer, Brett <[hidden email]> wrote:
>
> In our data/solr/<shard_replica>/data/index on the filesystem, we have files that go back 1 year. I don’t understand why and I doubt they are in use. Files with extensions like fdx,cfe,doc,pos,tip,dvm etc. Some of these are very large and running us out of server space. Our search indexes themselves are not large, in total we might have 50k documents.  How can I reduce this /data/solr space? Is this what the Solr Optimize command is for? Thanks!
>
> Brett
>
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately and then delete it.
>
> TIAA
> *************************************************************************

Reply | Threaded
Open this post in threaded view
|

Re: Indexed Data Size

Shawn Heisey-2
In reply to this post by Moyer, Brett
On 8/8/2019 3:17 PM, Moyer, Brett wrote:
> In our data/solr/<shard_replica>/data/index on the filesystem, we have files that go back 1 year. I don’t understand why and I doubt they are in use. Files with extensions like fdx,cfe,doc,pos,tip,dvm etc. Some of these are very large and running us out of server space. Our search indexes themselves are not large, in total we might have 50k documents.  How can I reduce this /data/solr space? Is this what the Solr Optimize command is for? Thanks!

+1 to everything Erick said.

Another piece of information that could be helpful is a screenshot of
the core overview in the admin UI.  It would look something like this:

https://www.dropbox.com/s/mbh6ll1v8ghloko/solr-core-overview.png?dl=0

To get that, just go to the admin UI and choose one of the big cores
from the core dropdown.  That should put you on the overview tab for the
core.  Then grab a screenshot and use a file sharing site to share it.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

RE: Indexed Data Size

Moyer, Brett
In reply to this post by Erick Erickson
Thanks! We update each index nightly, we don’t clear, but bring in New and Deltas, delete expired/404. All our data are basically webpages, so none are very large. Some PDFs but again not too large. We are running Solr 7.5, hopefully you can access the links.

https://www.dropbox.com/s/lzd6hkoikhagujs/CoreOne.png?dl=0
https://www.dropbox.com/s/ae6rayb38q39u9c/CoreTwo.png?dl=0

Brett
-----Original Message-----
From: Erick Erickson <[hidden email]>
Sent: Thursday, August 8, 2019 5:49 PM
To: [hidden email]
Subject: Re: Indexed Data Size

On the surface, this makes no sense at all, so there’s something I don’t understand here ;).

How often do you update your index? Having files from a long time ago is perfectly reasonable if you’re not updating regularly.

But your statement that some of these are huge for just a 50K document index is odd unless they’re _huge_ documents.

I wouldn’t optimize, unless you’re on Solr 7.5+ as that’ll create a single segment, see:
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
and
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

The extensions you mentioned are perfectly reasonable. Each segment is made up of multiple files. .fdt for instance contains stored data. See: https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/codecs/lucene62/package-summary.html

Can you give us a long listing of one of your index directories?

Best,
Erick

> On Aug 8, 2019, at 5:17 PM, Moyer, Brett <[hidden email]> wrote:
>
> In our data/solr/<shard_replica>/data/index on the filesystem, we have files that go back 1 year. I don’t understand why and I doubt they are in use. Files with extensions like fdx,cfe,doc,pos,tip,dvm etc. Some of these are very large and running us out of server space. Our search indexes themselves are not large, in total we might have 50k documents.  How can I reduce this /data/solr space? Is this what the Solr Optimize command is for? Thanks!
>
> Brett
>
> **********************************************************************
> *** This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately and then delete it.
>
> TIAA
> **********************************************************************
> ***

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA
*************************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Indexed Data Size

Shawn Heisey-2
On 8/9/2019 6:12 AM, Moyer, Brett wrote:
> Thanks! We update each index nightly, we don’t clear, but bring in New and Deltas, delete expired/404. All our data are basically webpages, so none are very large. Some PDFs but again not too large. We are running Solr 7.5, hopefully you can access the links.

Solr is saying that the entire size of the index directory is 95 MB for
one of those indexes and the other is 30 MB.  Those sound to me like
very small indexes, not very large like you indicated.  You were saying
that the large files were in data/index, and did not mention anything
about index.<timestamp> directories.

If you do have a bunch of index.<timestamp> directories in the "Data"
directory mentioned on the Core overview page, you can safely delete all
of the index and/or index.* directories under that directory EXCEPT the
one that is indicated as the "Index" directory.  If you delete that one,
you're deleting the actual live index ... and since you're not on
Windows, the OS will let you delete it without complaining.

The directory locations are cut off on both screenshots, so I can't
confirm anything there.

The larger core has about 2000 deleted docs and the smaller one has 40.
Doing an optimize will not save much disk space or take very long.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

RE: Indexed Data Size

Moyer, Brett
Correct our indexes are small document wise, but for some ready we have a years' worth of files in the data/solr folders. There are no index.<timestamp> files.

The biggest is /data/solr/system_logs_shard1_replica_n1/data/index, files with the extensions I stated previously. Each is 5gb and there are a few hundred. Dated by to last 3 months. I don’t understand why there are so many files with such small indexes. Not sure how to clean them up.

-----Original Message-----
From: Shawn Heisey <[hidden email]>
Sent: Friday, August 9, 2019 9:11 AM
To: [hidden email]
Subject: Re: Indexed Data Size

On 8/9/2019 6:12 AM, Moyer, Brett wrote:
> Thanks! We update each index nightly, we don’t clear, but bring in New and Deltas, delete expired/404. All our data are basically webpages, so none are very large. Some PDFs but again not too large. We are running Solr 7.5, hopefully you can access the links.

Solr is saying that the entire size of the index directory is 95 MB for one of those indexes and the other is 30 MB.  Those sound to me like very small indexes, not very large like you indicated.  You were saying that the large files were in data/index, and did not mention anything about index.<timestamp> directories.

If you do have a bunch of index.<timestamp> directories in the "Data"
directory mentioned on the Core overview page, you can safely delete all of the index and/or index.* directories under that directory EXCEPT the one that is indicated as the "Index" directory.  If you delete that one, you're deleting the actual live index ... and since you're not on Windows, the OS will let you delete it without complaining.

The directory locations are cut off on both screenshots, so I can't confirm anything there.

The larger core has about 2000 deleted docs and the smaller one has 40.
Doing an optimize will not save much disk space or take very long.

Thanks,
Shawn
*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA
*************************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Indexed Data Size

Shawn Heisey-2
On 8/9/2019 12:17 PM, Moyer, Brett wrote:
> The biggest is /data/solr/system_logs_shard1_replica_n1/data/index, files with the extensions I stated previously. Each is 5gb and there are a few hundred. Dated by to last 3 months. I don’t understand why there are so many files with such small indexes. Not sure how to clean them up.

Can you get a screenshot of the core overview for that particular core?
Solr should correctly calculate the size on the overview based on what
files are actually in the index directory.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

RE: Indexed Data Size

Moyer, Brett
Turns out this is due to a job that indexes logs. We were able to clear some with another job. We are working through the value of these indexed logs. Thanks for all your help!

Brett Moyer
Manager, Sr. Technical Lead | TFS Technology
  Public Production Support
  Digital Search & Discovery

8625 Andrew Carnegie Blvd | 4th floor
Charlotte, NC 28263
Tel: 704.988.4508
Fax: 704.988.4907
[hidden email]

-----Original Message-----
From: Shawn Heisey <[hidden email]>
Sent: Friday, August 9, 2019 2:25 PM
To: [hidden email]
Subject: Re: Indexed Data Size

On 8/9/2019 12:17 PM, Moyer, Brett wrote:
> The biggest is /data/solr/system_logs_shard1_replica_n1/data/index, files with the extensions I stated previously. Each is 5gb and there are a few hundred. Dated by to last 3 months. I don’t understand why there are so many files with such small indexes. Not sure how to clean them up.

Can you get a screenshot of the core overview for that particular core?
Solr should correctly calculate the size on the overview based on what files are actually in the index directory.

Thanks,
Shawn
*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and then delete it.

TIAA
*************************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: Indexed Data Size

Greg Harris-2
Brett, it’s probably because you hit the 5g default segment size limit on
Solr and in order to merge segments a huge number of the docs within the
segment must be marked as deleted. So even if large amounts of docs are
deleted docs within the segment, the segment is still there, happily taking
up space. That could theoretically be a reason for a optimize, but you’d
want to specify maxsegments with the goal of not merging to a single
segment for the entire index. Ideally you should just keep as many of the
logs as you actually use (which is hopefully more limited than what you are
keeping). Since the segments will be somewhat time based they would
eventually disappear/merge through time, hopefully negating any reason to
consider having to optimize

Greg

On Tue, Aug 13, 2019 at 3:31 PM Moyer, Brett <[hidden email]> wrote:

> Turns out this is due to a job that indexes logs. We were able to clear
> some with another job. We are working through the value of these indexed
> logs. Thanks for all your help!
>
> Brett Moyer
> Manager, Sr. Technical Lead | TFS Technology
>   Public Production Support
>   Digital Search & Discovery
>
> 8625 Andrew Carnegie Blvd | 4th floor
> Charlotte, NC 28263
> Tel: 704.988.4508
> Fax: 704.988.4907
> [hidden email]
>
> -----Original Message-----
> From: Shawn Heisey <[hidden email]>
> Sent: Friday, August 9, 2019 2:25 PM
> To: [hidden email]
> Subject: Re: Indexed Data Size
>
> On 8/9/2019 12:17 PM, Moyer, Brett wrote:
> > The biggest is /data/solr/system_logs_shard1_replica_n1/data/index,
> files with the extensions I stated previously. Each is 5gb and there are a
> few hundred. Dated by to last 3 months. I don’t understand why there are so
> many files with such small indexes. Not sure how to clean them up.
>
> Can you get a screenshot of the core overview for that particular core?
> Solr should correctly calculate the size on the overview based on what
> files are actually in the index directory.
>
> Thanks,
> Shawn
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender
> immediately and then delete it.
>
> TIAA
> *************************************************************************
>