Question on how index works - runs out of disk space!

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Question on how index works - runs out of disk space!

sundar shankar
Hi All,
          We have a cluster of 4 servers for the application and Just one server for Solr. We have just about 2 million docs to index and we never bothered to make the solr environment clustered as Solr was delivering performance with the current setup itself. Offlate we just discovered a problem and I am not sure what would be the right way to go about this.

We have a cron that runs from the application that does a nightly index of data added from another enterprise application. The index job, indexes all courses, be it already indexed or not, and re indexes them. We observed that the job was starting up on all 4 servers at about the same time. All 4 servers point to the same Solr box and the same data is apparently added to the solr box 4 times. There is an update command for every 10,000 data fetched from the database and an commit at the end of the full job.

The surprising thing that I noticed was that even though there is a primary key defined in the solr schema, the size of the data(folder) seems to incrementaly increase and is causing the solr server to run out of disk space. I have recently upgraded to the 1.3 version about a month back and I guess the problems might be something that is occuring after that update.

The index size of a about millions docs on a clustered dev used to be about 520 megs and is about that much the first time index all the courses. The current size of the same number docs (got from stats page) is 6.5 gigs.

Am not sure what has changed and if I there any config change that I could use. The write lock is disabled on dev with lock-type = single. I am not sure if this matters.

-Sundar

_________________________________________________________________
Searching for the best deals on travel? Visit MSN Travel.
http://in.msn.com/coxandkings
Reply | Threaded
Open this post in threaded view
|

Re: Question on how index works - runs out of disk space!

Jason Rennie-2
Have you tried performing an "optimize"?  Solr doesn't seem to fully
integrate all updates into a single index until an optimize is performed.

Jason

On Wed, Sep 10, 2008 at 1:05 PM, sundar shankar <[hidden email]>wrote:

> Hi All,
>          We have a cluster of 4 servers for the application and Just one
> server for Solr. We have just about 2 million docs to index and we never
> bothered to make the solr environment clustered as Solr was delivering
> performance with the current setup itself. Offlate we just discovered a
> problem and I am not sure what would be the right way to go about this.
>
> We have a cron that runs from the application that does a nightly index of
> data added from another enterprise application. The index job, indexes all
> courses, be it already indexed or not, and re indexes them. We observed that
> the job was starting up on all 4 servers at about the same time. All 4
> servers point to the same Solr box and the same data is apparently added to
> the solr box 4 times. There is an update command for every 10,000 data
> fetched from the database and an commit at the end of the full job.
>
> The surprising thing that I noticed was that even though there is a primary
> key defined in the solr schema, the size of the data(folder) seems to
> incrementaly increase and is causing the solr server to run out of disk
> space. I have recently upgraded to the 1.3 version about a month back and I
> guess the problems might be something that is occuring after that update.
>
> The index size of a about millions docs on a clustered dev used to be about
> 520 megs and is about that much the first time index all the courses. The
> current size of the same number docs (got from stats page) is 6.5 gigs.
>
> Am not sure what has changed and if I there any config change that I could
> use. The write lock is disabled on dev with lock-type = single. I am not
> sure if this matters.
>
> -Sundar
>
> _________________________________________________________________
> Searching for the best deals on travel? Visit MSN Travel.
> http://in.msn.com/coxandkings
>



--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

RE: Question on how index works - runs out of disk space!

sundar shankar
I had an Optimize earlier. But removed it as it was too grueling and very time consuming. IS there a way to configure auto optimize in solr. A settings that should optimize the data in some time or after some records, Similar to what we have for commit?



> Date: Wed, 10 Sep 2008 14:37:11 -0400
> From: [hidden email]
> To: [hidden email]
> Subject: Re: Question on how index works - runs out of disk space!
>
> Have you tried performing an "optimize"?  Solr doesn't seem to fully
> integrate all updates into a single index until an optimize is performed.
>
> Jason
>
> On Wed, Sep 10, 2008 at 1:05 PM, sundar shankar <[hidden email]>wrote:
>
> > Hi All,
> >          We have a cluster of 4 servers for the application and Just one
> > server for Solr. We have just about 2 million docs to index and we never
> > bothered to make the solr environment clustered as Solr was delivering
> > performance with the current setup itself. Offlate we just discovered a
> > problem and I am not sure what would be the right way to go about this.
> >
> > We have a cron that runs from the application that does a nightly index of
> > data added from another enterprise application. The index job, indexes all
> > courses, be it already indexed or not, and re indexes them. We observed that
> > the job was starting up on all 4 servers at about the same time. All 4
> > servers point to the same Solr box and the same data is apparently added to
> > the solr box 4 times. There is an update command for every 10,000 data
> > fetched from the database and an commit at the end of the full job.
> >
> > The surprising thing that I noticed was that even though there is a primary
> > key defined in the solr schema, the size of the data(folder) seems to
> > incrementaly increase and is causing the solr server to run out of disk
> > space. I have recently upgraded to the 1.3 version about a month back and I
> > guess the problems might be something that is occuring after that update.
> >
> > The index size of a about millions docs on a clustered dev used to be about
> > 520 megs and is about that much the first time index all the courses. The
> > current size of the same number docs (got from stats page) is 6.5 gigs.
> >
> > Am not sure what has changed and if I there any config change that I could
> > use. The write lock is disabled on dev with lock-type = single. I am not
> > sure if this matters.
> >
> > -Sundar
> >
> > _________________________________________________________________
> > Searching for the best deals on travel? Visit MSN Travel.
> > http://in.msn.com/coxandkings
> >
>
>
>
> --
> Jason Rennie
> Head of Machine Learning Technologies, StyleFeeder
> http://www.stylefeeder.com/
> Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

_________________________________________________________________
Searching for weekend getaways? Try Live.com
http://www.live.com/?scope=video&form=MICOAL
Reply | Threaded
Open this post in threaded view
|

RE: Question on how index works - runs out of disk space!

sundar shankar
In reply to this post by Jason Rennie-2
OPtimize solved it . Thanks Jason. I am surprised on why solr does this?



> Date: Wed, 10 Sep 2008 14:37:11 -0400
> From: [hidden email]
> To: [hidden email]
> Subject: Re: Question on how index works - runs out of disk space!
>
> Have you tried performing an "optimize"?  Solr doesn't seem to fully
> integrate all updates into a single index until an optimize is performed.
>
> Jason
>
> On Wed, Sep 10, 2008 at 1:05 PM, sundar shankar <[hidden email]>wrote:
>
> > Hi All,
> >          We have a cluster of 4 servers for the application and Just one
> > server for Solr. We have just about 2 million docs to index and we never
> > bothered to make the solr environment clustered as Solr was delivering
> > performance with the current setup itself. Offlate we just discovered a
> > problem and I am not sure what would be the right way to go about this.
> >
> > We have a cron that runs from the application that does a nightly index of
> > data added from another enterprise application. The index job, indexes all
> > courses, be it already indexed or not, and re indexes them. We observed that
> > the job was starting up on all 4 servers at about the same time. All 4
> > servers point to the same Solr box and the same data is apparently added to
> > the solr box 4 times. There is an update command for every 10,000 data
> > fetched from the database and an commit at the end of the full job.
> >
> > The surprising thing that I noticed was that even though there is a primary
> > key defined in the solr schema, the size of the data(folder) seems to
> > incrementaly increase and is causing the solr server to run out of disk
> > space. I have recently upgraded to the 1.3 version about a month back and I
> > guess the problems might be something that is occuring after that update.
> >
> > The index size of a about millions docs on a clustered dev used to be about
> > 520 megs and is about that much the first time index all the courses. The
> > current size of the same number docs (got from stats page) is 6.5 gigs.
> >
> > Am not sure what has changed and if I there any config change that I could
> > use. The write lock is disabled on dev with lock-type = single. I am not
> > sure if this matters.
> >
> > -Sundar
> >
> > _________________________________________________________________
> > Searching for the best deals on travel? Visit MSN Travel.
> > http://in.msn.com/coxandkings
> >
>
>
>
> --
> Jason Rennie
> Head of Machine Learning Technologies, StyleFeeder
> http://www.stylefeeder.com/
> Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

_________________________________________________________________
Searching for the best deals on travel? Visit MSN Travel.
http://in.msn.com/coxandkings
Reply | Threaded
Open this post in threaded view
|

RE: Question on how index works - runs out of disk space!

hossman

: OPtimize solved it . Thanks Jason. I am surprised on why solr does this?

this gets into some complicated discussions about the underlying Lucnee
index format, this is discussed at a very low level in the Lucene docs...

        http://lucene.apache.org/java/2_3_2/fileformats.html

...but at a slightly higher level the issue comes from the basic nature of
an inverted index.  even though you have a uniqueKey, and are "replacing"
an existing document, there is no easy way to reclaim the space used by
the previous version of the document in realtime -- instead a single bit
records that the old version was deleted, and the new version is added to
the end.

the space used by those deleted docs is reclaimed when "segments" get
"merged".  All segments are merged into one compact segment when you do an
optimize -- but an optimize isn't actaully neccessary to ensure that the
deleted docs are *eventually* purged, as documents are added, incremental
merges are constantly taking place.  How often they take place (as a
function of docs added) can be controlled with various settings in
solrconfig.xml

That is the root of why you can see an index grow even though you only
"replace" existing docswithout adding new docs ... it will grow and then
it will shrnk again once merging happens.

On a slightly related topic: if you really want to explicitly forge some
segment merging, but a full optimize takes longer then you are willing to
wait, there is a new option in Solr 1.3 to support to support partial
optimiation...

  <optimize maxSegments="5" />


-Hoss

Reply | Threaded
Open this post in threaded view
|

RE: Question on how index works - runs out of disk space!

sundar shankar
Thats brilliant. I am just starting to wonder if there anything at all
that you guys haven't thought about ;) Thanks that setting should be
really useful.


> Date: Wed, 10 Sep 2008 15:26:57 -0700
> From: [hidden email]
> To: [hidden email]
> Subject: RE: Question on how index works - runs out of disk space!
>
>
> : OPtimize solved it . Thanks Jason. I am surprised on why solr does this?
>
> this gets into some complicated discussions about the underlying Lucnee
> index format, this is discussed at a very low level in the Lucene docs...
>
> http://lucene.apache.org/java/2_3_2/fileformats.html
>
> ...but at a slightly higher level the issue comes from the basic nature of
> an inverted index.  even though you have a uniqueKey, and are "replacing"
> an existing document, there is no easy way to reclaim the space used by
> the previous version of the document in realtime -- instead a single bit
> records that the old version was deleted, and the new version is added to
> the end.
>
> the space used by those deleted docs is reclaimed when "segments" get
> "merged".  All segments are merged into one compact segment when you do an
> optimize -- but an optimize isn't actaully neccessary to ensure that the
> deleted docs are *eventually* purged, as documents are added, incremental
> merges are constantly taking place.  How often they take place (as a
> function of docs added) can be controlled with various settings in
> solrconfig.xml
>
> That is the root of why you can see an index grow even though you only
> "replace" existing docswithout adding new docs ... it will grow and then
> it will shrnk again once merging happens.
>
> On a slightly related topic: if you really want to explicitly forge some
> segment merging, but a full optimize takes longer then you are willing to
> wait, there is a new option in Solr 1.3 to support to support partial
> optimiation...
>
>   <optimize maxSegments="5" />
>
>
> -Hoss
>

_________________________________________________________________
Searching for weekend getaways? Try Live.com
http://www.live.com/?scope=video&form=MICOAL
Reply | Threaded
Open this post in threaded view
|

Re: Question on how index works - runs out of disk space!

Jason Rennie-2
In reply to this post by sundar shankar
Optimize can be a very expensive operation since it copies the entire index
to new data files.  Not sure if solr has an auto-optimize feature, though I
doubt it would be used much.  Our policy is to run commits every few
thousand documents and run an optimize once every day or so.  These commands
are easy to make via the solrj client we use.  Though, for one of our
indexes, we perform all of the updates offline and run an optimize before
putting the index into production.  Hope this helps.

Cheers,

Jason

--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

RE: Question on how index works - runs out of disk space!

sundar shankar
It Totally Helps. Thanks Jason!
Hoss,
       Are the parameters you mentioned, available in the sample solrconfig.xml that comes with the nightly build? My schema and config files are about a year old(1.2.X version) one and am not sure if the 1.3 files for the same have some default options like these that make performance better. Do you guys feel that it will be good to upgrade the xmls too and copy my changes made with the defaults of 1.2 to the newer one? I am using the 1.3 archives already though!

-Sundar


> Date: Thu, 11 Sep 2008 10:51:49 -0400
> From: [hidden email]
> To: [hidden email]
> Subject: Re: Question on how index works - runs out of disk space!
>
> Optimize can be a very expensive operation since it copies the entire index
> to new data files.  Not sure if solr has an auto-optimize feature, though I
> doubt it would be used much.  Our policy is to run commits every few
> thousand documents and run an optimize once every day or so.  These commands
> are easy to make via the solrj client we use.  Though, for one of our
> indexes, we perform all of the updates offline and run an optimize before
> putting the index into production.  Hope this helps.
>
> Cheers,
>
> Jason
>
> --
> Jason Rennie
> Head of Machine Learning Technologies, StyleFeeder
> http://www.stylefeeder.com/
> Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

_________________________________________________________________
Movies, sports & news! Get your daily entertainment fix, only on live.com
http://www.live.com/?scope=video&form=MICOAL
Reply | Threaded
Open this post in threaded view
|

RE: Question on how index works - runs out of disk space!

hossman
:        Are the parameters you mentioned, available in the sample
: solrconfig.xml that comes with the nightly build? My schema and config

the options for influencing when merging happen have always been in the
same solrconfig.xml ... but there are new ones in 1.3 to reflect the new
options in Lucene (i think they're new in 1.3, they might have been in
1.2)

: performance better. Do you guys feel that it will be good to upgrade the
: xmls too and copy my changes made with the defaults of 1.2 to the newer
: one? I am using the 1.3 archives already though!

I don't recommend people throw out their configs when upgrading Solr. I do
recommend that when upgrading, people compare the exampple configs in the
version they were using with the example configs in the new version, and
ask themseles if the changes made there make sense for them.  Sometimes
new options are added that you may want to take advantage of, sometimes
old options are removed becaues they are deprecated -- either way you
should actually compare and decide if it's right for you.


-Hoss