best way to ensure IndexWriter won't corrupt the index?

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

best way to ensure IndexWriter won't corrupt the index?

Istvan Soos
Hi,

What are the typical scenarios when the index will go corrupt? E.g.
can a simple JVM crash during indexing will cause it?
What are the best way to minimalize the possibility of corrupt index?
Copy the directory before indexing / then flipping the pointers?

I'm using Lucene 2.9.

Thanks,
    Istvan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: best way to ensure IndexWriter won't corrupt the index?

Ian Lea
> What are the typical scenarios when the index will go corrupt?

Dodgy disks.

> E.g. can a simple JVM crash during indexing will cause it?

No.  See the javadocs for IndexWriter.

> What are the best way to minimalize the possibility of corrupt index?

Don't use dodgy disks.

> Copy the directory before indexing / then flipping the pointers?

Yes, that's good.  Or just make sure you've got a backup, or can
easily recreate your index from scratch.


--
Ian.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: best way to ensure IndexWriter won't corrupt the index?

Max Lynch
On Wed, Nov 25, 2009 at 9:31 AM, Ian Lea <[hidden email]> wrote:

> > What are the typical scenarios when the index will go corrupt?
>
> Dodgy disks.
>

I also have had index corruption on two occasions.  It is not a big deal for
me since my data is fairly real time so the old documents aren't as
important.

However, I'm running this on a VPS with slicehost, so whether or not they
use dodgy disks is not something I can confirm or even deal with.

I do need to upgrade to 2.9 from 2.4, but I think one of the reasons for my
index corruption is deleting the index.write file rather than removing the
lock through the Lucene APIs.  This seems like it could be a cause of
corruption, correct?
Reply | Threaded
Open this post in threaded view
|

Re: best way to ensure IndexWriter won't corrupt the index?

Ian Lea
Yes, good point. Messing around with lucene locking may well be a  way
to get corrupt indexes.  Any others?


--
Ian.


On Wed, Nov 25, 2009 at 3:37 PM, Max Lynch <[hidden email]> wrote:

> On Wed, Nov 25, 2009 at 9:31 AM, Ian Lea <[hidden email]> wrote:
>
>> > What are the typical scenarios when the index will go corrupt?
>>
>> Dodgy disks.
>>
>
> I also have had index corruption on two occasions.  It is not a big deal for
> me since my data is fairly real time so the old documents aren't as
> important.
>
> However, I'm running this on a VPS with slicehost, so whether or not they
> use dodgy disks is not something I can confirm or even deal with.
>
> I do need to upgrade to 2.9 from 2.4, but I think one of the reasons for my
> index corruption is deleting the index.write file rather than removing the
> lock through the Lucene APIs.  This seems like it could be a cause of
> corruption, correct?

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: best way to ensure IndexWriter won't corrupt the index?

Michael McCandless-2
In reply to this post by Max Lynch
Before 2.4 it was possible that a crash of the OS, or sudden power
loss to the machine, could corrupt the index.  But that's been fixed
with 2.4.

The only known sources of corruption are hardware faults (bad RAM, bad
disk, etc.), and, accidentally allowing 2 writers to write to the same
index at once (this will very quickly cause corruption).  Lucene's
write lock normally prevents this from happening.

kill -9, JRE crashing, OS crashing, power loss, etc., should not cause
corruption for Lucene >= 2.4.

Backing up is definitely a good idea -- Lucene's
SnapshotDeletionPolicy makes it easy to do a hot backup (backup even
though IndexWriter is still changing the index).  There's a paper on
this (NOTE: I'm the author!) available  at
http://www.manning.com/hatcher3/ that gives details (look for "Hot
backups with Lucene (green paper - html)".

Mike

On Wed, Nov 25, 2009 at 10:37 AM, Max Lynch <[hidden email]> wrote:

> On Wed, Nov 25, 2009 at 9:31 AM, Ian Lea <[hidden email]> wrote:
>
>> > What are the typical scenarios when the index will go corrupt?
>>
>> Dodgy disks.
>>
>
> I also have had index corruption on two occasions.  It is not a big deal for
> me since my data is fairly real time so the old documents aren't as
> important.
>
> However, I'm running this on a VPS with slicehost, so whether or not they
> use dodgy disks is not something I can confirm or even deal with.
>
> I do need to upgrade to 2.9 from 2.4, but I think one of the reasons for my
> index corruption is deleting the index.write file rather than removing the
> lock through the Lucene APIs.  This seems like it could be a cause of
> corruption, correct?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: best way to ensure IndexWriter won't corrupt the index?

Max Lynch
On Wed, Nov 25, 2009 at 9:49 AM, Michael McCandless <
[hidden email]> wrote:

> Before 2.4 it was possible that a crash of the OS, or sudden power
> loss to the machine, could corrupt the index.  But that's been fixed
> with 2.4.
>
> The only known sources of corruption are hardware faults (bad RAM, bad
> disk, etc.), and, accidentally allowing 2 writers to write to the same
> index at once (this will very quickly cause corruption).  Lucene's
> write lock normally prevents this from happening.
>

So in my case I have an indexer running in the background, and if it had
been running for more than 8 hours, I would remove the write lock and start
another indexer, which would cause corruption if the first one was still
writing.  It's a bad process I'm using and I know it's not how Lucene
usually does things so I need to improve my system.
Reply | Threaded
Open this post in threaded view
|

Re: best way to ensure IndexWriter won't corrupt the index?

Erick Erickson
Why do you want to kill your indexer anyway? Just because it had
been running "too long"? Or was it behaving poorly?

But yeah, you need to change your process, you're almost guaranteeing
that you'll corrupt your index. Perhaps, if you really need to stop and
restart you could have your indexer voluntarily stop after 8 hours. That
would allow you to close your indexwriter, thus insuring that all your
documents in memory were written to the index. Killing the process
from outside does NOT guarantee this. Your index won't be corrupt, but
it also won't have all your documents.

Best
Erick

On Wed, Nov 25, 2009 at 11:10 AM, Max Lynch <[hidden email]> wrote:

> On Wed, Nov 25, 2009 at 9:49 AM, Michael McCandless <
> [hidden email]> wrote:
>
> > Before 2.4 it was possible that a crash of the OS, or sudden power
> > loss to the machine, could corrupt the index.  But that's been fixed
> > with 2.4.
> >
> > The only known sources of corruption are hardware faults (bad RAM, bad
> > disk, etc.), and, accidentally allowing 2 writers to write to the same
> > index at once (this will very quickly cause corruption).  Lucene's
> > write lock normally prevents this from happening.
> >
>
> So in my case I have an indexer running in the background, and if it had
> been running for more than 8 hours, I would remove the write lock and start
> another indexer, which would cause corruption if the first one was still
> writing.  It's a bad process I'm using and I know it's not how Lucene
> usually does things so I need to improve my system.
>
Reply | Threaded
Open this post in threaded view
|

Re: best way to ensure IndexWriter won't corrupt the index?

Max Lynch
On Wed, Nov 25, 2009 at 11:18 AM, Erick Erickson <[hidden email]>wrote:

> Why do you want to kill your indexer anyway? Just because it had
> been running "too long"? Or was it behaving poorly?
>
> But yeah, you need to change your process, you're almost guaranteeing
> that you'll corrupt your index.


I've learned a lot more about Lucene since I wrote the first indexer, so
going back over it and keeping this stuff in mind is going to make it much
better.

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: best way to ensure IndexWriter won't corrupt the index?

khosro asgharifard
In reply to this post by Michael McCandless-2
Hello,
>>Before 2.4 it was possible that a crash of the OS, or sudden power

>>loss to the machine, could corrupt the index.  But that's been fixed
>>with 2.4.

Did you mean that Lucene does not have this issue in 2.4.1?
We are running the program that index some data ,and sometime we must shutdown Tomcat,
and in some case the index corrupt.This is a probelm in our program,and our data is too huge and
we can not run reindeing  all data agian.We use Lucene 2.4.0 with Compass.

Best Regards
Khosro.  

>>The only known sources of corruption are hardware faults (bad RAM, bad
>>disk, etc.), and, accidentally allowing 2 writers to write to the same
>>index at once (this will very quickly cause corruption).  Lucene's
>>write lock normally prevents this from happening.

>>kill -9, JRE crashing, OS crashing, power loss, etc., should not cause
>>corruption for Lucene >= 2.4.

>>Backing up is definitely a good idea -- Lucene's
>>SnapshotDeletionPolicy makes it easy to do a hot backup (backup even
>>though IndexWriter is still changing the index).  There's a paper on
>>this (NOTE: I'm the author!) available  at
>>http://www.manning.com/hatcher3/ that gives details (look for "Hot
>>backups with Lucene (green paper - html)".

Mike

On Wed, Nov 25, 2009 at 10:37 AM, Max Lynch <[hidden email]> wrote:

> On Wed, Nov 25, 2009 at 9:31 AM, Ian Lea <[hidden email]> wrote:
>
>> > What are the typical scenarios when the index will go corrupt?
>>
>> Dodgy disks.
>>
>
> I also have had index corruption on two occasions.  It is not a big deal for
> me since my data is fairly real time so the old documents aren't as
> important.
>
> However, I'm running this on a VPS with slicehost, so whether or not they
> use dodgy disks is not something I can confirm or even deal with.
>
> I do need to upgrade to 2.9 from 2.4, but I think one of the reasons for my
> index corruption is deleting the index.write file rather than removing the
> lock through the Lucene APIs.  This seems like it could be a cause of
> corruption, correct?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


     
Reply | Threaded
Open this post in threaded view
|

Re: best way to ensure IndexWriter won't corrupt the index?

Michael McCandless-2
Right, in 2.4, if you kill -9, pull power, OS crashes, etc., it should
not corrupt the index.

Can you share details on what corruption you see?  Is it possible
there are two IndexWriters open on the index at once?

Mike

On Thu, Nov 26, 2009 at 2:08 PM, Khosro Asgharifard
<[hidden email]> wrote:

> Hello,
>>>Before 2.4 it was possible that a crash of the OS, or sudden power
>
>>>loss to the machine, could corrupt the index.  But that's been fixed
>>>with 2.4.
>
> Did you mean that Lucene does not have this issue in 2.4.1?
> We are running the program that index some data ,and sometime we must shutdown Tomcat,
> and in some case the index corrupt.This is a probelm in our program,and our data is too huge and
> we can not run reindeing  all data agian.We use Lucene 2.4.0 with Compass.
>
> Best Regards
> Khosro.
>
>>>The only known sources of corruption are hardware faults (bad RAM, bad
>>>disk, etc.), and, accidentally allowing 2 writers to write to the same
>>>index at once (this will very quickly cause corruption).  Lucene's
>>>write lock normally prevents this from happening.
>
>>>kill -9, JRE crashing, OS crashing, power loss, etc., should not cause
>>>corruption for Lucene >= 2.4.
>
>>>Backing up is definitely a good idea -- Lucene's
>>>SnapshotDeletionPolicy makes it easy to do a hot backup (backup even
>>>though IndexWriter is still changing the index).  There's a paper on
>>>this (NOTE: I'm the author!) available  at
>>>http://www.manning.com/hatcher3/ that gives details (look for "Hot
>>>backups with Lucene (green paper - html)".
>
> Mike
>
> On Wed, Nov 25, 2009 at 10:37 AM, Max Lynch <[hidden email]> wrote:
>> On Wed, Nov 25, 2009 at 9:31 AM, Ian Lea <[hidden email]> wrote:
>>
>>> > What are the typical scenarios when the index will go corrupt?
>>>
>>> Dodgy disks.
>>>
>>
>> I also have had index corruption on two occasions.  It is not a big deal for
>> me since my data is fairly real time so the old documents aren't as
>> important.
>>
>> However, I'm running this on a VPS with slicehost, so whether or not they
>> use dodgy disks is not something I can confirm or even deal with.
>>
>> I do need to upgrade to 2.9 from 2.4, but I think one of the reasons for my
>> index corruption is deleting the index.write file rather than removing the
>> lock through the Lucene APIs.  This seems like it could be a cause of
>> corruption, correct?
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: best way to ensure IndexWriter won't corrupt the index?

khosro asgharifard
>>Right, in 2.4, if you kill -9, pull power, OS crashes, etc., it should

>>not corrupt the index.

>>Can you share details on what corruption you see?  Is it possible
>>there are two IndexWriters open on the index at once?

Our apps is multithread,and sometimes when we shutdown the Tomcat ,
the wrike.lock file exist and does not disappear,and in the subsequent run,the write.lock does not allow
to index data,so that i must delete it by hand,and after that when i check the index
by Lukeall ,it tells that "There is no valid Lucene index in this directory".
And ,in another case,after showting down Tomcat, and  running it again  more than two times ,write.lock file does not allow to index the data,
and i must delete it by hand ,  and after that   i check the index by Lukeall ,it tells that "There is no valid Lucene index in this directory".

Khosro.



>>Mike

On Thu, Nov 26, 2009 at 2:08 PM, Khosro Asgharifard
<[hidden email]> wrote:

> Hello,
>>>Before 2.4 it was possible that a crash of the OS, or sudden power
>
>>>loss to the machine, could corrupt the index.  But that's been fixed
>>>with 2.4.
>
> Did you mean that Lucene does not have this issue in 2.4.1?
> We are running the program that index some data ,and sometime we must shutdown Tomcat,
> and in some case the index corrupt.This is a probelm in our program,and our data is too huge and
> we can not run reindeing  all data agian.We use Lucene 2.4.0 with Compass.
>
> Best Regards
> Khosro.
>
>>>The only known sources of corruption are hardware faults (bad RAM, bad
>>>disk, etc.), and, accidentally allowing 2 writers to write to the same
>>>index at once (this will very quickly cause corruption).  Lucene's
>>>write lock normally prevents this from happening.
>
>>>kill -9, JRE crashing, OS crashing, power loss, etc., should not cause
>>>corruption for Lucene >= 2.4.
>
>>>Backing up is definitely a good idea -- Lucene's
>>>SnapshotDeletionPolicy makes it easy to do a hot backup (backup even
>>>though IndexWriter is still changing the index).  There's a paper on
>>>this (NOTE: I'm the author!) available  at
>>>http://www.manning.com/hatcher3/ that gives details (look for "Hot
>>>backups with Lucene (green paper - html)".
>
> Mike
>
> On Wed, Nov 25, 2009 at 10:37 AM, Max Lynch <[hidden email]> wrote:
>> On Wed, Nov 25, 2009 at 9:31 AM, Ian Lea <[hidden email]> wrote:
>>
>>> > What are the typical scenarios when the index will go corrupt?
>>>
>>> Dodgy disks.
>>>
>>
>> I also have had index corruption on two occasions.  It is not a big deal for
>> me since my data is fairly real time so the old documents aren't as
>> important.
>>
>> However, I'm running this on a VPS with slicehost, so whether or not they
>> use dodgy disks is not something I can confirm or even deal with.
>>
>> I do need to upgrade to 2.9 from 2.4, but I think one of the reasons for my
>> index corruption is deleting the index.write file rather than removing the
>> lock through the Lucene APIs.  This seems like it could be a cause of
>> corruption, correct?
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


     
Reply | Threaded
Open this post in threaded view
|

Re: best way to ensure IndexWriter won't corrupt the index?

Michael McCandless-2
The leftover write.lock is expected on ungraceful shutdown; you just
have to remove it.  Or, switch to NativeFSLockFactory, which correctly
detects when the lock is no longer in use.

Next time this happens, try running CheckIndex on the index.

Are you sure the Luke version you're using is new enough to read 2.4.1 indexes?

Mike

On Thu, Nov 26, 2009 at 3:35 PM, Khosro Asgharifard
<[hidden email]> wrote:

>>>Right, in 2.4, if you kill -9, pull power, OS crashes, etc., it should
>
>>>not corrupt the index.
>
>>>Can you share details on what corruption you see?  Is it possible
>>>there are two IndexWriters open on the index at once?
>
> Our apps is multithread,and sometimes when we shutdown the Tomcat ,
> the wrike.lock file exist and does not disappear,and in the subsequent run,the write.lock does not allow
> to index data,so that i must delete it by hand,and after that when i check the index
> by Lukeall ,it tells that "There is no valid Lucene index in this directory".
> And ,in another case,after showting down Tomcat, and  running it again  more than two times ,write.lock file does not allow to index the data,
> and i must delete it by hand ,  and after that   i check the index by Lukeall ,it tells that "There is no valid Lucene index in this directory".
>
> Khosro.
>
>
>
>>>Mike
>
> On Thu, Nov 26, 2009 at 2:08 PM, Khosro Asgharifard
> <[hidden email]> wrote:
>> Hello,
>>>>Before 2.4 it was possible that a crash of the OS, or sudden power
>>
>>>>loss to the machine, could corrupt the index.  But that's been fixed
>>>>with 2.4.
>>
>> Did you mean that Lucene does not have this issue in 2.4.1?
>> We are running the program that index some data ,and sometime we must shutdown Tomcat,
>> and in some case the index corrupt.This is a probelm in our program,and our data is too huge and
>> we can not run reindeing  all data agian.We use Lucene 2.4.0 with Compass.
>>
>> Best Regards
>> Khosro.
>>
>>>>The only known sources of corruption are hardware faults (bad RAM, bad
>>>>disk, etc.), and, accidentally allowing 2 writers to write to the same
>>>>index at once (this will very quickly cause corruption).  Lucene's
>>>>write lock normally prevents this from happening.
>>
>>>>kill -9, JRE crashing, OS crashing, power loss, etc., should not cause
>>>>corruption for Lucene >= 2.4.
>>
>>>>Backing up is definitely a good idea -- Lucene's
>>>>SnapshotDeletionPolicy makes it easy to do a hot backup (backup even
>>>>though IndexWriter is still changing the index).  There's a paper on
>>>>this (NOTE: I'm the author!) available  at
>>>>http://www.manning.com/hatcher3/ that gives details (look for "Hot
>>>>backups with Lucene (green paper - html)".
>>
>> Mike
>>
>> On Wed, Nov 25, 2009 at 10:37 AM, Max Lynch <[hidden email]> wrote:
>>> On Wed, Nov 25, 2009 at 9:31 AM, Ian Lea <[hidden email]> wrote:
>>>
>>>> > What are the typical scenarios when the index will go corrupt?
>>>>
>>>> Dodgy disks.
>>>>
>>>
>>> I also have had index corruption on two occasions.  It is not a big deal for
>>> me since my data is fairly real time so the old documents aren't as
>>> important.
>>>
>>> However, I'm running this on a VPS with slicehost, so whether or not they
>>> use dodgy disks is not something I can confirm or even deal with.
>>>
>>> I do need to upgrade to 2.9 from 2.4, but I think one of the reasons for my
>>> index corruption is deleting the index.write file rather than removing the
>>> lock through the Lucene APIs.  This seems like it could be a cause of
>>> corruption, correct?
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]