how to rebuild a index corrupted?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

how to rebuild a index corrupted?

Cristian Lorenzetto
lucene can rebuild index using his internal info and how ? or in have to
reinsert all in other way?
Reply | Threaded
Open this post in threaded view
|

Re: how to rebuild a index corrupted?

Marco Reis
I'm afraid it's not possible to rebuild index. It's important to maintain a
backup policy because of that.


On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
[hidden email]> wrote:

> lucene can rebuild index using his internal info and how ? or in have to
> reinsert all in other way?
>
--
Marco Reis
Software Architect
http://marcoreis.net
https://github.com/masreis
+55 61 9 81194620
Reply | Threaded
Open this post in threaded view
|

Re: how to rebuild a index corrupted?

Michael McCandless-2
You can use Lucene's CheckIndex tool with the -exorcise option but this is
quite brutal: it simply drops any segment that has corruption it detects.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <[hidden email]> wrote:

> I'm afraid it's not possible to rebuild index. It's important to maintain a
> backup policy because of that.
>
>
> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
> [hidden email]> wrote:
>
> > lucene can rebuild index using his internal info and how ? or in have to
> > reinsert all in other way?
> >
> --
> Marco Reis
> Software Architect
> http://marcoreis.net
> https://github.com/masreis
> +55 61 9 81194620
>
Reply | Threaded
Open this post in threaded view
|

Re: how to rebuild a index corrupted?

Cristian Lorenzetto
Errata corridge/integration for questions related to previous my post

I studied a bit this lucene classes for understanding:
1) setCommitData is designed for versioning the index , not for passing a
transaction log. However if userdata is different for every transactionid
it is equivalent .
2) NRT refresh automatically searcher/reader it dont call commit. I based
my implementation using nrt on
http://stackoverflow.com/questions/17993960/lucene-4-4-0-new-controlledrealtimereopenthread-sample-usage.
In this example commit is executed for every crud operation in synchronous
way but in general it is advised to use a batch thread because the commit
is a long operation. *So it is not clear how to do the commit in a
near-real time system with a indefinite index size.*
     2.a if the commit is synchronous , i can use user data because it is
used before a commit, every commit has a different user data and i can
trace the transactions changes.But in general a commit can requires also
minutes for be completed so then it dont seams a real solution in a near
real time solution.
    2.b if the commit is async, it is executed every X times (or better how
memory if full) , the commit can not be used for tracing the transactions
and i can pass a trnsaction id associated with a lucene commit. I can add a
mutex in crud ( when i loading uncommit data) i m sure the last uncummit
Index is aligned to the last transaction id X, so there is no overlappind
and the crud block is very fast when happens.But how to grant that the
commit is related to the last CommitIndex what i loaded? Maybe if i
introduce that mutex in a custom mergePolicy?
It is right what i wrote until now ?The best solution is 2.b? In this case
how to grant the commit is done based on the uncommit data loaded in a
specific commitIndex?





2017-03-22 15:32 GMT+01:00 Michael McCandless <[hidden email]>:

> Hi, I think you forgot to CC the lucene user's list (
> [hidden email]) in your reply?  Can you resend?
>
> Thanks.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto <
> [hidden email]> wrote:
>
>> hi , i m thinking about what you told me in previous message and how to
>> solve the corruption problem and the problem about commit operation
>> executed in async way.
>>
>> I m thinking to create a simple transaction log in a file.
>> i use a long atomic sequence for a ordinable transaction id.
>>
>> when i make a new operation
>> 1) generate new incremental transaction id
>> 2) save the operation abstract info in transaction log associated to id.
>>     2.a insert ,update with the a serialized version of the object to
>> save
>>     2b delete the query serialized where apply delete
>> 3) execute same operation in lucene adding before property transactionId
>> (executed in ram)
>>
>> 4) in async way commit is executed. After the commit the transaction log
>> until last transaction id is deleted.(i dont know how insert block after
>> commit , using near real time reader and SearcherManager) I might
>>  introduce a logic in the way a commit is done. The order is simlilar to a
>> queue so it follows the transactionId order. i Is there a example about
>> possibility to commit a specific set of uncommit operations?
>>
>> 5) i need the warrenty after a crud operation the data in available in
>> memory  in a possible imminent research so i think i might execute
>> flush/refreshReader after every CUD operations
>>
>> if there is a failure transaction log will be not empty. But i can
>> rexecute operations not executed after restartup.
>> Maybe it could be usefull also for fixing a corruption but it is sure the
>> corrution dont touch also segments already commited completely in the past?
>> or maybe for a stable solution i might anyway save data in a secondary
>> repository ?
>>
>>
>>
>> for your opinion this solution will be sufficient . It is a good solution
>> for you, i m forgetting some aspects?
>>
>> PS Another interesting aspect maybe could be associate the segment
>> associated to a transaction. In this way if a segment is missing i can
>> apply again it without rebuild all the index from scratch.
>>
>> 2017-03-21 0:58 GMT+01:00 Michael McCandless <[hidden email]>:
>>
>>> You can use Lucene's CheckIndex tool with the -exorcise option but this
>>> is quite brutal: it simply drops any segment that has corruption it detects.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <[hidden email]> wrote:
>>>
>>>> I'm afraid it's not possible to rebuild index. It's important to
>>>> maintain a
>>>> backup policy because of that.
>>>>
>>>>
>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
>>>> [hidden email]> wrote:
>>>>
>>>> > lucene can rebuild index using his internal info and how ? or in have
>>>> to
>>>> > reinsert all in other way?
>>>> >
>>>> --
>>>> Marco Reis
>>>> Software Architect
>>>> http://marcoreis.net
>>>> https://github.com/masreis
>>>> +55 61 9 81194620
>>>>
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: how to rebuild a index corrupted?

Cristian Lorenzetto
In the flow of the thinking ... i added a explanation for evoiding
misunderstanding. I use  TransactionId not for introduce transaction in
lucene (a async commit excludes a traditional transaction system) but for
signing segments with a extenal key (transactionid) , so if for a
corruption error in index i cant find a segment 5 , searching segment 4 and
6 i can understand the range of foreign keys (transaction ids) to reload in
lucene. So i can load in lucene all the documents missing realoding them
for example from a database.




2017-03-23 10:53 GMT+01:00 Cristian Lorenzetto <
[hidden email]>:

> Errata corridge/integration for questions related to previous my post
>
> I studied a bit this lucene classes for understanding:
> 1) setCommitData is designed for versioning the index , not for passing a
> transaction log. However if userdata is different for every transactionid
> it is equivalent .
> 2) NRT refresh automatically searcher/reader it dont call commit. I based
> my implementation using nrt on http://stackoverflow.com/
> questions/17993960/lucene-4-4-0-new-controlledrealtimereopenthread
> -sample-usage. In this example commit is executed for every crud
> operation in synchronous way but in general it is advised to use a batch
> thread because the commit is a long operation. *So it is not clear how to
> do the commit in a near-real time system with a indefinite index size.*
>      2.a if the commit is synchronous , i can use user data because it is
> used before a commit, every commit has a different user data and i can
> trace the transactions changes.But in general a commit can requires also
> minutes for be completed so then it dont seams a real solution in a near
> real time solution.
>     2.b if the commit is async, it is executed every X times (or better
> how memory if full) , the commit can not be used for tracing the
> transactions and i can pass a trnsaction id associated with a lucene
> commit. I can add a mutex in crud ( when i loading uncommit data) i m sure
> the last uncummit Index is aligned to the last transaction id X, so there
> is no overlappind and the crud block is very fast when happens.But how to
> grant that the commit is related to the last CommitIndex what i loaded?
> Maybe if i introduce that mutex in a custom mergePolicy?
> It is right what i wrote until now ?The best solution is 2.b? In this case
> how to grant the commit is done based on the uncommit data loaded in a
> specific commitIndex?
>
>
>
>
>
> 2017-03-22 15:32 GMT+01:00 Michael McCandless <[hidden email]>:
>
>> Hi, I think you forgot to CC the lucene user's list (
>> [hidden email]) in your reply?  Can you resend?
>>
>> Thanks.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto <
>> [hidden email]> wrote:
>>
>>> hi , i m thinking about what you told me in previous message and how to
>>> solve the corruption problem and the problem about commit operation
>>> executed in async way.
>>>
>>> I m thinking to create a simple transaction log in a file.
>>> i use a long atomic sequence for a ordinable transaction id.
>>>
>>> when i make a new operation
>>> 1) generate new incremental transaction id
>>> 2) save the operation abstract info in transaction log associated to id.
>>>     2.a insert ,update with the a serialized version of the object to
>>> save
>>>     2b delete the query serialized where apply delete
>>> 3) execute same operation in lucene adding before property transactionId
>>> (executed in ram)
>>>
>>> 4) in async way commit is executed. After the commit the transaction log
>>> until last transaction id is deleted.(i dont know how insert block after
>>> commit , using near real time reader and SearcherManager) I might
>>>  introduce a logic in the way a commit is done. The order is simlilar to a
>>> queue so it follows the transactionId order. i Is there a example about
>>> possibility to commit a specific set of uncommit operations?
>>>
>>> 5) i need the warrenty after a crud operation the data in available in
>>> memory  in a possible imminent research so i think i might execute
>>> flush/refreshReader after every CUD operations
>>>
>>> if there is a failure transaction log will be not empty. But i can
>>> rexecute operations not executed after restartup.
>>> Maybe it could be usefull also for fixing a corruption but it is sure
>>> the corrution dont touch also segments already commited completely in the
>>> past? or maybe for a stable solution i might anyway save data in a
>>> secondary repository ?
>>>
>>>
>>>
>>> for your opinion this solution will be sufficient . It is a good
>>> solution for you, i m forgetting some aspects?
>>>
>>> PS Another interesting aspect maybe could be associate the segment
>>> associated to a transaction. In this way if a segment is missing i can
>>> apply again it without rebuild all the index from scratch.
>>>
>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless <[hidden email]>
>>> :
>>>
>>>> You can use Lucene's CheckIndex tool with the -exorcise option but this
>>>> is quite brutal: it simply drops any segment that has corruption it detects.
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <[hidden email]> wrote:
>>>>
>>>>> I'm afraid it's not possible to rebuild index. It's important to
>>>>> maintain a
>>>>> backup policy because of that.
>>>>>
>>>>>
>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
>>>>> [hidden email]> wrote:
>>>>>
>>>>> > lucene can rebuild index using his internal info and how ? or in
>>>>> have to
>>>>> > reinsert all in other way?
>>>>> >
>>>>> --
>>>>> Marco Reis
>>>>> Software Architect
>>>>> http://marcoreis.net
>>>>> https://github.com/masreis
>>>>> +55 61 9 81194620
>>>>>
>>>>
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: how to rebuild a index corrupted?

Michael McCandless-2
In reply to this post by Cristian Lorenzetto
You should be able to use the sequence numbers returned by IndexWriter
operations to "know" which operations made it into the commit and which did
not, and then on disaster recovery replay only those operations that didn't
make it?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Mar 23, 2017 at 5:53 AM, Cristian Lorenzetto <
[hidden email]> wrote:

> Errata corridge/integration for questions related to previous my post
>
> I studied a bit this lucene classes for understanding:
> 1) setCommitData is designed for versioning the index , not for passing a
> transaction log. However if userdata is different for every transactionid
> it is equivalent .
> 2) NRT refresh automatically searcher/reader it dont call commit. I based
> my implementation using nrt on http://stackoverflow.com/
> questions/17993960/lucene-4-4-0-new-controlledrealtimereopenthread
> -sample-usage. In this example commit is executed for every crud
> operation in synchronous way but in general it is advised to use a batch
> thread because the commit is a long operation. *So it is not clear how to
> do the commit in a near-real time system with a indefinite index size.*
>      2.a if the commit is synchronous , i can use user data because it is
> used before a commit, every commit has a different user data and i can
> trace the transactions changes.But in general a commit can requires also
> minutes for be completed so then it dont seams a real solution in a near
> real time solution.
>     2.b if the commit is async, it is executed every X times (or better
> how memory if full) , the commit can not be used for tracing the
> transactions and i can pass a trnsaction id associated with a lucene
> commit. I can add a mutex in crud ( when i loading uncommit data) i m sure
> the last uncummit Index is aligned to the last transaction id X, so there
> is no overlappind and the crud block is very fast when happens.But how to
> grant that the commit is related to the last CommitIndex what i loaded?
> Maybe if i introduce that mutex in a custom mergePolicy?
> It is right what i wrote until now ?The best solution is 2.b? In this case
> how to grant the commit is done based on the uncommit data loaded in a
> specific commitIndex?
>
>
>
>
>
> 2017-03-22 15:32 GMT+01:00 Michael McCandless <[hidden email]>:
>
>> Hi, I think you forgot to CC the lucene user's list (
>> [hidden email]) in your reply?  Can you resend?
>>
>> Thanks.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto <
>> [hidden email]> wrote:
>>
>>> hi , i m thinking about what you told me in previous message and how to
>>> solve the corruption problem and the problem about commit operation
>>> executed in async way.
>>>
>>> I m thinking to create a simple transaction log in a file.
>>> i use a long atomic sequence for a ordinable transaction id.
>>>
>>> when i make a new operation
>>> 1) generate new incremental transaction id
>>> 2) save the operation abstract info in transaction log associated to id.
>>>     2.a insert ,update with the a serialized version of the object to
>>> save
>>>     2b delete the query serialized where apply delete
>>> 3) execute same operation in lucene adding before property transactionId
>>> (executed in ram)
>>>
>>> 4) in async way commit is executed. After the commit the transaction log
>>> until last transaction id is deleted.(i dont know how insert block after
>>> commit , using near real time reader and SearcherManager) I might
>>>  introduce a logic in the way a commit is done. The order is simlilar to a
>>> queue so it follows the transactionId order. i Is there a example about
>>> possibility to commit a specific set of uncommit operations?
>>>
>>> 5) i need the warrenty after a crud operation the data in available in
>>> memory  in a possible imminent research so i think i might execute
>>> flush/refreshReader after every CUD operations
>>>
>>> if there is a failure transaction log will be not empty. But i can
>>> rexecute operations not executed after restartup.
>>> Maybe it could be usefull also for fixing a corruption but it is sure
>>> the corrution dont touch also segments already commited completely in the
>>> past? or maybe for a stable solution i might anyway save data in a
>>> secondary repository ?
>>>
>>>
>>>
>>> for your opinion this solution will be sufficient . It is a good
>>> solution for you, i m forgetting some aspects?
>>>
>>> PS Another interesting aspect maybe could be associate the segment
>>> associated to a transaction. In this way if a segment is missing i can
>>> apply again it without rebuild all the index from scratch.
>>>
>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless <[hidden email]>
>>> :
>>>
>>>> You can use Lucene's CheckIndex tool with the -exorcise option but this
>>>> is quite brutal: it simply drops any segment that has corruption it detects.
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <[hidden email]> wrote:
>>>>
>>>>> I'm afraid it's not possible to rebuild index. It's important to
>>>>> maintain a
>>>>> backup policy because of that.
>>>>>
>>>>>
>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
>>>>> [hidden email]> wrote:
>>>>>
>>>>> > lucene can rebuild index using his internal info and how ? or in
>>>>> have to
>>>>> > reinsert all in other way?
>>>>> >
>>>>> --
>>>>> Marco Reis
>>>>> Software Architect
>>>>> http://marcoreis.net
>>>>> https://github.com/masreis
>>>>> +55 61 9 81194620
>>>>>
>>>>
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: how to rebuild a index corrupted?

Cristian Lorenzetto
Yes exactly. I saw, working in the past in systems using lucene (for
example alfresco projects),  lucene corruption happens sometimes and every
time the building requires a lot of times ... so i thougth a way for
accelerating the fixing of a corruption index. In addition there is a rare
case not described here ( If after a database commit lucene throws a
exception for exampe disk is full ) there is a possibility of a
 disalignement from the database and the lucene index. With this system
these problems could be solved automatically. In database every row has a
property with trasaction id.  So if i know in lucene is missing a segment 6
, corrisponds to   transactions range[ 1000, 1050] so i can reload in a
query in database just corrisponding rows.

2017-03-23 14:59 GMT+01:00 Michael McCandless <[hidden email]>:

> You should be able to use the sequence numbers returned by IndexWriter
> operations to "know" which operations made it into the commit and which did
> not, and then on disaster recovery replay only those operations that didn't
> make it?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Mar 23, 2017 at 5:53 AM, Cristian Lorenzetto <
> [hidden email]> wrote:
>
>> Errata corridge/integration for questions related to previous my post
>>
>> I studied a bit this lucene classes for understanding:
>> 1) setCommitData is designed for versioning the index , not for passing a
>> transaction log. However if userdata is different for every transactionid
>> it is equivalent .
>> 2) NRT refresh automatically searcher/reader it dont call commit. I based
>> my implementation using nrt on http://stackoverflow.com/qu
>> estions/17993960/lucene-4-4-0-new-controlledrealtimereopenth
>> read-sample-usage. In this example commit is executed for every crud
>> operation in synchronous way but in general it is advised to use a batch
>> thread because the commit is a long operation. *So it is not clear how
>> to do the commit in a near-real time system with a indefinite index size.*
>>      2.a if the commit is synchronous , i can use user data because it is
>> used before a commit, every commit has a different user data and i can
>> trace the transactions changes.But in general a commit can requires also
>> minutes for be completed so then it dont seams a real solution in a near
>> real time solution.
>>     2.b if the commit is async, it is executed every X times (or better
>> how memory if full) , the commit can not be used for tracing the
>> transactions and i can pass a trnsaction id associated with a lucene
>> commit. I can add a mutex in crud ( when i loading uncommit data) i m sure
>> the last uncummit Index is aligned to the last transaction id X, so there
>> is no overlappind and the crud block is very fast when happens.But how to
>> grant that the commit is related to the last CommitIndex what i loaded?
>> Maybe if i introduce that mutex in a custom mergePolicy?
>> It is right what i wrote until now ?The best solution is 2.b? In this
>> case how to grant the commit is done based on the uncommit data loaded in a
>> specific commitIndex?
>>
>>
>>
>>
>>
>> 2017-03-22 15:32 GMT+01:00 Michael McCandless <[hidden email]>
>> :
>>
>>> Hi, I think you forgot to CC the lucene user's list (
>>> [hidden email]) in your reply?  Can you resend?
>>>
>>> Thanks.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto <
>>> [hidden email]> wrote:
>>>
>>>> hi , i m thinking about what you told me in previous message and how to
>>>> solve the corruption problem and the problem about commit operation
>>>> executed in async way.
>>>>
>>>> I m thinking to create a simple transaction log in a file.
>>>> i use a long atomic sequence for a ordinable transaction id.
>>>>
>>>> when i make a new operation
>>>> 1) generate new incremental transaction id
>>>> 2) save the operation abstract info in transaction log associated to
>>>> id.
>>>>     2.a insert ,update with the a serialized version of the object to
>>>> save
>>>>     2b delete the query serialized where apply delete
>>>> 3) execute same operation in lucene adding before property
>>>> transactionId (executed in ram)
>>>>
>>>> 4) in async way commit is executed. After the commit the transaction
>>>> log until last transaction id is deleted.(i dont know how insert block
>>>> after commit , using near real time reader and SearcherManager) I might
>>>>  introduce a logic in the way a commit is done. The order is simlilar to a
>>>> queue so it follows the transactionId order. i Is there a example about
>>>> possibility to commit a specific set of uncommit operations?
>>>>
>>>> 5) i need the warrenty after a crud operation the data in available in
>>>> memory  in a possible imminent research so i think i might execute
>>>> flush/refreshReader after every CUD operations
>>>>
>>>> if there is a failure transaction log will be not empty. But i can
>>>> rexecute operations not executed after restartup.
>>>> Maybe it could be usefull also for fixing a corruption but it is sure
>>>> the corrution dont touch also segments already commited completely in the
>>>> past? or maybe for a stable solution i might anyway save data in a
>>>> secondary repository ?
>>>>
>>>>
>>>>
>>>> for your opinion this solution will be sufficient . It is a good
>>>> solution for you, i m forgetting some aspects?
>>>>
>>>> PS Another interesting aspect maybe could be associate the segment
>>>> associated to a transaction. In this way if a segment is missing i can
>>>> apply again it without rebuild all the index from scratch.
>>>>
>>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless <[hidden email]
>>>> >:
>>>>
>>>>> You can use Lucene's CheckIndex tool with the -exorcise option but
>>>>> this is quite brutal: it simply drops any segment that has corruption it
>>>>> detects.
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <[hidden email]> wrote:
>>>>>
>>>>>> I'm afraid it's not possible to rebuild index. It's important to
>>>>>> maintain a
>>>>>> backup policy because of that.
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
>>>>>> [hidden email]> wrote:
>>>>>>
>>>>>> > lucene can rebuild index using his internal info and how ? or in
>>>>>> have to
>>>>>> > reinsert all in other way?
>>>>>> >
>>>>>> --
>>>>>> Marco Reis
>>>>>> Software Architect
>>>>>> http://marcoreis.net
>>>>>> https://github.com/masreis
>>>>>> +55 61 9 81194620
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: how to rebuild a index corrupted?

Michael McCandless-2
Lucene corruption should be rare and only due to bad hardware; if you are
seeing otherwise we really should get to the root cause.

Mapping documents to each segment will not be easy in general, especially
if that segment is now corrupted so you can't search it.

Documents lost because of power loss / OS crash while indexing can be more
common, and its for that use case that the sequence numbers / transaction
log should be helpful.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Mar 23, 2017 at 10:12 AM, Cristian Lorenzetto <
[hidden email]> wrote:

> Yes exactly. I saw, working in the past in systems using lucene (for
> example alfresco projects),  lucene corruption happens sometimes and every
> time the building requires a lot of times ... so i thougth a way for
> accelerating the fixing of a corruption index. In addition there is a rare
> case not described here ( If after a database commit lucene throws a
> exception for exampe disk is full ) there is a possibility of a
>  disalignement from the database and the lucene index. With this system
> these problems could be solved automatically. In database every row has a
> property with trasaction id.  So if i know in lucene is missing a segment 6
> , corrisponds to   transactions range[ 1000, 1050] so i can reload in a
> query in database just corrisponding rows.
>
> 2017-03-23 14:59 GMT+01:00 Michael McCandless <[hidden email]>:
>
>> You should be able to use the sequence numbers returned by IndexWriter
>> operations to "know" which operations made it into the commit and which did
>> not, and then on disaster recovery replay only those operations that didn't
>> make it?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Mar 23, 2017 at 5:53 AM, Cristian Lorenzetto <
>> [hidden email]> wrote:
>>
>>> Errata corridge/integration for questions related to previous my post
>>>
>>> I studied a bit this lucene classes for understanding:
>>> 1) setCommitData is designed for versioning the index , not for passing
>>> a transaction log. However if userdata is different for every transactionid
>>> it is equivalent .
>>> 2) NRT refresh automatically searcher/reader it dont call commit. I
>>> based my implementation using nrt on http://stackoverflow.com/qu
>>> estions/17993960/lucene-4-4-0-new-controlledrealtimereopenth
>>> read-sample-usage. In this example commit is executed for every crud
>>> operation in synchronous way but in general it is advised to use a batch
>>> thread because the commit is a long operation. *So it is not clear how
>>> to do the commit in a near-real time system with a indefinite index size.*
>>>      2.a if the commit is synchronous , i can use user data because it
>>> is used before a commit, every commit has a different user data and i can
>>> trace the transactions changes.But in general a commit can requires also
>>> minutes for be completed so then it dont seams a real solution in a near
>>> real time solution.
>>>     2.b if the commit is async, it is executed every X times (or better
>>> how memory if full) , the commit can not be used for tracing the
>>> transactions and i can pass a trnsaction id associated with a lucene
>>> commit. I can add a mutex in crud ( when i loading uncommit data) i m sure
>>> the last uncummit Index is aligned to the last transaction id X, so there
>>> is no overlappind and the crud block is very fast when happens.But how to
>>> grant that the commit is related to the last CommitIndex what i loaded?
>>> Maybe if i introduce that mutex in a custom mergePolicy?
>>> It is right what i wrote until now ?The best solution is 2.b? In this
>>> case how to grant the commit is done based on the uncommit data loaded in a
>>> specific commitIndex?
>>>
>>>
>>>
>>>
>>>
>>> 2017-03-22 15:32 GMT+01:00 Michael McCandless <[hidden email]
>>> >:
>>>
>>>> Hi, I think you forgot to CC the lucene user's list (
>>>> [hidden email]) in your reply?  Can you resend?
>>>>
>>>> Thanks.
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto <
>>>> [hidden email]> wrote:
>>>>
>>>>> hi , i m thinking about what you told me in previous message and how
>>>>> to solve the corruption problem and the problem about commit operation
>>>>> executed in async way.
>>>>>
>>>>> I m thinking to create a simple transaction log in a file.
>>>>> i use a long atomic sequence for a ordinable transaction id.
>>>>>
>>>>> when i make a new operation
>>>>> 1) generate new incremental transaction id
>>>>> 2) save the operation abstract info in transaction log associated to
>>>>> id.
>>>>>     2.a insert ,update with the a serialized version of the object to
>>>>> save
>>>>>     2b delete the query serialized where apply delete
>>>>> 3) execute same operation in lucene adding before property
>>>>> transactionId (executed in ram)
>>>>>
>>>>> 4) in async way commit is executed. After the commit the transaction
>>>>> log until last transaction id is deleted.(i dont know how insert block
>>>>> after commit , using near real time reader and SearcherManager) I might
>>>>>  introduce a logic in the way a commit is done. The order is simlilar to a
>>>>> queue so it follows the transactionId order. i Is there a example about
>>>>> possibility to commit a specific set of uncommit operations?
>>>>>
>>>>> 5) i need the warrenty after a crud operation the data in available in
>>>>> memory  in a possible imminent research so i think i might execute
>>>>> flush/refreshReader after every CUD operations
>>>>>
>>>>> if there is a failure transaction log will be not empty. But i can
>>>>> rexecute operations not executed after restartup.
>>>>> Maybe it could be usefull also for fixing a corruption but it is sure
>>>>> the corrution dont touch also segments already commited completely in the
>>>>> past? or maybe for a stable solution i might anyway save data in a
>>>>> secondary repository ?
>>>>>
>>>>>
>>>>>
>>>>> for your opinion this solution will be sufficient . It is a good
>>>>> solution for you, i m forgetting some aspects?
>>>>>
>>>>> PS Another interesting aspect maybe could be associate the segment
>>>>> associated to a transaction. In this way if a segment is missing i can
>>>>> apply again it without rebuild all the index from scratch.
>>>>>
>>>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless <
>>>>> [hidden email]>:
>>>>>
>>>>>> You can use Lucene's CheckIndex tool with the -exorcise option but
>>>>>> this is quite brutal: it simply drops any segment that has corruption it
>>>>>> detects.
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <[hidden email]> wrote:
>>>>>>
>>>>>>> I'm afraid it's not possible to rebuild index. It's important to
>>>>>>> maintain a
>>>>>>> backup policy because of that.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
>>>>>>> [hidden email]> wrote:
>>>>>>>
>>>>>>> > lucene can rebuild index using his internal info and how ? or in
>>>>>>> have to
>>>>>>> > reinsert all in other way?
>>>>>>> >
>>>>>>> --
>>>>>>> Marco Reis
>>>>>>> Software Architect
>>>>>>> http://marcoreis.net
>>>>>>> https://github.com/masreis
>>>>>>> +55 61 9 81194620
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: how to rebuild a index corrupted?

Cristian Lorenzetto
I deduce the transaction range not using the segment corrupted but the
corrected segments. The transaction id is incremental and i imagine segment
are saved sequentelly so if it is missing the segment 5 , reading the
correct segment 4 i can find the maximunn transaction id A , reading the
segment 6 i can find the minimum transaction id B so i can deduce the hole
, the range is [A+1,B-1] ... making a query in db i reaload the
corrisponding document and i add again in lucene this missing documents.


2017-03-23 15:28 GMT+01:00 Cristian Lorenzetto <
[hidden email]>:

> I deduce the transaction range not using the segment corrupted but the
> corrected segments. The transaction id is incremental and i imagine segment
> are saved sequentelly so if it is missing the segment 5 , reading the
> correct segment 4 i can find the maximunn transaction id A , reading the
> segment 6 i can find the minimum transaction id B so i can deduce the hole
> , the range is [A+1,B-1] ... making a query in db i reaload the
> corrisponding document and i add again in lucene this missing documents.
>
>
> 2017-03-23 15:17 GMT+01:00 Michael McCandless <[hidden email]>:
>
>> Lucene corruption should be rare and only due to bad hardware; if you are
>> seeing otherwise we really should get to the root cause.
>>
>> Mapping documents to each segment will not be easy in general, especially
>> if that segment is now corrupted so you can't search it.
>>
>> Documents lost because of power loss / OS crash while indexing can be
>> more common, and its for that use case that the sequence numbers /
>> transaction log should be helpful.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Mar 23, 2017 at 10:12 AM, Cristian Lorenzetto <
>> [hidden email]> wrote:
>>
>>> Yes exactly. I saw, working in the past in systems using lucene (for
>>> example alfresco projects),  lucene corruption happens sometimes and every
>>> time the building requires a lot of times ... so i thougth a way for
>>> accelerating the fixing of a corruption index. In addition there is a rare
>>> case not described here ( If after a database commit lucene throws a
>>> exception for exampe disk is full ) there is a possibility of a
>>>  disalignement from the database and the lucene index. With this system
>>> these problems could be solved automatically. In database every row has a
>>> property with trasaction id.  So if i know in lucene is missing a segment 6
>>> , corrisponds to   transactions range[ 1000, 1050] so i can reload in a
>>> query in database just corrisponding rows.
>>>
>>> 2017-03-23 14:59 GMT+01:00 Michael McCandless <[hidden email]
>>> >:
>>>
>>>> You should be able to use the sequence numbers returned by IndexWriter
>>>> operations to "know" which operations made it into the commit and which did
>>>> not, and then on disaster recovery replay only those operations that didn't
>>>> make it?
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Thu, Mar 23, 2017 at 5:53 AM, Cristian Lorenzetto <
>>>> [hidden email]> wrote:
>>>>
>>>>> Errata corridge/integration for questions related to previous my post
>>>>>
>>>>> I studied a bit this lucene classes for understanding:
>>>>> 1) setCommitData is designed for versioning the index , not for
>>>>> passing a transaction log. However if userdata is different for every
>>>>> transactionid it is equivalent .
>>>>> 2) NRT refresh automatically searcher/reader it dont call commit. I
>>>>> based my implementation using nrt on http://stackoverflow.com/qu
>>>>> estions/17993960/lucene-4-4-0-new-controlledrealtimereopenth
>>>>> read-sample-usage. In this example commit is executed for every crud
>>>>> operation in synchronous way but in general it is advised to use a batch
>>>>> thread because the commit is a long operation. *So it is not clear
>>>>> how to do the commit in a near-real time system with a indefinite index
>>>>> size.*
>>>>>      2.a if the commit is synchronous , i can use user data because it
>>>>> is used before a commit, every commit has a different user data and i can
>>>>> trace the transactions changes.But in general a commit can requires also
>>>>> minutes for be completed so then it dont seams a real solution in a near
>>>>> real time solution.
>>>>>     2.b if the commit is async, it is executed every X times (or
>>>>> better how memory if full) , the commit can not be used for tracing the
>>>>> transactions and i can pass a trnsaction id associated with a lucene
>>>>> commit. I can add a mutex in crud ( when i loading uncommit data) i m sure
>>>>> the last uncummit Index is aligned to the last transaction id X, so there
>>>>> is no overlappind and the crud block is very fast when happens.But how to
>>>>> grant that the commit is related to the last CommitIndex what i loaded?
>>>>> Maybe if i introduce that mutex in a custom mergePolicy?
>>>>> It is right what i wrote until now ?The best solution is 2.b? In this
>>>>> case how to grant the commit is done based on the uncommit data loaded in a
>>>>> specific commitIndex?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2017-03-22 15:32 GMT+01:00 Michael McCandless <
>>>>> [hidden email]>:
>>>>>
>>>>>> Hi, I think you forgot to CC the lucene user's list (
>>>>>> [hidden email]) in your reply?  Can you resend?
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto <
>>>>>> [hidden email]> wrote:
>>>>>>
>>>>>>> hi , i m thinking about what you told me in previous message and how
>>>>>>> to solve the corruption problem and the problem about commit operation
>>>>>>> executed in async way.
>>>>>>>
>>>>>>> I m thinking to create a simple transaction log in a file.
>>>>>>> i use a long atomic sequence for a ordinable transaction id.
>>>>>>>
>>>>>>> when i make a new operation
>>>>>>> 1) generate new incremental transaction id
>>>>>>> 2) save the operation abstract info in transaction log associated to
>>>>>>> id.
>>>>>>>     2.a insert ,update with the a serialized version of the object
>>>>>>> to save
>>>>>>>     2b delete the query serialized where apply delete
>>>>>>> 3) execute same operation in lucene adding before property
>>>>>>> transactionId (executed in ram)
>>>>>>>
>>>>>>> 4) in async way commit is executed. After the commit the transaction
>>>>>>> log until last transaction id is deleted.(i dont know how insert block
>>>>>>> after commit , using near real time reader and SearcherManager) I might
>>>>>>>  introduce a logic in the way a commit is done. The order is simlilar to a
>>>>>>> queue so it follows the transactionId order. i Is there a example about
>>>>>>> possibility to commit a specific set of uncommit operations?
>>>>>>>
>>>>>>> 5) i need the warrenty after a crud operation the data in available
>>>>>>> in memory  in a possible imminent research so i think i might execute
>>>>>>> flush/refreshReader after every CUD operations
>>>>>>>
>>>>>>> if there is a failure transaction log will be not empty. But i can
>>>>>>> rexecute operations not executed after restartup.
>>>>>>> Maybe it could be usefull also for fixing a corruption but it is
>>>>>>> sure the corrution dont touch also segments already commited completely in
>>>>>>> the past? or maybe for a stable solution i might anyway save data in a
>>>>>>> secondary repository ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> for your opinion this solution will be sufficient . It is a good
>>>>>>> solution for you, i m forgetting some aspects?
>>>>>>>
>>>>>>> PS Another interesting aspect maybe could be associate the segment
>>>>>>> associated to a transaction. In this way if a segment is missing i can
>>>>>>> apply again it without rebuild all the index from scratch.
>>>>>>>
>>>>>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless <
>>>>>>> [hidden email]>:
>>>>>>>
>>>>>>>> You can use Lucene's CheckIndex tool with the -exorcise option but
>>>>>>>> this is quite brutal: it simply drops any segment that has corruption it
>>>>>>>> detects.
>>>>>>>>
>>>>>>>> Mike McCandless
>>>>>>>>
>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>
>>>>>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm afraid it's not possible to rebuild index. It's important to
>>>>>>>>> maintain a
>>>>>>>>> backup policy because of that.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> > lucene can rebuild index using his internal info and how ? or in
>>>>>>>>> have to
>>>>>>>>> > reinsert all in other way?
>>>>>>>>> >
>>>>>>>>> --
>>>>>>>>> Marco Reis
>>>>>>>>> Software Architect
>>>>>>>>> http://marcoreis.net
>>>>>>>>> https://github.com/masreis
>>>>>>>>> +55 61 9 81194620
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: how to rebuild a index corrupted?

Michael McCandless-2
If you use a single thread then, yes, segments are sequential.

But if e.g. you are updating documents, then deletions (because a document
was replaced) are recorded against different segments, so merely dropping
the corrupted segment will mean you don't drop the deletions.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Mar 23, 2017 at 10:29 AM, Cristian Lorenzetto <
[hidden email]> wrote:

> I deduce the transaction range not using the segment corrupted but the
> corrected segments. The transaction id is incremental and i imagine segment
> are saved sequentelly so if it is missing the segment 5 , reading the
> correct segment 4 i can find the maximunn transaction id A , reading the
> segment 6 i can find the minimum transaction id B so i can deduce the hole
> , the range is [A+1,B-1] ... making a query in db i reaload the
> corrisponding document and i add again in lucene this missing documents.
>
>
> 2017-03-23 15:28 GMT+01:00 Cristian Lorenzetto <
> [hidden email]>:
>
>> I deduce the transaction range not using the segment corrupted but the
>> corrected segments. The transaction id is incremental and i imagine segment
>> are saved sequentelly so if it is missing the segment 5 , reading the
>> correct segment 4 i can find the maximunn transaction id A , reading the
>> segment 6 i can find the minimum transaction id B so i can deduce the hole
>> , the range is [A+1,B-1] ... making a query in db i reaload the
>> corrisponding document and i add again in lucene this missing documents.
>>
>>
>> 2017-03-23 15:17 GMT+01:00 Michael McCandless <[hidden email]>
>> :
>>
>>> Lucene corruption should be rare and only due to bad hardware; if you
>>> are seeing otherwise we really should get to the root cause.
>>>
>>> Mapping documents to each segment will not be easy in general,
>>> especially if that segment is now corrupted so you can't search it.
>>>
>>> Documents lost because of power loss / OS crash while indexing can be
>>> more common, and its for that use case that the sequence numbers /
>>> transaction log should be helpful.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Thu, Mar 23, 2017 at 10:12 AM, Cristian Lorenzetto <
>>> [hidden email]> wrote:
>>>
>>>> Yes exactly. I saw, working in the past in systems using lucene (for
>>>> example alfresco projects),  lucene corruption happens sometimes and every
>>>> time the building requires a lot of times ... so i thougth a way for
>>>> accelerating the fixing of a corruption index. In addition there is a rare
>>>> case not described here ( If after a database commit lucene throws a
>>>> exception for exampe disk is full ) there is a possibility of a
>>>>  disalignement from the database and the lucene index. With this system
>>>> these problems could be solved automatically. In database every row has a
>>>> property with trasaction id.  So if i know in lucene is missing a segment 6
>>>> , corrisponds to   transactions range[ 1000, 1050] so i can reload in a
>>>> query in database just corrisponding rows.
>>>>
>>>> 2017-03-23 14:59 GMT+01:00 Michael McCandless <
>>>> [hidden email]>:
>>>>
>>>>> You should be able to use the sequence numbers returned by IndexWriter
>>>>> operations to "know" which operations made it into the commit and which did
>>>>> not, and then on disaster recovery replay only those operations that didn't
>>>>> make it?
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>> On Thu, Mar 23, 2017 at 5:53 AM, Cristian Lorenzetto <
>>>>> [hidden email]> wrote:
>>>>>
>>>>>> Errata corridge/integration for questions related to previous my post
>>>>>>
>>>>>> I studied a bit this lucene classes for understanding:
>>>>>> 1) setCommitData is designed for versioning the index , not for
>>>>>> passing a transaction log. However if userdata is different for every
>>>>>> transactionid it is equivalent .
>>>>>> 2) NRT refresh automatically searcher/reader it dont call commit. I
>>>>>> based my implementation using nrt on http://stackoverflow.com/qu
>>>>>> estions/17993960/lucene-4-4-0-new-controlledrealtimereopenth
>>>>>> read-sample-usage. In this example commit is executed for every crud
>>>>>> operation in synchronous way but in general it is advised to use a batch
>>>>>> thread because the commit is a long operation. *So it is not clear
>>>>>> how to do the commit in a near-real time system with a indefinite index
>>>>>> size.*
>>>>>>      2.a if the commit is synchronous , i can use user data because
>>>>>> it is used before a commit, every commit has a different user data and i
>>>>>> can trace the transactions changes.But in general a commit can requires
>>>>>> also minutes for be completed so then it dont seams a real solution in a
>>>>>> near real time solution.
>>>>>>     2.b if the commit is async, it is executed every X times (or
>>>>>> better how memory if full) , the commit can not be used for tracing the
>>>>>> transactions and i can pass a trnsaction id associated with a lucene
>>>>>> commit. I can add a mutex in crud ( when i loading uncommit data) i m sure
>>>>>> the last uncummit Index is aligned to the last transaction id X, so there
>>>>>> is no overlappind and the crud block is very fast when happens.But how to
>>>>>> grant that the commit is related to the last CommitIndex what i loaded?
>>>>>> Maybe if i introduce that mutex in a custom mergePolicy?
>>>>>> It is right what i wrote until now ?The best solution is 2.b? In this
>>>>>> case how to grant the commit is done based on the uncommit data loaded in a
>>>>>> specific commitIndex?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-03-22 15:32 GMT+01:00 Michael McCandless <
>>>>>> [hidden email]>:
>>>>>>
>>>>>>> Hi, I think you forgot to CC the lucene user's list (
>>>>>>> [hidden email]) in your reply?  Can you resend?
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> Mike McCandless
>>>>>>>
>>>>>>> http://blog.mikemccandless.com
>>>>>>>
>>>>>>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto <
>>>>>>> [hidden email]> wrote:
>>>>>>>
>>>>>>>> hi , i m thinking about what you told me in previous message and
>>>>>>>> how to solve the corruption problem and the problem about commit operation
>>>>>>>> executed in async way.
>>>>>>>>
>>>>>>>> I m thinking to create a simple transaction log in a file.
>>>>>>>> i use a long atomic sequence for a ordinable transaction id.
>>>>>>>>
>>>>>>>> when i make a new operation
>>>>>>>> 1) generate new incremental transaction id
>>>>>>>> 2) save the operation abstract info in transaction log associated
>>>>>>>> to id.
>>>>>>>>     2.a insert ,update with the a serialized version of the object
>>>>>>>> to save
>>>>>>>>     2b delete the query serialized where apply delete
>>>>>>>> 3) execute same operation in lucene adding before property
>>>>>>>> transactionId (executed in ram)
>>>>>>>>
>>>>>>>> 4) in async way commit is executed. After the commit the
>>>>>>>> transaction log until last transaction id is deleted.(i dont know how
>>>>>>>> insert block after commit , using near real time reader and
>>>>>>>> SearcherManager) I might  introduce a logic in the way a commit is done.
>>>>>>>> The order is simlilar to a queue so it follows the transactionId order. i
>>>>>>>> Is there a example about possibility to commit a specific set of uncommit
>>>>>>>> operations?
>>>>>>>>
>>>>>>>> 5) i need the warrenty after a crud operation the data in available
>>>>>>>> in memory  in a possible imminent research so i think i might execute
>>>>>>>> flush/refreshReader after every CUD operations
>>>>>>>>
>>>>>>>> if there is a failure transaction log will be not empty. But i can
>>>>>>>> rexecute operations not executed after restartup.
>>>>>>>> Maybe it could be usefull also for fixing a corruption but it is
>>>>>>>> sure the corrution dont touch also segments already commited completely in
>>>>>>>> the past? or maybe for a stable solution i might anyway save data in a
>>>>>>>> secondary repository ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> for your opinion this solution will be sufficient . It is a good
>>>>>>>> solution for you, i m forgetting some aspects?
>>>>>>>>
>>>>>>>> PS Another interesting aspect maybe could be associate the segment
>>>>>>>> associated to a transaction. In this way if a segment is missing i can
>>>>>>>> apply again it without rebuild all the index from scratch.
>>>>>>>>
>>>>>>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless <
>>>>>>>> [hidden email]>:
>>>>>>>>
>>>>>>>>> You can use Lucene's CheckIndex tool with the -exorcise option but
>>>>>>>>> this is quite brutal: it simply drops any segment that has corruption it
>>>>>>>>> detects.
>>>>>>>>>
>>>>>>>>> Mike McCandless
>>>>>>>>>
>>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>>
>>>>>>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <[hidden email]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I'm afraid it's not possible to rebuild index. It's important to
>>>>>>>>>> maintain a
>>>>>>>>>> backup policy because of that.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
>>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> > lucene can rebuild index using his internal info and how ? or
>>>>>>>>>> in have to
>>>>>>>>>> > reinsert all in other way?
>>>>>>>>>> >
>>>>>>>>>> --
>>>>>>>>>> Marco Reis
>>>>>>>>>> Software Architect
>>>>>>>>>> http://marcoreis.net
>>>>>>>>>> https://github.com/masreis
>>>>>>>>>> +55 61 9 81194620
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: how to rebuild a index corrupted?

Cristian Lorenzetto
You are right , but maybe it is possible to solve this problem. I can try :)
i m not sure but in NRT , using a single commiter it is a single batch
thread executing the commits so it might be sequential.

 I think your case is when 2 segments are not merged and contains changes
in the same entities. I imagine this case can happens until the segments
are not merged in a unique segment. So pratically if i add additional info
about deletion , there is no risk to consume too much disk, because is a
temporary status, not accumulative.
in db i have special table "deletion_table"
deletion table

transactionId , entityId




document A inserted in segment 1 with transaction 5
document B inserted in segment 1 with transaction 5
document C inserted in segment 1 with trnsaction 6
document A   is deleted in segment 2 with transaction 7 and i save in
deletion table 7->A until segments are merged

segment 2 is corrupted

searching range [7 , *]

in db i search transaction 7, but i dont find in the document tables, i
search in deletion table
7->A. I check entity A is not present in document tables , so i can deduce
i have to recall delete entity A in lucene removing this document.











2017-03-23 16:18 GMT+01:00 Michael McCandless <[hidden email]>:

> If you use a single thread then, yes, segments are sequential.
>
> But if e.g. you are updating documents, then deletions (because a document
> was replaced) are recorded against different segments, so merely dropping
> the corrupted segment will mean you don't drop the deletions.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Mar 23, 2017 at 10:29 AM, Cristian Lorenzetto <
> [hidden email]> wrote:
>
>> I deduce the transaction range not using the segment corrupted but the
>> corrected segments. The transaction id is incremental and i imagine segment
>> are saved sequentelly so if it is missing the segment 5 , reading the
>> correct segment 4 i can find the maximunn transaction id A , reading the
>> segment 6 i can find the minimum transaction id B so i can deduce the hole
>> , the range is [A+1,B-1] ... making a query in db i reaload the
>> corrisponding document and i add again in lucene this missing documents.
>>
>>
>> 2017-03-23 15:28 GMT+01:00 Cristian Lorenzetto <
>> [hidden email]>:
>>
>>> I deduce the transaction range not using the segment corrupted but the
>>> corrected segments. The transaction id is incremental and i imagine segment
>>> are saved sequentelly so if it is missing the segment 5 , reading the
>>> correct segment 4 i can find the maximunn transaction id A , reading the
>>> segment 6 i can find the minimum transaction id B so i can deduce the hole
>>> , the range is [A+1,B-1] ... making a query in db i reaload the
>>> corrisponding document and i add again in lucene this missing documents.
>>>
>>>
>>> 2017-03-23 15:17 GMT+01:00 Michael McCandless <[hidden email]
>>> >:
>>>
>>>> Lucene corruption should be rare and only due to bad hardware; if you
>>>> are seeing otherwise we really should get to the root cause.
>>>>
>>>> Mapping documents to each segment will not be easy in general,
>>>> especially if that segment is now corrupted so you can't search it.
>>>>
>>>> Documents lost because of power loss / OS crash while indexing can be
>>>> more common, and its for that use case that the sequence numbers /
>>>> transaction log should be helpful.
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Thu, Mar 23, 2017 at 10:12 AM, Cristian Lorenzetto <
>>>> [hidden email]> wrote:
>>>>
>>>>> Yes exactly. I saw, working in the past in systems using lucene (for
>>>>> example alfresco projects),  lucene corruption happens sometimes and every
>>>>> time the building requires a lot of times ... so i thougth a way for
>>>>> accelerating the fixing of a corruption index. In addition there is a rare
>>>>> case not described here ( If after a database commit lucene throws a
>>>>> exception for exampe disk is full ) there is a possibility of a
>>>>>  disalignement from the database and the lucene index. With this system
>>>>> these problems could be solved automatically. In database every row has a
>>>>> property with trasaction id.  So if i know in lucene is missing a segment 6
>>>>> , corrisponds to   transactions range[ 1000, 1050] so i can reload in a
>>>>> query in database just corrisponding rows.
>>>>>
>>>>> 2017-03-23 14:59 GMT+01:00 Michael McCandless <
>>>>> [hidden email]>:
>>>>>
>>>>>> You should be able to use the sequence numbers returned by
>>>>>> IndexWriter operations to "know" which operations made it into the commit
>>>>>> and which did not, and then on disaster recovery replay only those
>>>>>> operations that didn't make it?
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>> On Thu, Mar 23, 2017 at 5:53 AM, Cristian Lorenzetto <
>>>>>> [hidden email]> wrote:
>>>>>>
>>>>>>> Errata corridge/integration for questions related to previous my post
>>>>>>>
>>>>>>> I studied a bit this lucene classes for understanding:
>>>>>>> 1) setCommitData is designed for versioning the index , not for
>>>>>>> passing a transaction log. However if userdata is different for every
>>>>>>> transactionid it is equivalent .
>>>>>>> 2) NRT refresh automatically searcher/reader it dont call commit. I
>>>>>>> based my implementation using nrt on http://stackoverflow.com/qu
>>>>>>> estions/17993960/lucene-4-4-0-new-controlledrealtimereopenth
>>>>>>> read-sample-usage. In this example commit is executed for every
>>>>>>> crud operation in synchronous way but in general it is advised to use a
>>>>>>> batch thread because the commit is a long operation. *So it is not
>>>>>>> clear how to do the commit in a near-real time system with a indefinite
>>>>>>> index size.*
>>>>>>>      2.a if the commit is synchronous , i can use user data because
>>>>>>> it is used before a commit, every commit has a different user data and i
>>>>>>> can trace the transactions changes.But in general a commit can requires
>>>>>>> also minutes for be completed so then it dont seams a real solution in a
>>>>>>> near real time solution.
>>>>>>>     2.b if the commit is async, it is executed every X times (or
>>>>>>> better how memory if full) , the commit can not be used for tracing the
>>>>>>> transactions and i can pass a trnsaction id associated with a lucene
>>>>>>> commit. I can add a mutex in crud ( when i loading uncommit data) i m sure
>>>>>>> the last uncummit Index is aligned to the last transaction id X, so there
>>>>>>> is no overlappind and the crud block is very fast when happens.But how to
>>>>>>> grant that the commit is related to the last CommitIndex what i loaded?
>>>>>>> Maybe if i introduce that mutex in a custom mergePolicy?
>>>>>>> It is right what i wrote until now ?The best solution is 2.b? In
>>>>>>> this case how to grant the commit is done based on the uncommit data loaded
>>>>>>> in a specific commitIndex?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2017-03-22 15:32 GMT+01:00 Michael McCandless <
>>>>>>> [hidden email]>:
>>>>>>>
>>>>>>>> Hi, I think you forgot to CC the lucene user's list (
>>>>>>>> [hidden email]) in your reply?  Can you resend?
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> Mike McCandless
>>>>>>>>
>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>
>>>>>>>> On Wed, Mar 22, 2017 at 9:02 AM, Cristian Lorenzetto <
>>>>>>>> [hidden email]> wrote:
>>>>>>>>
>>>>>>>>> hi , i m thinking about what you told me in previous message and
>>>>>>>>> how to solve the corruption problem and the problem about commit operation
>>>>>>>>> executed in async way.
>>>>>>>>>
>>>>>>>>> I m thinking to create a simple transaction log in a file.
>>>>>>>>> i use a long atomic sequence for a ordinable transaction id.
>>>>>>>>>
>>>>>>>>> when i make a new operation
>>>>>>>>> 1) generate new incremental transaction id
>>>>>>>>> 2) save the operation abstract info in transaction log associated
>>>>>>>>> to id.
>>>>>>>>>     2.a insert ,update with the a serialized version of the object
>>>>>>>>> to save
>>>>>>>>>     2b delete the query serialized where apply delete
>>>>>>>>> 3) execute same operation in lucene adding before property
>>>>>>>>> transactionId (executed in ram)
>>>>>>>>>
>>>>>>>>> 4) in async way commit is executed. After the commit the
>>>>>>>>> transaction log until last transaction id is deleted.(i dont know how
>>>>>>>>> insert block after commit , using near real time reader and
>>>>>>>>> SearcherManager) I might  introduce a logic in the way a commit is done.
>>>>>>>>> The order is simlilar to a queue so it follows the transactionId order. i
>>>>>>>>> Is there a example about possibility to commit a specific set of uncommit
>>>>>>>>> operations?
>>>>>>>>>
>>>>>>>>> 5) i need the warrenty after a crud operation the data in
>>>>>>>>> available in memory  in a possible imminent research so i think i might
>>>>>>>>> execute flush/refreshReader after every CUD operations
>>>>>>>>>
>>>>>>>>> if there is a failure transaction log will be not empty. But i can
>>>>>>>>> rexecute operations not executed after restartup.
>>>>>>>>> Maybe it could be usefull also for fixing a corruption but it is
>>>>>>>>> sure the corrution dont touch also segments already commited completely in
>>>>>>>>> the past? or maybe for a stable solution i might anyway save data in a
>>>>>>>>> secondary repository ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> for your opinion this solution will be sufficient . It is a good
>>>>>>>>> solution for you, i m forgetting some aspects?
>>>>>>>>>
>>>>>>>>> PS Another interesting aspect maybe could be associate the segment
>>>>>>>>> associated to a transaction. In this way if a segment is missing i can
>>>>>>>>> apply again it without rebuild all the index from scratch.
>>>>>>>>>
>>>>>>>>> 2017-03-21 0:58 GMT+01:00 Michael McCandless <
>>>>>>>>> [hidden email]>:
>>>>>>>>>
>>>>>>>>>> You can use Lucene's CheckIndex tool with the -exorcise option
>>>>>>>>>> but this is quite brutal: it simply drops any segment that has corruption
>>>>>>>>>> it detects.
>>>>>>>>>>
>>>>>>>>>> Mike McCandless
>>>>>>>>>>
>>>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 20, 2017 at 4:44 PM, Marco Reis <[hidden email]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm afraid it's not possible to rebuild index. It's important to
>>>>>>>>>>> maintain a
>>>>>>>>>>> backup policy because of that.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Mar 20, 2017 at 5:12 PM Cristian Lorenzetto <
>>>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> > lucene can rebuild index using his internal info and how ? or
>>>>>>>>>>> in have to
>>>>>>>>>>> > reinsert all in other way?
>>>>>>>>>>> >
>>>>>>>>>>> --
>>>>>>>>>>> Marco Reis
>>>>>>>>>>> Software Architect
>>>>>>>>>>> http://marcoreis.net
>>>>>>>>>>> https://github.com/masreis
>>>>>>>>>>> +55 61 9 81194620
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>