How can we know if 2 lucene indexes are same?

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

How can we know if 2 lucene indexes are same?

Noble Paul നോബിള്‍  नोब्ळ्
hi,
I wish to know if the contents of two indexes have same data.
will all the files be exactly same if I put same set of documents to both?
--Noble

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

Karl Wettin

29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍  
नोब्ळ्:

> hi,
> I wish to know if the contents of two indexes have same data.
> will all the files be exactly same if I put same set of documents to  
> both?

If you insert the documents in the same order with the same settings  
and both indices are optimized, then the files ought to be  
identitical. I'm however not sure.

The instantiated index contrib module contains a test that assert two  
index readers are identical. You could use this to be really sure, but  
it it a rather long running process for a large index:

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java


Perhaps you should explain why you need to do this.


           karl
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

Noble Paul നോബിള്‍  नोब्ळ्
The use case is as follows

I have two indexes . One at the master and one at the slave. The user
occasionally keeps committing on the master and the delta is
replicated everytime. But when the optimize happens the transfer size
can be really large. So I am thinking of  doing the optimize
separately on master and slave .

So far, so good. But how can I really know that after the optimize the
indexes are indeed the same or no documents got added in between.?



On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[hidden email]> wrote:

>
> 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍ नोब्ळ्:
>
>> hi,
>> I wish to know if the contents of two indexes have same data.
>> will all the files be exactly same if I put same set of documents to both?
>
> If you insert the documents in the same order with the same settings and
> both indices are optimized, then the files ought to be identitical. I'm
> however not sure.
>
> The instantiated index contrib module contains a test that assert two index
> readers are identical. You could use this to be really sure, but it it a
> rather long running process for a large index:
>
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java
>
>
> Perhaps you should explain why you need to do this.
>
>
>          karl
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
--Noble Paul
Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

叶双明
No documents can added into index when the index is optimizing,  or
optimizing can't run durling documents adding to the index.
So, without other error, I think we can beleive the two index are indeed the
same.

:)

2008/9/4 Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]>

> The use case is as follows
>
> I have two indexes . One at the master and one at the slave. The user
> occasionally keeps committing on the master and the delta is
> replicated everytime. But when the optimize happens the transfer size
> can be really large. So I am thinking of  doing the optimize
> separately on master and slave .
>
> So far, so good. But how can I really know that after the optimize the
> indexes are indeed the same or no documents got added in between.?
>
>
>
> On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[hidden email]>
> wrote:
> >
> > 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍ नोब्ळ्:
> >
> >> hi,
> >> I wish to know if the contents of two indexes have same data.
> >> will all the files be exactly same if I put same set of documents to
> both?
> >
> > If you insert the documents in the same order with the same settings and
> > both indices are optimized, then the files ought to be identitical. I'm
> > however not sure.
> >
> > The instantiated index contrib module contains a test that assert two
> index
> > readers are identical. You could use this to be really sure, but it it a
> > rather long running process for a large index:
> >
> >
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java
> >
> >
> > Perhaps you should explain why you need to do this.
> >
> >
> >          karl
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
>
> --
> --Noble Paul
>
Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

Michael McCandless-2

Actually, as of 2.3, this is no longer true: merges and optimizing run  
in the background, and allow add/update/delete documents to run at the  
same time.

I think it's probably best to use application logic (outside of  
Lucene) to keep track of what updates happened to the master while the  
slave was optimizing.

Mike

叶双明 wrote:

> No documents can added into index when the index is optimizing,  or
> optimizing can't run durling documents adding to the index.
> So, without other error, I think we can beleive the two index are  
> indeed the
> same.
>
> :)
>
> 2008/9/4 Noble Paul നോബിള്‍ नोब्ळ्  
> <[hidden email]>
>
>> The use case is as follows
>>
>> I have two indexes . One at the master and one at the slave. The user
>> occasionally keeps committing on the master and the delta is
>> replicated everytime. But when the optimize happens the transfer size
>> can be really large. So I am thinking of  doing the optimize
>> separately on master and slave .
>>
>> So far, so good. But how can I really know that after the optimize  
>> the
>> indexes are indeed the same or no documents got added in between.?
>>
>>
>>
>> On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[hidden email]>
>> wrote:
>>>
>>> 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍  
>>> नोब्ळ्:
>>>
>>>> hi,
>>>> I wish to know if the contents of two indexes have same data.
>>>> will all the files be exactly same if I put same set of documents  
>>>> to
>> both?
>>>
>>> If you insert the documents in the same order with the same  
>>> settings and
>>> both indices are optimized, then the files ought to be  
>>> identitical. I'm
>>> however not sure.
>>>
>>> The instantiated index contrib module contains a test that assert  
>>> two
>> index
>>> readers are identical. You could use this to be really sure, but  
>>> it it a
>>> rather long running process for a large index:
>>>
>>>
>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java
>>>
>>>
>>> Perhaps you should explain why you need to do this.
>>>
>>>
>>>         karl
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

叶双明
I don't agreed with Michael McCandless. :)

I konw that after 2.3, add and delete can run in one IndexWriter at one
time, and also lucene has a update method which delete documents by term
then add the new document.

In my test, either LockObtainFailedException with thread sleep sentence:

org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
SimpleFSLock@E:\index\write.lock
 at org.apache.lucene.store.Lock.obtain(Lock.java:85)
 at
org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298)
 at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
 at
org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
 at org.test.IndexThread.run(IndexThread.java:33)

or StaleReaderException without thread sleep sentence:

org.apache.lucene.index.StaleReaderException: IndexReader out of date and no
longer valid for delete, undelete, or setNorm operations
 at
org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308)
 at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
 at
org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
 at org.test.IndexThread.run(IndexThread.java:31)

My test code:


public class Main {

 public static void main(String[] args) throws IOException {
  Directory directory = FSDirectory.getDirectory("e:/index");
  IndexWriter writer = new IndexWriter(directory, null, false);
  Document document = new Document();
  document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED));
  writer.addDocument(document);

  Thread t = new IndexThread();
  t.start();

  try {
   Thread.sleep(1000);
  } catch (InterruptedException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }

  writer.optimize();
  writer.close();
  System.out.println("out");
 }
}

public class IndexThread extends Thread {

 @Override
 public void run() {
  Directory directory;
  try {
   try {
    Thread.sleep(10);
   } catch (InterruptedException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
   }

   directory = FSDirectory.getDirectory("e:/index");
   System.out.println("thread begin");
   //IndexWriter reader = new IndexWriter(directory, null, false);
   IndexReader reader = IndexReader.open(directory);
   Term term = new Term("bbb", "bbb");
   reader.deleteDocuments(term);
   reader.close();
   System.out.println("thread end");
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }
}



2008/9/4, Michael McCandless <[hidden email]>:

>
>
> Actually, as of 2.3, this is no longer true: merges and optimizing run in
> the background, and allow add/update/delete documents to run at the same
> time.
>
> I think it's probably best to use application logic (outside of Lucene) to
> keep track of what updates happened to the master while the slave was
> optimizing.
>
> Mike
>
> 叶双明 wrote:
>
> No documents can added into index when the index is optimizing,  or
>> optimizing can't run durling documents adding to the index.
>> So, without other error, I think we can beleive the two index are indeed
>> the
>> same.
>>
>> :)
>>
>> 2008/9/4 Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]>
>>
>> The use case is as follows
>>>
>>> I have two indexes . One at the master and one at the slave. The user
>>> occasionally keeps committing on the master and the delta is
>>> replicated everytime. But when the optimize happens the transfer size
>>> can be really large. So I am thinking of  doing the optimize
>>> separately on master and slave .
>>>
>>> So far, so good. But how can I really know that after the optimize the
>>> indexes are indeed the same or no documents got added in between.?
>>>
>>>
>>>
>>> On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[hidden email]>
>>> wrote:
>>>
>>>>
>>>> 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍ नोब्ळ्:
>>>>
>>>> hi,
>>>>> I wish to know if the contents of two indexes have same data.
>>>>> will all the files be exactly same if I put same set of documents to
>>>>>
>>>> both?
>>>
>>>>
>>>> If you insert the documents in the same order with the same settings and
>>>> both indices are optimized, then the files ought to be identitical. I'm
>>>> however not sure.
>>>>
>>>> The instantiated index contrib module contains a test that assert two
>>>>
>>> index
>>>
>>>> readers are identical. You could use this to be really sure, but it it a
>>>> rather long running process for a large index:
>>>>
>>>>
>>>>
>>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java
>>>
>>>>
>>>>
>>>> Perhaps you should explain why you need to do this.
>>>>
>>>>
>>>>        karl
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> --Noble Paul
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

Michael McCandless-2

Sorry, I should have said: you must always use the same writer, ie as  
of 2.3, while IndexWriter.optimize (or normal segment merging) is  
running, under one thread, another thread can use that *same* writer  
to add/delete/update documents, and both are free to make changes to  
the index.

Before 2.3, optimize() was fully synchronized and blocked add/update/
delete documents from changing the index until the optimize() call  
completed.

So, your test is expected to fail: you're not allowed to open 2  
"writers" on a single index at the same time, where "writer" includes  
an IndexReader that deletes documents; so those exceptions  
(LockObtainFailed, StaleReader) are expected.

Mike

叶双明 wrote:

> I don't agreed with Michael McCandless. :)
>
> I konw that after 2.3, add and delete can run in one IndexWriter at  
> one
> time, and also lucene has a update method which delete documents by  
> term
> then add the new document.
>
> In my test, either LockObtainFailedException with thread sleep  
> sentence:
>
> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed  
> out:
> SimpleFSLock@E:\index\write.lock
> at org.apache.lucene.store.Lock.obtain(Lock.java:85)
> at
> org
> .apache
> .lucene
> .index
> .DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298)
> at  
> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:
> 750)
> at
> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:
> 786)
> at org.test.IndexThread.run(IndexThread.java:33)
>
> or StaleReaderException without thread sleep sentence:
>
> org.apache.lucene.index.StaleReaderException: IndexReader out of  
> date and no
> longer valid for delete, undelete, or setNorm operations
> at
> org
> .apache
> .lucene
> .index
> .DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308)
> at  
> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:
> 750)
> at
> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:
> 786)
> at org.test.IndexThread.run(IndexThread.java:31)
>
> My test code:
>
>
> public class Main {
>
> public static void main(String[] args) throws IOException {
>  Directory directory = FSDirectory.getDirectory("e:/index");
>  IndexWriter writer = new IndexWriter(directory, null, false);
>  Document document = new Document();
>  document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED));
>  writer.addDocument(document);
>
>  Thread t = new IndexThread();
>  t.start();
>
>  try {
>   Thread.sleep(1000);
>  } catch (InterruptedException e) {
>   // TODO Auto-generated catch block
>   e.printStackTrace();
>  }
>
>  writer.optimize();
>  writer.close();
>  System.out.println("out");
> }
> }
>
> public class IndexThread extends Thread {
>
> @Override
> public void run() {
>  Directory directory;
>  try {
>   try {
>    Thread.sleep(10);
>   } catch (InterruptedException e) {
>    // TODO Auto-generated catch block
>    e.printStackTrace();
>   }
>
>   directory = FSDirectory.getDirectory("e:/index");
>   System.out.println("thread begin");
>   //IndexWriter reader = new IndexWriter(directory, null, false);
>   IndexReader reader = IndexReader.open(directory);
>   Term term = new Term("bbb", "bbb");
>   reader.deleteDocuments(term);
>   reader.close();
>   System.out.println("thread end");
>  } catch (IOException e) {
>   // TODO Auto-generated catch block
>   e.printStackTrace();
>  }
> }
> }
>
>
>
> 2008/9/4, Michael McCandless <[hidden email]>:
>>
>>
>> Actually, as of 2.3, this is no longer true: merges and optimizing  
>> run in
>> the background, and allow add/update/delete documents to run at the  
>> same
>> time.
>>
>> I think it's probably best to use application logic (outside of  
>> Lucene) to
>> keep track of what updates happened to the master while the slave was
>> optimizing.
>>
>> Mike
>>
>> 叶双明 wrote:
>>
>> No documents can added into index when the index is optimizing,  or
>>> optimizing can't run durling documents adding to the index.
>>> So, without other error, I think we can beleive the two index are  
>>> indeed
>>> the
>>> same.
>>>
>>> :)
>>>
>>> 2008/9/4 Noble Paul നോബിള്‍ नोब्ळ्  
>>> <[hidden email]>
>>>
>>> The use case is as follows
>>>>
>>>> I have two indexes . One at the master and one at the slave. The  
>>>> user
>>>> occasionally keeps committing on the master and the delta is
>>>> replicated everytime. But when the optimize happens the transfer  
>>>> size
>>>> can be really large. So I am thinking of  doing the optimize
>>>> separately on master and slave .
>>>>
>>>> So far, so good. But how can I really know that after the  
>>>> optimize the
>>>> indexes are indeed the same or no documents got added in between.?
>>>>
>>>>
>>>>
>>>> On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin  
>>>> <[hidden email]>
>>>> wrote:
>>>>
>>>>>
>>>>> 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍  
>>>>> नोब्ळ्:
>>>>>
>>>>> hi,
>>>>>> I wish to know if the contents of two indexes have same data.
>>>>>> will all the files be exactly same if I put same set of  
>>>>>> documents to
>>>>>>
>>>>> both?
>>>>
>>>>>
>>>>> If you insert the documents in the same order with the same  
>>>>> settings and
>>>>> both indices are optimized, then the files ought to be  
>>>>> identitical. I'm
>>>>> however not sure.
>>>>>
>>>>> The instantiated index contrib module contains a test that  
>>>>> assert two
>>>>>
>>>> index
>>>>
>>>>> readers are identical. You could use this to be really sure, but  
>>>>> it it a
>>>>> rather long running process for a large index:
>>>>>
>>>>>
>>>>>
>>>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java
>>>>
>>>>>
>>>>>
>>>>> Perhaps you should explain why you need to do this.
>>>>>
>>>>>
>>>>>       karl
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [hidden email]
>>>>> For additional commands, e-mail: [hidden email]
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> --Noble Paul
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

叶双明
I see now, thanks Michael McCandless, good explain!!

2008/9/4, Michael McCandless <[hidden email]>:

>
>
> Sorry, I should have said: you must always use the same writer, ie as of
> 2.3, while IndexWriter.optimize (or normal segment merging) is running,
> under one thread, another thread can use that *same* writer to
> add/delete/update documents, and both are free to make changes to the index.
>
> Before 2.3, optimize() was fully synchronized and blocked add/update/delete
> documents from changing the index until the optimize() call completed.
>
> So, your test is expected to fail: you're not allowed to open 2 "writers"
> on a single index at the same time, where "writer" includes an IndexReader
> that deletes documents; so those exceptions (LockObtainFailed, StaleReader)
> are expected.
>
> Mike
>
> 叶双明 wrote:
>
> I don't agreed with Michael McCandless. :)
>>
>> I konw that after 2.3, add and delete can run in one IndexWriter at one
>> time, and also lucene has a update method which delete documents by term
>> then add the new document.
>>
>> In my test, either LockObtainFailedException with thread sleep sentence:
>>
>> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
>> SimpleFSLock@E:\index\write.lock
>> at org.apache.lucene.store.Lock.obtain(Lock.java:85)
>> at
>>
>> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298)
>> at
>> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
>> at
>> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
>> at org.test.IndexThread.run(IndexThread.java:33)
>>
>> or StaleReaderException without thread sleep sentence:
>>
>> org.apache.lucene.index.StaleReaderException: IndexReader out of date and
>> no
>> longer valid for delete, undelete, or setNorm operations
>> at
>>
>> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308)
>> at
>> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
>> at
>> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
>> at org.test.IndexThread.run(IndexThread.java:31)
>>
>> My test code:
>>
>>
>> public class Main {
>>
>> public static void main(String[] args) throws IOException {
>>  Directory directory = FSDirectory.getDirectory("e:/index");
>>  IndexWriter writer = new IndexWriter(directory, null, false);
>>  Document document = new Document();
>>  document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED));
>>  writer.addDocument(document);
>>
>>  Thread t = new IndexThread();
>>  t.start();
>>
>>  try {
>>  Thread.sleep(1000);
>>  } catch (InterruptedException e) {
>>  // TODO Auto-generated catch block
>>  e.printStackTrace();
>>  }
>>
>>  writer.optimize();
>>  writer.close();
>>  System.out.println("out");
>> }
>> }
>>
>> public class IndexThread extends Thread {
>>
>> @Override
>> public void run() {
>>  Directory directory;
>>  try {
>>  try {
>>   Thread.sleep(10);
>>  } catch (InterruptedException e) {
>>   // TODO Auto-generated catch block
>>   e.printStackTrace();
>>  }
>>
>>  directory = FSDirectory.getDirectory("e:/index");
>>  System.out.println("thread begin");
>>  //IndexWriter reader = new IndexWriter(directory, null, false);
>>  IndexReader reader = IndexReader.open(directory);
>>  Term term = new Term("bbb", "bbb");
>>  reader.deleteDocuments(term);
>>  reader.close();
>>  System.out.println("thread end");
>>  } catch (IOException e) {
>>  // TODO Auto-generated catch block
>>  e.printStackTrace();
>>  }
>> }
>> }
>>
>>
>>
>> 2008/9/4, Michael McCandless <[hidden email]>:
>>
>>>
>>>
>>> Actually, as of 2.3, this is no longer true: merges and optimizing run in
>>> the background, and allow add/update/delete documents to run at the same
>>> time.
>>>
>>> I think it's probably best to use application logic (outside of Lucene)
>>> to
>>> keep track of what updates happened to the master while the slave was
>>> optimizing.
>>>
>>> Mike
>>>
>>> 叶双明 wrote:
>>>
>>> No documents can added into index when the index is optimizing,  or
>>>
>>>> optimizing can't run durling documents adding to the index.
>>>> So, without other error, I think we can beleive the two index are indeed
>>>> the
>>>> same.
>>>>
>>>> :)
>>>>
>>>> 2008/9/4 Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]>
>>>>
>>>> The use case is as follows
>>>>
>>>>>
>>>>> I have two indexes . One at the master and one at the slave. The user
>>>>> occasionally keeps committing on the master and the delta is
>>>>> replicated everytime. But when the optimize happens the transfer size
>>>>> can be really large. So I am thinking of  doing the optimize
>>>>> separately on master and slave .
>>>>>
>>>>> So far, so good. But how can I really know that after the optimize the
>>>>> indexes are indeed the same or no documents got added in between.?
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>
>>>>>> 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍ नोब्ळ्:
>>>>>>
>>>>>> hi,
>>>>>>
>>>>>>> I wish to know if the contents of two indexes have same data.
>>>>>>> will all the files be exactly same if I put same set of documents to
>>>>>>>
>>>>>>> both?
>>>>>>
>>>>>
>>>>>
>>>>>> If you insert the documents in the same order with the same settings
>>>>>> and
>>>>>> both indices are optimized, then the files ought to be identitical.
>>>>>> I'm
>>>>>> however not sure.
>>>>>>
>>>>>> The instantiated index contrib module contains a test that assert two
>>>>>>
>>>>>> index
>>>>>
>>>>> readers are identical. You could use this to be really sure, but it it
>>>>>> a
>>>>>> rather long running process for a large index:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java
>>>>>
>>>>>
>>>>>>
>>>>>> Perhaps you should explain why you need to do this.
>>>>>>
>>>>>>
>>>>>>      karl
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [hidden email]
>>>>>> For additional commands, e-mail: [hidden email]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> --Noble Paul
>>>>>
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

Noble Paul നോബിള്‍  नोब्ळ्
In reply to this post by Michael McCandless-2
I am not using the same index with different writers. These are two
separate indexes both have their own reader/writer

I just wanted to minimize the network load by avoiding the download of
an optimized index if the contents are indeed same.
--noble

On Thu, Sep 4, 2008 at 7:36 PM, Michael McCandless
<[hidden email]> wrote:

>
> Sorry, I should have said: you must always use the same writer, ie as of
> 2.3, while IndexWriter.optimize (or normal segment merging) is running,
> under one thread, another thread can use that *same* writer to
> add/delete/update documents, and both are free to make changes to the index.
>
> Before 2.3, optimize() was fully synchronized and blocked add/update/delete
> documents from changing the index until the optimize() call completed.
>
> So, your test is expected to fail: you're not allowed to open 2 "writers" on
> a single index at the same time, where "writer" includes an IndexReader that
> deletes documents; so those exceptions (LockObtainFailed, StaleReader) are
> expected.
>
> Mike
>
> 叶双明 wrote:
>
>> I don't agreed with Michael McCandless. :)
>>
>> I konw that after 2.3, add and delete can run in one IndexWriter at one
>> time, and also lucene has a update method which delete documents by term
>> then add the new document.
>>
>> In my test, either LockObtainFailedException with thread sleep sentence:
>>
>> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
>> SimpleFSLock@E:\index\write.lock
>> at org.apache.lucene.store.Lock.obtain(Lock.java:85)
>> at
>>
>> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298)
>> at
>> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
>> at
>> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
>> at org.test.IndexThread.run(IndexThread.java:33)
>>
>> or StaleReaderException without thread sleep sentence:
>>
>> org.apache.lucene.index.StaleReaderException: IndexReader out of date and
>> no
>> longer valid for delete, undelete, or setNorm operations
>> at
>>
>> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308)
>> at
>> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
>> at
>> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
>> at org.test.IndexThread.run(IndexThread.java:31)
>>
>> My test code:
>>
>>
>> public class Main {
>>
>> public static void main(String[] args) throws IOException {
>>  Directory directory = FSDirectory.getDirectory("e:/index");
>>  IndexWriter writer = new IndexWriter(directory, null, false);
>>  Document document = new Document();
>>  document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED));
>>  writer.addDocument(document);
>>
>>  Thread t = new IndexThread();
>>  t.start();
>>
>>  try {
>>  Thread.sleep(1000);
>>  } catch (InterruptedException e) {
>>  // TODO Auto-generated catch block
>>  e.printStackTrace();
>>  }
>>
>>  writer.optimize();
>>  writer.close();
>>  System.out.println("out");
>> }
>> }
>>
>> public class IndexThread extends Thread {
>>
>> @Override
>> public void run() {
>>  Directory directory;
>>  try {
>>  try {
>>   Thread.sleep(10);
>>  } catch (InterruptedException e) {
>>   // TODO Auto-generated catch block
>>   e.printStackTrace();
>>  }
>>
>>  directory = FSDirectory.getDirectory("e:/index");
>>  System.out.println("thread begin");
>>  //IndexWriter reader = new IndexWriter(directory, null, false);
>>  IndexReader reader = IndexReader.open(directory);
>>  Term term = new Term("bbb", "bbb");
>>  reader.deleteDocuments(term);
>>  reader.close();
>>  System.out.println("thread end");
>>  } catch (IOException e) {
>>  // TODO Auto-generated catch block
>>  e.printStackTrace();
>>  }
>> }
>> }
>>
>>
>>
>> 2008/9/4, Michael McCandless <[hidden email]>:
>>>
>>>
>>> Actually, as of 2.3, this is no longer true: merges and optimizing run in
>>> the background, and allow add/update/delete documents to run at the same
>>> time.
>>>
>>> I think it's probably best to use application logic (outside of Lucene)
>>> to
>>> keep track of what updates happened to the master while the slave was
>>> optimizing.
>>>
>>> Mike
>>>
>>> 叶双明 wrote:
>>>
>>> No documents can added into index when the index is optimizing,  or
>>>>
>>>> optimizing can't run durling documents adding to the index.
>>>> So, without other error, I think we can beleive the two index are indeed
>>>> the
>>>> same.
>>>>
>>>> :)
>>>>
>>>> 2008/9/4 Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]>
>>>>
>>>> The use case is as follows
>>>>>
>>>>> I have two indexes . One at the master and one at the slave. The user
>>>>> occasionally keeps committing on the master and the delta is
>>>>> replicated everytime. But when the optimize happens the transfer size
>>>>> can be really large. So I am thinking of  doing the optimize
>>>>> separately on master and slave .
>>>>>
>>>>> So far, so good. But how can I really know that after the optimize the
>>>>> indexes are indeed the same or no documents got added in between.?
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍ नोब्ळ्:
>>>>>>
>>>>>> hi,
>>>>>>>
>>>>>>> I wish to know if the contents of two indexes have same data.
>>>>>>> will all the files be exactly same if I put same set of documents to
>>>>>>>
>>>>>> both?
>>>>>
>>>>>>
>>>>>> If you insert the documents in the same order with the same settings
>>>>>> and
>>>>>> both indices are optimized, then the files ought to be identitical.
>>>>>> I'm
>>>>>> however not sure.
>>>>>>
>>>>>> The instantiated index contrib module contains a test that assert two
>>>>>>
>>>>> index
>>>>>
>>>>>> readers are identical. You could use this to be really sure, but it it
>>>>>> a
>>>>>> rather long running process for a large index:
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java
>>>>>
>>>>>>
>>>>>>
>>>>>> Perhaps you should explain why you need to do this.
>>>>>>
>>>>>>
>>>>>>      karl
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [hidden email]
>>>>>> For additional commands, e-mail: [hidden email]
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --Noble Paul
>>>>>
>>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
--Noble Paul
Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

叶双明
Do you use index at the slave as a backup for index at the master??
And in case the master break down, you can turn the query to the slave??

When add a Document to master, also add it to the slave?

Sorry, I don't clear about what your problem, can you show more detail about
what do you worry about?

2008/9/5 Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]>

> I am not using the same index with different writers. These are two
> separate indexes both have their own reader/writer
>
> I just wanted to minimize the network load by avoiding the download of
> an optimized index if the contents are indeed same.
> --noble
>
> On Thu, Sep 4, 2008 at 7:36 PM, Michael McCandless
> <[hidden email]> wrote:
> >
> > Sorry, I should have said: you must always use the same writer, ie as of
> > 2.3, while IndexWriter.optimize (or normal segment merging) is running,
> > under one thread, another thread can use that *same* writer to
> > add/delete/update documents, and both are free to make changes to the
> index.
> >
> > Before 2.3, optimize() was fully synchronized and blocked
> add/update/delete
> > documents from changing the index until the optimize() call completed.
> >
> > So, your test is expected to fail: you're not allowed to open 2 "writers"
> on
> > a single index at the same time, where "writer" includes an IndexReader
> that
> > deletes documents; so those exceptions (LockObtainFailed, StaleReader)
> are
> > expected.
> >
> > Mike
> >
> > 叶双明 wrote:
> >
> >> I don't agreed with Michael McCandless. :)
> >>
> >> I konw that after 2.3, add and delete can run in one IndexWriter at one
> >> time, and also lucene has a update method which delete documents by term
> >> then add the new document.
> >>
> >> In my test, either LockObtainFailedException with thread sleep sentence:
> >>
> >> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
> out:
> >> SimpleFSLock@E:\index\write.lock
> >> at org.apache.lucene.store.Lock.obtain(Lock.java:85)
> >> at
> >>
> >>
> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298)
> >> at
> >> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
> >> at
> >>
> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
> >> at org.test.IndexThread.run(IndexThread.java:33)
> >>
> >> or StaleReaderException without thread sleep sentence:
> >>
> >> org.apache.lucene.index.StaleReaderException: IndexReader out of date
> and
> >> no
> >> longer valid for delete, undelete, or setNorm operations
> >> at
> >>
> >>
> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308)
> >> at
> >> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
> >> at
> >>
> org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
> >> at org.test.IndexThread.run(IndexThread.java:31)
> >>
> >> My test code:
> >>
> >>
> >> public class Main {
> >>
> >> public static void main(String[] args) throws IOException {
> >>  Directory directory = FSDirectory.getDirectory("e:/index");
> >>  IndexWriter writer = new IndexWriter(directory, null, false);
> >>  Document document = new Document();
> >>  document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED));
> >>  writer.addDocument(document);
> >>
> >>  Thread t = new IndexThread();
> >>  t.start();
> >>
> >>  try {
> >>  Thread.sleep(1000);
> >>  } catch (InterruptedException e) {
> >>  // TODO Auto-generated catch block
> >>  e.printStackTrace();
> >>  }
> >>
> >>  writer.optimize();
> >>  writer.close();
> >>  System.out.println("out");
> >> }
> >> }
> >>
> >> public class IndexThread extends Thread {
> >>
> >> @Override
> >> public void run() {
> >>  Directory directory;
> >>  try {
> >>  try {
> >>   Thread.sleep(10);
> >>  } catch (InterruptedException e) {
> >>   // TODO Auto-generated catch block
> >>   e.printStackTrace();
> >>  }
> >>
> >>  directory = FSDirectory.getDirectory("e:/index");
> >>  System.out.println("thread begin");
> >>  //IndexWriter reader = new IndexWriter(directory, null, false);
> >>  IndexReader reader = IndexReader.open(directory);
> >>  Term term = new Term("bbb", "bbb");
> >>  reader.deleteDocuments(term);
> >>  reader.close();
> >>  System.out.println("thread end");
> >>  } catch (IOException e) {
> >>  // TODO Auto-generated catch block
> >>  e.printStackTrace();
> >>  }
> >> }
> >> }
> >>
> >>
> >>
> >> 2008/9/4, Michael McCandless <[hidden email]>:
> >>>
> >>>
> >>> Actually, as of 2.3, this is no longer true: merges and optimizing run
> in
> >>> the background, and allow add/update/delete documents to run at the
> same
> >>> time.
> >>>
> >>> I think it's probably best to use application logic (outside of Lucene)
> >>> to
> >>> keep track of what updates happened to the master while the slave was
> >>> optimizing.
> >>>
> >>> Mike
> >>>
> >>> 叶双明 wrote:
> >>>
> >>> No documents can added into index when the index is optimizing,  or
> >>>>
> >>>> optimizing can't run durling documents adding to the index.
> >>>> So, without other error, I think we can beleive the two index are
> indeed
> >>>> the
> >>>> same.
> >>>>
> >>>> :)
> >>>>
> >>>> 2008/9/4 Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]>
> >>>>
> >>>> The use case is as follows
> >>>>>
> >>>>> I have two indexes . One at the master and one at the slave. The user
> >>>>> occasionally keeps committing on the master and the delta is
> >>>>> replicated everytime. But when the optimize happens the transfer size
> >>>>> can be really large. So I am thinking of  doing the optimize
> >>>>> separately on master and slave .
> >>>>>
> >>>>> So far, so good. But how can I really know that after the optimize
> the
> >>>>> indexes are indeed the same or no documents got added in between.?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <[hidden email]>
> >>>>> wrote:
> >>>>>
> >>>>>>
> >>>>>> 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍ नोब्ळ्:
> >>>>>>
> >>>>>> hi,
> >>>>>>>
> >>>>>>> I wish to know if the contents of two indexes have same data.
> >>>>>>> will all the files be exactly same if I put same set of documents
> to
> >>>>>>>
> >>>>>> both?
> >>>>>
> >>>>>>
> >>>>>> If you insert the documents in the same order with the same settings
> >>>>>> and
> >>>>>> both indices are optimized, then the files ought to be identitical.
> >>>>>> I'm
> >>>>>> however not sure.
> >>>>>>
> >>>>>> The instantiated index contrib module contains a test that assert
> two
> >>>>>>
> >>>>> index
> >>>>>
> >>>>>> readers are identical. You could use this to be really sure, but it
> it
> >>>>>> a
> >>>>>> rather long running process for a large index:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> Perhaps you should explain why you need to do this.
> >>>>>>
> >>>>>>
> >>>>>>      karl
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: [hidden email]
> >>>>>> For additional commands, e-mail: [hidden email]
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> --Noble Paul
> >>>>>
> >>>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
>
> --
> --Noble Paul
>
Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

Shalin Shekhar Mangar
Let me try to explain.

I have a master where indexing is done. I have multiple slaves for querying.

If I commit+optimize on the master and then rsync the index, the data
transferred on the network is huge. An alternate way is to commit on master,
transfer the delta to the slave and issue an optimize on the slave. This is
very fast because less data is transferred on the network.

However, we need to ensure that the index on the slave is actually in sync
with the master. So that on another commit, we can blindly transfer the
delta to the slave.

On Fri, Sep 5, 2008 at 1:03 PM, 叶双明 <[hidden email]> wrote:

> Do you use index at the slave as a backup for index at the master??
> And in case the master break down, you can turn the query to the slave??
>
> When add a Document to master, also add it to the slave?
>
> Sorry, I don't clear about what your problem, can you show more detail
> about
> what do you worry about?
>
> 2008/9/5 Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]>
>
> > I am not using the same index with different writers. These are two
> > separate indexes both have their own reader/writer
> >
> > I just wanted to minimize the network load by avoiding the download of
> > an optimized index if the contents are indeed same.
> > --noble
> >
> > On Thu, Sep 4, 2008 at 7:36 PM, Michael McCandless
> > <[hidden email]> wrote:
> > >
> > > Sorry, I should have said: you must always use the same writer, ie as
> of
> > > 2.3, while IndexWriter.optimize (or normal segment merging) is running,
> > > under one thread, another thread can use that *same* writer to
> > > add/delete/update documents, and both are free to make changes to the
> > index.
> > >
> > > Before 2.3, optimize() was fully synchronized and blocked
> > add/update/delete
> > > documents from changing the index until the optimize() call completed.
> > >
> > > So, your test is expected to fail: you're not allowed to open 2
> "writers"
> > on
> > > a single index at the same time, where "writer" includes an IndexReader
> > that
> > > deletes documents; so those exceptions (LockObtainFailed, StaleReader)
> > are
> > > expected.
> > >
> > > Mike
> > >
> > > 叶双明 wrote:
> > >
> > >> I don't agreed with Michael McCandless. :)
> > >>
> > >> I konw that after 2.3, add and delete can run in one IndexWriter at
> one
> > >> time, and also lucene has a update method which delete documents by
> term
> > >> then add the new document.
> > >>
> > >> In my test, either LockObtainFailedException with thread sleep
> sentence:
> > >>
> > >> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
> > out:
> > >> SimpleFSLock@E:\index\write.lock
> > >> at org.apache.lucene.store.Lock.obtain(Lock.java:85)
> > >> at
> > >>
> > >>
> >
> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:298)
> > >> at
> > >>
> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
> > >> at
> > >>
> > org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
> > >> at org.test.IndexThread.run(IndexThread.java:33)
> > >>
> > >> or StaleReaderException without thread sleep sentence:
> > >>
> > >> org.apache.lucene.index.StaleReaderException: IndexReader out of date
> > and
> > >> no
> > >> longer valid for delete, undelete, or setNorm operations
> > >> at
> > >>
> > >>
> >
> org.apache.lucene.index.DirectoryIndexReader.acquireWriteLock(DirectoryIndexReader.java:308)
> > >> at
> > >>
> org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:750)
> > >> at
> > >>
> > org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:786)
> > >> at org.test.IndexThread.run(IndexThread.java:31)
> > >>
> > >> My test code:
> > >>
> > >>
> > >> public class Main {
> > >>
> > >> public static void main(String[] args) throws IOException {
> > >>  Directory directory = FSDirectory.getDirectory("e:/index");
> > >>  IndexWriter writer = new IndexWriter(directory, null, false);
> > >>  Document document = new Document();
> > >>  document.add(new Field("bbb", "bbb", Store.YES, Index.UN_TOKENIZED));
> > >>  writer.addDocument(document);
> > >>
> > >>  Thread t = new IndexThread();
> > >>  t.start();
> > >>
> > >>  try {
> > >>  Thread.sleep(1000);
> > >>  } catch (InterruptedException e) {
> > >>  // TODO Auto-generated catch block
> > >>  e.printStackTrace();
> > >>  }
> > >>
> > >>  writer.optimize();
> > >>  writer.close();
> > >>  System.out.println("out");
> > >> }
> > >> }
> > >>
> > >> public class IndexThread extends Thread {
> > >>
> > >> @Override
> > >> public void run() {
> > >>  Directory directory;
> > >>  try {
> > >>  try {
> > >>   Thread.sleep(10);
> > >>  } catch (InterruptedException e) {
> > >>   // TODO Auto-generated catch block
> > >>   e.printStackTrace();
> > >>  }
> > >>
> > >>  directory = FSDirectory.getDirectory("e:/index");
> > >>  System.out.println("thread begin");
> > >>  //IndexWriter reader = new IndexWriter(directory, null, false);
> > >>  IndexReader reader = IndexReader.open(directory);
> > >>  Term term = new Term("bbb", "bbb");
> > >>  reader.deleteDocuments(term);
> > >>  reader.close();
> > >>  System.out.println("thread end");
> > >>  } catch (IOException e) {
> > >>  // TODO Auto-generated catch block
> > >>  e.printStackTrace();
> > >>  }
> > >> }
> > >> }
> > >>
> > >>
> > >>
> > >> 2008/9/4, Michael McCandless <[hidden email]>:
> > >>>
> > >>>
> > >>> Actually, as of 2.3, this is no longer true: merges and optimizing
> run
> > in
> > >>> the background, and allow add/update/delete documents to run at the
> > same
> > >>> time.
> > >>>
> > >>> I think it's probably best to use application logic (outside of
> Lucene)
> > >>> to
> > >>> keep track of what updates happened to the master while the slave was
> > >>> optimizing.
> > >>>
> > >>> Mike
> > >>>
> > >>> 叶双明 wrote:
> > >>>
> > >>> No documents can added into index when the index is optimizing,  or
> > >>>>
> > >>>> optimizing can't run durling documents adding to the index.
> > >>>> So, without other error, I think we can beleive the two index are
> > indeed
> > >>>> the
> > >>>> same.
> > >>>>
> > >>>> :)
> > >>>>
> > >>>> 2008/9/4 Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]>
> > >>>>
> > >>>> The use case is as follows
> > >>>>>
> > >>>>> I have two indexes . One at the master and one at the slave. The
> user
> > >>>>> occasionally keeps committing on the master and the delta is
> > >>>>> replicated everytime. But when the optimize happens the transfer
> size
> > >>>>> can be really large. So I am thinking of  doing the optimize
> > >>>>> separately on master and slave .
> > >>>>>
> > >>>>> So far, so good. But how can I really know that after the optimize
> > the
> > >>>>> indexes are indeed the same or no documents got added in between.?
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Aug 29, 2008 at 3:13 PM, Karl Wettin <
> [hidden email]>
> > >>>>> wrote:
> > >>>>>
> > >>>>>>
> > >>>>>> 29 aug 2008 kl. 11.35 skrev Noble Paul നോബിള്‍ नोब्ळ्:
> > >>>>>>
> > >>>>>> hi,
> > >>>>>>>
> > >>>>>>> I wish to know if the contents of two indexes have same data.
> > >>>>>>> will all the files be exactly same if I put same set of documents
> > to
> > >>>>>>>
> > >>>>>> both?
> > >>>>>
> > >>>>>>
> > >>>>>> If you insert the documents in the same order with the same
> settings
> > >>>>>> and
> > >>>>>> both indices are optimized, then the files ought to be
> identitical.
> > >>>>>> I'm
> > >>>>>> however not sure.
> > >>>>>>
> > >>>>>> The instantiated index contrib module contains a test that assert
> > two
> > >>>>>>
> > >>>>> index
> > >>>>>
> > >>>>>> readers are identical. You could use this to be really sure, but
> it
> > it
> > >>>>>> a
> > >>>>>> rather long running process for a large index:
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> >
> http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java
> > >>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Perhaps you should explain why you need to do this.
> > >>>>>>
> > >>>>>>
> > >>>>>>      karl
> > >>>>>>
> > ---------------------------------------------------------------------
> > >>>>>> To unsubscribe, e-mail: [hidden email]
> > >>>>>> For additional commands, e-mail: [hidden email]
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> --Noble Paul
> > >>>>>
> > >>>>>
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: [hidden email]
> > >>> For additional commands, e-mail: [hidden email]
> > >>>
> > >>>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
> >
> >
> > --
> > --Noble Paul
> >
>



--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

Michael McCandless-2

Shalin Shekhar Mangar wrote:

> Let me try to explain.
>
> I have a master where indexing is done. I have multiple slaves for  
> querying.
>
> If I commit+optimize on the master and then rsync the index, the data
> transferred on the network is huge. An alternate way is to commit on  
> master,
> transfer the delta to the slave and issue an optimize on the slave.  
> This is
> very fast because less data is transferred on the network.

Large segment merges will also send huge traffic.  You may just want  
to send all updates (document adds/deletes) to all slaves directly?  
It'd be nice if you could somehow NOT sync the effects of segment  
merging, but do sync doc add/deletes... not sure how to do that.

> However, we need to ensure that the index on the slave is actually  
> in sync
> with the master. So that on another commit, we can blindly transfer  
> the
> delta to the slave.

I assume your app ensures that no deltas arrive to the slave while  
it's running optimize?  So then the question becomes (I think) "if two  
indices are identical to begin with, and I separately run optimize on  
each, will the resulting two optimized indices be identical?".

By "in sync" you also require the final segment name (after optimize)  
is identical right?

I think the answer is yes, but I'm not certain unless I think more  
about it.  Also this behavior is not "promised" in Lucene's API.

Merges for optimize are now allowed to run concurrently (by default,  
with ConcurrentMergeScheduler), except for the final (< mergeFactor  
segments) merge, which waits until others have finished.  So if there  
are 7 obvious merges necessary to optimize, 3 will run concurrently,  
while 4 wait.  Those 4 then run as the merges finish over time, which  
may happen in different orders for each index and so different merges  
may run.  Then the final merge will run and I *think* the net number  
of merges that ran should always be the same and so the final segment  
name should be the same.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

Jason Rutherglen
In Ocean I had to use a transaction log and execute everything that
way like SQL database replication.  Then let each node handle it's own
merging process.  Syncing the indexes is used to get a new node up to
speed, otherwise it's avoided for the reasons mentioned in the
previous email.

On Fri, Sep 5, 2008 at 8:33 AM, Michael McCandless
<[hidden email]> wrote:

>
> Shalin Shekhar Mangar wrote:
>
>> Let me try to explain.
>>
>> I have a master where indexing is done. I have multiple slaves for
>> querying.
>>
>> If I commit+optimize on the master and then rsync the index, the data
>> transferred on the network is huge. An alternate way is to commit on
>> master,
>> transfer the delta to the slave and issue an optimize on the slave. This
>> is
>> very fast because less data is transferred on the network.
>
> Large segment merges will also send huge traffic.  You may just want to send
> all updates (document adds/deletes) to all slaves directly?  It'd be nice if
> you could somehow NOT sync the effects of segment merging, but do sync doc
> add/deletes... not sure how to do that.
>
>> However, we need to ensure that the index on the slave is actually in sync
>> with the master. So that on another commit, we can blindly transfer the
>> delta to the slave.
>
> I assume your app ensures that no deltas arrive to the slave while it's
> running optimize?  So then the question becomes (I think) "if two indices
> are identical to begin with, and I separately run optimize on each, will the
> resulting two optimized indices be identical?".
>
> By "in sync" you also require the final segment name (after optimize) is
> identical right?
>
> I think the answer is yes, but I'm not certain unless I think more about it.
>  Also this behavior is not "promised" in Lucene's API.
>
> Merges for optimize are now allowed to run concurrently (by default, with
> ConcurrentMergeScheduler), except for the final (< mergeFactor segments)
> merge, which waits until others have finished.  So if there are 7 obvious
> merges necessary to optimize, 3 will run concurrently, while 4 wait.  Those
> 4 then run as the merges finish over time, which may happen in different
> orders for each index and so different merges may run.  Then the final merge
> will run and I *think* the net number of merges that ran should always be
> the same and so the final segment name should be the same.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

Noble Paul നോബിള്‍  नोब्ळ्
On Fri, Sep 5, 2008 at 6:20 PM, Jason Rutherglen
<[hidden email]> wrote:
> In Ocean I had to use a transaction log and execute everything that
> way like SQL database replication.  Then let each node handle it's own
> merging process.  Syncing the indexes is used to get a new node up to
> speed, otherwise it's avoided for the reasons mentioned in the
> previous email.
Propagating the logs is OK but most of Solr users may not need that
feature. They usually update the master in one indexing operation in
which they may add millions of docs and they may commit after that.
Just think about the cost of indexing that many documents on each
slave . It may slow down the responses from live slaves.
Anyway we have not ruled out that approach . That can be added as
another option and we are anyway planning to do that too

>
> On Fri, Sep 5, 2008 at 8:33 AM, Michael McCandless
> <[hidden email]> wrote:
>>
>> Shalin Shekhar Mangar wrote:
>>
>>> Let me try to explain.
>>>
>>> I have a master where indexing is done. I have multiple slaves for
>>> querying.
>>>
>>> If I commit+optimize on the master and then rsync the index, the data
>>> transferred on the network is huge. An alternate way is to commit on
>>> master,
>>> transfer the delta to the slave and issue an optimize on the slave. This
>>> is
>>> very fast because less data is transferred on the network.
>>
>> Large segment merges will also send huge traffic.  You may just want to send
>> all updates (document adds/deletes) to all slaves directly?  It'd be nice if
>> you could somehow NOT sync the effects of segment merging, but do sync doc
>> add/deletes... not sure how to do that.
>>
>>> However, we need to ensure that the index on the slave is actually in sync
>>> with the master. So that on another commit, we can blindly transfer the
>>> delta to the slave.
>>
>> I assume your app ensures that no deltas arrive to the slave while it's
>> running optimize?  So then the question becomes (I think) "if two indices
>> are identical to begin with, and I separately run optimize on each, will the
>> resulting two optimized indices be identical?".
>>
>> By "in sync" you also require the final segment name (after optimize) is
>> identical right?
>>
>> I think the answer is yes, but I'm not certain unless I think more about it.
>>  Also this behavior is not "promised" in Lucene's API.
>>
>> Merges for optimize are now allowed to run concurrently (by default, with
>> ConcurrentMergeScheduler), except for the final (< mergeFactor segments)
>> merge, which waits until others have finished.  So if there are 7 obvious
>> merges necessary to optimize, 3 will run concurrently, while 4 wait.  Those
>> 4 then run as the merges finish over time, which may happen in different
>> orders for each index and so different merges may run.  Then the final merge
>> will run and I *think* the net number of merges that ran should always be
>> the same and so the final segment name should be the same.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
--Noble Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

叶双明
In reply to this post by Michael McCandless-2
There is more and more complex, actually I hava a small index system can
config multiple index server for query,

In my opinion,  because  index update operating is synchronized between
different Thread that update the index, so

for indexing new data : can process data that want to index at the master,
when get the documents, add the documents to the index at the master and add
them to every slave,

for deleting data : delete at the master and every slaves at the same time,

I think we can believe  the index is indeed the same at the master and at
all slaves except other unexpected error in individual node.

And i hear about there is some frame to sync data between computers, but
just hear about.

Sorry for my englist. :)




2008/9/5, Michael McCandless <[hidden email]>:

>
>
> Shalin Shekhar Mangar wrote:
>
> Let me try to explain.
>>
>> I have a master where indexing is done. I have multiple slaves for
>> querying.
>>
>> If I commit+optimize on the master and then rsync the index, the data
>> transferred on the network is huge. An alternate way is to commit on
>> master,
>> transfer the delta to the slave and issue an optimize on the slave. This
>> is
>> very fast because less data is transferred on the network.
>>
>
> Large segment merges will also send huge traffic.  You may just want to
> send all updates (document adds/deletes) to all slaves directly?  It'd be
> nice if you could somehow NOT sync the effects of segment merging, but do
> sync doc add/deletes... not sure how to do that.
>
> However, we need to ensure that the index on the slave is actually in sync
>> with the master. So that on another commit, we can blindly transfer the
>> delta to the slave.
>>
>
> I assume your app ensures that no deltas arrive to the slave while it's
> running optimize?  So then the question becomes (I think) "if two indices
> are identical to begin with, and I separately run optimize on each, will the
> resulting two optimized indices be identical?".
>
> By "in sync" you also require the final segment name (after optimize) is
> identical right?
>
> I think the answer is yes, but I'm not certain unless I think more about
> it.  Also this behavior is not "promised" in Lucene's API.
>
> Merges for optimize are now allowed to run concurrently (by default, with
> ConcurrentMergeScheduler), except for the final (< mergeFactor segments)
> merge, which waits until others have finished.  So if there are 7 obvious
> merges necessary to optimize, 3 will run concurrently, while 4 wait.  Those
> 4 then run as the merges finish over time, which may happen in different
> orders for each index and so different merges may run.  Then the final merge
> will run and I *think* the net number of merges that ran should always be
> the same and so the final segment name should be the same.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

Shalin Shekhar Mangar
In reply to this post by Michael McCandless-2
On Fri, Sep 5, 2008 at 6:03 PM, Michael McCandless <
[hidden email]> wrote:

>
> Large segment merges will also send huge traffic.  You may just want to
> send all updates (document adds/deletes) to all slaves directly?  It'd be
> nice if you could somehow NOT sync the effects of segment merging, but do
> sync doc add/deletes... not sure how to do that.
>

As Noble said, that is another option we can consider.

I assume your app ensures that no deltas arrive to the slave while it's
> running optimize?  So then the question becomes (I think) "if two indices
> are identical to begin with, and I separately run optimize on each, will the
> resulting two optimized indices be identical?".


Yes


> By "in sync" you also require the final segment name (after optimize) is
> identical right?


Yes

I think the answer is yes, but I'm not certain unless I think more about it.

>  Also this behavior is not "promised" in Lucene's API.
>
> Merges for optimize are now allowed to run concurrently (by default, with
> ConcurrentMergeScheduler), except for the final (< mergeFactor segments)
> merge, which waits until others have finished.  So if there are 7 obvious
> merges necessary to optimize, 3 will run concurrently, while 4 wait.  Those
> 4 then run as the merges finish over time, which may happen in different
> orders for each index and so different merges may run.  Then the final merge
> will run and I *think* the net number of merges that ran should always be
> the same and so the final segment name should be the same.


Thanks for the explanation Mike. The core problem is to make sure both
indices are in sync. The log replication helps us because we compare the
master and slave index with a reference point (log position). If it becomes
possible for us to specify a version number during a commit, we can use the
master's version number on the slave. This can help us compare the two
indices. Not sure if that API change will be generally useful. Thoughts?



--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

叶双明
In reply to this post by 叶双明
Just think about the cost of indexing that many documents on each
slave . It may slow down the responses from live slaves.

I think there must be something like search service at the slaves incude a
IndexSearcher or other equals object, and indexing that many documents by a
IndexWriter , isn't the IndexSearcher affected by the indexing process?
After the indexing, reopen the IndexSearcher to load the new data.




2008/9/5, 叶双明 <[hidden email]>:

>
> There is more and more complex, actually I hava a small index system can
> config multiple index server for query,
>
> In my opinion,  because  index update operating is synchronized between
> different Thread that update the index, so
>
> for indexing new data : can process data that want to index at the master,
> when get the documents, add the documents to the index at the master and add
> them to every slave,
>
> for deleting data : delete at the master and every slaves at the same time,
>
> I think we can believe  the index is indeed the same at the master and at
> all slaves except other unexpected error in individual node.
>
> And i hear about there is some frame to sync data between computers, but
> just hear about.
>
> Sorry for my englist. :)
>
>
>
>
> 2008/9/5, Michael McCandless <[hidden email]>:
>>
>>
>> Shalin Shekhar Mangar wrote:
>>
>> Let me try to explain.
>>>
>>> I have a master where indexing is done. I have multiple slaves for
>>> querying.
>>>
>>> If I commit+optimize on the master and then rsync the index, the data
>>> transferred on the network is huge. An alternate way is to commit on
>>> master,
>>> transfer the delta to the slave and issue an optimize on the slave. This
>>> is
>>> very fast because less data is transferred on the network.
>>>
>>
>> Large segment merges will also send huge traffic.  You may just want to
>> send all updates (document adds/deletes) to all slaves directly?  It'd be
>> nice if you could somehow NOT sync the effects of segment merging, but do
>> sync doc add/deletes... not sure how to do that.
>>
>> However, we need to ensure that the index on the slave is actually in sync
>>> with the master. So that on another commit, we can blindly transfer the
>>> delta to the slave.
>>>
>>
>> I assume your app ensures that no deltas arrive to the slave while it's
>> running optimize?  So then the question becomes (I think) "if two indices
>> are identical to begin with, and I separately run optimize on each, will the
>> resulting two optimized indices be identical?".
>>
>> By "in sync" you also require the final segment name (after optimize) is
>> identical right?
>>
>> I think the answer is yes, but I'm not certain unless I think more about
>> it.  Also this behavior is not "promised" in Lucene's API.
>>
>> Merges for optimize are now allowed to run concurrently (by default, with
>> ConcurrentMergeScheduler), except for the final (< mergeFactor segments)
>> merge, which waits until others have finished.  So if there are 7 obvious
>> merges necessary to optimize, 3 will run concurrently, while 4 wait.  Those
>> 4 then run as the merges finish over time, which may happen in different
>> orders for each index and so different merges may run.  Then the final merge
>> will run and I *think* the net number of merges that ran should always be
>> the same and so the final segment name should be the same.
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

Michael McCandless-2
In reply to this post by Shalin Shekhar Mangar

Shalin Shekhar Mangar wrote:

> On Fri, Sep 5, 2008 at 6:03 PM, Michael McCandless <
> [hidden email]> wrote:
>
>>
>> Large segment merges will also send huge traffic.  You may just  
>> want to
>> send all updates (document adds/deletes) to all slaves directly?  
>> It'd be
>> nice if you could somehow NOT sync the effects of segment merging,  
>> but do
>> sync doc add/deletes... not sure how to do that.
>>
>
> As Noble said, that is another option we can consider.

Well this is certainly a nice challenging problem :)

> Thanks for the explanation Mike. The core problem is to make sure both
> indices are in sync. The log replication helps us because we compare  
> the
> master and slave index with a reference point (log position). If it  
> becomes
> possible for us to specify a version number during a commit, we can  
> use the
> master's version number on the slave. This can help us compare the two
> indices. Not sure if that API change will be generally useful.  
> Thoughts?

I think this could be a generally useful feature?

So you're thinking IndexWriter.commit() would take an optional opaque  
argument (maybe a String for generality?) that's recorded into the  
segments_N and could then later be retrieved by IndexReader and  
IndexWriter?

After a merge completes, should it just carry forward whatever was  
stored on the last segments_N?

We should call it something other than version, which already exists  
-- maybe "commitDetails", "commitComment", "commitUserData" or  
something?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

Shalin Shekhar Mangar
On Fri, Sep 5, 2008 at 9:52 PM, Michael McCandless <
[hidden email]> wrote:

>
> Well this is certainly a nice challenging problem :)


Yes it is :-)

I think this could be a generally useful feature?

>
> So you're thinking IndexWriter.commit() would take an optional opaque
> argument (maybe a String for generality?) that's recorded into the
> segments_N and could then later be retrieved by IndexReader and IndexWriter?
>
> After a merge completes, should it just carry forward whatever was stored
> on the last segments_N?
>
> We should call it something other than version, which already exists --
> maybe "commitDetails", "commitComment", "commitUserData" or something?
>
>
Thinking more on this, we may not need to modify the index format at all for
this use-case. This is easily achieved in the current system by adding a
dummy document which Solr can read/write -- not very elegant but it can work
:-)

Using the version came to my mind because I didn't see it as very useful by
itself. It is just the current time stamp as a long, incremented for every
commit.

--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: How can we know if 2 lucene indexes are same?

mark harwood

>
> I think this could be a generally useful feature?
>  
+1. I could definitely use a "commitUserData" option for the same reasons.

> Thinking more on this, we may not need to modify the index format at all for
> this use-case. This is easily achieved in the current system by adding a
> dummy document which Solr can read/write -- not very elegant but it can work
>  

I thought about this but was uncomfortable with the idea of adding an
extra doc - some use cases that become troublesome for any application
logic are:
1) IndexReader.numDocs/IndexReader.maxDoc will give "incorrect" values
2) Any queries of the type "all documents *without* a value for field X
return the commit.userdata document.

I was toying with the idea of maintaining my own commit.userdata file
which I would manage in my framework when calling IndexWriter.commit but
this does not feel as clean as Lucene core code holding the user data in
the segments file.

Cheers
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12