To clone or have a pluggable docidbitset for IndexReader

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

To clone or have a pluggable docidbitset for IndexReader

Jason Rutherglen
Hello,

In trying to figure out the best way to have a system for realtime whereby
the deletedDocs do not need to be saved there are two possible methods,
1) setting the DocIdBitSet manually (which breaks the saving and things,
but does not require doing norms cloning), or 2) implementing IndexReader.clone
which requires deletedDocs and norms "copy on write". 

The discussion about reopen (https://issues.apache.org/jira/browse/LUCENE-743)
was lengthy and I can see from the code and the discussion why no one wants to
revisit IndexReader.reopen in the form of IndexReader.clone and possibly mess things up. 

Is some alternative easier API possible that I'm missing?

-J
Reply | Threaded
Open this post in threaded view
|

Re: To clone or have a pluggable docidbitset for IndexReader

Michael McCandless-2

Jason,

Is your need for IndexReader.clone entirely driven by needing a fast  
way to swap in your own deleted docs?

Meaning, if you could plug in your own deleted docs to a reader  
(somehow), would you not use clone anymore?

Mike

Jason Rutherglen wrote:

> Hello,
>
> In trying to figure out the best way to have a system for realtime  
> whereby
> the deletedDocs do not need to be saved there are two possible  
> methods,
> 1) setting the DocIdBitSet manually (which breaks the saving and  
> things,
> but does not require doing norms cloning), or 2) implementing  
> IndexReader.clone
> which requires deletedDocs and norms "copy on write".
>
> The discussion about reopen (https://issues.apache.org/jira/browse/LUCENE-743 
> )
> was lengthy and I can see from the code and the discussion why no  
> one wants to
> revisit IndexReader.reopen in the form of IndexReader.clone and  
> possibly mess things up.
>
> Is some alternative easier API possible that I'm missing?
>
> -J


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: To clone or have a pluggable docidbitset for IndexReader

Jason Rutherglen
Mike,

> needing a fast way to swap in your own deleted docs?

Yes, however it is necessary to have a new IndexReader as well from a "reopened" reader.  So clone seems the best approach (unless there's a way I'm not seeing).  The clone
code is coming along, the norms test seems to pass.  As long as similar rules as reopen are followed such as from the javadoc "The re-opened reader instance and the old instance might share the same resources. For this reason no index modification operations (e. g. deleteDocument(int), setNorm(int, String, byte)) should be performed using one of the readers until the old reader instance is closed. Otherwise, the behavior of the readers is undefined.". 

I think the clone method javadoc should read "After cloning a reader, the original reader will throw exceptions on index modification operations (e. g. deleteDocument(int), setNorm(int, String, byte))".  This way one may read from the original, but the cloned reader (new reader) may accept updates.  This happens by way to automatically releasing a lock on clone (does this cause any unforseen problems?).

Jason

On Tue, Dec 16, 2008 at 7:00 AM, Michael McCandless <[hidden email]> wrote:

Jason,

Is your need for IndexReader.clone entirely driven by needing a fast way to swap in your own deleted docs?

Meaning, if you could plug in your own deleted docs to a reader (somehow), would you not use clone anymore?

Mike


Jason Rutherglen wrote:

Hello,

In trying to figure out the best way to have a system for realtime whereby
the deletedDocs do not need to be saved there are two possible methods,
1) setting the DocIdBitSet manually (which breaks the saving and things,
but does not require doing norms cloning), or 2) implementing IndexReader.clone
which requires deletedDocs and norms "copy on write".

The discussion about reopen (https://issues.apache.org/jira/browse/LUCENE-743)
was lengthy and I can see from the code and the discussion why no one wants to
revisit IndexReader.reopen in the form of IndexReader.clone and possibly mess things up.

Is some alternative easier API possible that I'm missing?

-J


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: To clone or have a pluggable docidbitset for IndexReader

Michael McCandless-2

So it seems like a cloned reader would share everything with the
previous reader, but these rules would be enforced:

   * If the old reader had pending changes (held the write lock) when
     it was cloned, it 1) transfers the write lock to the clone, 2)
     refuses any further changes to itself (freezes itself), 3)
     continues to reflect the pending changes, and 4) will not commit
     its changes to disk when it's closed.  Ie it freezes itself into
     a "point in time" snapshot, just not via an on-disk index.

   * If any changes (to deletions or norms) are done with the new
     reader, it then makes a private copy ("copy on write").  This
     would apply to reopen too, since clone & reopen share the same
     code; so this is an "improvement" over the current reopen
     semantics and we should fix the javadocs saying so.

It seems like the only reason to clone would be if you intend to
[further] change deletions or norms but still want to use the previous
reader w/ the unchanged deletions and norms, ie "snapshot" the
previous reader without going through disk as intermediary, right?

I think this is a reasonable use case.  Since an IndexReader can still
make changes (something I think we should eventually move away from,
but cannot, yet, because of the immediacy of deletions that
IndexReader offers), cloning is an important tool to let you make an
efficient "point in time" snapshot (without having to go through the
Directory).

If this makes sense, can you update the patch on LUCENE-1314 to
enforce these semantics?  I think we should get this in for 2.9?

Mike

Jason Rutherglen wrote:

> Mike,
>
> > needing a fast way to swap in your own deleted docs?
>
> Yes, however it is necessary to have a new IndexReader as well from  
> a "reopened" reader.  So clone seems the best approach (unless  
> there's a way I'm not seeing).  The clone
> code is coming along, the norms test seems to pass.  As long as  
> similar rules as reopen are followed such as from the javadoc "The  
> re-opened reader instance and the old instance might share the same  
> resources. For this reason no index modification operations (e. g.  
> deleteDocument(int), setNorm(int, String, byte)) should be performed  
> using one of the readers until the old reader instance is closed.  
> Otherwise, the behavior of the readers is undefined.".
>
> I think the clone method javadoc should read "After cloning a  
> reader, the original reader will throw exceptions on index  
> modification operations (e. g. deleteDocument(int), setNorm(int,  
> String, byte))".  This way one may read from the original, but the  
> cloned reader (new reader) may accept updates.  This happens by way  
> to automatically releasing a lock on clone (does this cause any  
> unforseen problems?).
>
> Jason
>
> On Tue, Dec 16, 2008 at 7:00 AM, Michael McCandless <[hidden email]
> > wrote:
>
> Jason,
>
> Is your need for IndexReader.clone entirely driven by needing a fast  
> way to swap in your own deleted docs?
>
> Meaning, if you could plug in your own deleted docs to a reader  
> (somehow), would you not use clone anymore?
>
> Mike
>
>
> Jason Rutherglen wrote:
>
> Hello,
>
> In trying to figure out the best way to have a system for realtime  
> whereby
> the deletedDocs do not need to be saved there are two possible  
> methods,
> 1) setting the DocIdBitSet manually (which breaks the saving and  
> things,
> but does not require doing norms cloning), or 2) implementing  
> IndexReader.clone
> which requires deletedDocs and norms "copy on write".
>
> The discussion about reopen (https://issues.apache.org/jira/browse/LUCENE-743 
> )
> was lengthy and I can see from the code and the discussion why no  
> one wants to
> revisit IndexReader.reopen in the form of IndexReader.clone and  
> possibly mess things up.
>
> Is some alternative easier API possible that I'm missing?
>
> -J
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: To clone or have a pluggable docidbitset for IndexReader

Jason Rutherglen
> ie "snapshot" the previous reader without going through disk as intermediary, right?

Yes. 

>  refuses any further changes to itself (freezes itself)

Should I create a new variable for "refuse updates/freeze" or use readonly?  If the variable is true then inside of doClose throw an exception?  If someone tries to clone a cloned reader that has no lock then the resulting reader is frozen as well.  Let define more about how frozen behaves and the exceptions thrown and from where for a frozen reader.

Copy on write for norms and deletedocs is implemented, now that we've more or less agreed on the rules, I can make sure the unit tests reflect testing the rules.

> If this makes sense, can you update the patch on LUCENE-1314 to enforce these semantics? 

Yes.

> I think we should get this in for 2.9?

Should be doable!

On Tue, Dec 16, 2008 at 12:55 PM, Michael McCandless <[hidden email]> wrote:

So it seems like a cloned reader would share everything with the
previous reader, but these rules would be enforced:

 * If the old reader had pending changes (held the write lock) when
   it was cloned, it 1) transfers the write lock to the clone, 2)
   refuses any further changes to itself (freezes itself), 3)
   continues to reflect the pending changes, and 4) will not commit
   its changes to disk when it's closed.  Ie it freezes itself into
   a "point in time" snapshot, just not via an on-disk index.

 * If any changes (to deletions or norms) are done with the new
   reader, it then makes a private copy ("copy on write").  This
   would apply to reopen too, since clone & reopen share the same
   code; so this is an "improvement" over the current reopen
   semantics and we should fix the javadocs saying so.

It seems like the only reason to clone would be if you intend to
[further] change deletions or norms but still want to use the previous
reader w/ the unchanged deletions and norms, ie "snapshot" the
previous reader without going through disk as intermediary, right?

I think this is a reasonable use case.  Since an IndexReader can still
make changes (something I think we should eventually move away from,
but cannot, yet, because of the immediacy of deletions that
IndexReader offers), cloning is an important tool to let you make an
efficient "point in time" snapshot (without having to go through the
Directory).

If this makes sense, can you update the patch on LUCENE-1314 to
enforce these semantics?  I think we should get this in for 2.9?

Mike


Jason Rutherglen wrote:

Mike,

> needing a fast way to swap in your own deleted docs?

Yes, however it is necessary to have a new IndexReader as well from a "reopened" reader.  So clone seems the best approach (unless there's a way I'm not seeing).  The clone
code is coming along, the norms test seems to pass.  As long as similar rules as reopen are followed such as from the javadoc "The re-opened reader instance and the old instance might share the same resources. For this reason no index modification operations (e. g. deleteDocument(int), setNorm(int, String, byte)) should be performed using one of the readers until the old reader instance is closed. Otherwise, the behavior of the readers is undefined.".

I think the clone method javadoc should read "After cloning a reader, the original reader will throw exceptions on index modification operations (e. g. deleteDocument(int), setNorm(int, String, byte))".  This way one may read from the original, but the cloned reader (new reader) may accept updates.  This happens by way to automatically releasing a lock on clone (does this cause any unforseen problems?).

Jason

On Tue, Dec 16, 2008 at 7:00 AM, Michael McCandless <[hidden email]> wrote:

Jason,

Is your need for IndexReader.clone entirely driven by needing a fast way to swap in your own deleted docs?

Meaning, if you could plug in your own deleted docs to a reader (somehow), would you not use clone anymore?

Mike


Jason Rutherglen wrote:

Hello,

In trying to figure out the best way to have a system for realtime whereby
the deletedDocs do not need to be saved there are two possible methods,
1) setting the DocIdBitSet manually (which breaks the saving and things,
but does not require doing norms cloning), or 2) implementing IndexReader.clone
which requires deletedDocs and norms "copy on write".

The discussion about reopen (https://issues.apache.org/jira/browse/LUCENE-743)
was lengthy and I can see from the code and the discussion why no one wants to
revisit IndexReader.reopen in the form of IndexReader.clone and possibly mess things up.

Is some alternative easier API possible that I'm missing?

-J


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: To clone or have a pluggable docidbitset for IndexReader

Michael McCandless-2

Jason Rutherglen wrote:

> > ie "snapshot" the previous reader without going through disk as  
> intermediary, right?
>
> Yes.
>
> >  refuses any further changes to itself (freezes itself)
>
> Should I create a new variable for "refuse updates/freeze" or use  
> readonly?  If the variable is true then inside of doClose throw an  
> exception?  If someone tries to clone a cloned reader that has no  
> lock then the resulting reader is frozen as well.  Let define more  
> about how frozen behaves and the exceptions thrown and from where  
> for a frozen reader.

Setting readOnly seems reasonable, on quick thought.  You should add a  
testcase that asserts an exception is hit on trying to make changes to  
a cloned reader.

> Copy on write for norms and deletedocs is implemented, now that  
> we've more or less agreed on the rules, I can make sure the unit  
> tests reflect testing the rules.
>
> > If this makes sense, can you update the patch on LUCENE-1314 to  
> enforce these semantics?
>
> Yes.
>
> > I think we should get this in for 2.9?
>
> Should be doable!

OK thanks!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: To clone or have a pluggable docidbitset for IndexReader

Yonik Seeley
In reply to this post by Michael McCandless-2
On Tue, Dec 16, 2008 at 10:00 AM, Michael McCandless
<[hidden email]> wrote:
> Is your need for IndexReader.clone entirely driven by needing a fast way to
> swap in your own deleted docs?

Could this be done with a FilteredIndexReader subclass that keeps
track of additional deletions?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]