Unique doc ids

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Unique doc ids

Michael Busch
Hi Team,

the question of how to delete with IndexWriter using doc ids is
currently being discussed on java-user
(http://www.gossamer-threads.com/lists/lucene/java-user/57228), so I
thought this is a good time to mention an idea that I recently had. I'm
planning to work on column-stored fields soon (I used to call them
per-document payloads). Then we'll have the ability to store metadata
for each document very efficiently in the index.

This new data structure could be used to store a unique ID for each doc
in the index. The IndexReader would then get an API that provides a
mapping from the dynamic doc ids to the new unique ones. We would also
have to store a reverse mapping (UID -> ID) in the index - we could use
a VInt list + skip list for that.

Then we should be able to make IndexReaders "read-only" (LUCENE-1030)
and provide a new API in IndexWriter "delete by UID". This would allow
to "delete by query" as well. The disadvantage is that the index would
become bigger, but that should still be ok: 8 bytes per doc for the
ID->UID map (assuming we took long for the UID, which I'd suggest). The
UID->ID map might even be a bit smaller initially (using VInts and
VLongs), but might become bigger when the index has lot's of deleted
docs, because then the delta encoding wouldn't be as efficient anymore
for the UIDs.

If RAM permits, the maps could also be cached in memory (optional,
configurable). The FieldCache overhaul (LUCENE-831) with column fields
as source can help here.

After all this is implemented (column fields, UIDs, "read-only"
IndexReaders, FieldCache overhaul) I'd like to make the column fields
(and norms) updateable via IndexWriter.

OK lot's of food for thought.

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unique doc ids

Terry Yang
Hi,Michael,
You idea is good! But i have a question and thanks for your help!

How you plan to store a unique ID for each doc? My understanding will be
adding a field(i.e uniqueid) for each doc and the field has one identical
token value.
We can add unique ID as payload for that token before indexing. So we can
use IndexReader.termPositions() to get all the uniqueIDs and IDs.
Can u explain more about how you store a reverse UID-->ID?  How u guarantee
UID
can be mapped to the correct dynamic ID. I mean if a docid =5 and then for
some reason changed to 60, but you still stored UID-->5 in a file/memory?

On 1/22/08, Michael Busch <[hidden email]> wrote:

> Hi Team,
>
> the question of how to delete with IndexWriter using doc ids is
> currently being discussed on java-user
> (http://www.gossamer-threads.com/lists/lucene/java-user/57228), so I
> thought this is a good time to mention an idea that I recently had. I'm
> planning to work on column-stored fields soon (I used to call them
> per-document payloads). Then we'll have the ability to store metadata
> for each document very efficiently in the index.
>
> This new data structure could be used to store a unique ID for each doc
> in the index. The IndexReader would then get an API that provides a
> mapping from the dynamic doc ids to the new unique ones. We would also
> have to store a reverse mapping (UID -> ID) in the index - we could use
> a VInt list + skip list for that.
>
> Then we should be able to make IndexReaders "read-only" (LUCENE-1030)
> and provide a new API in IndexWriter "delete by UID". This would allow
> to "delete by query" as well. The disadvantage is that the index would
> become bigger, but that should still be ok: 8 bytes per doc for the
> ID->UID map (assuming we took long for the UID, which I'd suggest). The
> UID->ID map might even be a bit smaller initially (using VInts and
> VLongs), but might become bigger when the index has lot's of deleted
> docs, because then the delta encoding wouldn't be as efficient anymore
> for the UIDs.
>
> If RAM permits, the maps could also be cached in memory (optional,
> configurable). The FieldCache overhaul (LUCENE-831) with column fields
> as source can help here.
>
> After all this is implemented (column fields, UIDs, "read-only"
> IndexReaders, FieldCache overhaul) I'd like to make the column fields
> (and norms) updateable via IndexWriter.
>
> OK lot's of food for thought.
>
> -Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Unique doc ids

Paul Elschot
In reply to this post by Michael Busch
Michael,

How would IndexWriter.addIndexes() work with unique doc ids?

Regards,
Paul Elschot


Op Tuesday 22 January 2008 12:07:16 schreef Michael Busch:

> Hi Team,
>
> the question of how to delete with IndexWriter using doc ids is
> currently being discussed on java-user
> (http://www.gossamer-threads.com/lists/lucene/java-user/57228), so I
> thought this is a good time to mention an idea that I recently had. I'm
> planning to work on column-stored fields soon (I used to call them
> per-document payloads). Then we'll have the ability to store metadata
> for each document very efficiently in the index.
>
> This new data structure could be used to store a unique ID for each doc
> in the index. The IndexReader would then get an API that provides a
> mapping from the dynamic doc ids to the new unique ones. We would also
> have to store a reverse mapping (UID -> ID) in the index - we could use
> a VInt list + skip list for that.
>
> Then we should be able to make IndexReaders "read-only" (LUCENE-1030)
> and provide a new API in IndexWriter "delete by UID". This would allow
> to "delete by query" as well. The disadvantage is that the index would
> become bigger, but that should still be ok: 8 bytes per doc for the
> ID->UID map (assuming we took long for the UID, which I'd suggest). The
> UID->ID map might even be a bit smaller initially (using VInts and
> VLongs), but might become bigger when the index has lot's of deleted
> docs, because then the delta encoding wouldn't be as efficient anymore
> for the UIDs.
>
> If RAM permits, the maps could also be cached in memory (optional,
> configurable). The FieldCache overhaul (LUCENE-831) with column fields
> as source can help here.
>
> After all this is implemented (column fields, UIDs, "read-only"
> IndexReaders, FieldCache overhaul) I'd like to make the column fields
> (and norms) updateable via IndexWriter.
>
> OK lot's of food for thought.
>
> -Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unique doc ids

Michael Busch
In reply to this post by Terry Yang
Terry Yang wrote:
> Hi,Michael,
> You idea is good! But i have a question and thanks for your help!
>

Hi Terry,

> Can u explain more about how you store a reverse UID-->ID?  How u guarantee
> UID
> can be mapped to the correct dynamic ID. I mean if a docid =5 and then for
> some reason changed to 60, but you still stored UID-->5 in a file/memory?
>
>

Good question!

You can think of a UID as a special, unique term that every document
has. Let's say we have the following segment:

S1:
UID -> ID
  0 ->  0
  1 ->  1
  2 ->  2

Now we flush the segment, add two docs, update the document with UID=2,
add another doc, and then we'll have these two segments:

S1:
UID -> ID
  0 ->  0
  1 ->  1 (deleted)
  2 ->  2

S12
UID -> ID
  1 ->  2
  3 ->  0
  4 ->  1
  5 ->  3

You can view the UIDs as terms with a posting list, each list containing
just one posting. Now we want to find the ID for UID=1: in the example
we have two segments with the same UID=1. However, we know that the doc
in S1 with ID=1 is deleted, so we keep looking in the other segment(s)
for the UID until we find one whose corresponding ID is not deleted.
There can only be one valid entry at any time for one UID.

Of course we shouldn't really use a term + postinglist for the UIDs,
because this would be quite inefficient with the data structures we
currently have. We wouldn't want to store the UIDs as Strings and we
wouldn't need to store e. g. freq or positions. Also we might be able to
implement some heuristics to optimize the order in which we iterate the
segments for the UID lookup.

I believe this should work?

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unique doc ids

Michael Busch
In reply to this post by Paul Elschot
Paul Elschot wrote:
> Michael,
>
> How would IndexWriter.addIndexes() work with unique doc ids?

Hi Paul,

it would probably be a limitation of this design. The only way I can
think of right now to ensure that during an addIndexes() the UIDs don't
change is an API in IndexWriter like setMinUID(long). When you create an
index and you know that you'll add it to another one via addIndexes(),
then you could use this method to set the min UID value in that index to
the max number of add/update operations you'd expect in the other index.

Please note that the UIDs that I'm thinking about here would actually
not affect the index order. All postings would still be stored in
(dynamic) doc id order.
This means, with this design the search results would not be returned in
UID order, so the UIDs couldn't be used efficiently e. g. for a join
operation with an external data structure (e. g. database). I think in
this regard my proposed UID design differs from what was discussed here
some time ago.

The main usecase here is to get rid of readers that do write operations.
I think that this would be very desireable when we implement updateable
column-fields. Then you could use the UIDs that an IndexReader returned
to delete or update docs or the column fields/norms, and you wouldn't
have to worry about IndexReaders being "in sync" with the IndexWriters.

Maybe this UID design that I'm thinking out loudly here is total
overkill for the mentioned use cases. I'm open and interested in other
alternative ideas!

-Michael


>
> Regards,
> Paul Elschot
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unique doc ids

Michael McCandless-2

Michael,

Couldn't we add deleteByQuery to IndexWriter without adding the UID  
field?

Would that be "enough" to make IndexReader read-only (ie, do we still  
really need to delete by docID from IndexWriter?).

If we still need that ... maybe we could extend IndexWriter so that  
you can hold a lock on docIDs changing while you do your stuff, eg:

   writer.freezeDocIDs();
   try {
     get docIDs from somewhere & call writer.deleteByDocID
   } finally {
     writer.unfreezeDocIDs();
   }

If we went that route, we'd need to expose methods in IndexWriter to  
let you get reader(s), and, to then delete by docID.

I'm not certain this will work :)  I'm just throwing alternative  
ideas out...

I do like the idea of a UID field, but I'm a bit nervous about having  
the "core" maintain it and then have things in the core that depend  
on its presence.  At first it might be optional, but I could see us  
over time making more and more functionality that require UID to be  
present, to the point where it's eventually not really optional...

Mike

Michael Busch wrote:

> Paul Elschot wrote:
>> Michael,
>>
>> How would IndexWriter.addIndexes() work with unique doc ids?
>
> Hi Paul,
>
> it would probably be a limitation of this design. The only way I can
> think of right now to ensure that during an addIndexes() the UIDs  
> don't
> change is an API in IndexWriter like setMinUID(long). When you  
> create an
> index and you know that you'll add it to another one via addIndexes(),
> then you could use this method to set the min UID value in that  
> index to
> the max number of add/update operations you'd expect in the other  
> index.
>
> Please note that the UIDs that I'm thinking about here would actually
> not affect the index order. All postings would still be stored in
> (dynamic) doc id order.
> This means, with this design the search results would not be  
> returned in
> UID order, so the UIDs couldn't be used efficiently e. g. for a join
> operation with an external data structure (e. g. database). I think in
> this regard my proposed UID design differs from what was discussed  
> here
> some time ago.
>
> The main usecase here is to get rid of readers that do write  
> operations.
> I think that this would be very desireable when we implement  
> updateable
> column-fields. Then you could use the UIDs that an IndexReader  
> returned
> to delete or update docs or the column fields/norms, and you wouldn't
> have to worry about IndexReaders being "in sync" with the  
> IndexWriters.
>
> Maybe this UID design that I'm thinking out loudly here is total
> overkill for the mentioned use cases. I'm open and interested in other
> alternative ideas!
> -Michael
>
>
>>
>> Regards,
>> Paul Elschot
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unique doc ids

Grant Ingersoll-2

On Jan 23, 2008, at 6:34 AM, Michael McCandless wrote:

>
> At first it might be optional,

+1

There are still applications that don't require a UID, or are static  
for long enough periods of time that the Lucene internal id is  
sufficient, so I would hate to impose this on those apps.

I think the "per doc payloads" is a good idea, but I don't know if we  
need to provide explicit UID functionality on top of that.  Or, if we  
do, it could be an optional layer on top of the existing functionality.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unique doc ids

Yonik Seeley-2
In reply to this post by Michael McCandless-2
On Jan 23, 2008 6:34 AM, Michael McCandless <[hidden email]> wrote:
>    writer.freezeDocIDs();
>    try {
>      get docIDs from somewhere & call writer.deleteByDocID
>    } finally {
>      writer.unfreezeDocIDs();
>    }

Interesting idea, but would require the IndexWriter to flush the
buffered docs so an IndexReader could be created fro them.  (or would
require the existence of an UnflushedDocumentsIndexReader)

> If we went that route, we'd need to expose methods in IndexWriter to
> let you get reader(s), and, to then delete by docID.

Right... I had envisioned a callback that was called after a new
segment was created/flushed that passed IndexReader[].  In an
environment of mixed deletes and adds, it would avoid slowing down the
indexing part by limiting where the deletes happen.

It does put a little more burden on the user, but a slightly harder
(but more powerful / more efficient) API is preferable since easier
APIs can always be built on top (but not vice-versa).

> I do like the idea of a UID field, but I'm a bit nervous about having
> the "core" maintain it

+1

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unique doc ids

Nadav Har'El
In reply to this post by Michael Busch
Hi Michael,

On Tue, Jan 22, 2008, Michael Busch wrote about "Unique doc ids":
> the question of how to delete with IndexWriter using doc ids is
>...
> mapping from the dynamic doc ids to the new unique ones. We would also
> have to store a reverse mapping (UID -> ID) in the index - we could use
> a VInt list + skip list for that.
> Then we should be able to make IndexReaders "read-only" (LUCENE-1030)
> and provide a new API in IndexWriter "delete by UID".

It sounds to me that this list would be split according to segments, right?
In that case, whoever wants to read this list will need to behave like a
IndexReader (which opens all segments), not an IndexWriter (which writes to
only one segment). So it still makes some sort of twisted sense to have
"delete by UID" in the indexReader (like delete document was originally).

In any case, I'm afraid I don't understand how your proposal to add special
"UIDs" differs from the existing situation, where you can put your UIDs in
a certain field (e.g., call that field "UID" if you want) and then you can
use IndexWriter.deleteDocuments(term) to delete documents (in your case,
just one) with this term. How is your new suggestion better, or more efficient?

> This would allow
> to "delete by query" as well.

Again, I don't understand how this makes a difference. In the existing Lucene,
you can also theoretically run a query, get a list of docids, and then delete
them all. I said "theoretically" because unfortunately, the current
IndexWriter interface doesn't support the necessary calls (either a
deleteDocuments(Query) or a deleteDocuments(int docid) call), but I don't
see why this can't be fixed without adding new concepts (like UID) to the
index. Or maybe I'm missing something?


--
Nadav Har'El                        |   Wednesday, Jan 23 2008, 17 Shevat 5768
IBM Haifa Research Lab              |-----------------------------------------
                                    |War doesn't determine who's right but
http://nadav.harel.org.il           |who's left.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unique doc ids

Michael McCandless-2
In reply to this post by Yonik Seeley-2

Yonik Seeley wrote:

> On Jan 23, 2008 6:34 AM, Michael McCandless  
> <[hidden email]> wrote:
>>    writer.freezeDocIDs();
>>    try {
>>      get docIDs from somewhere & call writer.deleteByDocID
>>    } finally {
>>      writer.unfreezeDocIDs();
>>    }
>
> Interesting idea, but would require the IndexWriter to flush the
> buffered docs so an IndexReader could be created fro them.  (or would
> require the existence of an UnflushedDocumentsIndexReader)

True.

Actually, an UnflushedDocumentsIndexReader would not be hard!

DocumentsWriter already has an IndexInput (ByteSliceReader) that can  
read the postings for a single term from the RAM buffer (this is used  
when flushing the segment).  I think it'd be straightforward to get  
TermEnum/TermDocs/TermPositions iterators on the buffered docs.  
Norms are already stored as byte arrays in memory.  FieldInfos is  
already available.  The stored fields & term vectors are already  
flushed to the directory so they could be read normally.

Hmm, buffered delete terms are tricky.  I guess freezeDocIDs would  
have to flush deleted terms (and queries, if we add that) before  
making a reader accessible, though, the cost is shared because the  
readers need to be opened anyway (so the app can find docIDs).

So maybe this approach becomes this:

   // Returns a "point in time" frozen view of index...
   IndexReader reader = writer.getReader();
   try {
     <get docIDs from reader, delete by docID>
  } finally {
     writer.releaseReader();
   }

?

We may even be able to implement this w/o actually freezing the  
writer, ie, still allowing add/updateDocument calls to proceed.  
Merging could certainly still proceed.  This way you could at any  
time ask a writer for a "point in time" reader, independent of what  
else you are doing with the writer.  This would require, on flushing,  
that writer goes and swaps in a "real" segment reader, limited to a  
specified docID, for any point in time readers that are open.

>> If we went that route, we'd need to expose methods in IndexWriter to
>> let you get reader(s), and, to then delete by docID.
>
> Right... I had envisioned a callback that was called after a new
> segment was created/flushed that passed IndexReader[].  In an
> environment of mixed deletes and adds, it would avoid slowing down the
> indexing part by limiting where the deletes happen.

This would certainly be less work :)  I guess the question is how  
severely are we limiting the application by requiring that you can  
only do deletes when IW decides to flush, or, by forcing the  
application to flush when it wants to do deletes.

> It does put a little more burden on the user, but a slightly harder
> (but more powerful / more efficient) API is preferable since easier
> APIs can always be built on top (but not vice-versa).

True, though emulating the easier API on top of the "you get to  
delete only when IW flushes" means you are forcing a flush, right?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unique doc ids

Yonik Seeley-2
On Jan 24, 2008 5:47 AM, Michael McCandless <[hidden email]> wrote:

>
> Yonik Seeley wrote:
>
> > On Jan 23, 2008 6:34 AM, Michael McCandless
> > <[hidden email]> wrote:
> >>    writer.freezeDocIDs();
> >>    try {
> >>      get docIDs from somewhere & call writer.deleteByDocID
> >>    } finally {
> >>      writer.unfreezeDocIDs();
> >>    }
> >
> > Interesting idea, but would require the IndexWriter to flush the
> > buffered docs so an IndexReader could be created fro them.  (or would
> > require the existence of an UnflushedDocumentsIndexReader)
>
> True.
>
> Actually, an UnflushedDocumentsIndexReader would not be hard!
>
> DocumentsWriter already has an IndexInput (ByteSliceReader) that can
> read the postings for a single term from the RAM buffer (this is used
> when flushing the segment).  I think it'd be straightforward to get
> TermEnum/TermDocs/TermPositions iterators on the buffered docs.
> Norms are already stored as byte arrays in memory.  FieldInfos is
> already available.  The stored fields & term vectors are already
> flushed to the directory so they could be read normally.
>
> Hmm, buffered delete terms are tricky.  I guess freezeDocIDs would
> have to flush deleted terms (and queries, if we add that) before
> making a reader accessible,

If we buffer queries, that would seem to take care of 99% of the
usecases that need an IndexReader, right?   A custom query could get
ids from an index however it wanted.

> though, the cost is shared because the
> readers need to be opened anyway (so the app can find docIDs).
>
> So maybe this approach becomes this:
>
>    // Returns a "point in time" frozen view of index...
>    IndexReader reader = writer.getReader();
>    try {
>      <get docIDs from reader, delete by docID>
>   } finally {
>      writer.releaseReader();
>    }
>
> ?
>
> We may even be able to implement this w/o actually freezing the
> writer,
> ie, still allowing add/updateDocument calls to proceed.
> Merging could certainly still proceed.  This way you could at any
> time ask a writer for a "point in time" reader, independent of what
> else you are doing with the writer.  This would require, on flushing,
> that writer goes and swaps in a "real" segment reader, limited to a
> specified docID, for any point in time readers that are open.

Wow... sounds complex.

> >> If we went that route, we'd need to expose methods in IndexWriter to
> >> let you get reader(s), and, to then delete by docID.
> >
> > Right... I had envisioned a callback that was called after a new
> > segment was created/flushed that passed IndexReader[].  In an
> > environment of mixed deletes and adds, it would avoid slowing down the
> > indexing part by limiting where the deletes happen.
>
> This would certainly be less work :)  I guess the question is how
> severely are we limiting the application by requiring that you can
> only do deletes when IW decides to flush, or, by forcing the
> application to flush when it wants to do deletes.

Seems like more work, rather than limiting... "when" really isn't as
important as long as it's before a new external IndexReader is opened
for searching.

> > It does put a little more burden on the user, but a slightly harder
> > (but more powerful / more efficient) API is preferable since easier
> > APIs can always be built on top (but not vice-versa).
>
> True, though emulating the easier API on top of the "you get to
> delete only when IW flushes" means you are forcing a flush, right?

I was thinking via buffering (the same way term deletes are handled now).
You keep track of maxDoc() at the time of the delete and defer it until later.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

JBoss Cache as a store

Manik Surtani-2
Hi guys

I've just written a plugin for Lucene to use JBoss Cache as an index  
store.  The benefits of something like this are:

1.  Faster access to indexes as they will be in memory
2.  Indexes replicated across a cluster of servers
3.  Indexes "persisted" in clustered memory - faster that persistence  
to disk

The implementation I have is pretty basic for now.

Is there a set of tests in the Lucene sources I could use to test the  
"JBCDirectory", as I call it?  Perhaps something way I could change  
the "index store provider" and re-run some existing tests, and perhaps  
add some clustered tests specific to my plugin?

Finally, regarding hosting, I am happy to contribute this to Lucene  
(alongside the JEDirectory, etc) but if licensing (JBoss Cache is  
LGPL, although the plugin code can be ASL if need be) or language  
levels (the plugin depends on JBoss Cache 2.x, which requires JDK 5)  
then I'm happy to host the plugin externally.

Cheers,
--
Manik Surtani
Lead, JBoss Cache
[hidden email]







---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Unique doc ids

Michael McCandless-2
In reply to this post by Yonik Seeley-2

Yonik Seeley wrote:

> On Jan 24, 2008 5:47 AM, Michael McCandless  
> <[hidden email]> wrote:
>>
>> Yonik Seeley wrote:
>>
>>> On Jan 23, 2008 6:34 AM, Michael McCandless
>>> <[hidden email]> wrote:
>>>>    writer.freezeDocIDs();
>>>>    try {
>>>>      get docIDs from somewhere & call writer.deleteByDocID
>>>>    } finally {
>>>>      writer.unfreezeDocIDs();
>>>>    }
>>>
>>> Interesting idea, but would require the IndexWriter to flush the
>>> buffered docs so an IndexReader could be created fro them.  (or  
>>> would
>>> require the existence of an UnflushedDocumentsIndexReader)
>>
>> True.
>>
>> Actually, an UnflushedDocumentsIndexReader would not be hard!
>>
>> DocumentsWriter already has an IndexInput (ByteSliceReader) that can
>> read the postings for a single term from the RAM buffer (this is used
>> when flushing the segment).  I think it'd be straightforward to get
>> TermEnum/TermDocs/TermPositions iterators on the buffered docs.
>> Norms are already stored as byte arrays in memory.  FieldInfos is
>> already available.  The stored fields & term vectors are already
>> flushed to the directory so they could be read normally.
>>
>> Hmm, buffered delete terms are tricky.  I guess freezeDocIDs would
>> have to flush deleted terms (and queries, if we add that) before
>> making a reader accessible,
>
> If we buffer queries, that would seem to take care of 99% of the
> usecases that need an IndexReader, right?   A custom query could get
> ids from an index however it wanted.

I think so?

So, if we add only buffered "deleteByQuery" (and setNorm) to  
IndexWriter, is that enough to deprecate deleteDocument, setNorm in  
IndexReader?

>> though, the cost is shared because the
>> readers need to be opened anyway (so the app can find docIDs).
>>
>> So maybe this approach becomes this:
>>
>>    // Returns a "point in time" frozen view of index...
>>    IndexReader reader = writer.getReader();
>>    try {
>>      <get docIDs from reader, delete by docID>
>>   } finally {
>>      writer.releaseReader();
>>    }
>>
>> ?
>>
>> We may even be able to implement this w/o actually freezing the
>> writer,
>> ie, still allowing add/updateDocument calls to proceed.
>> Merging could certainly still proceed.  This way you could at any
>> time ask a writer for a "point in time" reader, independent of what
>> else you are doing with the writer.  This would require, on flushing,
>> that writer goes and swaps in a "real" segment reader, limited to a
>> specified docID, for any point in time readers that are open.
>
> Wow... sounds complex.

I think it may not be so bad ... the raw ingredients are already done  
(like ByteSliceReader) ... need to ponder it some more.

I think one very powerful side effect of doing this would be that you  
could have extremely low latency indexing ("highly interactive  
indexing").  You would add/delete docs using the writer, then quickly  
re-open the reader, and be able to search the buffered docs without  
the cost of flushing a new segment, assuming it's all within one JVM.

This reader (that searches both on-disk segments and the writer's  
buffered docs) would do reopen extremely efficiently.  In the  
[distant?] future, it could even do searching "live", meaning the  
full buffer is always searched rather than a point-in-time snapshot.  
But we couldn't really do this until we re-work the FieldCache API to  
belong to each segment & be incrementally updateable such that if a  
new doc is added to the writer, we could efficiently update the  
FieldCache, if present.  That would be a big change :)

Lots to think through ....

>>>> If we went that route, we'd need to expose methods in  
>>>> IndexWriter to
>>>> let you get reader(s), and, to then delete by docID.
>>>
>>> Right... I had envisioned a callback that was called after a new
>>> segment was created/flushed that passed IndexReader[].  In an
>>> environment of mixed deletes and adds, it would avoid slowing  
>>> down the
>>> indexing part by limiting where the deletes happen.
>>
>> This would certainly be less work :)  I guess the question is how
>> severely are we limiting the application by requiring that you can
>> only do deletes when IW decides to flush, or, by forcing the
>> application to flush when it wants to do deletes.
>
> Seems like more work, rather than limiting... "when" really isn't as
> important as long as it's before a new external IndexReader is opened
> for searching.

Right but if you want very low latency indexing (or even essentially  
0) then you can't really afford to buffer deletes (or adds) for that  
long...

>>> It does put a little more burden on the user, but a slightly harder
>>> (but more powerful / more efficient) API is preferable since easier
>>> APIs can always be built on top (but not vice-versa).
>>
>> True, though emulating the easier API on top of the "you get to
>> delete only when IW flushes" means you are forcing a flush, right?
>
> I was thinking via buffering (the same way term deletes are handled  
> now).
> You keep track of maxDoc() at the time of the delete and defer it  
> until later.

Oh, right, OK.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: JBoss Cache as a store

Manik Surtani-2
In reply to this post by Manik Surtani-2
Bump.  Anyone?


On 24 Jan 2008, at 14:07, Manik Surtani wrote:

> Hi guys
>
> I've just written a plugin for Lucene to use JBoss Cache as an index  
> store.  The benefits of something like this are:
>
> 1.  Faster access to indexes as they will be in memory
> 2.  Indexes replicated across a cluster of servers
> 3.  Indexes "persisted" in clustered memory - faster that  
> persistence to disk
>
> The implementation I have is pretty basic for now.
>
> Is there a set of tests in the Lucene sources I could use to test  
> the "JBCDirectory", as I call it?  Perhaps something way I could  
> change the "index store provider" and re-run some existing tests,  
> and perhaps add some clustered tests specific to my plugin?
>
> Finally, regarding hosting, I am happy to contribute this to Lucene  
> (alongside the JEDirectory, etc) but if licensing (JBoss Cache is  
> LGPL, although the plugin code can be ASL if need be) or language  
> levels (the plugin depends on JBoss Cache 2.x, which requires JDK 5)  
> then I'm happy to host the plugin externally.
>
> Cheers,
> --
> Manik Surtani
> Lead, JBoss Cache
> [hidden email]
>
>
>
>
>
>

--
Manik Surtani
Lead, JBoss Cache
[hidden email]







---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: JBoss Cache as a store

mark harwood
In reply to this post by Manik Surtani-2
Hi Manik,



>>>
Is
there
a
set
of
tests
in
the
Lucene
sources
I
could
use
to
test  the
"JBCDirectory",
as
I
call
it?



You would probably need to adapt existing Junit tests in
contrib/benchmark and src/test for performance and functionality
testing, respectively.

They use the existing RAMDirectory and FSDirectory Directory
implementations so you'll need to change the test code to use your
JBCDirectory instead.



Cheers,

Mark



----- Original Message ----
From: Manik Surtani <[hidden email]>
To: [hidden email]
Sent: Tuesday, 29 January, 2008 3:38:17 PM
Subject: Re: JBoss Cache as a store

Bump.  
Anyone?


On
24
Jan
2008,
at
14:07,
Manik
Surtani
wrote:

>
Hi
guys
>
>
I've
just
written
a
plugin
for
Lucene
to
use
JBoss
Cache
as
an
index  
>
store.  
The
benefits
of
something
like
this
are:
>
>
1.  
Faster
access
to
indexes
as
they
will
be
in
memory
>
2.  
Indexes
replicated
across
a
cluster
of
servers
>
3.  
Indexes
"persisted"
in
clustered
memory
-
faster
that  
>
persistence
to
disk
>
>
The
implementation
I
have
is
pretty
basic
for
now.
>
>
Is
there
a
set
of
tests
in
the
Lucene
sources
I
could
use
to
test  
>
the
"JBCDirectory",
as
I
call
it?  
Perhaps
something
way
I
could  
>
change
the
"index
store
provider"
and
re-run
some
existing
tests,  
>
and
perhaps
add
some
clustered
tests
specific
to
my
plugin?
>
>
Finally,
regarding
hosting,
I
am
happy
to
contribute
this
to
Lucene  
>
(alongside
the
JEDirectory,
etc)
but
if
licensing
(JBoss
Cache
is  
>
LGPL,
although
the
plugin
code
can
be
ASL
if
need
be)
or
language  
>
levels
(the
plugin
depends
on
JBoss
Cache
2.x,
which
requires
JDK
5)  
>
then
I'm
happy
to
host
the
plugin
externally.
>
>
Cheers,
>
--
>
Manik
Surtani
>
Lead,
JBoss
Cache
>
[hidden email]
>
>
>
>
>
>

--
Manik
Surtani
Lead,
JBoss
Cache
[hidden email]







---------------------------------------------------------------------
To
unsubscribe,
e-mail:
[hidden email]
For
additional
commands,
e-mail:
[hidden email]






      ___________________________________________________________
Support the World Aids Awareness campaign this month with Yahoo! For Good http://uk.promotions.yahoo.com/forgood/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: JBoss Cache as a store

hossman
In reply to this post by Manik Surtani-2

: Is there a set of tests in the Lucene sources I could use to test the
: "JBCDirectory", as I call it?  Perhaps something way I could change the "index
: store provider" and re-run some existing tests, and perhaps add some clustered
: tests specific to my plugin?

I think most of the existing tests have the Directory impl hardcoded in
them ... the best thing to do might be to refactor the existing tests so
Directory creation comes from an overridable function in a subclass...  
come ot think of it, Karl may have already done this as part of his
InstantiatedIndex patch (check jira) but i'm not sure ... the conversation
sounds familiar, but i think he was looking at facading the entire
IndexReader impl not just the directory, so any refactoring approach he
might have taken may not have gone far enough to work in this case.

It would certianly be nice if there was an easy way to run every test in
the test suite against an arbitrary Directory implementation.

: Finally, regarding hosting, I am happy to contribute this to Lucene (alongside
: the JEDirectory, etc) but if licensing (JBoss Cache is LGPL, although the
: plugin code can be ASL if need be) or language levels (the plugin depends on
: JBoss Cache 2.x, which requires JDK 5) then I'm happy to host the plugin
: externally.

contribs can run require 1.5 already ... an soon the trunk will move to
1.5 so that's not really an issue, the licensing may be, but it depends on
how the integration with JBoss winds up working (ie: i don't know if
having the build scripts download JBoss at build time to compile against
them is allowed or not)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: JBoss Cache as a store

Manik Surtani-2

On 29 Jan 2008, at 22:30, Chris Hostetter wrote:

>
> : Is there a set of tests in the Lucene sources I could use to test  
> the
> : "JBCDirectory", as I call it?  Perhaps something way I could  
> change the "index
> : store provider" and re-run some existing tests, and perhaps add  
> some clustered
> : tests specific to my plugin?
>
> I think most of the existing tests have the Directory impl hardcoded  
> in
> them ... the best thing to do might be to refactor the existing  
> tests so
> Directory creation comes from an overridable function in a subclass...
> come ot think of it, Karl may have already done this as part of his
> InstantiatedIndex patch (check jira) but i'm not sure ... the  
> conversation
> sounds familiar, but i think he was looking at facading the entire
> IndexReader impl not just the directory, so any refactoring approach  
> he
> might have taken may not have gone far enough to work in this case.
>
> It would certianly be nice if there was an easy way to run every  
> test in
> the test suite against an arbitrary Directory implementation.

Cool.  Well, for now, I'll follow Mark Harwood's recommendation to  
copy the relevant tests that use RAMDirectory and change the directory  
implementation.

>
>
> : Finally, regarding hosting, I am happy to contribute this to  
> Lucene (alongside
> : the JEDirectory, etc) but if licensing (JBoss Cache is LGPL,  
> although the
> : plugin code can be ASL if need be) or language levels (the plugin  
> depends on
> : JBoss Cache 2.x, which requires JDK 5) then I'm happy to host the  
> plugin
> : externally.
>
> contribs can run require 1.5 already ... an soon the trunk will move  
> to
> 1.5 so that's not really an issue, the licensing may be, but it  
> depends on
> how the integration with JBoss winds up working (ie: i don't know if
> having the build scripts download JBoss at build time to compile  
> against
> them is allowed or not)
>
>

Who would the best person be to contact about this?  I'm assuming this  
is not a problem since the JEDirectory pulls down BDBJE stuff which  
certainly isn't apache licensed.

Cheers,
--
Manik Surtani
Lead, JBoss Cache
[hidden email]







---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: JBoss Cache as a store

Karl Wettin
In reply to this post by hossman

29 jan 2008 kl. 23.30 skrev Chris Hostetter:

> I think most of the existing tests have the Directory impl hardcoded  
> in
> them ... the best thing to do might be to refactor the existing  
> tests so
> Directory creation comes from an overridable function in a subclass...
> come ot think of it, Karl may have already done this as part of his

I did, but the patch was for old code and removed as an artifact as I  
came up with a simpler scheme, to populate my store with the contents  
of an FSDirectory and then assert the behaviour of two index readers.

See TestCompareIndices.java in LUCENE-550.


   karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: JBoss Cache as a store

Otis Gospodnetic-2
In reply to this post by Manik Surtani-2
I believe download-and-build is OK (just like with JEDirectory), so that contrib would be great to have.  Only the code/libs *distributed* by ASF have to be ASL.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Manik Surtani <[hidden email]>
To: [hidden email]
Sent: Wednesday, January 30, 2008 8:38:48 AM
Subject: Re: JBoss Cache as a store


On
29
Jan
2008,
at
22:30,
Chris
Hostetter
wrote:

>
>
:
Is
there
a
set
of
tests
in
the
Lucene
sources
I
could
use
to
test  
>
the
>
:
"JBCDirectory",
as
I
call
it?  
Perhaps
something
way
I
could  
>
change
the
"index
>
:
store
provider"
and
re-run
some
existing
tests,
and
perhaps
add  
>
some
clustered
>
:
tests
specific
to
my
plugin?
>
>
I
think
most
of
the
existing
tests
have
the
Directory
impl
hardcoded  
>
in
>
them
...
the
best
thing
to
do
might
be
to
refactor
the
existing  
>
tests
so
>
Directory
creation
comes
from
an
overridable
function
in
a
subclass...
>
come
ot
think
of
it,
Karl
may
have
already
done
this
as
part
of
his
>
InstantiatedIndex
patch
(check
jira)
but
i'm
not
sure
...
the  
>
conversation
>
sounds
familiar,
but
i
think
he
was
looking
at
facading
the
entire
>
IndexReader
impl
not
just
the
directory,
so
any
refactoring
approach  
>
he
>
might
have
taken
may
not
have
gone
far
enough
to
work
in
this
case.
>
>
It
would
certianly
be
nice
if
there
was
an
easy
way
to
run
every  
>
test
in
>
the
test
suite
against
an
arbitrary
Directory
implementation.

Cool.  
Well,
for
now,
I'll
follow
Mark
Harwood's
recommendation
to  
copy
the
relevant
tests
that
use
RAMDirectory
and
change
the
directory  
implementation.

>
>
>
:
Finally,
regarding
hosting,
I
am
happy
to
contribute
this
to  
>
Lucene
(alongside
>
:
the
JEDirectory,
etc)
but
if
licensing
(JBoss
Cache
is
LGPL,  
>
although
the
>
:
plugin
code
can
be
ASL
if
need
be)
or
language
levels
(the
plugin  
>
depends
on
>
:
JBoss
Cache
2.x,
which
requires
JDK
5)
then
I'm
happy
to
host
the  
>
plugin
>
:
externally.
>
>
contribs
can
run
require
1.5
already
...
an
soon
the
trunk
will
move  
>
to
>
1.5
so
that's
not
really
an
issue,
the
licensing
may
be,
but
it  
>
depends
on
>
how
the
integration
with
JBoss
winds
up
working
(ie:
i
don't
know
if
>
having
the
build
scripts
download
JBoss
at
build
time
to
compile  
>
against
>
them
is
allowed
or
not)
>
>

Who
would
the
best
person
be
to
contact
about
this?  
I'm
assuming
this  
is
not
a
problem
since
the
JEDirectory
pulls
down
BDBJE
stuff
which  
certainly
isn't
apache
licensed.

Cheers,
--
Manik
Surtani
Lead,
JBoss
Cache
[hidden email]







---------------------------------------------------------------------
To
unsubscribe,
e-mail:
[hidden email]
For
additional
commands,
e-mail:
[hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: JBoss Cache as a store

Manik Surtani-2
Ok, sweet.  I'll create a JIRA and submit a patch (presuming this is  
the best approach)


On 31 Jan 2008, at 05:50, Otis Gospodnetic wrote:

> I believe download-and-build is OK (just like with JEDirectory), so  
> that contrib would be great to have.  Only the code/libs  
> *distributed* by ASF have to be ASL.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Manik Surtani <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, January 30, 2008 8:38:48 AM
> Subject: Re: JBoss Cache as a store
>
>
> On
> 29
> Jan
> 2008,
> at
> 22:30,
> Chris
> Hostetter
> wrote:
>
>>
>>
> :
> Is
> there
> a
> set
> of
> tests
> in
> the
> Lucene
> sources
> I
> could
> use
> to
> test
>>
> the
>>
> :
> "JBCDirectory",
> as
> I
> call
> it?
> Perhaps
> something
> way
> I
> could
>>
> change
> the
> "index
>>
> :
> store
> provider"
> and
> re-run
> some
> existing
> tests,
> and
> perhaps
> add
>>
> some
> clustered
>>
> :
> tests
> specific
> to
> my
> plugin?
>>
>>
> I
> think
> most
> of
> the
> existing
> tests
> have
> the
> Directory
> impl
> hardcoded
>>
> in
>>
> them
> ...
> the
> best
> thing
> to
> do
> might
> be
> to
> refactor
> the
> existing
>>
> tests
> so
>>
> Directory
> creation
> comes
> from
> an
> overridable
> function
> in
> a
> subclass...
>>
> come
> ot
> think
> of
> it,
> Karl
> may
> have
> already
> done
> this
> as
> part
> of
> his
>>
> InstantiatedIndex
> patch
> (check
> jira)
> but
> i'm
> not
> sure
> ...
> the
>>
> conversation
>>
> sounds
> familiar,
> but
> i
> think
> he
> was
> looking
> at
> facading
> the
> entire
>>
> IndexReader
> impl
> not
> just
> the
> directory,
> so
> any
> refactoring
> approach
>>
> he
>>
> might
> have
> taken
> may
> not
> have
> gone
> far
> enough
> to
> work
> in
> this
> case.
>>
>>
> It
> would
> certianly
> be
> nice
> if
> there
> was
> an
> easy
> way
> to
> run
> every
>>
> test
> in
>>
> the
> test
> suite
> against
> an
> arbitrary
> Directory
> implementation.
>
> Cool.
> Well,
> for
> now,
> I'll
> follow
> Mark
> Harwood's
> recommendation
> to
> copy
> the
> relevant
> tests
> that
> use
> RAMDirectory
> and
> change
> the
> directory
> implementation.
>
>>
>>
>>
> :
> Finally,
> regarding
> hosting,
> I
> am
> happy
> to
> contribute
> this
> to
>>
> Lucene
> (alongside
>>
> :
> the
> JEDirectory,
> etc)
> but
> if
> licensing
> (JBoss
> Cache
> is
> LGPL,
>>
> although
> the
>>
> :
> plugin
> code
> can
> be
> ASL
> if
> need
> be)
> or
> language
> levels
> (the
> plugin
>>
> depends
> on
>>
> :
> JBoss
> Cache
> 2.x,
> which
> requires
> JDK
> 5)
> then
> I'm
> happy
> to
> host
> the
>>
> plugin
>>
> :
> externally.
>>
>>
> contribs
> can
> run
> require
> 1.5
> already
> ...
> an
> soon
> the
> trunk
> will
> move
>>
> to
>>
> 1.5
> so
> that's
> not
> really
> an
> issue,
> the
> licensing
> may
> be,
> but
> it
>>
> depends
> on
>>
> how
> the
> integration
> with
> JBoss
> winds
> up
> working
> (ie:
> i
> don't
> know
> if
>>
> having
> the
> build
> scripts
> download
> JBoss
> at
> build
> time
> to
> compile
>>
> against
>>
> them
> is
> allowed
> or
> not)
>>
>>
>
> Who
> would
> the
> best
> person
> be
> to
> contact
> about
> this?
> I'm
> assuming
> this
> is
> not
> a
> problem
> since
> the
> JEDirectory
> pulls
> down
> BDBJE
> stuff
> which
> certainly
> isn't
> apache
> licensed.
>
> Cheers,
> --
> Manik
> Surtani
> Lead,
> JBoss
> Cache
> [hidden email]
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To
> unsubscribe,
> e-mail:
> [hidden email]
> For
> additional
> commands,
> e-mail:
> [hidden email]
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--
Manik Surtani
Lead, JBoss Cache
[hidden email]







---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]