[jira] Created: (LUCENE-2575) Concurrent byte and int block implementations

classic Classic list List threaded Threaded
51 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
Concurrent byte and int block implementations
---------------------------------------------

                 Key: LUCENE-2575
                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: Realtime Branch
            Reporter: Jason Rutherglen
             Fix For: Realtime Branch


The current *BlockPool implementations aren't quite concurrent.
We really need something that has a locking flush method, where
flush is called at the end of adding a document. Once flushed,
the newly written data would be available to all other reading
threads (ie, postings etc). I'm not sure I understand the slices
concept, it seems like it'd be easier to implement a seekable
random access file like API. One'd seek to a given position,
then read or write from there. The underlying management of byte
arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893564#action_12893564 ]

Michael McCandless commented on LUCENE-2575:
--------------------------------------------

Logically, every term has its own open IndexOutput, where it can write any number of bytes.  During indexing, when we hit a given term, we init its IndexOutput (two of of them -- one frq, one prx) and write a few bytes as appropriate.

It's that abstraction that the interleaved byte slices API provides -- the ability to hold open a great many IndexOutputs.

We should then be able to init IndexInputs against these slices as well, but they can only sequentially scan.

To handle skipping, I think we can write to another ByteBlockPool?  That skip data would be similar to the multi-level skip data we now record, except instead of indexing into a single frq or prx file, it indexes into positions in the primary ByteBlockPool.

Where is there a concurrency problem?  Is it a JMM visibility issue of writes from one thread vs reads, in a shared byte[]?

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893615#action_12893615 ]

Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

{quote}Where is there a concurrency problem? Is it a JMM
visibility issue of writes from one thread vs reads, in a shared
byte[]?{quote}

For example if the 1st dimension of the byte array increases,
how is that made visible while it's also being read? In
addition, there's the JMM issue of the actual byte[]s that hold
the data (ie, if we're writing to a byte[] there's no guarantee
as to when those bytes will be available to reading threads).

We're writing segments to the Directory API today, it seems we
should be able to also write to a virtual ram filesystem (a
concurrent RAMDir?) with some basic block level locking
semantics (encapsulated in a flush method). Then we could safely
and concurrently read from this new RAMDr. Couldn't we also use
the RAMDir to interleave IndexOutputs, or provide an abstraction
that performs actual the interleaving?

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907069#action_12907069 ]

Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

bq. every term has its own open IndexOutput

I'm not seeing IndexOutput in use with the RAM buffer, do you
mean the the write* (writeVInt, writeBytes, writeByte) methods
of TermsHashPerField?

Included in this patch will need to be a way to concurrently
grow other arrays such as ParallelPostingsArray. PPA is used to
store pointers to data stored in the block pools. Maybe we need
a class that concurrently manages growing arrays and block
pools.

Or we may need to slightly re-architect how we're storing the
RAM buffer data so that concurrency can be guaranteed, ie, I
think we'll need to write to temporary arrays, which are then
flushed to primary readable arrays. The flush would occur after
adding a document, or probably for better efficiency, only when
getReader is called.

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907493#action_12907493 ]

Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

I'm finally understanding the slice concept, basically we're
over-allocating space within the ByteBlockPool byte[]s for more
postings for a particular term, hence the levelSizeArray which
determines the length of each "slice" of a byte[] the postings
will use. They're probably not always filled in completely?

It's a bit tricky to follow by reading the code, which makes
figuring out how to make the RAM buffer concurrent challenging.
Especially in the newSlice method which rewrites the end of the
last slice with the forwarding index/address of the next slice.
It's very clever however maybe we can encapsulate it better with
methods delineating the various operations which right now are
operations directly on the assortment of arrays. In general we
can possibly get away with using copy-on-write to achieve
performant single-threaded write and multi-threaded reader
concurrency.



> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907587#action_12907587 ]

Michael McCandless commented on LUCENE-2575:
--------------------------------------------

Right, we over-allocate each slice according to the sizes in levelSizeArray.

For a given "stream", all of its slices but the last one will be filled in.  The interleaved slices "logically" encode many streams (one per unique term), and all of these streams will have a "tip" slice, where any bytes written to that stream will go.

How to make this properly concurrent is a challenge.  Each reader knows its max docID, so, it can stop reading a given stream when it hits a docID over the max.  But, because the write can go back and overwrite the last 4 bytes of a slice (w/ the address of the next slice), we have to guard for that case (when a reader is trying to read that slice at the same time).

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907624#action_12907624 ]

Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

I see the stream as an argument to the writeByte method, however I only see 0 or 1 being passed in for freq and prox respectively.  

I'm not sure how we can implement the linked slices concept concurrently without pre-allocating 4 bytes at the end of each slice for the forwarding address.  We would be able to go back to byte[]s that are readonly, copy, rewrite the forwarding address(es), and then synchronously flush the rewritten byte[]s back to the readonly list.  We could have a switch that turns the auto-address writing on or off if a user does not plan on using realtime search.

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907628#action_12907628 ]

Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

Actually, have said that, we don't need to pre-allocate the forwarding address because we're copy-on-writing the byte[]s.  So I guess we're good!

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2575:
-------------------------------------

    Attachment: LUCENE-2575.patch

Here's a start at concurrency, the terms dictionary, and
iterating over doc ids.

* It needs concurrency unit tests

* At an as yet undetermined interval, we need to conglomerate
the existing terms into a sorted int[] rather than continue to
use the ConcurrentSkipListMap, which consumes a far greater
amount of RAM. The tradeoff and reason for using the CSLM is the
level of concurrency gained by using it at the cost of greater
memory consumption when compared with the sorted int[] of term
ids.

* An int[] based term enum needs to be implemented. In addition,
a multi term enum, maybe there's one we can use, I'm not
familiar enough with the new flex code base.

* Copy on write is used to obtain a read-only version of the
ByteBlockPool and IntBlockPool. In the case of the byte blocks,
a boolean[] marks which elements need to be copied prior to
writing by the DocumentsWriterPerThread on byte slice forwarding
address rewrite.

* A write lock on each DWPT guarantees that as reference copies
are made, arrays being copied will not be altered in flight.
There shouldn't be an issue even though to get a complete
IndexReader[], we need to wait for each document to finish
flushing, we're not blocking indexing, only the obtaining of the
IRs. I can't see this being an issue for most use cases.

* Similarly, a reference is copied of the ParallelPostingsArray
(rather than a full copy) for use by the RAM Buffer based
IndexReader. It is OK for the PPA to be changed during future doc
adds, as the only the elements greater than the IRs max term id
will be altered, ie, we're not going to run into JMM thread
issues because the writing and read-only array reference copies
occur in a reentrant lock.

* Recycling of byte[]s becomes a bit more complex as RAM IRs will
likely hold references to them. When the RAM IR is closed, however,
the byte[]s can be recycled. The user could experience unusual
RAM usage spikes if IRs are not closed properly.



> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2575:
-------------------------------------

    Attachment: LUCENE-2575.patch

This includes a basic implementation of the sorted term id based
term enum. We'll want to over-allocate the sorted term id array
so that future merges of new term ids will not require
allocating a new array for growth. I think overall the ram
buffer based searching will not require too much more of a RAM
outlay. The merging of new term ids could occur in a background
thread if we think it's expensive, however for now we can simply
merge them in on demand as new RAM readers are created.

Seek is implemented as a binary search of the sorted term ids.
If this is not efficient enough, we can implement a terms index
ala the current system.

For now the conversion from CSLM to sorted term id array can be
a percentage of the total number of terms, which I'll default to
10%. We may want to make this a function (eg, percentage) of RAM
consumption in the future.

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908849#action_12908849 ]

Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

One thing I noticed, correct me if I'm wrong, is the term doc
frequency (the one stored per term, ie, TermsEnum.docFreq)
doesn't seem to be currently recorded in the ram buffer code
tree. It will be easy to add, though if we make it accurate per
RAM index reader then we could be allocating a unique array, the
length of the number of terms, per reader. I'll implement it
this way to start and we can change it later if necessary.
Actually, to save RAM this could be another use case where a 2
dimensional copy-on-write array is practical.

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2575:
-------------------------------------

    Attachment: LUCENE-2575.patch

Term frequency is recorded and returned.  There are Terms, TermsEnum, DocsEnum implementations.  Needs the term vectors, doc stores exposed via the RAM reader, concurrency unit tests, and a payload unit test.  Still quite rough.

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-2575:
-------------------------------------

    Attachment: LUCENE-2575.patch

Added a unit test for payloads, term vectors, and doc stores.  The reader flushes term vectors and doc stores on demand, once per reader.  Also, little things are getting cleaned up in the realtime branch.

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909580#action_12909580 ]

Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

For the posting skip list we need to implement seek on the
ByteSliceReader. However if we're rewriting a portion of a
slice, then I guess we could have a problem... Meaning we'd be
storing an absolute position in the skip list, and we could go
to look up the value, however that byte(s) could have been
altered to not be delta encoded doc ids anymore, but instead
is/are the forwarding address to the next slice.

Do we need an intelligent mechanism that interacts with the byte
slice writer to not point at byte array elements (ie the end of
slices) that could later be converted into forwarding addresses?

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909771#action_12909771 ]

Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

Because of the way byte slices work, eg, they need to pre-know
the size of the slice before iterating on it, we can't simply
point to the middle of a slice and read without probably
iterating over the forwarding address.

It seems the skip list will need to point to the beginning of a
slice. This'll make the interval iteration in the RAM buffer
skip list writer a little more complicated than today in that
it'll need to store positions that are the start of byte slices.
In other words, the intervals will be slightly uneven at times.

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909839#action_12909839 ]

Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

Is there a way to know the level of a slice given only the forwarding address/position?  It doesn't look like it.  Hmm... This could mean encoding the level or the size of the slice into the slice, which would elongate slices in general, I suppose though that the level index would only add one byte and that would be okay.

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909852#action_12909852 ]

Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

In the following line of ByteBlockPool.allocSlice we're recording the slice level, however it's at the end of the slice rather than the beginning, which is where we'll need to write the level in order to implement slice seek.  I'm not immediately sure what's reading the level at this end position of the byte[].

{code}
buffer[byteUpto-1] = (byte) (16|newLevel);
{code}

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910123#action_12910123 ]

Michael McCandless commented on LUCENE-2575:
--------------------------------------------

bq. I'm not immediately sure what's reading the level at this end position of the byte[].

This is so that once we exhaust the slice and must allocate the next one we know what size (level + 1, ceiling'd) to make the next slice.

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910201#action_12910201 ]

Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

bq. we know what size (level + 1, ceiling'd) to make the next slice.

Thanks.  In the midst of debugging last night I realized this.  The next question is whether to remove it.

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2575) Concurrent byte and int block implementations

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913383#action_12913383 ]

Jason Rutherglen commented on LUCENE-2575:
------------------------------------------

This issue is blocked because the change made to ByteBlockPool to add the level of the slice, to the beginning of the slice, moves all of the positions forward by one.  This has caused TestByteSlices to fail an assertion.  I'm not sure if the test needs to be changed, or there's a bug in the new BBP implementation.  Either way it's a bit of a challenge to debug.

> Concurrent byte and int block implementations
> ---------------------------------------------
>
>                 Key: LUCENE-2575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: Realtime Branch
>            Reporter: Jason Rutherglen
>             Fix For: Realtime Branch
>
>         Attachments: LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch, LUCENE-2575.patch
>
>
> The current *BlockPool implementations aren't quite concurrent.
> We really need something that has a locking flush method, where
> flush is called at the end of adding a document. Once flushed,
> the newly written data would be available to all other reading
> threads (ie, postings etc). I'm not sure I understand the slices
> concept, it seems like it'd be easier to implement a seekable
> random access file like API. One'd seek to a given position,
> then read or write from there. The underlying management of byte
> arrays could then be hidden?

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

123