Realtime Search for Social Networks Collaboration

classic Classic list List threaded Threaded
61 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Michael McCandless-2

This would just tap into the live hashtable that DocumentsWriter*  
maintain for the posting lists... except the docFreq will need to be  
copied away on reopen, I think.

Mike

Jason Rutherglen wrote:

> Term dictionary?  I'm curious how that would be solved?
>
> On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
> <[hidden email]> wrote:
>>
>> Yonik Seeley wrote:
>>
>>>> I think it's quite feasible, but, it'd still have a "reopen" cost  
>>>> in that
>>>> any buffered delete by term or query would have to be  
>>>> "materialiazed"
>>>> into
>>>> docIDs on reopen.  Though, if this somehow turns out to be a  
>>>> problem, in
>>>> the
>>>> future we could do this materializing immediately, instead of  
>>>> buffering,
>>>> if
>>>> we already have a reader open.
>>>
>>> Right... it seems like re-using readers internally is something we
>>> could already be doing in IndexWriter.
>>
>> True.
>>
>>>> Flushing is somewhat tricky because any open RAM readers would  
>>>> then have
>>>> to
>>>> cutover to the newly flushed segment once the flush completes, so  
>>>> that
>>>> the
>>>> RAM buffer can be recycled for the next segment.
>>>
>>> Re-use of a RAM buffer doesn't seem like such a big deal.
>>>
>>> But, how would you maintain a static view of an index...?
>>>
>>> IndexReader r1 = indexWriter.getCurrentIndex()
>>> indexWriter.addDocument(...)
>>> IndexReader r2 = indexWriter.getCurrentIndex()
>>>
>>> I assume r1 will have a view of the index before the document was
>>> added, and r2 after?
>>
>> Right, getCurrentIndex would return a MultiReader that includes
>> SegmentReader for each segment in the index, plus a "RAMReader" that
>> searches the RAM buffer.  That RAMReader is a tiny shell class that  
>> would
>> basically just record the max docID it's allowed to go up to (the  
>> docID as
>> of when it was opened), and stop enumerating docIDs (eg in the  
>> TermDocs)
>> when it hits a docID beyond that limit.
>>
>> For reading stored fields and term vectors, which are now flushed
>> immediately to disk, we need to somehow get an IndexInput from the
>> IndexOutputs that IndexWriter holds open on these files.  Or,  
>> maybe, just
>> open new IndexInputs?
>>
>>> Another thing that will help is if users could get their hands on  
>>> the
>>> sub-readers of a multi-segment reader.  Right now that is hidden in
>>> MultiSegmentReader and makes updating anything incrementally
>>> difficult.
>>
>> Besides what's handled by MultiSegmentReader.reopen already, what  
>> else do
>> you need to incrementally update?
>>
>> Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Yonik Seeley-2
In reply to this post by Michael McCandless-2
On Tue, Sep 9, 2008 at 5:28 AM, Michael McCandless
<[hidden email]> wrote:
> Yonik Seeley wrote:
>> What about something like term freq?  Would it need to count the
>> number of docs after the local maxDoc or is there a better way?
>
> Good question...
>
> I think we'd have to take a full copy of the term -> termFreq on reopen?  I
> don't see how else to do it (I don't understand your suggestion above).  So,
> this will clearly add to the cost of reopen.

One could adjust the freq by iterating over the terms documents...
skipTo(localMaxDoc) and count how many are after that, then subtract
from the freq.  I didn't say it was a *good* idea :-)

>>> For reading stored fields and term vectors, which are now flushed
>>> immediately to disk, we need to somehow get an IndexInput from the
>>> IndexOutputs that IndexWriter holds open on these files.  Or, maybe, just
>>> open new IndexInputs?
>>
>> Hmmm, seems like a case of our nice and simple Directory model not
>> having quite enough features in this case.
>
> I think we can simply open IndexInputs on these files.  I believe Java does
> the right thing on windows, such that if we are already writing to the file,
> it does not prevent another file handle from opening the file for reading.

Yeah, I think the underlying RandomAccessFile might do the right
thing, but IndexInput isn't required to see any changes on the fly
(and current implementations don't) so at a minimum it would be a
change of IndexInput semantics.  Maybe there would need to be a
refresh() function added, or we would need to require a specific
Directory impl?

OR, if all writes are append-only, perhaps we don't ever need to
invalidate the read buffer and would just need to remove the current
logic that caches the file length and then let the underlying
RandomAccessFile do the EOF checking.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Ning Li-3
On Mon, Sep 8, 2008 at 4:23 PM, Yonik Seeley <[hidden email]> wrote:
>> I thought an index reader which supports real-time search no longer
>> maintains a static view of an index?
>
> It seems advantageous to just make it really cheap to get a new view
> of the index (if you do it for every search, t amounts to the same
> thing, right?)

Sounds like these light-weight views of the index are backed up by
something dynamic, right?


> Quite a bit of code in Lucene assumes a static view of
> the Index I think (even IndexSearcher), and it's nice to have a stable
> index view for the duration of a single request.

Agree.


On Tue, Sep 9, 2008 at 10:02 AM, Yonik Seeley <[hidden email]> wrote:

> Yeah, I think the underlying RandomAccessFile might do the right
> thing, but IndexInput isn't required to see any changes on the fly
> (and current implementations don't) so at a minimum it would be a
> change of IndexInput semantics.  Maybe there would need to be a
> refresh() function added, or we would need to require a specific
> Directory impl?
>
> OR, if all writes are append-only, perhaps we don't ever need to
> invalidate the read buffer and would just need to remove the current
> logic that caches the file length and then let the underlying
> RandomAccessFile do the EOF checking.

We cannot assume it's always RandomAccessFile, can we?
So we may have to flush after writing each document. Even so,
this may not be sufficient for some FS such as HDFS... Is it
reasonable in this case to keep in memory everything including
stored fields and term vectors?


Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Yonik Seeley-2
On Tue, Sep 9, 2008 at 11:42 AM, Ning Li <[hidden email]> wrote:

> On Tue, Sep 9, 2008 at 10:02 AM, Yonik Seeley <[hidden email]> wrote:
>> Yeah, I think the underlying RandomAccessFile might do the right
>> thing, but IndexInput isn't required to see any changes on the fly
>> (and current implementations don't) so at a minimum it would be a
>> change of IndexInput semantics.  Maybe there would need to be a
>> refresh() function added, or we would need to require a specific
>> Directory impl?
>>
>> OR, if all writes are append-only, perhaps we don't ever need to
>> invalidate the read buffer and would just need to remove the current
>> logic that caches the file length and then let the underlying
>> RandomAccessFile do the EOF checking.
>
> We cannot assume it's always RandomAccessFile, can we?

No, it would essentially be a change in the semantics that all
implementations would need to support.

> So we may have to flush after writing each document.

Flush when creating a new index view (which could possibly be after
every document is added, but doesn't have to be).

> Even so,
> this may not be sufficient for some FS such as HDFS... Is it
> reasonable in this case to keep in memory everything including
> stored fields and term vectors?

We could maybe do something like a proxy IndexInput/IndexOutput that
would allow updating the read buffer from the writer buffer.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Michael McCandless-2
In reply to this post by Yonik Seeley-2

Yonik Seeley wrote:

> On Tue, Sep 9, 2008 at 5:28 AM, Michael McCandless
> <[hidden email]> wrote:
>> Yonik Seeley wrote:
>>> What about something like term freq?  Would it need to count the
>>> number of docs after the local maxDoc or is there a better way?
>>
>> Good question...
>>
>> I think we'd have to take a full copy of the term -> termFreq on  
>> reopen?  I
>> don't see how else to do it (I don't understand your suggestion  
>> above).  So,
>> this will clearly add to the cost of reopen.
>
> One could adjust the freq by iterating over the terms documents...
> skipTo(localMaxDoc) and count how many are after that, then subtract
> from the freq.  I didn't say it was a *good* idea :-)

Ahh, OK :)

>>>> For reading stored fields and term vectors, which are now flushed
>>>> immediately to disk, we need to somehow get an IndexInput from the
>>>> IndexOutputs that IndexWriter holds open on these files.  Or,  
>>>> maybe, just
>>>> open new IndexInputs?
>>>
>>> Hmmm, seems like a case of our nice and simple Directory model not
>>> having quite enough features in this case.
>>
>> I think we can simply open IndexInputs on these files.  I believe  
>> Java does
>> the right thing on windows, such that if we are already writing to  
>> the file,
>> it does not prevent another file handle from opening the file for  
>> reading.
>
> Yeah, I think the underlying RandomAccessFile might do the right
> thing, but IndexInput isn't required to see any changes on the fly
> (and current implementations don't) so at a minimum it would be a
> change of IndexInput semantics.  Maybe there would need to be a
> refresh() function added, or we would need to require a specific
> Directory impl?
>
> OR, if all writes are append-only, perhaps we don't ever need to
> invalidate the read buffer and would just need to remove the current
> logic that caches the file length and then let the underlying
> RandomAccessFile do the EOF checking.

All writes to these files are append only, and, when we open the  
IndexInput we would never read beyond it's current length (once we  
flush our IndexOutput) because that's the local maxDocID limit.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Michael McCandless-2
In reply to this post by Yonik Seeley-2

Yonik Seeley wrote:

> On Tue, Sep 9, 2008 at 11:42 AM, Ning Li <[hidden email]> wrote:
>> On Tue, Sep 9, 2008 at 10:02 AM, Yonik Seeley <[hidden email]>  
>> wrote:
>>> Yeah, I think the underlying RandomAccessFile might do the right
>>> thing, but IndexInput isn't required to see any changes on the fly
>>> (and current implementations don't) so at a minimum it would be a
>>> change of IndexInput semantics.  Maybe there would need to be a
>>> refresh() function added, or we would need to require a specific
>>> Directory impl?
>>>
>>> OR, if all writes are append-only, perhaps we don't ever need to
>>> invalidate the read buffer and would just need to remove the current
>>> logic that caches the file length and then let the underlying
>>> RandomAccessFile do the EOF checking.
>>
>> We cannot assume it's always RandomAccessFile, can we?
>
> No, it would essentially be a change in the semantics that all
> implementations would need to support.

Right, which is you are allowed to open an IndexInput on a file when  
an IndexOutput has that same file open and is still appending to it.

>> So we may have to flush after writing each document.
>
> Flush when creating a new index view (which could possibly be after
> every document is added, but doesn't have to be).

Assuming we can make the above semantics requirement change to  
IndexInput, we don't need to flush on opening a new RAM reader?

>> Even so,
>> this may not be sufficient for some FS such as HDFS... Is it
>> reasonable in this case to keep in memory everything including
>> stored fields and term vectors?
>
> We could maybe do something like a proxy IndexInput/IndexOutput that
> would allow updating the read buffer from the writer buffer.

Does HDFS disallow a reader from reading a file that's still open for  
append?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Yonik Seeley-2
In reply to this post by Michael McCandless-2
On Tue, Sep 9, 2008 at 12:41 PM, Michael McCandless
<[hidden email]> wrote:
> Yonik Seeley wrote:
>> OR, if all writes are append-only, perhaps we don't ever need to
>> invalidate the read buffer and would just need to remove the current
>> logic that caches the file length and then let the underlying
>> RandomAccessFile do the EOF checking.
>
> All writes to these files are append only, and, when we open the IndexInput
> we would never read beyond it's current length (once we flush our
> IndexOutput) because that's the local maxDocID limit.

Right, but it would be nice to not have to open a new IndexInput for
each snapshot... opening a file is not a quick operation.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Yonik Seeley-2
In reply to this post by Michael McCandless-2
On Tue, Sep 9, 2008 at 12:45 PM, Michael McCandless
<[hidden email]> wrote:
> Yonik Seeley wrote:
>> No, it would essentially be a change in the semantics that all
>> implementations would need to support.
>
> Right, which is you are allowed to open an IndexInput on a file when an
> IndexOutput has that same file open and is still appending to it.

Not just that, but that the size can actually grow after the
IndexInput has been opened, and that should be visible.  That would
seem necessary for sharing the IndexInput (via a clone).

>>> So we may have to flush after writing each document.
>>
>> Flush when creating a new index view (which could possibly be after
>> every document is added, but doesn't have to be).
>
> Assuming we can make the above semantics requirement change to IndexInput,
> we don't need to flush on opening a new RAM reader?

Yes, we would need to flush... I was just pointing out that you don't
necessarily need a new RAM reader for every document added (but that
is the worst case scenario).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Ning Li-3
In reply to this post by Michael McCandless-2
>>> Even so,
>>> this may not be sufficient for some FS such as HDFS... Is it
>>> reasonable in this case to keep in memory everything including
>>> stored fields and term vectors?
>>
>> We could maybe do something like a proxy IndexInput/IndexOutput that
>> would allow updating the read buffer from the writer buffer.
>
> Does HDFS disallow a reader from reading a file that's still open for
> append?

HDFS allows that. "A reader is guaranteed to be able to read data that
was 'flushed' before the reader opened the file." However, it may not
see the latest appends (after open) even if they are flushed. Yonik's
comments below also apply in this case.

> Right, but it would be nice to not have to open a new IndexInput for
> each snapshot... opening a file is not a quick operation.

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
In reply to this post by Michael McCandless-2
Hi Mike,

There would be a new sorted list or something to replace the
hashtable?  Seems like an issue that is not solved.

Jason

On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless
<[hidden email]> wrote:

>
> This would just tap into the live hashtable that DocumentsWriter* maintain
> for the posting lists... except the docFreq will need to be copied away on
> reopen, I think.
>
> Mike
>
> Jason Rutherglen wrote:
>
>> Term dictionary?  I'm curious how that would be solved?
>>
>> On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
>> <[hidden email]> wrote:
>>>
>>> Yonik Seeley wrote:
>>>
>>>>> I think it's quite feasible, but, it'd still have a "reopen" cost in
>>>>> that
>>>>> any buffered delete by term or query would have to be "materialiazed"
>>>>> into
>>>>> docIDs on reopen.  Though, if this somehow turns out to be a problem,
>>>>> in
>>>>> the
>>>>> future we could do this materializing immediately, instead of
>>>>> buffering,
>>>>> if
>>>>> we already have a reader open.
>>>>
>>>> Right... it seems like re-using readers internally is something we
>>>> could already be doing in IndexWriter.
>>>
>>> True.
>>>
>>>>> Flushing is somewhat tricky because any open RAM readers would then
>>>>> have
>>>>> to
>>>>> cutover to the newly flushed segment once the flush completes, so that
>>>>> the
>>>>> RAM buffer can be recycled for the next segment.
>>>>
>>>> Re-use of a RAM buffer doesn't seem like such a big deal.
>>>>
>>>> But, how would you maintain a static view of an index...?
>>>>
>>>> IndexReader r1 = indexWriter.getCurrentIndex()
>>>> indexWriter.addDocument(...)
>>>> IndexReader r2 = indexWriter.getCurrentIndex()
>>>>
>>>> I assume r1 will have a view of the index before the document was
>>>> added, and r2 after?
>>>
>>> Right, getCurrentIndex would return a MultiReader that includes
>>> SegmentReader for each segment in the index, plus a "RAMReader" that
>>> searches the RAM buffer.  That RAMReader is a tiny shell class that would
>>> basically just record the max docID it's allowed to go up to (the docID
>>> as
>>> of when it was opened), and stop enumerating docIDs (eg in the TermDocs)
>>> when it hits a docID beyond that limit.
>>>
>>> For reading stored fields and term vectors, which are now flushed
>>> immediately to disk, we need to somehow get an IndexInput from the
>>> IndexOutputs that IndexWriter holds open on these files.  Or, maybe, just
>>> open new IndexInputs?
>>>
>>>> Another thing that will help is if users could get their hands on the
>>>> sub-readers of a multi-segment reader.  Right now that is hidden in
>>>> MultiSegmentReader and makes updating anything incrementally
>>>> difficult.
>>>
>>> Besides what's handled by MultiSegmentReader.reopen already, what else do
>>> you need to incrementally update?
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Michael McCandless-2

Right, there would need to be a snapshot taken of all terms when  
IndexWriter.getReader() is called.

This snapshot would 1) hold a frozen int docFreq per term, and 2) sort  
the terms so TermEnum can just step through them.  (We might be able  
to delay this sorting until the first time something asks for it).  
Also, it must merge this data from all threads, since each thread  
holds its hash per field.  I've got a rough start at coding this up...

The costs are clearly growing, in order to keep the "point in time"  
feature of this RAMIndexReader, but I think are still well contained  
unless you have a really huge RAM buffer.

Flushing is still tricky because we cannot recycle the byte block  
buffers until all running TermDocs/TermPositions iterations are  
"finished".  Alternatively, I may just allocate new byte blocks and  
allow the old ones to be GC'd on their own once running iterations are  
finished.

Mike

Jason Rutherglen wrote:

> Hi Mike,
>
> There would be a new sorted list or something to replace the
> hashtable?  Seems like an issue that is not solved.
>
> Jason
>
> On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless
> <[hidden email]> wrote:
>>
>> This would just tap into the live hashtable that DocumentsWriter*  
>> maintain
>> for the posting lists... except the docFreq will need to be copied  
>> away on
>> reopen, I think.
>>
>> Mike
>>
>> Jason Rutherglen wrote:
>>
>>> Term dictionary?  I'm curious how that would be solved?
>>>
>>> On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
>>> <[hidden email]> wrote:
>>>>
>>>> Yonik Seeley wrote:
>>>>
>>>>>> I think it's quite feasible, but, it'd still have a "reopen"  
>>>>>> cost in
>>>>>> that
>>>>>> any buffered delete by term or query would have to be  
>>>>>> "materialiazed"
>>>>>> into
>>>>>> docIDs on reopen.  Though, if this somehow turns out to be a  
>>>>>> problem,
>>>>>> in
>>>>>> the
>>>>>> future we could do this materializing immediately, instead of
>>>>>> buffering,
>>>>>> if
>>>>>> we already have a reader open.
>>>>>
>>>>> Right... it seems like re-using readers internally is something we
>>>>> could already be doing in IndexWriter.
>>>>
>>>> True.
>>>>
>>>>>> Flushing is somewhat tricky because any open RAM readers would  
>>>>>> then
>>>>>> have
>>>>>> to
>>>>>> cutover to the newly flushed segment once the flush completes,  
>>>>>> so that
>>>>>> the
>>>>>> RAM buffer can be recycled for the next segment.
>>>>>
>>>>> Re-use of a RAM buffer doesn't seem like such a big deal.
>>>>>
>>>>> But, how would you maintain a static view of an index...?
>>>>>
>>>>> IndexReader r1 = indexWriter.getCurrentIndex()
>>>>> indexWriter.addDocument(...)
>>>>> IndexReader r2 = indexWriter.getCurrentIndex()
>>>>>
>>>>> I assume r1 will have a view of the index before the document was
>>>>> added, and r2 after?
>>>>
>>>> Right, getCurrentIndex would return a MultiReader that includes
>>>> SegmentReader for each segment in the index, plus a "RAMReader"  
>>>> that
>>>> searches the RAM buffer.  That RAMReader is a tiny shell class  
>>>> that would
>>>> basically just record the max docID it's allowed to go up to (the  
>>>> docID
>>>> as
>>>> of when it was opened), and stop enumerating docIDs (eg in the  
>>>> TermDocs)
>>>> when it hits a docID beyond that limit.
>>>>
>>>> For reading stored fields and term vectors, which are now flushed
>>>> immediately to disk, we need to somehow get an IndexInput from the
>>>> IndexOutputs that IndexWriter holds open on these files.  Or,  
>>>> maybe, just
>>>> open new IndexInputs?
>>>>
>>>>> Another thing that will help is if users could get their hands  
>>>>> on the
>>>>> sub-readers of a multi-segment reader.  Right now that is hidden  
>>>>> in
>>>>> MultiSegmentReader and makes updating anything incrementally
>>>>> difficult.
>>>>
>>>> Besides what's handled by MultiSegmentReader.reopen already, what  
>>>> else do
>>>> you need to incrementally update?
>>>>
>>>> Mike
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
Mike,

The other issue that will occur that I addressed is the field caches.
The underlying smaller IndexReaders will need to be exposed because of
the field caching.  Currently in ocean realtime search the individual
readers are searched on using a MultiSearcher in order to search in
parallel and reuse the field caches. How will field caching work with
the IndexWriter approach?  It seems like it would need a dynamically
growing field cache array?  That is a bit tricky.  By doing in memory
merging in ocean, the field caches last longer and do not require
growing arrays.  How do you plan to handle rapidly delete the docs of
the disk segments?  Can the SegmentReader clone patch be used for
this?

Jason

On Thu, Sep 11, 2008 at 8:29 AM, Michael McCandless
<[hidden email]> wrote:

>
> Right, there would need to be a snapshot taken of all terms when
> IndexWriter.getReader() is called.
>
> This snapshot would 1) hold a frozen int docFreq per term, and 2) sort the
> terms so TermEnum can just step through them.  (We might be able to delay
> this sorting until the first time something asks for it).  Also, it must
> merge this data from all threads, since each thread holds its hash per
> field.  I've got a rough start at coding this up...
>
> The costs are clearly growing, in order to keep the "point in time" feature
> of this RAMIndexReader, but I think are still well contained unless you have
> a really huge RAM buffer.
>
> Flushing is still tricky because we cannot recycle the byte block buffers
> until all running TermDocs/TermPositions iterations are "finished".
>  Alternatively, I may just allocate new byte blocks and allow the old ones
> to be GC'd on their own once running iterations are finished.
>
> Mike
>
> Jason Rutherglen wrote:
>
>> Hi Mike,
>>
>> There would be a new sorted list or something to replace the
>> hashtable?  Seems like an issue that is not solved.
>>
>> Jason
>>
>> On Tue, Sep 9, 2008 at 5:29 AM, Michael McCandless
>> <[hidden email]> wrote:
>>>
>>> This would just tap into the live hashtable that DocumentsWriter*
>>> maintain
>>> for the posting lists... except the docFreq will need to be copied away
>>> on
>>> reopen, I think.
>>>
>>> Mike
>>>
>>> Jason Rutherglen wrote:
>>>
>>>> Term dictionary?  I'm curious how that would be solved?
>>>>
>>>> On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
>>>> <[hidden email]> wrote:
>>>>>
>>>>> Yonik Seeley wrote:
>>>>>
>>>>>>> I think it's quite feasible, but, it'd still have a "reopen" cost in
>>>>>>> that
>>>>>>> any buffered delete by term or query would have to be "materialiazed"
>>>>>>> into
>>>>>>> docIDs on reopen.  Though, if this somehow turns out to be a problem,
>>>>>>> in
>>>>>>> the
>>>>>>> future we could do this materializing immediately, instead of
>>>>>>> buffering,
>>>>>>> if
>>>>>>> we already have a reader open.
>>>>>>
>>>>>> Right... it seems like re-using readers internally is something we
>>>>>> could already be doing in IndexWriter.
>>>>>
>>>>> True.
>>>>>
>>>>>>> Flushing is somewhat tricky because any open RAM readers would then
>>>>>>> have
>>>>>>> to
>>>>>>> cutover to the newly flushed segment once the flush completes, so
>>>>>>> that
>>>>>>> the
>>>>>>> RAM buffer can be recycled for the next segment.
>>>>>>
>>>>>> Re-use of a RAM buffer doesn't seem like such a big deal.
>>>>>>
>>>>>> But, how would you maintain a static view of an index...?
>>>>>>
>>>>>> IndexReader r1 = indexWriter.getCurrentIndex()
>>>>>> indexWriter.addDocument(...)
>>>>>> IndexReader r2 = indexWriter.getCurrentIndex()
>>>>>>
>>>>>> I assume r1 will have a view of the index before the document was
>>>>>> added, and r2 after?
>>>>>
>>>>> Right, getCurrentIndex would return a MultiReader that includes
>>>>> SegmentReader for each segment in the index, plus a "RAMReader" that
>>>>> searches the RAM buffer.  That RAMReader is a tiny shell class that
>>>>> would
>>>>> basically just record the max docID it's allowed to go up to (the docID
>>>>> as
>>>>> of when it was opened), and stop enumerating docIDs (eg in the
>>>>> TermDocs)
>>>>> when it hits a docID beyond that limit.
>>>>>
>>>>> For reading stored fields and term vectors, which are now flushed
>>>>> immediately to disk, we need to somehow get an IndexInput from the
>>>>> IndexOutputs that IndexWriter holds open on these files.  Or, maybe,
>>>>> just
>>>>> open new IndexInputs?
>>>>>
>>>>>> Another thing that will help is if users could get their hands on the
>>>>>> sub-readers of a multi-segment reader.  Right now that is hidden in
>>>>>> MultiSegmentReader and makes updating anything incrementally
>>>>>> difficult.
>>>>>
>>>>> Besides what's handled by MultiSegmentReader.reopen already, what else
>>>>> do
>>>>> you need to incrementally update?
>>>>>
>>>>> Mike
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [hidden email]
>>>>> For additional commands, e-mail: [hidden email]
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Michael McCandless-2

Jason Rutherglen wrote:

> Mike,
>
> The other issue that will occur that I addressed is the field caches.
> The underlying smaller IndexReaders will need to be exposed because of
> the field caching.  Currently in ocean realtime search the individual
> readers are searched on using a MultiSearcher in order to search in
> parallel and reuse the field caches. How will field caching work with
> the IndexWriter approach?  It seems like it would need a dynamically
> growing field cache array?  That is a bit tricky.  By doing in memory
> merging in ocean, the field caches last longer and do not require
> growing arrays.

First off, I think the combination of LUCENE-1231 and LUCENE-831,  
which should result in FieldCache that is "distributed" down to each  
SegmentReader and much faster to initialize, should make incrementally  
updating the FieldCache much more efficient (ie, on calling  
IndexReader.reopen, it should only be the new segments that need to  
populate their FieldCache).

Hopefully these land before real-time search, because then I have more  
API flexibility to expose column-stride fields on the in-RAM  
documents.  There is still some trickiness, because an "ordinary"  
IndexWriter would never hold the column-stride fields in RAM.  They'd  
be flushed to the Directory, immediately per document, just liked  
stored fields and term vectors are today.  So, maybe, the first  
RAMReader you get from the IndexWriter would load back in these  
fields, triggering IndexWriter to add to them as documents are added  
(maybe using exponentially growing arrays as the underlying store, or,  
perhaps separate array fragments, to prevent synchronization when  
reading from them), such that subsequent reopens simply resync their  
max docID.

> How do you plan to handle rapidly delete the docs of
> the disk segments?  Can the SegmentReader clone patch be used for
> this?

I was thinking we'd flush new .del files every time a reopen is  
called, but that could very well be costly.  Instead, we can keep the  
deletes pending in the SegmentReaders we're holding open, and then go  
back to flushing on IndexWriter's normal schedule.  Reopen then must  
only "materialize" any buffered deletes by Term & Query, unless we  
decide to move up that materialization into the actual delete cal,  
since we will have SegmentReaders open anyway.  I think I'm leaning  
towards that approach... best to pay the cost as you go, instead of  
aggregated cost on reopen?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
Hi Mike,

How does column stride fields work for StringIndex field caching?  I
have been working on the tag index which may be more suitable for
field caching and makes range queries faster.  It is something that
would be good to integrate into core Lucene as well.  It may be more
suitable for many situations.  Perhaps the column stride and tag index
can be merged?  What is the progress on cs?

> Reopen then must only "materialize" any
> buffered deletes by Term & Query, unless we decide to move up that
> materialization into the actual delete cal, since we will have
> SegmentReaders open anyway.  I think I'm leaning towards that approach...
> best to pay the cost as you go, instead of aggregated cost on reopen?

I don't follow this part.  There is an IndexReader exposed from
IndexWriter.  I think the individual SegmentReaders should be exposed
as well, I don't see any reason not to and there are many cases where
it has been frustrating that SegmentReaders are package protected.  I
am not sure from what you mentioned how the deletedDocs bitvector is
handled.

On Fri, Sep 19, 2008 at 8:30 AM, Michael McCandless
<[hidden email]> wrote:

>
> Jason Rutherglen wrote:
>
>> Mike,
>>
>> The other issue that will occur that I addressed is the field caches.
>> The underlying smaller IndexReaders will need to be exposed because of
>> the field caching.  Currently in ocean realtime search the individual
>> readers are searched on using a MultiSearcher in order to search in
>> parallel and reuse the field caches. How will field caching work with
>> the IndexWriter approach?  It seems like it would need a dynamically
>> growing field cache array?  That is a bit tricky.  By doing in memory
>> merging in ocean, the field caches last longer and do not require
>> growing arrays.
>
> First off, I think the combination of LUCENE-1231 and LUCENE-831, which
> should result in FieldCache that is "distributed" down to each SegmentReader
> and much faster to initialize, should make incrementally updating the
> FieldCache much more efficient (ie, on calling IndexReader.reopen, it should
> only be the new segments that need to populate their FieldCache).
>
> Hopefully these land before real-time search, because then I have more API
> flexibility to expose column-stride fields on the in-RAM documents.  There
> is still some trickiness, because an "ordinary" IndexWriter would never hold
> the column-stride fields in RAM.  They'd be flushed to the Directory,
> immediately per document, just liked stored fields and term vectors are
> today.  So, maybe, the first RAMReader you get from the IndexWriter would
> load back in these fields, triggering IndexWriter to add to them as
> documents are added (maybe using exponentially growing arrays as the
> underlying store, or, perhaps separate array fragments, to prevent
> synchronization when reading from them), such that subsequent reopens simply
> resync their max docID.
>
>> How do you plan to handle rapidly delete the docs of
>> the disk segments?  Can the SegmentReader clone patch be used for
>> this?
>
> I was thinking we'd flush new .del files every time a reopen is called, but
> that could very well be costly.  Instead, we can keep the deletes pending in
> the SegmentReaders we're holding open, and then go back to flushing on
> IndexWriter's normal schedule.  Reopen then must only "materialize" any
> buffered deletes by Term & Query, unless we decide to move up that
> materialization into the actual delete cal, since we will have
> SegmentReaders open anyway.  I think I'm leaning towards that approach...
> best to pay the cost as you go, instead of aggregated cost on reopen?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Michael McCandless-2

Jason Rutherglen wrote:

> How does column stride fields work for StringIndex field caching?

I'm not sure -- Michael Busch is working on column-stride fields.

> I
> have been working on the tag index which may be more suitable for
> field caching and makes range queries faster.  It is something that
> would be good to integrate into core Lucene as well.  It may be more
> suitable for many situations.  Perhaps the column stride and tag index
> can be merged?  What is the progress on cs?

Michael can you answer on the progress of column-stride fields / how  
Jason's Tag index would apply?

>> Reopen then must only "materialize" any
>> buffered deletes by Term & Query, unless we decide to move up that
>> materialization into the actual delete cal, since we will have
>> SegmentReaders open anyway.  I think I'm leaning towards that  
>> approach...
>> best to pay the cost as you go, instead of aggregated cost on reopen?
>
> I don't follow this part.  There is an IndexReader exposed from
> IndexWriter.  I think the individual SegmentReaders should be exposed
> as well, I don't see any reason not to and there are many cases where
> it has been frustrating that SegmentReaders are package protected.

Well, you ask IndexWriter for a reader.  It returns to you an  
IndexReader impl that under the hood is basically  MultiReader over a  
bunch of SegmentReaders (already flushed to the index), plus the  
RAMReader.  We may want to expose access these sub-readers, but that's  
orthogonal I think?

> I am not sure from what you mentioned how the deletedDocs bitvector is
> handled.


I'm now thinking each SegmentReader holds its own deletedDocs as well  
as a pending deletedDocs (deletes that happened since the last  
reopen).  As deletes are done (by Query, Term or doc ID) in  
IndexWriter, they are synchronously materialized & recorded against  
the pending deletedDocs for each SegmentReader as well as the RAM  
deletedDocs (that apply to docs buffered in RAM).  When you reopen,  
the pending deletions are merged with the deletedDocs.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Noble Paul നോബിള്‍  नोब्ळ्
In reply to this post by Jason Rutherglen
Moving back to RDBMS model will be a big step backwards where we miss
mulivalued fields and arbitrary fields .

On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
<[hidden email]> wrote:

> Cool.  I mention H2 because it does have some Lucene code in it yes.
> Also according to some benchmarks it's the fastest of the open source
> databases.  I think it's possible to integrate realtime search for H2.
>  I suppose there is no need to store the data in Lucene in this case?
> One loses the multiple values per field Lucene offers, and the schema
> become static.  Perhaps it's a trade off?
>
> On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[hidden email]> wrote:
>> Yes, both Marcelo and I would be interested.
>>
>> We looked into H2 and it looks like something similar to Oracle's ODCI can
>> be implemented. Plus the primitive full-text implementación is based on
>> Lucene.
>> I say primitive because looking at the code I saw that one cannot define an
>> Analyzer and for each scan corresponding to a where clause a searcher is
>> open and closed, instead of having a pool, plus it does not have any way to
>> queue changes to reduce the use of the IndexWriter, etc.
>>
>> But its open source and that is a great starting point!
>>
>> -- Joaquin
>>
>> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>> <[hidden email]> wrote:
>>>
>>> Perhaps an interesting project would be to integrate Ocean with H2
>>> www.h2database.com to take advantage of both models.  I'm not sure how
>>> exactly that would work, but it seems like it would not be too
>>> difficult.  Perhaps this would solve being able to perform faster
>>> hierarchical queries and perhaps other types of queries that Lucene is
>>> not capable of.
>>>
>>> Is this something Joaquin you are interested in collaborating on?  I
>>> am definitely interested in it.
>>>
>>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[hidden email]>
>>> wrote:
>>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>>> > <[hidden email]> wrote:
>>> >>
>>> >> Regarding real-time search and Solr, my feeling is the focus should be
>>> >> on
>>> >> first adding real-time search to Lucene, and then we'll figure out how
>>> >> to
>>> >> incorporate that into Solr later.
>>> >
>>> >
>>> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>>> >  Note
>>> > that Lucene, being a indexing/search library (and not a full blown
>>> > search
>>> > engine), is by definition "real-time": once you add/write a document to
>>> > the
>>> > index it becomes immediately searchable and if a document is logically
>>> > deleted and no longer returned in a search, though physical deletion
>>> > happens
>>> > during an index optimization.
>>> >
>>> > Now, the problem of adding/deleting documents in bulk, as part of a
>>> > transaction and making these documents available for search immediately
>>> > after the transaction is commited sounds more like a search engine
>>> > problem
>>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
>>> > be
>>> > I/O expensive and thus are usually implemented bached proceeses with
>>> > some
>>> > kind of sync mechanism, which makes them non real-time.
>>> >
>>> > For example, in my previous life, I designed and help implement a
>>> > quasi-realtime enterprise search engine using Lucene, having a set of
>>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>>> > accross
>>> > different search services which powered a broker based distributed
>>> > search
>>> > interface. The most recent documents provided to the indexers were
>>> > always
>>> > added to the smaller in-memory (RAM) indexes which usually could absorbe
>>> > the
>>> > load of a bulk "add" transaction and later would be merged into larger
>>> > disk
>>> > based indexes and then flushed to make them ready to absorbe new fresh
>>> > docs.
>>> > We even had further partitioning of the indexes that reflected time
>>> > periods
>>> > with caps on size for them to be merged into older more archive based
>>> > indexes which were used less (yes the search engine default search was
>>> > on
>>> > data no more than 1 month old, though user could open the time window by
>>> > including archives).
>>> >
>>> > As for SOLR and OCEAN,  I would argue that these semi-structured search
>>> > engines are becomming more and more like relational databases with
>>> > full-text
>>> > search capablities (without the benefit of full reletional algebra --
>>> > for
>>> > example joins are not possible using SOLR). Notice that "real-time" CRUD
>>> > operations and transactionality are core DB concepts adn have been
>>> > studied
>>> > and developed by database communities for aquite long time. There has
>>> > been
>>> > recent efforts on how to effeciently integrate Lucene into releational
>>> > databases (see Lucene JVM ORACLE integration, see
>>> >
>>> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>>> >
>>> > I think we should seriously look at joining efforts with open-source
>>> > Database engine projects, written in Java (see
>>> > http://java-source.net/open-source/database-engines) in order to blend
>>> > IR
>>> > and ORM for once and for all.
>>> >
>>> > -- Joaquin
>>> >
>>> >
>>> >>
>>> >> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>>> >> times to understand bits and pieces of it.  I have to admit there is
>>> >> still
>>> >> some fuzziness about the whole things in my head - is "Ocean" something
>>> >> that
>>> >> already works, a separate project on googlecode.com?  I think so.  If
>>> >> so,
>>> >> and if you are working on getting it integrated into Lucene, would it
>>> >> make
>>> >> it less confusing to just refer to it as "real-time search", so there
>>> >> is no
>>> >> confusion?
>>> >>
>>> >> If this is to be initially integrated into Lucene, why are things like
>>> >> replication, crowding/field collapsing, locallucene, name service, tag
>>> >> index, etc. all mentioned there on the Wiki and bundled with
>>> >> description of
>>> >> how real-time search works and is to be implemented?  I suppose
>>> >> mentioning
>>> >> replication kind-of makes sense because the replication approach is
>>> >> closely
>>> >> tied to real-time search - all query nodes need to see index changes
>>> >> fast.
>>> >>  But Lucene itself offers no replication mechanism, so maybe the
>>> >> replication
>>> >> is something to figure out separately, say on the Solr level, later on
>>> >> "once
>>> >> we get there".  I think even just the essential real-time search
>>> >> requires
>>> >> substantial changes to Lucene (I remember seeing large patches in
>>> >> JIRA),
>>> >> which makes it hard to digest, understand, comment on, and ultimately
>>> >> commit
>>> >> (hence the luke warm response, I think).  Bringing other non-essential
>>> >> elements into discussion at the same time makes it more difficult t o
>>> >>  process all this new stuff, at least for me.  Am I the only one who
>>> >> finds
>>> >> this hard?
>>> >>
>>> >> That said, it sounds like we have some discussion going (Karl...), so I
>>> >> look forward to understanding more! :)
>>> >>
>>> >>
>>> >> Otis
>>> >> --
>>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >>
>>> >>
>>> >>
>>> >> ----- Original Message ----
>>> >> > From: Yonik Seeley <[hidden email]>
>>> >> > To: [hidden email]
>>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>>> >> >
>>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>>> >> > wrote:
>>> >> > > I also think it's got a
>>> >> > > lot of things now which makes integration difficult to do properly.
>>> >> >
>>> >> > I agree, and that's why the major bump in version number rather than
>>> >> > minor - we recognize that some features will need some amount of
>>> >> > rearchitecture.
>>> >> >
>>> >> > > I think the problem with integration with SOLR is it was designed
>>> >> > > with
>>> >> > > a different problem set in mind than Ocean, originally the CNET
>>> >> > > shopping application.
>>> >> >
>>> >> > That was the first use of Solr, but it actually existed before that
>>> >> > w/o any defined use other than to be a "plan B" alternative to MySQL
>>> >> > based search servers (that's actually where some of the parameter
>>> >> > names come from... the default /select URL instead of /search, the
>>> >> > "rows" parameter, etc).
>>> >> >
>>> >> > But you're right... some things like the replication strategy were
>>> >> > designed (well, borrowed from Doug to be exact) with the idea that it
>>> >> > would be OK to have slightly "stale" views of the data in the range
>>> >> > of
>>> >> > minutes.  It just made things easier/possible at the time.  But tons
>>> >> > of Solr and Lucene users want almost instantaneous visibility of
>>> >> > added
>>> >> > documents, if they can get it.  It's hardly restricted to social
>>> >> > network applications.
>>> >> >
>>> >> > Bottom line is that Solr aims to be a general enterprise search
>>> >> > platform, and getting as real-time as we can get, and as scalable as
>>> >> > we can get are some of the top priorities going forward.
>>> >> >
>>> >> > -Yonik
>>> >> >
>>> >> > ---------------------------------------------------------------------
>>> >> > To unsubscribe, e-mail: [hidden email]
>>> >> > For additional commands, e-mail: [hidden email]
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [hidden email]
>>> >> For additional commands, e-mail: [hidden email]
>>> >>
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
--Noble Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
Agreed, it's a system that is of value to a subset of cases.

On Sat, Sep 20, 2008 at 4:04 PM, Noble Paul നോബിള്‍ नोब्ळ्
<[hidden email]> wrote:

> Moving back to RDBMS model will be a big step backwards where we miss
> mulivalued fields and arbitrary fields .
>
> On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
> <[hidden email]> wrote:
>> Cool.  I mention H2 because it does have some Lucene code in it yes.
>> Also according to some benchmarks it's the fastest of the open source
>> databases.  I think it's possible to integrate realtime search for H2.
>>  I suppose there is no need to store the data in Lucene in this case?
>> One loses the multiple values per field Lucene offers, and the schema
>> become static.  Perhaps it's a trade off?
>>
>> On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[hidden email]> wrote:
>>> Yes, both Marcelo and I would be interested.
>>>
>>> We looked into H2 and it looks like something similar to Oracle's ODCI can
>>> be implemented. Plus the primitive full-text implementación is based on
>>> Lucene.
>>> I say primitive because looking at the code I saw that one cannot define an
>>> Analyzer and for each scan corresponding to a where clause a searcher is
>>> open and closed, instead of having a pool, plus it does not have any way to
>>> queue changes to reduce the use of the IndexWriter, etc.
>>>
>>> But its open source and that is a great starting point!
>>>
>>> -- Joaquin
>>>
>>> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>>> <[hidden email]> wrote:
>>>>
>>>> Perhaps an interesting project would be to integrate Ocean with H2
>>>> www.h2database.com to take advantage of both models.  I'm not sure how
>>>> exactly that would work, but it seems like it would not be too
>>>> difficult.  Perhaps this would solve being able to perform faster
>>>> hierarchical queries and perhaps other types of queries that Lucene is
>>>> not capable of.
>>>>
>>>> Is this something Joaquin you are interested in collaborating on?  I
>>>> am definitely interested in it.
>>>>
>>>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[hidden email]>
>>>> wrote:
>>>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>>>> > <[hidden email]> wrote:
>>>> >>
>>>> >> Regarding real-time search and Solr, my feeling is the focus should be
>>>> >> on
>>>> >> first adding real-time search to Lucene, and then we'll figure out how
>>>> >> to
>>>> >> incorporate that into Solr later.
>>>> >
>>>> >
>>>> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>>>> >  Note
>>>> > that Lucene, being a indexing/search library (and not a full blown
>>>> > search
>>>> > engine), is by definition "real-time": once you add/write a document to
>>>> > the
>>>> > index it becomes immediately searchable and if a document is logically
>>>> > deleted and no longer returned in a search, though physical deletion
>>>> > happens
>>>> > during an index optimization.
>>>> >
>>>> > Now, the problem of adding/deleting documents in bulk, as part of a
>>>> > transaction and making these documents available for search immediately
>>>> > after the transaction is commited sounds more like a search engine
>>>> > problem
>>>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
>>>> > be
>>>> > I/O expensive and thus are usually implemented bached proceeses with
>>>> > some
>>>> > kind of sync mechanism, which makes them non real-time.
>>>> >
>>>> > For example, in my previous life, I designed and help implement a
>>>> > quasi-realtime enterprise search engine using Lucene, having a set of
>>>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>>>> > accross
>>>> > different search services which powered a broker based distributed
>>>> > search
>>>> > interface. The most recent documents provided to the indexers were
>>>> > always
>>>> > added to the smaller in-memory (RAM) indexes which usually could absorbe
>>>> > the
>>>> > load of a bulk "add" transaction and later would be merged into larger
>>>> > disk
>>>> > based indexes and then flushed to make them ready to absorbe new fresh
>>>> > docs.
>>>> > We even had further partitioning of the indexes that reflected time
>>>> > periods
>>>> > with caps on size for them to be merged into older more archive based
>>>> > indexes which were used less (yes the search engine default search was
>>>> > on
>>>> > data no more than 1 month old, though user could open the time window by
>>>> > including archives).
>>>> >
>>>> > As for SOLR and OCEAN,  I would argue that these semi-structured search
>>>> > engines are becomming more and more like relational databases with
>>>> > full-text
>>>> > search capablities (without the benefit of full reletional algebra --
>>>> > for
>>>> > example joins are not possible using SOLR). Notice that "real-time" CRUD
>>>> > operations and transactionality are core DB concepts adn have been
>>>> > studied
>>>> > and developed by database communities for aquite long time. There has
>>>> > been
>>>> > recent efforts on how to effeciently integrate Lucene into releational
>>>> > databases (see Lucene JVM ORACLE integration, see
>>>> >
>>>> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>>>> >
>>>> > I think we should seriously look at joining efforts with open-source
>>>> > Database engine projects, written in Java (see
>>>> > http://java-source.net/open-source/database-engines) in order to blend
>>>> > IR
>>>> > and ORM for once and for all.
>>>> >
>>>> > -- Joaquin
>>>> >
>>>> >
>>>> >>
>>>> >> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>>>> >> times to understand bits and pieces of it.  I have to admit there is
>>>> >> still
>>>> >> some fuzziness about the whole things in my head - is "Ocean" something
>>>> >> that
>>>> >> already works, a separate project on googlecode.com?  I think so.  If
>>>> >> so,
>>>> >> and if you are working on getting it integrated into Lucene, would it
>>>> >> make
>>>> >> it less confusing to just refer to it as "real-time search", so there
>>>> >> is no
>>>> >> confusion?
>>>> >>
>>>> >> If this is to be initially integrated into Lucene, why are things like
>>>> >> replication, crowding/field collapsing, locallucene, name service, tag
>>>> >> index, etc. all mentioned there on the Wiki and bundled with
>>>> >> description of
>>>> >> how real-time search works and is to be implemented?  I suppose
>>>> >> mentioning
>>>> >> replication kind-of makes sense because the replication approach is
>>>> >> closely
>>>> >> tied to real-time search - all query nodes need to see index changes
>>>> >> fast.
>>>> >>  But Lucene itself offers no replication mechanism, so maybe the
>>>> >> replication
>>>> >> is something to figure out separately, say on the Solr level, later on
>>>> >> "once
>>>> >> we get there".  I think even just the essential real-time search
>>>> >> requires
>>>> >> substantial changes to Lucene (I remember seeing large patches in
>>>> >> JIRA),
>>>> >> which makes it hard to digest, understand, comment on, and ultimately
>>>> >> commit
>>>> >> (hence the luke warm response, I think).  Bringing other non-essential
>>>> >> elements into discussion at the same time makes it more difficult t o
>>>> >>  process all this new stuff, at least for me.  Am I the only one who
>>>> >> finds
>>>> >> this hard?
>>>> >>
>>>> >> That said, it sounds like we have some discussion going (Karl...), so I
>>>> >> look forward to understanding more! :)
>>>> >>
>>>> >>
>>>> >> Otis
>>>> >> --
>>>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>> >>
>>>> >>
>>>> >>
>>>> >> ----- Original Message ----
>>>> >> > From: Yonik Seeley <[hidden email]>
>>>> >> > To: [hidden email]
>>>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>>>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>>>> >> >
>>>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>>>> >> > wrote:
>>>> >> > > I also think it's got a
>>>> >> > > lot of things now which makes integration difficult to do properly.
>>>> >> >
>>>> >> > I agree, and that's why the major bump in version number rather than
>>>> >> > minor - we recognize that some features will need some amount of
>>>> >> > rearchitecture.
>>>> >> >
>>>> >> > > I think the problem with integration with SOLR is it was designed
>>>> >> > > with
>>>> >> > > a different problem set in mind than Ocean, originally the CNET
>>>> >> > > shopping application.
>>>> >> >
>>>> >> > That was the first use of Solr, but it actually existed before that
>>>> >> > w/o any defined use other than to be a "plan B" alternative to MySQL
>>>> >> > based search servers (that's actually where some of the parameter
>>>> >> > names come from... the default /select URL instead of /search, the
>>>> >> > "rows" parameter, etc).
>>>> >> >
>>>> >> > But you're right... some things like the replication strategy were
>>>> >> > designed (well, borrowed from Doug to be exact) with the idea that it
>>>> >> > would be OK to have slightly "stale" views of the data in the range
>>>> >> > of
>>>> >> > minutes.  It just made things easier/possible at the time.  But tons
>>>> >> > of Solr and Lucene users want almost instantaneous visibility of
>>>> >> > added
>>>> >> > documents, if they can get it.  It's hardly restricted to social
>>>> >> > network applications.
>>>> >> >
>>>> >> > Bottom line is that Solr aims to be a general enterprise search
>>>> >> > platform, and getting as real-time as we can get, and as scalable as
>>>> >> > we can get are some of the top priorities going forward.
>>>> >> >
>>>> >> > -Yonik
>>>> >> >
>>>> >> > ---------------------------------------------------------------------
>>>> >> > To unsubscribe, e-mail: [hidden email]
>>>> >> > For additional commands, e-mail: [hidden email]
>>>> >>
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: [hidden email]
>>>> >> For additional commands, e-mail: [hidden email]
>>>> >>
>>>> >
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
>
> --
> --Noble Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

J. Delgado
In reply to this post by Noble Paul നോബിള്‍ नोब्ळ्
On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]> wrote:
Moving back to RDBMS model will be a big step backwards where we miss
mulivalued fields and arbitrary fields .

 No one is suggesting to "lose" any of the virtues of the field based indexing that Lucene provides. All but the contrary: by extending the RDBMS model with Lucene-based indexes one can map relational rows to documents and columns to fields. Note that one relational field can be mapped to one or more text based fields and multi-valued fields will still be allowed.

Please check the Lucence OJVM implementation for details on implementation and philosophy on the RDBMS-Lucene converged model:

http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg

More discussions at Marcelo's blog who will be presenting in Oracle World 2008 this week.
http://marceloochoa.blogspot.com/

BTW, it just happen that this was implemented using Oracle but similar implementation in H2 seems not only feasible but desirable.

-- Joaquin




On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
<[hidden email]> wrote:
> Cool.  I mention H2 because it does have some Lucene code in it yes.
> Also according to some benchmarks it's the fastest of the open source
> databases.  I think it's possible to integrate realtime search for H2.
>  I suppose there is no need to store the data in Lucene in this case?
> One loses the multiple values per field Lucene offers, and the schema
> become static.  Perhaps it's a trade off?
>
> On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[hidden email]> wrote:
>> Yes, both Marcelo and I would be interested.
>>
>> We looked into H2 and it looks like something similar to Oracle's ODCI can
>> be implemented. Plus the primitive full-text implementación is based on
>> Lucene.
>> I say primitive because looking at the code I saw that one cannot define an
>> Analyzer and for each scan corresponding to a where clause a searcher is
>> open and closed, instead of having a pool, plus it does not have any way to
>> queue changes to reduce the use of the IndexWriter, etc.
>>
>> But its open source and that is a great starting point!
>>
>> -- Joaquin
>>
>> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>> <[hidden email]> wrote:
>>>
>>> Perhaps an interesting project would be to integrate Ocean with H2
>>> www.h2database.com to take advantage of both models.  I'm not sure how
>>> exactly that would work, but it seems like it would not be too
>>> difficult.  Perhaps this would solve being able to perform faster
>>> hierarchical queries and perhaps other types of queries that Lucene is
>>> not capable of.
>>>
>>> Is this something Joaquin you are interested in collaborating on?  I
>>> am definitely interested in it.
>>>
>>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[hidden email]>
>>> wrote:
>>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>>> > <[hidden email]> wrote:
>>> >>
>>> >> Regarding real-time search and Solr, my feeling is the focus should be
>>> >> on
>>> >> first adding real-time search to Lucene, and then we'll figure out how
>>> >> to
>>> >> incorporate that into Solr later.
>>> >
>>> >
>>> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>>> >  Note
>>> > that Lucene, being a indexing/search library (and not a full blown
>>> > search
>>> > engine), is by definition "real-time": once you add/write a document to
>>> > the
>>> > index it becomes immediately searchable and if a document is logically
>>> > deleted and no longer returned in a search, though physical deletion
>>> > happens
>>> > during an index optimization.
>>> >
>>> > Now, the problem of adding/deleting documents in bulk, as part of a
>>> > transaction and making these documents available for search immediately
>>> > after the transaction is commited sounds more like a search engine
>>> > problem
>>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
>>> > be
>>> > I/O expensive and thus are usually implemented bached proceeses with
>>> > some
>>> > kind of sync mechanism, which makes them non real-time.
>>> >
>>> > For example, in my previous life, I designed and help implement a
>>> > quasi-realtime enterprise search engine using Lucene, having a set of
>>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>>> > accross
>>> > different search services which powered a broker based distributed
>>> > search
>>> > interface. The most recent documents provided to the indexers were
>>> > always
>>> > added to the smaller in-memory (RAM) indexes which usually could absorbe
>>> > the
>>> > load of a bulk "add" transaction and later would be merged into larger
>>> > disk
>>> > based indexes and then flushed to make them ready to absorbe new fresh
>>> > docs.
>>> > We even had further partitioning of the indexes that reflected time
>>> > periods
>>> > with caps on size for them to be merged into older more archive based
>>> > indexes which were used less (yes the search engine default search was
>>> > on
>>> > data no more than 1 month old, though user could open the time window by
>>> > including archives).
>>> >
>>> > As for SOLR and OCEAN,  I would argue that these semi-structured search
>>> > engines are becomming more and more like relational databases with
>>> > full-text
>>> > search capablities (without the benefit of full reletional algebra --
>>> > for
>>> > example joins are not possible using SOLR). Notice that "real-time" CRUD
>>> > operations and transactionality are core DB concepts adn have been
>>> > studied
>>> > and developed by database communities for aquite long time. There has
>>> > been
>>> > recent efforts on how to effeciently integrate Lucene into releational
>>> > databases (see Lucene JVM ORACLE integration, see
>>> >
>>> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>>> >
>>> > I think we should seriously look at joining efforts with open-source
>>> > Database engine projects, written in Java (see
>>> > http://java-source.net/open-source/database-engines) in order to blend
>>> > IR
>>> > and ORM for once and for all.
>>> >
>>> > -- Joaquin
>>> >
>>> >
>>> >>
>>> >> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>>> >> times to understand bits and pieces of it.  I have to admit there is
>>> >> still
>>> >> some fuzziness about the whole things in my head - is "Ocean" something
>>> >> that
>>> >> already works, a separate project on googlecode.com?  I think so.  If
>>> >> so,
>>> >> and if you are working on getting it integrated into Lucene, would it
>>> >> make
>>> >> it less confusing to just refer to it as "real-time search", so there
>>> >> is no
>>> >> confusion?
>>> >>
>>> >> If this is to be initially integrated into Lucene, why are things like
>>> >> replication, crowding/field collapsing, locallucene, name service, tag
>>> >> index, etc. all mentioned there on the Wiki and bundled with
>>> >> description of
>>> >> how real-time search works and is to be implemented?  I suppose
>>> >> mentioning
>>> >> replication kind-of makes sense because the replication approach is
>>> >> closely
>>> >> tied to real-time search - all query nodes need to see index changes
>>> >> fast.
>>> >>  But Lucene itself offers no replication mechanism, so maybe the
>>> >> replication
>>> >> is something to figure out separately, say on the Solr level, later on
>>> >> "once
>>> >> we get there".  I think even just the essential real-time search
>>> >> requires
>>> >> substantial changes to Lucene (I remember seeing large patches in
>>> >> JIRA),
>>> >> which makes it hard to digest, understand, comment on, and ultimately
>>> >> commit
>>> >> (hence the luke warm response, I think).  Bringing other non-essential
>>> >> elements into discussion at the same time makes it more difficult t o
>>> >>  process all this new stuff, at least for me.  Am I the only one who
>>> >> finds
>>> >> this hard?
>>> >>
>>> >> That said, it sounds like we have some discussion going (Karl...), so I
>>> >> look forward to understanding more! :)
>>> >>
>>> >>
>>> >> Otis
>>> >> --
>>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >>
>>> >>
>>> >>
>>> >> ----- Original Message ----
>>> >> > From: Yonik Seeley <[hidden email]>
>>> >> > To: [hidden email]
>>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>>> >> >
>>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>>> >> > wrote:
>>> >> > > I also think it's got a
>>> >> > > lot of things now which makes integration difficult to do properly.
>>> >> >
>>> >> > I agree, and that's why the major bump in version number rather than
>>> >> > minor - we recognize that some features will need some amount of
>>> >> > rearchitecture.
>>> >> >
>>> >> > > I think the problem with integration with SOLR is it was designed
>>> >> > > with
>>> >> > > a different problem set in mind than Ocean, originally the CNET
>>> >> > > shopping application.
>>> >> >
>>> >> > That was the first use of Solr, but it actually existed before that
>>> >> > w/o any defined use other than to be a "plan B" alternative to MySQL
>>> >> > based search servers (that's actually where some of the parameter
>>> >> > names come from... the default /select URL instead of /search, the
>>> >> > "rows" parameter, etc).
>>> >> >
>>> >> > But you're right... some things like the replication strategy were
>>> >> > designed (well, borrowed from Doug to be exact) with the idea that it
>>> >> > would be OK to have slightly "stale" views of the data in the range
>>> >> > of
>>> >> > minutes.  It just made things easier/possible at the time.  But tons
>>> >> > of Solr and Lucene users want almost instantaneous visibility of
>>> >> > added
>>> >> > documents, if they can get it.  It's hardly restricted to social
>>> >> > network applications.
>>> >> >
>>> >> > Bottom line is that Solr aims to be a general enterprise search
>>> >> > platform, and getting as real-time as we can get, and as scalable as
>>> >> > we can get are some of the top priorities going forward.
>>> >> >
>>> >> > -Yonik
>>> >> >
>>> >> > ---------------------------------------------------------------------
>>> >> > To unsubscribe, e-mail: [hidden email]
>>> >> > For additional commands, e-mail: [hidden email]
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [hidden email]
>>> >> For additional commands, e-mail: [hidden email]
>>> >>
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
--Noble Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

J. Delgado
Sorry, I meant "loose" (replacing "lose")

On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado <[hidden email]> wrote:
On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]> wrote:
Moving back to RDBMS model will be a big step backwards where we miss
mulivalued fields and arbitrary fields .

 No one is suggesting to "lose" any of the virtues of the field based indexing that Lucene provides. All but the contrary: by extending the RDBMS model with Lucene-based indexes one can map relational rows to documents and columns to fields. Note that one relational field can be mapped to one or more text based fields and multi-valued fields will still be allowed.

Please check the Lucence OJVM implementation for details on implementation and philosophy on the RDBMS-Lucene converged model:

http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg

More discussions at Marcelo's blog who will be presenting in Oracle World 2008 this week.
http://marceloochoa.blogspot.com/

BTW, it just happen that this was implemented using Oracle but similar implementation in H2 seems not only feasible but desirable.

-- Joaquin




On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
<[hidden email]> wrote:
> Cool.  I mention H2 because it does have some Lucene code in it yes.
> Also according to some benchmarks it's the fastest of the open source
> databases.  I think it's possible to integrate realtime search for H2.
>  I suppose there is no need to store the data in Lucene in this case?
> One loses the multiple values per field Lucene offers, and the schema
> become static.  Perhaps it's a trade off?
>
> On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[hidden email]> wrote:
>> Yes, both Marcelo and I would be interested.
>>
>> We looked into H2 and it looks like something similar to Oracle's ODCI can
>> be implemented. Plus the primitive full-text implementación is based on
>> Lucene.
>> I say primitive because looking at the code I saw that one cannot define an
>> Analyzer and for each scan corresponding to a where clause a searcher is
>> open and closed, instead of having a pool, plus it does not have any way to
>> queue changes to reduce the use of the IndexWriter, etc.
>>
>> But its open source and that is a great starting point!
>>
>> -- Joaquin
>>
>> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>> <[hidden email]> wrote:
>>>
>>> Perhaps an interesting project would be to integrate Ocean with H2
>>> www.h2database.com to take advantage of both models.  I'm not sure how
>>> exactly that would work, but it seems like it would not be too
>>> difficult.  Perhaps this would solve being able to perform faster
>>> hierarchical queries and perhaps other types of queries that Lucene is
>>> not capable of.
>>>
>>> Is this something Joaquin you are interested in collaborating on?  I
>>> am definitely interested in it.
>>>
>>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[hidden email]>
>>> wrote:
>>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>>> > <[hidden email]> wrote:
>>> >>
>>> >> Regarding real-time search and Solr, my feeling is the focus should be
>>> >> on
>>> >> first adding real-time search to Lucene, and then we'll figure out how
>>> >> to
>>> >> incorporate that into Solr later.
>>> >
>>> >
>>> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>>> >  Note
>>> > that Lucene, being a indexing/search library (and not a full blown
>>> > search
>>> > engine), is by definition "real-time": once you add/write a document to
>>> > the
>>> > index it becomes immediately searchable and if a document is logically
>>> > deleted and no longer returned in a search, though physical deletion
>>> > happens
>>> > during an index optimization.
>>> >
>>> > Now, the problem of adding/deleting documents in bulk, as part of a
>>> > transaction and making these documents available for search immediately
>>> > after the transaction is commited sounds more like a search engine
>>> > problem
>>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
>>> > be
>>> > I/O expensive and thus are usually implemented bached proceeses with
>>> > some
>>> > kind of sync mechanism, which makes them non real-time.
>>> >
>>> > For example, in my previous life, I designed and help implement a
>>> > quasi-realtime enterprise search engine using Lucene, having a set of
>>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>>> > accross
>>> > different search services which powered a broker based distributed
>>> > search
>>> > interface. The most recent documents provided to the indexers were
>>> > always
>>> > added to the smaller in-memory (RAM) indexes which usually could absorbe
>>> > the
>>> > load of a bulk "add" transaction and later would be merged into larger
>>> > disk
>>> > based indexes and then flushed to make them ready to absorbe new fresh
>>> > docs.
>>> > We even had further partitioning of the indexes that reflected time
>>> > periods
>>> > with caps on size for them to be merged into older more archive based
>>> > indexes which were used less (yes the search engine default search was
>>> > on
>>> > data no more than 1 month old, though user could open the time window by
>>> > including archives).
>>> >
>>> > As for SOLR and OCEAN,  I would argue that these semi-structured search
>>> > engines are becomming more and more like relational databases with
>>> > full-text
>>> > search capablities (without the benefit of full reletional algebra --
>>> > for
>>> > example joins are not possible using SOLR). Notice that "real-time" CRUD
>>> > operations and transactionality are core DB concepts adn have been
>>> > studied
>>> > and developed by database communities for aquite long time. There has
>>> > been
>>> > recent efforts on how to effeciently integrate Lucene into releational
>>> > databases (see Lucene JVM ORACLE integration, see
>>> >
>>> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>>> >
>>> > I think we should seriously look at joining efforts with open-source
>>> > Database engine projects, written in Java (see
>>> > http://java-source.net/open-source/database-engines) in order to blend
>>> > IR
>>> > and ORM for once and for all.
>>> >
>>> > -- Joaquin
>>> >
>>> >
>>> >>
>>> >> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>>> >> times to understand bits and pieces of it.  I have to admit there is
>>> >> still
>>> >> some fuzziness about the whole things in my head - is "Ocean" something
>>> >> that
>>> >> already works, a separate project on googlecode.com?  I think so.  If
>>> >> so,
>>> >> and if you are working on getting it integrated into Lucene, would it
>>> >> make
>>> >> it less confusing to just refer to it as "real-time search", so there
>>> >> is no
>>> >> confusion?
>>> >>
>>> >> If this is to be initially integrated into Lucene, why are things like
>>> >> replication, crowding/field collapsing, locallucene, name service, tag
>>> >> index, etc. all mentioned there on the Wiki and bundled with
>>> >> description of
>>> >> how real-time search works and is to be implemented?  I suppose
>>> >> mentioning
>>> >> replication kind-of makes sense because the replication approach is
>>> >> closely
>>> >> tied to real-time search - all query nodes need to see index changes
>>> >> fast.
>>> >>  But Lucene itself offers no replication mechanism, so maybe the
>>> >> replication
>>> >> is something to figure out separately, say on the Solr level, later on
>>> >> "once
>>> >> we get there".  I think even just the essential real-time search
>>> >> requires
>>> >> substantial changes to Lucene (I remember seeing large patches in
>>> >> JIRA),
>>> >> which makes it hard to digest, understand, comment on, and ultimately
>>> >> commit
>>> >> (hence the luke warm response, I think).  Bringing other non-essential
>>> >> elements into discussion at the same time makes it more difficult t o
>>> >>  process all this new stuff, at least for me.  Am I the only one who
>>> >> finds
>>> >> this hard?
>>> >>
>>> >> That said, it sounds like we have some discussion going (Karl...), so I
>>> >> look forward to understanding more! :)
>>> >>
>>> >>
>>> >> Otis
>>> >> --
>>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >>
>>> >>
>>> >>
>>> >> ----- Original Message ----
>>> >> > From: Yonik Seeley <[hidden email]>
>>> >> > To: [hidden email]
>>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>>> >> >
>>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>>> >> > wrote:
>>> >> > > I also think it's got a
>>> >> > > lot of things now which makes integration difficult to do properly.
>>> >> >
>>> >> > I agree, and that's why the major bump in version number rather than
>>> >> > minor - we recognize that some features will need some amount of
>>> >> > rearchitecture.
>>> >> >
>>> >> > > I think the problem with integration with SOLR is it was designed
>>> >> > > with
>>> >> > > a different problem set in mind than Ocean, originally the CNET
>>> >> > > shopping application.
>>> >> >
>>> >> > That was the first use of Solr, but it actually existed before that
>>> >> > w/o any defined use other than to be a "plan B" alternative to MySQL
>>> >> > based search servers (that's actually where some of the parameter
>>> >> > names come from... the default /select URL instead of /search, the
>>> >> > "rows" parameter, etc).
>>> >> >
>>> >> > But you're right... some things like the replication strategy were
>>> >> > designed (well, borrowed from Doug to be exact) with the idea that it
>>> >> > would be OK to have slightly "stale" views of the data in the range
>>> >> > of
>>> >> > minutes.  It just made things easier/possible at the time.  But tons
>>> >> > of Solr and Lucene users want almost instantaneous visibility of
>>> >> > added
>>> >> > documents, if they can get it.  It's hardly restricted to social
>>> >> > network applications.
>>> >> >
>>> >> > Bottom line is that Solr aims to be a general enterprise search
>>> >> > platform, and getting as real-time as we can get, and as scalable as
>>> >> > we can get are some of the top priorities going forward.
>>> >> >
>>> >> > -Yonik
>>> >> >
>>> >> > ---------------------------------------------------------------------
>>> >> > To unsubscribe, e-mail: [hidden email]
>>> >> > For additional commands, e-mail: [hidden email]
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [hidden email]
>>> >> For additional commands, e-mail: [hidden email]
>>> >>
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
--Noble Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

J. Delgado
Please ignore the correction... "lose" is fine:-)

On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado <[hidden email]> wrote:
Sorry, I meant "loose" (replacing "lose")


On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado <[hidden email]> wrote:
On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]> wrote:
Moving back to RDBMS model will be a big step backwards where we miss
mulivalued fields and arbitrary fields .

 No one is suggesting to "lose" any of the virtues of the field based indexing that Lucene provides. All but the contrary: by extending the RDBMS model with Lucene-based indexes one can map relational rows to documents and columns to fields. Note that one relational field can be mapped to one or more text based fields and multi-valued fields will still be allowed.

Please check the Lucence OJVM implementation for details on implementation and philosophy on the RDBMS-Lucene converged model:

http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg

More discussions at Marcelo's blog who will be presenting in Oracle World 2008 this week.
http://marceloochoa.blogspot.com/

BTW, it just happen that this was implemented using Oracle but similar implementation in H2 seems not only feasible but desirable.

-- Joaquin




On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
<[hidden email]> wrote:
> Cool.  I mention H2 because it does have some Lucene code in it yes.
> Also according to some benchmarks it's the fastest of the open source
> databases.  I think it's possible to integrate realtime search for H2.
>  I suppose there is no need to store the data in Lucene in this case?
> One loses the multiple values per field Lucene offers, and the schema
> become static.  Perhaps it's a trade off?
>
> On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[hidden email]> wrote:
>> Yes, both Marcelo and I would be interested.
>>
>> We looked into H2 and it looks like something similar to Oracle's ODCI can
>> be implemented. Plus the primitive full-text implementación is based on
>> Lucene.
>> I say primitive because looking at the code I saw that one cannot define an
>> Analyzer and for each scan corresponding to a where clause a searcher is
>> open and closed, instead of having a pool, plus it does not have any way to
>> queue changes to reduce the use of the IndexWriter, etc.
>>
>> But its open source and that is a great starting point!
>>
>> -- Joaquin
>>
>> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>> <[hidden email]> wrote:
>>>
>>> Perhaps an interesting project would be to integrate Ocean with H2
>>> www.h2database.com to take advantage of both models.  I'm not sure how
>>> exactly that would work, but it seems like it would not be too
>>> difficult.  Perhaps this would solve being able to perform faster
>>> hierarchical queries and perhaps other types of queries that Lucene is
>>> not capable of.
>>>
>>> Is this something Joaquin you are interested in collaborating on?  I
>>> am definitely interested in it.
>>>
>>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[hidden email]>
>>> wrote:
>>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>>> > <[hidden email]> wrote:
>>> >>
>>> >> Regarding real-time search and Solr, my feeling is the focus should be
>>> >> on
>>> >> first adding real-time search to Lucene, and then we'll figure out how
>>> >> to
>>> >> incorporate that into Solr later.
>>> >
>>> >
>>> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>>> >  Note
>>> > that Lucene, being a indexing/search library (and not a full blown
>>> > search
>>> > engine), is by definition "real-time": once you add/write a document to
>>> > the
>>> > index it becomes immediately searchable and if a document is logically
>>> > deleted and no longer returned in a search, though physical deletion
>>> > happens
>>> > during an index optimization.
>>> >
>>> > Now, the problem of adding/deleting documents in bulk, as part of a
>>> > transaction and making these documents available for search immediately
>>> > after the transaction is commited sounds more like a search engine
>>> > problem
>>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
>>> > be
>>> > I/O expensive and thus are usually implemented bached proceeses with
>>> > some
>>> > kind of sync mechanism, which makes them non real-time.
>>> >
>>> > For example, in my previous life, I designed and help implement a
>>> > quasi-realtime enterprise search engine using Lucene, having a set of
>>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>>> > accross
>>> > different search services which powered a broker based distributed
>>> > search
>>> > interface. The most recent documents provided to the indexers were
>>> > always
>>> > added to the smaller in-memory (RAM) indexes which usually could absorbe
>>> > the
>>> > load of a bulk "add" transaction and later would be merged into larger
>>> > disk
>>> > based indexes and then flushed to make them ready to absorbe new fresh
>>> > docs.
>>> > We even had further partitioning of the indexes that reflected time
>>> > periods
>>> > with caps on size for them to be merged into older more archive based
>>> > indexes which were used less (yes the search engine default search was
>>> > on
>>> > data no more than 1 month old, though user could open the time window by
>>> > including archives).
>>> >
>>> > As for SOLR and OCEAN,  I would argue that these semi-structured search
>>> > engines are becomming more and more like relational databases with
>>> > full-text
>>> > search capablities (without the benefit of full reletional algebra --
>>> > for
>>> > example joins are not possible using SOLR). Notice that "real-time" CRUD
>>> > operations and transactionality are core DB concepts adn have been
>>> > studied
>>> > and developed by database communities for aquite long time. There has
>>> > been
>>> > recent efforts on how to effeciently integrate Lucene into releational
>>> > databases (see Lucene JVM ORACLE integration, see
>>> >
>>> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>>> >
>>> > I think we should seriously look at joining efforts with open-source
>>> > Database engine projects, written in Java (see
>>> > http://java-source.net/open-source/database-engines) in order to blend
>>> > IR
>>> > and ORM for once and for all.
>>> >
>>> > -- Joaquin
>>> >
>>> >
>>> >>
>>> >> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>>> >> times to understand bits and pieces of it.  I have to admit there is
>>> >> still
>>> >> some fuzziness about the whole things in my head - is "Ocean" something
>>> >> that
>>> >> already works, a separate project on googlecode.com?  I think so.  If
>>> >> so,
>>> >> and if you are working on getting it integrated into Lucene, would it
>>> >> make
>>> >> it less confusing to just refer to it as "real-time search", so there
>>> >> is no
>>> >> confusion?
>>> >>
>>> >> If this is to be initially integrated into Lucene, why are things like
>>> >> replication, crowding/field collapsing, locallucene, name service, tag
>>> >> index, etc. all mentioned there on the Wiki and bundled with
>>> >> description of
>>> >> how real-time search works and is to be implemented?  I suppose
>>> >> mentioning
>>> >> replication kind-of makes sense because the replication approach is
>>> >> closely
>>> >> tied to real-time search - all query nodes need to see index changes
>>> >> fast.
>>> >>  But Lucene itself offers no replication mechanism, so maybe the
>>> >> replication
>>> >> is something to figure out separately, say on the Solr level, later on
>>> >> "once
>>> >> we get there".  I think even just the essential real-time search
>>> >> requires
>>> >> substantial changes to Lucene (I remember seeing large patches in
>>> >> JIRA),
>>> >> which makes it hard to digest, understand, comment on, and ultimately
>>> >> commit
>>> >> (hence the luke warm response, I think).  Bringing other non-essential
>>> >> elements into discussion at the same time makes it more difficult t o
>>> >>  process all this new stuff, at least for me.  Am I the only one who
>>> >> finds
>>> >> this hard?
>>> >>
>>> >> That said, it sounds like we have some discussion going (Karl...), so I
>>> >> look forward to understanding more! :)
>>> >>
>>> >>
>>> >> Otis
>>> >> --
>>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >>
>>> >>
>>> >>
>>> >> ----- Original Message ----
>>> >> > From: Yonik Seeley <[hidden email]>
>>> >> > To: [hidden email]
>>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>>> >> >
>>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>>> >> > wrote:
>>> >> > > I also think it's got a
>>> >> > > lot of things now which makes integration difficult to do properly.
>>> >> >
>>> >> > I agree, and that's why the major bump in version number rather than
>>> >> > minor - we recognize that some features will need some amount of
>>> >> > rearchitecture.
>>> >> >
>>> >> > > I think the problem with integration with SOLR is it was designed
>>> >> > > with
>>> >> > > a different problem set in mind than Ocean, originally the CNET
>>> >> > > shopping application.
>>> >> >
>>> >> > That was the first use of Solr, but it actually existed before that
>>> >> > w/o any defined use other than to be a "plan B" alternative to MySQL
>>> >> > based search servers (that's actually where some of the parameter
>>> >> > names come from... the default /select URL instead of /search, the
>>> >> > "rows" parameter, etc).
>>> >> >
>>> >> > But you're right... some things like the replication strategy were
>>> >> > designed (well, borrowed from Doug to be exact) with the idea that it
>>> >> > would be OK to have slightly "stale" views of the data in the range
>>> >> > of
>>> >> > minutes.  It just made things easier/possible at the time.  But tons
>>> >> > of Solr and Lucene users want almost instantaneous visibility of
>>> >> > added
>>> >> > documents, if they can get it.  It's hardly restricted to social
>>> >> > network applications.
>>> >> >
>>> >> > Bottom line is that Solr aims to be a general enterprise search
>>> >> > platform, and getting as real-time as we can get, and as scalable as
>>> >> > we can get are some of the top priorities going forward.
>>> >> >
>>> >> > -Yonik
>>> >> >
>>> >> > ---------------------------------------------------------------------
>>> >> > To unsubscribe, e-mail: [hidden email]
>>> >> > For additional commands, e-mail: [hidden email]
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [hidden email]
>>> >> For additional commands, e-mail: [hidden email]
>>> >>
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
--Noble Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




1234