Realtime Search for Social Networks Collaboration

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
61 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
Hi Joaquin,

Using HBase with realtime Lucene would be in line with what Google
does.  However the question is whether or not this is completely
necessary or the most simple approach.  That probably can only be
answered by doing a live comparison of the two!  Unfortunately that
would require probably quite a bit of work and resources.  For now,
Ocean stores the data in the Lucene indexes because it works, it's
easy to implement etc.  I have looked at other options, however they
need to be prioritized in terms of need vs cost.  I would put the
HBase solution possibly at the high end of the resource scale.  I
think usually it's best to keep things as simple as possible and as
cheap as possible.  More complexity in a scalable realtime search
solution would mean more people, more expertise, and more
possibilities for breakage.  It would need to be clear what HBase or
other solutions for storing the data brought to the table, which
because I don't have time to look at them, I cannot answer.
Nonetheless it is somewhat interesting.

Cheers,
Jason Rutherglen

On Sun, Sep 7, 2008 at 11:16 AM, J. Delgado <[hidden email]> wrote:

> On Sun, Sep 7, 2008 at 2:41 AM, mark harwood <[hidden email]>
> wrote:
>
>> >>for example joins are not possible using SOLR).
>>
>> It's largely *because* Lucene doesn't do joins that it can be made to
>> scale out. I've replaced two large-scale database systems this year with
>> distributed Lucene solutions because this scale-out architecture provided
>> significantly better performance. These were "semi-structured" systems too.
>> Lucene's comparitively simplistic data model/query model is both a weakness
>> and a strength in this regard.
>
>  Hey, maybe the right way to go for a truly scalable and high performance
> semi-structured database is to marry HBase (Big-table like data storage)
> with SOLR/Lucene.I concur with you in the sense that simplistic data models
> coupled with high performance are the killer.
>
> Let me quote this from the original Bigtable paper from Google:
>
> " Bigtable does not support a full relational data model; instead, it
> provides clients with a simple data model that supports dynamic control over
> data layout and format, and allows clients to reason about the locality
> properties of the data represented in the underlying storage. Data is
> indexed using row and column names that can be arbitrary strings. Bigtable
> also treats data as uninterpreted strings, although clients often serialize
> various forms of structured and semi-structured data into these strings.
> Clients can control the locality of their data through careful choices in
> their schemas. Finally, Bigtable schema parameters let clients dynamically
> control whether to serve data out of memory or from disk."
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Ning Li-3
Hi,

We experimented using HBase's scalable infrastructure to scale out Lucene:
http://www.mail-archive.com/hbase-user@.../msg01143.html

There is the concern on the impact of HDFS's random read performance
on Lucene search performance. And we can discuss if HBase's architecture
is best for scale-out Lucene. But to me, the general idea of reusing a scalable
infrastructure (if a suitable one exits) is appealing - such an infrastructure
already handles repartitioning for scalability, fault tolerance etc.

I agree with Otis that the first step for Lucene is probably to
support real-time
search. The instantiated index in contrib seems to be something close...

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Mark Miller-3
Ning Li wrote:
>
> I agree with Otis that the first step for Lucene is probably to
> support real-time
> search. The instantiated index in contrib seems to be something close..
Maybe we should start fleshing out what we want in realtime search on
the wiki?

Could it be as simple as making InstantiatedIndex realtime (allow
writes/read at same time?). Then you could search over your IndexReader
as well as the InstantiatedIndex. Writes go to both the Writer and the
InstantiatedIndex. Nothing is actually permanent until the true commit,
but stuff is visible pretty fast...a new IndexReader view starts a fresh
InstantiedIndex...

Jasons realtime patch is still pretty large...would be nice if we could
accomplish this with as few changes as possible...

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
InstantiatedIndex isn't quite realtime.  Instead a new
InstantiatedIndex is created per transaction in Ocean and managed
thereafter.  This however is fairly easy to build and could offer
realtime in Lucene without adding the transaction logging.  It would
be good to find out what scope is acceptable for a Lucene core version
of realtime.  Perhaps this basic feature set is good enough.

On Mon, Sep 8, 2008 at 10:23 AM, Mark Miller <[hidden email]> wrote:

> Ning Li wrote:
>>
>> I agree with Otis that the first step for Lucene is probably to
>> support real-time
>> search. The instantiated index in contrib seems to be something close..
>
> Maybe we should start fleshing out what we want in realtime search on the
> wiki?
>
> Could it be as simple as making InstantiatedIndex realtime (allow
> writes/read at same time?). Then you could search over your IndexReader as
> well as the InstantiatedIndex. Writes go to both the Writer and the
> InstantiatedIndex. Nothing is actually permanent until the true commit, but
> stuff is visible pretty fast...a new IndexReader view starts a fresh
> InstantiedIndex...
>
> Jasons realtime patch is still pretty large...would be nice if we could
> accomplish this with as few changes as possible...
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Michael McCandless-2

I'd also trying to make time to explore the approach of creating an  
IndexReader impl. that searches IndexWriter's RAM buffer.

I think it's quite feasible, but, it'd still have a "reopen" cost in  
that any buffered delete by term or query would have to be  
"materialiazed" into docIDs on reopen.  Though, if this somehow turns  
out to be a problem, in the future we could do this materializing  
immediately, instead of buffering, if we already have a reader open.

Flushing is somewhat tricky because any open RAM readers would then  
have to cutover to the newly flushed segment once the flush completes,  
so that the RAM buffer can be recycled for the next segment.

Mike

Jason Rutherglen wrote:

> InstantiatedIndex isn't quite realtime.  Instead a new
> InstantiatedIndex is created per transaction in Ocean and managed
> thereafter.  This however is fairly easy to build and could offer
> realtime in Lucene without adding the transaction logging.  It would
> be good to find out what scope is acceptable for a Lucene core version
> of realtime.  Perhaps this basic feature set is good enough.
>
> On Mon, Sep 8, 2008 at 10:23 AM, Mark Miller <[hidden email]>  
> wrote:
>> Ning Li wrote:
>>>
>>> I agree with Otis that the first step for Lucene is probably to
>>> support real-time
>>> search. The instantiated index in contrib seems to be something  
>>> close..
>>
>> Maybe we should start fleshing out what we want in realtime search  
>> on the
>> wiki?
>>
>> Could it be as simple as making InstantiatedIndex realtime (allow
>> writes/read at same time?). Then you could search over your  
>> IndexReader as
>> well as the InstantiatedIndex. Writes go to both the Writer and the
>> InstantiatedIndex. Nothing is actually permanent until the true  
>> commit, but
>> stuff is visible pretty fast...a new IndexReader view starts a fresh
>> InstantiedIndex...
>>
>> Jasons realtime patch is still pretty large...would be nice if we  
>> could
>> accomplish this with as few changes as possible...
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Yonik Seeley-2
On Mon, Sep 8, 2008 at 12:33 PM, Michael McCandless
<[hidden email]> wrote:
> I'd also trying to make time to explore the approach of creating an
> IndexReader impl. that searches IndexWriter's RAM buffer.

That seems like it could possibly be the best performing approach in
the long run.

> I think it's quite feasible, but, it'd still have a "reopen" cost in that
> any buffered delete by term or query would have to be "materialiazed" into
> docIDs on reopen.  Though, if this somehow turns out to be a problem, in the
> future we could do this materializing immediately, instead of buffering, if
> we already have a reader open.

Right... it seems like re-using readers internally is something we
could already be doing in IndexWriter.


> Flushing is somewhat tricky because any open RAM readers would then have to
> cutover to the newly flushed segment once the flush completes, so that the
> RAM buffer can be recycled for the next segment.

Re-use of a RAM buffer doesn't seem like such a big deal.

But, how would you maintain a static view of an index...?

IndexReader r1 = indexWriter.getCurrentIndex()
indexWriter.addDocument(...)
IndexReader r2 = indexWriter.getCurrentIndex()

I assume r1 will have a view of the index before the document was
added, and r2 after?

Another thing that will help is if users could get their hands on the
sub-readers of a multi-segment reader.  Right now that is hidden in
MultiSegmentReader and makes updating anything incrementally
difficult.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Karl Wettin
In reply to this post by Jason Rutherglen
I need to point out that the only thing I know InstantiatedIndex to be  
great at is read access in the inverted index. It consumes a lot more  
heap than RAMDirectory and InstantiatedIndexWriter is slightly less  
efficient than IndexWriter.

Please let me know if your experience differs from the above statement.

8 sep 2008 kl. 16.36 skrev Jason Rutherglen:

> InstantiatedIndex isn't quite realtime.  Instead a new
> InstantiatedIndex is created per transaction in Ocean and managed
> thereafter.  This however is fairly easy to build and could offer
> realtime in Lucene without adding the transaction logging.  It would
> be good to find out what scope is acceptable for a Lucene core version
> of realtime.  Perhaps this basic feature set is good enough.
>
> On Mon, Sep 8, 2008 at 10:23 AM, Mark Miller <[hidden email]>  
> wrote:
>> Ning Li wrote:
>>>
>>> I agree with Otis that the first step for Lucene is probably to
>>> support real-time
>>> search. The instantiated index in contrib seems to be something  
>>> close..
>>
>> Maybe we should start fleshing out what we want in realtime search  
>> on the
>> wiki?
>>
>> Could it be as simple as making InstantiatedIndex realtime (allow
>> writes/read at same time?). Then you could search over your  
>> IndexReader as
>> well as the InstantiatedIndex. Writes go to both the Writer and the
>> InstantiatedIndex. Nothing is actually permanent until the true  
>> commit, but
>> stuff is visible pretty fast...a new IndexReader view starts a fresh
>> InstantiedIndex...
>>
>> Jasons realtime patch is still pretty large...would be nice if we  
>> could
>> accomplish this with as few changes as possible...
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Michael McCandless-2
In reply to this post by Yonik Seeley-2

Yonik Seeley wrote:

>> I think it's quite feasible, but, it'd still have a "reopen" cost  
>> in that
>> any buffered delete by term or query would have to be  
>> "materialiazed" into
>> docIDs on reopen.  Though, if this somehow turns out to be a  
>> problem, in the
>> future we could do this materializing immediately, instead of  
>> buffering, if
>> we already have a reader open.
>
> Right... it seems like re-using readers internally is something we
> could already be doing in IndexWriter.

True.

>> Flushing is somewhat tricky because any open RAM readers would then  
>> have to
>> cutover to the newly flushed segment once the flush completes, so  
>> that the
>> RAM buffer can be recycled for the next segment.
>
> Re-use of a RAM buffer doesn't seem like such a big deal.
>
> But, how would you maintain a static view of an index...?
>
> IndexReader r1 = indexWriter.getCurrentIndex()
> indexWriter.addDocument(...)
> IndexReader r2 = indexWriter.getCurrentIndex()
>
> I assume r1 will have a view of the index before the document was
> added, and r2 after?

Right, getCurrentIndex would return a MultiReader that includes  
SegmentReader for each segment in the index, plus a "RAMReader" that  
searches the RAM buffer.  That RAMReader is a tiny shell class that  
would basically just record the max docID it's allowed to go up to  
(the docID as of when it was opened), and stop enumerating docIDs (eg  
in the TermDocs) when it hits a docID beyond that limit.

For reading stored fields and term vectors, which are now flushed  
immediately to disk, we need to somehow get an IndexInput from the  
IndexOutputs that IndexWriter holds open on these files.  Or, maybe,  
just open new IndexInputs?

> Another thing that will help is if users could get their hands on the
> sub-readers of a multi-segment reader.  Right now that is hidden in
> MultiSegmentReader and makes updating anything incrementally
> difficult.

Besides what's handled by MultiSegmentReader.reopen already, what else  
do you need to incrementally update?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Ning Li-3
In reply to this post by Yonik Seeley-2
On Mon, Sep 8, 2008 at 2:43 PM, Yonik Seeley <[hidden email]> wrote:
> But, how would you maintain a static view of an index...?
>
> IndexReader r1 = indexWriter.getCurrentIndex()
> indexWriter.addDocument(...)
> IndexReader r2 = indexWriter.getCurrentIndex()
>
> I assume r1 will have a view of the index before the document was
> added, and r2 after?

I thought an index reader which supports real-time search no longer
maintains a static view of an index? Similar to InstantiatedIndexReader,
it will be in sync with an index writer.

IndexReader r = indexWriter.getIndexReader();
getIndexReader() (i.e. get real-time index reader) returns the same
reader instance for a writer instance.

On Mon, Sep 8, 2008 at 12:33 PM, Michael McCandless
<[hidden email]> wrote:
> Flushing is somewhat tricky because any open RAM readers would then have to
> cutover to the newly flushed segment once the flush completes, so that the
> RAM buffer can be recycled for the next segment.

Now this won't be a problem any more.

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Yonik Seeley-2
On Mon, Sep 8, 2008 at 3:56 PM, Ning Li <[hidden email]> wrote:

> On Mon, Sep 8, 2008 at 2:43 PM, Yonik Seeley <[hidden email]> wrote:
>> But, how would you maintain a static view of an index...?
>>
>> IndexReader r1 = indexWriter.getCurrentIndex()
>> indexWriter.addDocument(...)
>> IndexReader r2 = indexWriter.getCurrentIndex()
>>
>> I assume r1 will have a view of the index before the document was
>> added, and r2 after?
>
> I thought an index reader which supports real-time search no longer
> maintains a static view of an index?

It seems advantageous to just make it really cheap to get a new view
of the index (if you do it for every search, t amounts to the same
thing, right?)  Quite a bit of code in Lucene assumes a static view of
the Index I think (even IndexSearcher), and it's nice to have a stable
index view for the duration of a single request.

> Similar to InstantiatedIndexReader,
> it will be in sync with an index writer.

Right... that's why I was clarifying.  You can still make stable views
of the index with multiple InstantiatedIndex instances, but it doesn't
seem as efficient.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
In reply to this post by Karl Wettin
That sounds about correct and I don't think it matters much.  I keep
the documents by default stored in InstantiatedIndex to 100.  So the
heap size doesn't become a problem.

On Mon, Sep 8, 2008 at 2:58 PM, Karl Wettin <[hidden email]> wrote:

> I need to point out that the only thing I know InstantiatedIndex to be great
> at is read access in the inverted index. It consumes a lot more heap than
> RAMDirectory and InstantiatedIndexWriter is slightly less efficient than
> IndexWriter.
>
> Please let me know if your experience differs from the above statement.
>
> 8 sep 2008 kl. 16.36 skrev Jason Rutherglen:
>
>> InstantiatedIndex isn't quite realtime.  Instead a new
>> InstantiatedIndex is created per transaction in Ocean and managed
>> thereafter.  This however is fairly easy to build and could offer
>> realtime in Lucene without adding the transaction logging.  It would
>> be good to find out what scope is acceptable for a Lucene core version
>> of realtime.  Perhaps this basic feature set is good enough.
>>
>> On Mon, Sep 8, 2008 at 10:23 AM, Mark Miller <[hidden email]>
>> wrote:
>>>
>>> Ning Li wrote:
>>>>
>>>> I agree with Otis that the first step for Lucene is probably to
>>>> support real-time
>>>> search. The instantiated index in contrib seems to be something close..
>>>
>>> Maybe we should start fleshing out what we want in realtime search on the
>>> wiki?
>>>
>>> Could it be as simple as making InstantiatedIndex realtime (allow
>>> writes/read at same time?). Then you could search over your IndexReader
>>> as
>>> well as the InstantiatedIndex. Writes go to both the Writer and the
>>> InstantiatedIndex. Nothing is actually permanent until the true commit,
>>> but
>>> stuff is visible pretty fast...a new IndexReader view starts a fresh
>>> InstantiedIndex...
>>>
>>> Jasons realtime patch is still pretty large...would be nice if we could
>>> accomplish this with as few changes as possible...
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
In reply to this post by Michael McCandless-2
Term dictionary?  I'm curious how that would be solved?

On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
<[hidden email]> wrote:

>
> Yonik Seeley wrote:
>
>>> I think it's quite feasible, but, it'd still have a "reopen" cost in that
>>> any buffered delete by term or query would have to be "materialiazed"
>>> into
>>> docIDs on reopen.  Though, if this somehow turns out to be a problem, in
>>> the
>>> future we could do this materializing immediately, instead of buffering,
>>> if
>>> we already have a reader open.
>>
>> Right... it seems like re-using readers internally is something we
>> could already be doing in IndexWriter.
>
> True.
>
>>> Flushing is somewhat tricky because any open RAM readers would then have
>>> to
>>> cutover to the newly flushed segment once the flush completes, so that
>>> the
>>> RAM buffer can be recycled for the next segment.
>>
>> Re-use of a RAM buffer doesn't seem like such a big deal.
>>
>> But, how would you maintain a static view of an index...?
>>
>> IndexReader r1 = indexWriter.getCurrentIndex()
>> indexWriter.addDocument(...)
>> IndexReader r2 = indexWriter.getCurrentIndex()
>>
>> I assume r1 will have a view of the index before the document was
>> added, and r2 after?
>
> Right, getCurrentIndex would return a MultiReader that includes
> SegmentReader for each segment in the index, plus a "RAMReader" that
> searches the RAM buffer.  That RAMReader is a tiny shell class that would
> basically just record the max docID it's allowed to go up to (the docID as
> of when it was opened), and stop enumerating docIDs (eg in the TermDocs)
> when it hits a docID beyond that limit.
>
> For reading stored fields and term vectors, which are now flushed
> immediately to disk, we need to somehow get an IndexInput from the
> IndexOutputs that IndexWriter holds open on these files.  Or, maybe, just
> open new IndexInputs?
>
>> Another thing that will help is if users could get their hands on the
>> sub-readers of a multi-segment reader.  Right now that is hidden in
>> MultiSegmentReader and makes updating anything incrementally
>> difficult.
>
> Besides what's handled by MultiSegmentReader.reopen already, what else do
> you need to incrementally update?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Yonik Seeley-2
In reply to this post by Michael McCandless-2
On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
<[hidden email]> wrote:
> Right, getCurrentIndex would return a MultiReader that includes
> SegmentReader for each segment in the index, plus a "RAMReader" that
> searches the RAM buffer.  That RAMReader is a tiny shell class that would
> basically just record the max docID it's allowed to go up to (the docID as
> of when it was opened), and stop enumerating docIDs (eg in the TermDocs)
> when it hits a docID beyond that limit.

What about something like term freq?  Would it need to count the
number of docs after the local maxDoc or is there a better way?

> For reading stored fields and term vectors, which are now flushed
> immediately to disk, we need to somehow get an IndexInput from the
> IndexOutputs that IndexWriter holds open on these files.  Or, maybe, just
> open new IndexInputs?

Hmmm, seems like a case of our nice and simple Directory model not
having quite enough features in this case.

>> Another thing that will help is if users could get their hands on the
>> sub-readers of a multi-segment reader.  Right now that is hidden in
>> MultiSegmentReader and makes updating anything incrementally
>> difficult.
>
> Besides what's handled by MultiSegmentReader.reopen already, what else do
> you need to incrementally update?

Anything that you want to incrementally update and uses an IndexReader as a key.
Mostly caches I would think... Solr has user-level (application
specific) caches, faceting caches, etc.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
In reply to this post by J. Delgado
Perhaps an interesting project would be to integrate Ocean with H2
www.h2database.com to take advantage of both models.  I'm not sure how
exactly that would work, but it seems like it would not be too
difficult.  Perhaps this would solve being able to perform faster
hierarchical queries and perhaps other types of queries that Lucene is
not capable of.

Is this something Joaquin you are interested in collaborating on?  I
am definitely interested in it.

On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[hidden email]> wrote:

> On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
> <[hidden email]> wrote:
>>
>> Regarding real-time search and Solr, my feeling is the focus should be on
>> first adding real-time search to Lucene, and then we'll figure out how to
>> incorporate that into Solr later.
>
>
> Otis, what do you mean exactly by "adding real-time search to Lucene"?  Note
> that Lucene, being a indexing/search library (and not a full blown search
> engine), is by definition "real-time": once you add/write a document to the
> index it becomes immediately searchable and if a document is logically
> deleted and no longer returned in a search, though physical deletion happens
> during an index optimization.
>
> Now, the problem of adding/deleting documents in bulk, as part of a
> transaction and making these documents available for search immediately
> after the transaction is commited sounds more like a search engine problem
> (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to be
> I/O expensive and thus are usually implemented bached proceeses with some
> kind of sync mechanism, which makes them non real-time.
>
> For example, in my previous life, I designed and help implement a
> quasi-realtime enterprise search engine using Lucene, having a set of
> multi-threaded indexers hitting a set of multiple indexes alocatted accross
> different search services which powered a broker based distributed search
> interface. The most recent documents provided to the indexers were always
> added to the smaller in-memory (RAM) indexes which usually could absorbe the
> load of a bulk "add" transaction and later would be merged into larger disk
> based indexes and then flushed to make them ready to absorbe new fresh docs.
> We even had further partitioning of the indexes that reflected time periods
> with caps on size for them to be merged into older more archive based
> indexes which were used less (yes the search engine default search was on
> data no more than 1 month old, though user could open the time window by
> including archives).
>
> As for SOLR and OCEAN,  I would argue that these semi-structured search
> engines are becomming more and more like relational databases with full-text
> search capablities (without the benefit of full reletional algebra -- for
> example joins are not possible using SOLR). Notice that "real-time" CRUD
> operations and transactionality are core DB concepts adn have been studied
> and developed by database communities for aquite long time. There has been
> recent efforts on how to effeciently integrate Lucene into releational
> databases (see Lucene JVM ORACLE integration, see
> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>
> I think we should seriously look at joining efforts with open-source
> Database engine projects, written in Java (see
> http://java-source.net/open-source/database-engines) in order to blend IR
> and ORM for once and for all.
>
> -- Joaquin
>
>
>>
>> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>> times to understand bits and pieces of it.  I have to admit there is still
>> some fuzziness about the whole things in my head - is "Ocean" something that
>> already works, a separate project on googlecode.com?  I think so.  If so,
>> and if you are working on getting it integrated into Lucene, would it make
>> it less confusing to just refer to it as "real-time search", so there is no
>> confusion?
>>
>> If this is to be initially integrated into Lucene, why are things like
>> replication, crowding/field collapsing, locallucene, name service, tag
>> index, etc. all mentioned there on the Wiki and bundled with description of
>> how real-time search works and is to be implemented?  I suppose mentioning
>> replication kind-of makes sense because the replication approach is closely
>> tied to real-time search - all query nodes need to see index changes fast.
>>  But Lucene itself offers no replication mechanism, so maybe the replication
>> is something to figure out separately, say on the Solr level, later on "once
>> we get there".  I think even just the essential real-time search requires
>> substantial changes to Lucene (I remember seeing large patches in JIRA),
>> which makes it hard to digest, understand, comment on, and ultimately commit
>> (hence the luke warm response, I think).  Bringing other non-essential
>> elements into discussion at the same time makes it more difficult t o
>>  process all this new stuff, at least for me.  Am I the only one who finds
>> this hard?
>>
>> That said, it sounds like we have some discussion going (Karl...), so I
>> look forward to understanding more! :)
>>
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>> > From: Yonik Seeley <[hidden email]>
>> > To: [hidden email]
>> > Sent: Thursday, September 4, 2008 10:13:32 AM
>> > Subject: Re: Realtime Search for Social Networks Collaboration
>> >
>> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>> > wrote:
>> > > I also think it's got a
>> > > lot of things now which makes integration difficult to do properly.
>> >
>> > I agree, and that's why the major bump in version number rather than
>> > minor - we recognize that some features will need some amount of
>> > rearchitecture.
>> >
>> > > I think the problem with integration with SOLR is it was designed with
>> > > a different problem set in mind than Ocean, originally the CNET
>> > > shopping application.
>> >
>> > That was the first use of Solr, but it actually existed before that
>> > w/o any defined use other than to be a "plan B" alternative to MySQL
>> > based search servers (that's actually where some of the parameter
>> > names come from... the default /select URL instead of /search, the
>> > "rows" parameter, etc).
>> >
>> > But you're right... some things like the replication strategy were
>> > designed (well, borrowed from Doug to be exact) with the idea that it
>> > would be OK to have slightly "stale" views of the data in the range of
>> > minutes.  It just made things easier/possible at the time.  But tons
>> > of Solr and Lucene users want almost instantaneous visibility of added
>> > documents, if they can get it.  It's hardly restricted to social
>> > network applications.
>> >
>> > Bottom line is that Solr aims to be a general enterprise search
>> > platform, and getting as real-time as we can get, and as scalable as
>> > we can get are some of the top priorities going forward.
>> >
>> > -Yonik
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

J. Delgado
Yes, both Marcelo and I would be interested.

We looked into H2 and it looks like something similar to Oracle's ODCI can be implemented. Plus the primitive full-text implementación is based on Lucene.
I say primitive because looking at the code I saw that one cannot define an Analyzer and for each scan corresponding to a where clause a searcher is open and closed, instead of having a pool, plus it does not have any way to queue changes to reduce the use of the IndexWriter, etc.

But its open source and that is a great starting point!

-- Joaquin

On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen <[hidden email]> wrote:
Perhaps an interesting project would be to integrate Ocean with H2
www.h2database.com to take advantage of both models.  I'm not sure how
exactly that would work, but it seems like it would not be too
difficult.  Perhaps this would solve being able to perform faster
hierarchical queries and perhaps other types of queries that Lucene is
not capable of.

Is this something Joaquin you are interested in collaborating on?  I
am definitely interested in it.

On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[hidden email]> wrote:
> On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
> <[hidden email]> wrote:
>>
>> Regarding real-time search and Solr, my feeling is the focus should be on
>> first adding real-time search to Lucene, and then we'll figure out how to
>> incorporate that into Solr later.
>
>
> Otis, what do you mean exactly by "adding real-time search to Lucene"?  Note
> that Lucene, being a indexing/search library (and not a full blown search
> engine), is by definition "real-time": once you add/write a document to the
> index it becomes immediately searchable and if a document is logically
> deleted and no longer returned in a search, though physical deletion happens
> during an index optimization.
>
> Now, the problem of adding/deleting documents in bulk, as part of a
> transaction and making these documents available for search immediately
> after the transaction is commited sounds more like a search engine problem
> (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to be
> I/O expensive and thus are usually implemented bached proceeses with some
> kind of sync mechanism, which makes them non real-time.
>
> For example, in my previous life, I designed and help implement a
> quasi-realtime enterprise search engine using Lucene, having a set of
> multi-threaded indexers hitting a set of multiple indexes alocatted accross
> different search services which powered a broker based distributed search
> interface. The most recent documents provided to the indexers were always
> added to the smaller in-memory (RAM) indexes which usually could absorbe the
> load of a bulk "add" transaction and later would be merged into larger disk
> based indexes and then flushed to make them ready to absorbe new fresh docs.
> We even had further partitioning of the indexes that reflected time periods
> with caps on size for them to be merged into older more archive based
> indexes which were used less (yes the search engine default search was on
> data no more than 1 month old, though user could open the time window by
> including archives).
>
> As for SOLR and OCEAN,  I would argue that these semi-structured search
> engines are becomming more and more like relational databases with full-text
> search capablities (without the benefit of full reletional algebra -- for
> example joins are not possible using SOLR). Notice that "real-time" CRUD
> operations and transactionality are core DB concepts adn have been studied
> and developed by database communities for aquite long time. There has been
> recent efforts on how to effeciently integrate Lucene into releational
> databases (see Lucene JVM ORACLE integration, see
> http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>
> I think we should seriously look at joining efforts with open-source
> Database engine projects, written in Java (see
> http://java-source.net/open-source/database-engines) in order to blend IR
> and ORM for once and for all.
>
> -- Joaquin
>
>
>>
>> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>> times to understand bits and pieces of it.  I have to admit there is still
>> some fuzziness about the whole things in my head - is "Ocean" something that
>> already works, a separate project on googlecode.com?  I think so.  If so,
>> and if you are working on getting it integrated into Lucene, would it make
>> it less confusing to just refer to it as "real-time search", so there is no
>> confusion?
>>
>> If this is to be initially integrated into Lucene, why are things like
>> replication, crowding/field collapsing, locallucene, name service, tag
>> index, etc. all mentioned there on the Wiki and bundled with description of
>> how real-time search works and is to be implemented?  I suppose mentioning
>> replication kind-of makes sense because the replication approach is closely
>> tied to real-time search - all query nodes need to see index changes fast.
>>  But Lucene itself offers no replication mechanism, so maybe the replication
>> is something to figure out separately, say on the Solr level, later on "once
>> we get there".  I think even just the essential real-time search requires
>> substantial changes to Lucene (I remember seeing large patches in JIRA),
>> which makes it hard to digest, understand, comment on, and ultimately commit
>> (hence the luke warm response, I think).  Bringing other non-essential
>> elements into discussion at the same time makes it more difficult t o
>>  process all this new stuff, at least for me.  Am I the only one who finds
>> this hard?
>>
>> That said, it sounds like we have some discussion going (Karl...), so I
>> look forward to understanding more! :)
>>
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>> > From: Yonik Seeley <[hidden email]>
>> > To: [hidden email]
>> > Sent: Thursday, September 4, 2008 10:13:32 AM
>> > Subject: Re: Realtime Search for Social Networks Collaboration
>> >
>> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>> > wrote:
>> > > I also think it's got a
>> > > lot of things now which makes integration difficult to do properly.
>> >
>> > I agree, and that's why the major bump in version number rather than
>> > minor - we recognize that some features will need some amount of
>> > rearchitecture.
>> >
>> > > I think the problem with integration with SOLR is it was designed with
>> > > a different problem set in mind than Ocean, originally the CNET
>> > > shopping application.
>> >
>> > That was the first use of Solr, but it actually existed before that
>> > w/o any defined use other than to be a "plan B" alternative to MySQL
>> > based search servers (that's actually where some of the parameter
>> > names come from... the default /select URL instead of /search, the
>> > "rows" parameter, etc).
>> >
>> > But you're right... some things like the replication strategy were
>> > designed (well, borrowed from Doug to be exact) with the idea that it
>> > would be OK to have slightly "stale" views of the data in the range of
>> > minutes.  It just made things easier/possible at the time.  But tons
>> > of Solr and Lucene users want almost instantaneous visibility of added
>> > documents, if they can get it.  It's hardly restricted to social
>> > network applications.
>> >
>> > Bottom line is that Solr aims to be a general enterprise search
>> > platform, and getting as real-time as we can get, and as scalable as
>> > we can get are some of the top priorities going forward.
>> >
>> > -Yonik
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
Cool.  I mention H2 because it does have some Lucene code in it yes.
Also according to some benchmarks it's the fastest of the open source
databases.  I think it's possible to integrate realtime search for H2.
 I suppose there is no need to store the data in Lucene in this case?
One loses the multiple values per field Lucene offers, and the schema
become static.  Perhaps it's a trade off?

On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[hidden email]> wrote:

> Yes, both Marcelo and I would be interested.
>
> We looked into H2 and it looks like something similar to Oracle's ODCI can
> be implemented. Plus the primitive full-text implementación is based on
> Lucene.
> I say primitive because looking at the code I saw that one cannot define an
> Analyzer and for each scan corresponding to a where clause a searcher is
> open and closed, instead of having a pool, plus it does not have any way to
> queue changes to reduce the use of the IndexWriter, etc.
>
> But its open source and that is a great starting point!
>
> -- Joaquin
>
> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
> <[hidden email]> wrote:
>>
>> Perhaps an interesting project would be to integrate Ocean with H2
>> www.h2database.com to take advantage of both models.  I'm not sure how
>> exactly that would work, but it seems like it would not be too
>> difficult.  Perhaps this would solve being able to perform faster
>> hierarchical queries and perhaps other types of queries that Lucene is
>> not capable of.
>>
>> Is this something Joaquin you are interested in collaborating on?  I
>> am definitely interested in it.
>>
>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[hidden email]>
>> wrote:
>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>> > <[hidden email]> wrote:
>> >>
>> >> Regarding real-time search and Solr, my feeling is the focus should be
>> >> on
>> >> first adding real-time search to Lucene, and then we'll figure out how
>> >> to
>> >> incorporate that into Solr later.
>> >
>> >
>> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>> >  Note
>> > that Lucene, being a indexing/search library (and not a full blown
>> > search
>> > engine), is by definition "real-time": once you add/write a document to
>> > the
>> > index it becomes immediately searchable and if a document is logically
>> > deleted and no longer returned in a search, though physical deletion
>> > happens
>> > during an index optimization.
>> >
>> > Now, the problem of adding/deleting documents in bulk, as part of a
>> > transaction and making these documents available for search immediately
>> > after the transaction is commited sounds more like a search engine
>> > problem
>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
>> > be
>> > I/O expensive and thus are usually implemented bached proceeses with
>> > some
>> > kind of sync mechanism, which makes them non real-time.
>> >
>> > For example, in my previous life, I designed and help implement a
>> > quasi-realtime enterprise search engine using Lucene, having a set of
>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>> > accross
>> > different search services which powered a broker based distributed
>> > search
>> > interface. The most recent documents provided to the indexers were
>> > always
>> > added to the smaller in-memory (RAM) indexes which usually could absorbe
>> > the
>> > load of a bulk "add" transaction and later would be merged into larger
>> > disk
>> > based indexes and then flushed to make them ready to absorbe new fresh
>> > docs.
>> > We even had further partitioning of the indexes that reflected time
>> > periods
>> > with caps on size for them to be merged into older more archive based
>> > indexes which were used less (yes the search engine default search was
>> > on
>> > data no more than 1 month old, though user could open the time window by
>> > including archives).
>> >
>> > As for SOLR and OCEAN,  I would argue that these semi-structured search
>> > engines are becomming more and more like relational databases with
>> > full-text
>> > search capablities (without the benefit of full reletional algebra --
>> > for
>> > example joins are not possible using SOLR). Notice that "real-time" CRUD
>> > operations and transactionality are core DB concepts adn have been
>> > studied
>> > and developed by database communities for aquite long time. There has
>> > been
>> > recent efforts on how to effeciently integrate Lucene into releational
>> > databases (see Lucene JVM ORACLE integration, see
>> >
>> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>> >
>> > I think we should seriously look at joining efforts with open-source
>> > Database engine projects, written in Java (see
>> > http://java-source.net/open-source/database-engines) in order to blend
>> > IR
>> > and ORM for once and for all.
>> >
>> > -- Joaquin
>> >
>> >
>> >>
>> >> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>> >> times to understand bits and pieces of it.  I have to admit there is
>> >> still
>> >> some fuzziness about the whole things in my head - is "Ocean" something
>> >> that
>> >> already works, a separate project on googlecode.com?  I think so.  If
>> >> so,
>> >> and if you are working on getting it integrated into Lucene, would it
>> >> make
>> >> it less confusing to just refer to it as "real-time search", so there
>> >> is no
>> >> confusion?
>> >>
>> >> If this is to be initially integrated into Lucene, why are things like
>> >> replication, crowding/field collapsing, locallucene, name service, tag
>> >> index, etc. all mentioned there on the Wiki and bundled with
>> >> description of
>> >> how real-time search works and is to be implemented?  I suppose
>> >> mentioning
>> >> replication kind-of makes sense because the replication approach is
>> >> closely
>> >> tied to real-time search - all query nodes need to see index changes
>> >> fast.
>> >>  But Lucene itself offers no replication mechanism, so maybe the
>> >> replication
>> >> is something to figure out separately, say on the Solr level, later on
>> >> "once
>> >> we get there".  I think even just the essential real-time search
>> >> requires
>> >> substantial changes to Lucene (I remember seeing large patches in
>> >> JIRA),
>> >> which makes it hard to digest, understand, comment on, and ultimately
>> >> commit
>> >> (hence the luke warm response, I think).  Bringing other non-essential
>> >> elements into discussion at the same time makes it more difficult t o
>> >>  process all this new stuff, at least for me.  Am I the only one who
>> >> finds
>> >> this hard?
>> >>
>> >> That said, it sounds like we have some discussion going (Karl...), so I
>> >> look forward to understanding more! :)
>> >>
>> >>
>> >> Otis
>> >> --
>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >> > From: Yonik Seeley <[hidden email]>
>> >> > To: [hidden email]
>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>> >> >
>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>> >> > wrote:
>> >> > > I also think it's got a
>> >> > > lot of things now which makes integration difficult to do properly.
>> >> >
>> >> > I agree, and that's why the major bump in version number rather than
>> >> > minor - we recognize that some features will need some amount of
>> >> > rearchitecture.
>> >> >
>> >> > > I think the problem with integration with SOLR is it was designed
>> >> > > with
>> >> > > a different problem set in mind than Ocean, originally the CNET
>> >> > > shopping application.
>> >> >
>> >> > That was the first use of Solr, but it actually existed before that
>> >> > w/o any defined use other than to be a "plan B" alternative to MySQL
>> >> > based search servers (that's actually where some of the parameter
>> >> > names come from... the default /select URL instead of /search, the
>> >> > "rows" parameter, etc).
>> >> >
>> >> > But you're right... some things like the replication strategy were
>> >> > designed (well, borrowed from Doug to be exact) with the idea that it
>> >> > would be OK to have slightly "stale" views of the data in the range
>> >> > of
>> >> > minutes.  It just made things easier/possible at the time.  But tons
>> >> > of Solr and Lucene users want almost instantaneous visibility of
>> >> > added
>> >> > documents, if they can get it.  It's hardly restricted to social
>> >> > network applications.
>> >> >
>> >> > Bottom line is that Solr aims to be a general enterprise search
>> >> > platform, and getting as real-time as we can get, and as scalable as
>> >> > we can get are some of the top priorities going forward.
>> >> >
>> >> > -Yonik
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: [hidden email]
>> >> > For additional commands, e-mail: [hidden email]
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [hidden email]
>> >> For additional commands, e-mail: [hidden email]
>> >>
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Marcelo F. Ochoa
Hi:
Integrating Lucene in a RDBMS has two separate concern:
  - Integrate it as index to receive notification when a row change
and that the optimizer can choose a right execution plan based on the
index statistics.
  - Replace Lucene file system store to align database changes with
Lucene changes, it means both should be part of one transaction.
For H2, first point seem to be viable of implement with more or less
efforts, for the second I don't know how H2 manage BLOB storage.
My experience with Oracle-Lucene Integration is that replacing the
file-system store by BLOB do not impose a big overhead and we get
rollback, replication and fault tolerance functionality for free :)
Best regards, Marcelo.

PD: Sure Lucene Index is small inside a database, we need to store as
UN_TOKENIZED the rowid, for the content of the other indexes field the
database has faster access than Lucene.

> Cool.  I mention H2 because it does have some Lucene code in it yes.
> Also according to some benchmarks it's the fastest of the open source
> databases.  I think it's possible to integrate realtime search for H2.
>  I suppose there is no need to store the data in Lucene in this case?
> One loses the multiple values per field Lucene offers, and the schema
> become static.  Perhaps it's a trade off?
>
> On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[hidden email]> wrote:
>> Yes, both Marcelo and I would be interested.
>>
>> We looked into H2 and it looks like something similar to Oracle's ODCI can
>> be implemented. Plus the primitive full-text implementación is based on
>> Lucene.
>> I say primitive because looking at the code I saw that one cannot define an
>> Analyzer and for each scan corresponding to a where clause a searcher is
>> open and closed, instead of having a pool, plus it does not have any way to
>> queue changes to reduce the use of the IndexWriter, etc.
>>
>> But its open source and that is a great starting point!
>>
>> -- Joaquin
>>
>> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>> <[hidden email]> wrote:
>>>
>>> Perhaps an interesting project would be to integrate Ocean with H2
>>> www.h2database.com to take advantage of both models.  I'm not sure how
>>> exactly that would work, but it seems like it would not be too
>>> difficult.  Perhaps this would solve being able to perform faster
>>> hierarchical queries and perhaps other types of queries that Lucene is
>>> not capable of.
>>>
>>> Is this something Joaquin you are interested in collaborating on?  I
>>> am definitely interested in it.
>>>
>>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[hidden email]>
>>> wrote:
>>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>>> > <[hidden email]> wrote:
>>> >>
>>> >> Regarding real-time search and Solr, my feeling is the focus should be
>>> >> on
>>> >> first adding real-time search to Lucene, and then we'll figure out how
>>> >> to
>>> >> incorporate that into Solr later.
>>> >
>>> >
>>> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>>> >  Note
>>> > that Lucene, being a indexing/search library (and not a full blown
>>> > search
>>> > engine), is by definition "real-time": once you add/write a document to
>>> > the
>>> > index it becomes immediately searchable and if a document is logically
>>> > deleted and no longer returned in a search, though physical deletion
>>> > happens
>>> > during an index optimization.
>>> >
>>> > Now, the problem of adding/deleting documents in bulk, as part of a
>>> > transaction and making these documents available for search immediately
>>> > after the transaction is commited sounds more like a search engine
>>> > problem
>>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
>>> > be
>>> > I/O expensive and thus are usually implemented bached proceeses with
>>> > some
>>> > kind of sync mechanism, which makes them non real-time.
>>> >
>>> > For example, in my previous life, I designed and help implement a
>>> > quasi-realtime enterprise search engine using Lucene, having a set of
>>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>>> > accross
>>> > different search services which powered a broker based distributed
>>> > search
>>> > interface. The most recent documents provided to the indexers were
>>> > always
>>> > added to the smaller in-memory (RAM) indexes which usually could absorbe
>>> > the
>>> > load of a bulk "add" transaction and later would be merged into larger
>>> > disk
>>> > based indexes and then flushed to make them ready to absorbe new fresh
>>> > docs.
>>> > We even had further partitioning of the indexes that reflected time
>>> > periods
>>> > with caps on size for them to be merged into older more archive based
>>> > indexes which were used less (yes the search engine default search was
>>> > on
>>> > data no more than 1 month old, though user could open the time window by
>>> > including archives).
>>> >
>>> > As for SOLR and OCEAN,  I would argue that these semi-structured search
>>> > engines are becomming more and more like relational databases with
>>> > full-text
>>> > search capablities (without the benefit of full reletional algebra --
>>> > for
>>> > example joins are not possible using SOLR). Notice that "real-time" CRUD
>>> > operations and transactionality are core DB concepts adn have been
>>> > studied
>>> > and developed by database communities for aquite long time. There has
>>> > been
>>> > recent efforts on how to effeciently integrate Lucene into releational
>>> > databases (see Lucene JVM ORACLE integration, see
>>> >
>>> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>>> >
>>> > I think we should seriously look at joining efforts with open-source
>>> > Database engine projects, written in Java (see
>>> > http://java-source.net/open-source/database-engines) in order to blend
>>> > IR
>>> > and ORM for once and for all.
>>> >
>>> > -- Joaquin
>>> >
>>> >
>>> >>
>>> >> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>>> >> times to understand bits and pieces of it.  I have to admit there is
>>> >> still
>>> >> some fuzziness about the whole things in my head - is "Ocean" something
>>> >> that
>>> >> already works, a separate project on googlecode.com?  I think so.  If
>>> >> so,
>>> >> and if you are working on getting it integrated into Lucene, would it
>>> >> make
>>> >> it less confusing to just refer to it as "real-time search", so there
>>> >> is no
>>> >> confusion?
>>> >>
>>> >> If this is to be initially integrated into Lucene, why are things like
>>> >> replication, crowding/field collapsing, locallucene, name service, tag
>>> >> index, etc. all mentioned there on the Wiki and bundled with
>>> >> description of
>>> >> how real-time search works and is to be implemented?  I suppose
>>> >> mentioning
>>> >> replication kind-of makes sense because the replication approach is
>>> >> closely
>>> >> tied to real-time search - all query nodes need to see index changes
>>> >> fast.
>>> >>  But Lucene itself offers no replication mechanism, so maybe the
>>> >> replication
>>> >> is something to figure out separately, say on the Solr level, later on
>>> >> "once
>>> >> we get there".  I think even just the essential real-time search
>>> >> requires
>>> >> substantial changes to Lucene (I remember seeing large patches in
>>> >> JIRA),
>>> >> which makes it hard to digest, understand, comment on, and ultimately
>>> >> commit
>>> >> (hence the luke warm response, I think).  Bringing other non-essential
>>> >> elements into discussion at the same time makes it more difficult t o
>>> >>  process all this new stuff, at least for me.  Am I the only one who
>>> >> finds
>>> >> this hard?
>>> >>
>>> >> That said, it sounds like we have some discussion going (Karl...), so I
>>> >> look forward to understanding more! :)
>>> >>
>>> >>
>>> >> Otis
>>> >> --
>>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >>
>>> >>
>>> >>
>>> >> ----- Original Message ----
>>> >> > From: Yonik Seeley <[hidden email]>
>>> >> > To: [hidden email]
>>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>>> >> >
>>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>>> >> > wrote:
>>> >> > > I also think it's got a
>>> >> > > lot of things now which makes integration difficult to do properly.
>>> >> >
>>> >> > I agree, and that's why the major bump in version number rather than
>>> >> > minor - we recognize that some features will need some amount of
>>> >> > rearchitecture.
>>> >> >
>>> >> > > I think the problem with integration with SOLR is it was designed
>>> >> > > with
>>> >> > > a different problem set in mind than Ocean, originally the CNET
>>> >> > > shopping application.
>>> >> >
>>> >> > That was the first use of Solr, but it actually existed before that
>>> >> > w/o any defined use other than to be a "plan B" alternative to MySQL
>>> >> > based search servers (that's actually where some of the parameter
>>> >> > names come from... the default /select URL instead of /search, the
>>> >> > "rows" parameter, etc).
>>> >> >
>>> >> > But you're right... some things like the replication strategy were
>>> >> > designed (well, borrowed from Doug to be exact) with the idea that it
>>> >> > would be OK to have slightly "stale" views of the data in the range
>>> >> > of
>>> >> > minutes.  It just made things easier/possible at the time.  But tons
>>> >> > of Solr and Lucene users want almost instantaneous visibility of
>>> >> > added
>>> >> > documents, if they can get it.  It's hardly restricted to social
>>> >> > network applications.
>>> >> >
>>> >> > Bottom line is that Solr aims to be a general enterprise search
>>> >> > platform, and getting as real-time as we can get, and as scalable as
>>> >> > we can get are some of the top priorities going forward.
>>> >> >
>>> >> > -Yonik
>>> >> >
>>> >> > ---------------------------------------------------------------------
>>> >> > To unsubscribe, e-mail: [hidden email]
>>> >> > For additional commands, e-mail: [hidden email]
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [hidden email]
>>> >> For additional commands, e-mail: [hidden email]
>>> >>
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
I am wondering if in an integrated solution, things like sorting still
require the field cache?  What if untokenized fields could be stored
in H2, normal tokenized fields in Lucene.  Then somehow make the query
work properly.  Yes the rowid would need to be stored.  Currently
Lucene range queries are slower than SQL based btree queries.

Are you saying store the Lucene segments as BLOBs?

On Mon, Sep 8, 2008 at 7:13 PM, Marcelo Ochoa <[hidden email]> wrote:

> Hi:
> Integrating Lucene in a RDBMS has two separate concern:
>  - Integrate it as index to receive notification when a row change
> and that the optimizer can choose a right execution plan based on the
> index statistics.
>  - Replace Lucene file system store to align database changes with
> Lucene changes, it means both should be part of one transaction.
> For H2, first point seem to be viable of implement with more or less
> efforts, for the second I don't know how H2 manage BLOB storage.
> My experience with Oracle-Lucene Integration is that replacing the
> file-system store by BLOB do not impose a big overhead and we get
> rollback, replication and fault tolerance functionality for free :)
> Best regards, Marcelo.
>
> PD: Sure Lucene Index is small inside a database, we need to store as
> UN_TOKENIZED the rowid, for the content of the other indexes field the
> database has faster access than Lucene.
>> Cool.  I mention H2 because it does have some Lucene code in it yes.
>> Also according to some benchmarks it's the fastest of the open source
>> databases.  I think it's possible to integrate realtime search for H2.
>>  I suppose there is no need to store the data in Lucene in this case?
>> One loses the multiple values per field Lucene offers, and the schema
>> become static.  Perhaps it's a trade off?
>>
>> On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[hidden email]> wrote:
>>> Yes, both Marcelo and I would be interested.
>>>
>>> We looked into H2 and it looks like something similar to Oracle's ODCI can
>>> be implemented. Plus the primitive full-text implementación is based on
>>> Lucene.
>>> I say primitive because looking at the code I saw that one cannot define an
>>> Analyzer and for each scan corresponding to a where clause a searcher is
>>> open and closed, instead of having a pool, plus it does not have any way to
>>> queue changes to reduce the use of the IndexWriter, etc.
>>>
>>> But its open source and that is a great starting point!
>>>
>>> -- Joaquin
>>>
>>> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>>> <[hidden email]> wrote:
>>>>
>>>> Perhaps an interesting project would be to integrate Ocean with H2
>>>> www.h2database.com to take advantage of both models.  I'm not sure how
>>>> exactly that would work, but it seems like it would not be too
>>>> difficult.  Perhaps this would solve being able to perform faster
>>>> hierarchical queries and perhaps other types of queries that Lucene is
>>>> not capable of.
>>>>
>>>> Is this something Joaquin you are interested in collaborating on?  I
>>>> am definitely interested in it.
>>>>
>>>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[hidden email]>
>>>> wrote:
>>>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>>>> > <[hidden email]> wrote:
>>>> >>
>>>> >> Regarding real-time search and Solr, my feeling is the focus should be
>>>> >> on
>>>> >> first adding real-time search to Lucene, and then we'll figure out how
>>>> >> to
>>>> >> incorporate that into Solr later.
>>>> >
>>>> >
>>>> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>>>> >  Note
>>>> > that Lucene, being a indexing/search library (and not a full blown
>>>> > search
>>>> > engine), is by definition "real-time": once you add/write a document to
>>>> > the
>>>> > index it becomes immediately searchable and if a document is logically
>>>> > deleted and no longer returned in a search, though physical deletion
>>>> > happens
>>>> > during an index optimization.
>>>> >
>>>> > Now, the problem of adding/deleting documents in bulk, as part of a
>>>> > transaction and making these documents available for search immediately
>>>> > after the transaction is commited sounds more like a search engine
>>>> > problem
>>>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
>>>> > be
>>>> > I/O expensive and thus are usually implemented bached proceeses with
>>>> > some
>>>> > kind of sync mechanism, which makes them non real-time.
>>>> >
>>>> > For example, in my previous life, I designed and help implement a
>>>> > quasi-realtime enterprise search engine using Lucene, having a set of
>>>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>>>> > accross
>>>> > different search services which powered a broker based distributed
>>>> > search
>>>> > interface. The most recent documents provided to the indexers were
>>>> > always
>>>> > added to the smaller in-memory (RAM) indexes which usually could absorbe
>>>> > the
>>>> > load of a bulk "add" transaction and later would be merged into larger
>>>> > disk
>>>> > based indexes and then flushed to make them ready to absorbe new fresh
>>>> > docs.
>>>> > We even had further partitioning of the indexes that reflected time
>>>> > periods
>>>> > with caps on size for them to be merged into older more archive based
>>>> > indexes which were used less (yes the search engine default search was
>>>> > on
>>>> > data no more than 1 month old, though user could open the time window by
>>>> > including archives).
>>>> >
>>>> > As for SOLR and OCEAN,  I would argue that these semi-structured search
>>>> > engines are becomming more and more like relational databases with
>>>> > full-text
>>>> > search capablities (without the benefit of full reletional algebra --
>>>> > for
>>>> > example joins are not possible using SOLR). Notice that "real-time" CRUD
>>>> > operations and transactionality are core DB concepts adn have been
>>>> > studied
>>>> > and developed by database communities for aquite long time. There has
>>>> > been
>>>> > recent efforts on how to effeciently integrate Lucene into releational
>>>> > databases (see Lucene JVM ORACLE integration, see
>>>> >
>>>> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>>>> >
>>>> > I think we should seriously look at joining efforts with open-source
>>>> > Database engine projects, written in Java (see
>>>> > http://java-source.net/open-source/database-engines) in order to blend
>>>> > IR
>>>> > and ORM for once and for all.
>>>> >
>>>> > -- Joaquin
>>>> >
>>>> >
>>>> >>
>>>> >> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>>>> >> times to understand bits and pieces of it.  I have to admit there is
>>>> >> still
>>>> >> some fuzziness about the whole things in my head - is "Ocean" something
>>>> >> that
>>>> >> already works, a separate project on googlecode.com?  I think so.  If
>>>> >> so,
>>>> >> and if you are working on getting it integrated into Lucene, would it
>>>> >> make
>>>> >> it less confusing to just refer to it as "real-time search", so there
>>>> >> is no
>>>> >> confusion?
>>>> >>
>>>> >> If this is to be initially integrated into Lucene, why are things like
>>>> >> replication, crowding/field collapsing, locallucene, name service, tag
>>>> >> index, etc. all mentioned there on the Wiki and bundled with
>>>> >> description of
>>>> >> how real-time search works and is to be implemented?  I suppose
>>>> >> mentioning
>>>> >> replication kind-of makes sense because the replication approach is
>>>> >> closely
>>>> >> tied to real-time search - all query nodes need to see index changes
>>>> >> fast.
>>>> >>  But Lucene itself offers no replication mechanism, so maybe the
>>>> >> replication
>>>> >> is something to figure out separately, say on the Solr level, later on
>>>> >> "once
>>>> >> we get there".  I think even just the essential real-time search
>>>> >> requires
>>>> >> substantial changes to Lucene (I remember seeing large patches in
>>>> >> JIRA),
>>>> >> which makes it hard to digest, understand, comment on, and ultimately
>>>> >> commit
>>>> >> (hence the luke warm response, I think).  Bringing other non-essential
>>>> >> elements into discussion at the same time makes it more difficult t o
>>>> >>  process all this new stuff, at least for me.  Am I the only one who
>>>> >> finds
>>>> >> this hard?
>>>> >>
>>>> >> That said, it sounds like we have some discussion going (Karl...), so I
>>>> >> look forward to understanding more! :)
>>>> >>
>>>> >>
>>>> >> Otis
>>>> >> --
>>>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>> >>
>>>> >>
>>>> >>
>>>> >> ----- Original Message ----
>>>> >> > From: Yonik Seeley <[hidden email]>
>>>> >> > To: [hidden email]
>>>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>>>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>>>> >> >
>>>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>>>> >> > wrote:
>>>> >> > > I also think it's got a
>>>> >> > > lot of things now which makes integration difficult to do properly.
>>>> >> >
>>>> >> > I agree, and that's why the major bump in version number rather than
>>>> >> > minor - we recognize that some features will need some amount of
>>>> >> > rearchitecture.
>>>> >> >
>>>> >> > > I think the problem with integration with SOLR is it was designed
>>>> >> > > with
>>>> >> > > a different problem set in mind than Ocean, originally the CNET
>>>> >> > > shopping application.
>>>> >> >
>>>> >> > That was the first use of Solr, but it actually existed before that
>>>> >> > w/o any defined use other than to be a "plan B" alternative to MySQL
>>>> >> > based search servers (that's actually where some of the parameter
>>>> >> > names come from... the default /select URL instead of /search, the
>>>> >> > "rows" parameter, etc).
>>>> >> >
>>>> >> > But you're right... some things like the replication strategy were
>>>> >> > designed (well, borrowed from Doug to be exact) with the idea that it
>>>> >> > would be OK to have slightly "stale" views of the data in the range
>>>> >> > of
>>>> >> > minutes.  It just made things easier/possible at the time.  But tons
>>>> >> > of Solr and Lucene users want almost instantaneous visibility of
>>>> >> > added
>>>> >> > documents, if they can get it.  It's hardly restricted to social
>>>> >> > network applications.
>>>> >> >
>>>> >> > Bottom line is that Solr aims to be a general enterprise search
>>>> >> > platform, and getting as real-time as we can get, and as scalable as
>>>> >> > we can get are some of the top priorities going forward.
>>>> >> >
>>>> >> > -Yonik
>>>> >> >
>>>> >> > ---------------------------------------------------------------------
>>>> >> > To unsubscribe, e-mail: [hidden email]
>>>> >> > For additional commands, e-mail: [hidden email]
>>>> >>
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: [hidden email]
>>>> >> For additional commands, e-mail: [hidden email]
>>>> >>
>>>> >
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
>
> --
> Marcelo F. Ochoa
> http://marceloochoa.blogspot.com/
> http://marcelo.ochoa.googlepages.com/home
> ______________
> Do you Know DBPrism? Look @ DB Prism's Web Site
> http://www.dbprism.com.ar/index.html
> More info?
> Chapter 17 of the book "Programming the Oracle Database using Java &
> Web Services"
> http://www.amazon.com/gp/product/1555583296/
> Chapter 21 of the book "Professional XML Databases" - Wrox Press
> http://www.amazon.com/gp/product/1861003587/
> Chapter 8 of the book "Oracle & Open Source" - O'Reilly
> http://www.oreilly.com/catalog/oracleopen/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Marcelo F. Ochoa
> I am wondering if in an integrated solution, things like sorting still
> require the field cache?  What if untokenized fields could be stored
> in H2, normal tokenized fields in Lucene.  Then somehow make the query
> work properly.  Yes the rowid would need to be stored.  Currently
> Lucene range queries are slower than SQL based btree queries.
  I am using an approach similar to field cache to maintain the
association between Lucene DocId and Oracle rowid, this cache is
automatically invalidated once an IndexWriter commit changes.
>
> Are you saying store the Lucene segments as BLOBs?
  Yep, I am using Oracle Secure Files (BLOB) to store Lucene segments,
this is done using OJVMDirectory which handle the operations.
   As I set in my previous email BLOBs have similar performance as an
NFS file system, obviously is not faster than native IO, but we
compensate this little overhead eliminating the network
transfer/marshalling when indexing databases information, remember
that Oracle-Lucene integration is running in a same memory space than
the data so it have local access.
   We can replace BLOB by any other fault tolerance, scalable storage
(like HBase), but I don't know if it can be aligned to a database
transaction.
  Best regards, Marcelo.

>
> On Mon, Sep 8, 2008 at 7:13 PM, Marcelo Ochoa <[hidden email]> wrote:
>> Hi:
>> Integrating Lucene in a RDBMS has two separate concern:
>>  - Integrate it as index to receive notification when a row change
>> and that the optimizer can choose a right execution plan based on the
>> index statistics.
>>  - Replace Lucene file system store to align database changes with
>> Lucene changes, it means both should be part of one transaction.
>> For H2, first point seem to be viable of implement with more or less
>> efforts, for the second I don't know how H2 manage BLOB storage.
>> My experience with Oracle-Lucene Integration is that replacing the
>> file-system store by BLOB do not impose a big overhead and we get
>> rollback, replication and fault tolerance functionality for free :)
>> Best regards, Marcelo.
>>
>> PD: Sure Lucene Index is small inside a database, we need to store as
>> UN_TOKENIZED the rowid, for the content of the other indexes field the
>> database has faster access than Lucene.
>>> Cool.  I mention H2 because it does have some Lucene code in it yes.
>>> Also according to some benchmarks it's the fastest of the open source
>>> databases.  I think it's possible to integrate realtime search for H2.
>>>  I suppose there is no need to store the data in Lucene in this case?
>>> One loses the multiple values per field Lucene offers, and the schema
>>> become static.  Perhaps it's a trade off?
>>>
>>> On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <[hidden email]> wrote:
>>>> Yes, both Marcelo and I would be interested.
>>>>
>>>> We looked into H2 and it looks like something similar to Oracle's ODCI can
>>>> be implemented. Plus the primitive full-text implementación is based on
>>>> Lucene.
>>>> I say primitive because looking at the code I saw that one cannot define an
>>>> Analyzer and for each scan corresponding to a where clause a searcher is
>>>> open and closed, instead of having a pool, plus it does not have any way to
>>>> queue changes to reduce the use of the IndexWriter, etc.
>>>>
>>>> But its open source and that is a great starting point!
>>>>
>>>> -- Joaquin
>>>>
>>>> On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
>>>> <[hidden email]> wrote:
>>>>>
>>>>> Perhaps an interesting project would be to integrate Ocean with H2
>>>>> www.h2database.com to take advantage of both models.  I'm not sure how
>>>>> exactly that would work, but it seems like it would not be too
>>>>> difficult.  Perhaps this would solve being able to perform faster
>>>>> hierarchical queries and perhaps other types of queries that Lucene is
>>>>> not capable of.
>>>>>
>>>>> Is this something Joaquin you are interested in collaborating on?  I
>>>>> am definitely interested in it.
>>>>>
>>>>> On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <[hidden email]>
>>>>> wrote:
>>>>> > On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
>>>>> > <[hidden email]> wrote:
>>>>> >>
>>>>> >> Regarding real-time search and Solr, my feeling is the focus should be
>>>>> >> on
>>>>> >> first adding real-time search to Lucene, and then we'll figure out how
>>>>> >> to
>>>>> >> incorporate that into Solr later.
>>>>> >
>>>>> >
>>>>> > Otis, what do you mean exactly by "adding real-time search to Lucene"?
>>>>> >  Note
>>>>> > that Lucene, being a indexing/search library (and not a full blown
>>>>> > search
>>>>> > engine), is by definition "real-time": once you add/write a document to
>>>>> > the
>>>>> > index it becomes immediately searchable and if a document is logically
>>>>> > deleted and no longer returned in a search, though physical deletion
>>>>> > happens
>>>>> > during an index optimization.
>>>>> >
>>>>> > Now, the problem of adding/deleting documents in bulk, as part of a
>>>>> > transaction and making these documents available for search immediately
>>>>> > after the transaction is commited sounds more like a search engine
>>>>> > problem
>>>>> > (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to
>>>>> > be
>>>>> > I/O expensive and thus are usually implemented bached proceeses with
>>>>> > some
>>>>> > kind of sync mechanism, which makes them non real-time.
>>>>> >
>>>>> > For example, in my previous life, I designed and help implement a
>>>>> > quasi-realtime enterprise search engine using Lucene, having a set of
>>>>> > multi-threaded indexers hitting a set of multiple indexes alocatted
>>>>> > accross
>>>>> > different search services which powered a broker based distributed
>>>>> > search
>>>>> > interface. The most recent documents provided to the indexers were
>>>>> > always
>>>>> > added to the smaller in-memory (RAM) indexes which usually could absorbe
>>>>> > the
>>>>> > load of a bulk "add" transaction and later would be merged into larger
>>>>> > disk
>>>>> > based indexes and then flushed to make them ready to absorbe new fresh
>>>>> > docs.
>>>>> > We even had further partitioning of the indexes that reflected time
>>>>> > periods
>>>>> > with caps on size for them to be merged into older more archive based
>>>>> > indexes which were used less (yes the search engine default search was
>>>>> > on
>>>>> > data no more than 1 month old, though user could open the time window by
>>>>> > including archives).
>>>>> >
>>>>> > As for SOLR and OCEAN,  I would argue that these semi-structured search
>>>>> > engines are becomming more and more like relational databases with
>>>>> > full-text
>>>>> > search capablities (without the benefit of full reletional algebra --
>>>>> > for
>>>>> > example joins are not possible using SOLR). Notice that "real-time" CRUD
>>>>> > operations and transactionality are core DB concepts adn have been
>>>>> > studied
>>>>> > and developed by database communities for aquite long time. There has
>>>>> > been
>>>>> > recent efforts on how to effeciently integrate Lucene into releational
>>>>> > databases (see Lucene JVM ORACLE integration, see
>>>>> >
>>>>> > http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)
>>>>> >
>>>>> > I think we should seriously look at joining efforts with open-source
>>>>> > Database engine projects, written in Java (see
>>>>> > http://java-source.net/open-source/database-engines) in order to blend
>>>>> > IR
>>>>> > and ORM for once and for all.
>>>>> >
>>>>> > -- Joaquin
>>>>> >
>>>>> >
>>>>> >>
>>>>> >> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>>>>> >> times to understand bits and pieces of it.  I have to admit there is
>>>>> >> still
>>>>> >> some fuzziness about the whole things in my head - is "Ocean" something
>>>>> >> that
>>>>> >> already works, a separate project on googlecode.com?  I think so.  If
>>>>> >> so,
>>>>> >> and if you are working on getting it integrated into Lucene, would it
>>>>> >> make
>>>>> >> it less confusing to just refer to it as "real-time search", so there
>>>>> >> is no
>>>>> >> confusion?
>>>>> >>
>>>>> >> If this is to be initially integrated into Lucene, why are things like
>>>>> >> replication, crowding/field collapsing, locallucene, name service, tag
>>>>> >> index, etc. all mentioned there on the Wiki and bundled with
>>>>> >> description of
>>>>> >> how real-time search works and is to be implemented?  I suppose
>>>>> >> mentioning
>>>>> >> replication kind-of makes sense because the replication approach is
>>>>> >> closely
>>>>> >> tied to real-time search - all query nodes need to see index changes
>>>>> >> fast.
>>>>> >>  But Lucene itself offers no replication mechanism, so maybe the
>>>>> >> replication
>>>>> >> is something to figure out separately, say on the Solr level, later on
>>>>> >> "once
>>>>> >> we get there".  I think even just the essential real-time search
>>>>> >> requires
>>>>> >> substantial changes to Lucene (I remember seeing large patches in
>>>>> >> JIRA),
>>>>> >> which makes it hard to digest, understand, comment on, and ultimately
>>>>> >> commit
>>>>> >> (hence the luke warm response, I think).  Bringing other non-essential
>>>>> >> elements into discussion at the same time makes it more difficult t o
>>>>> >>  process all this new stuff, at least for me.  Am I the only one who
>>>>> >> finds
>>>>> >> this hard?
>>>>> >>
>>>>> >> That said, it sounds like we have some discussion going (Karl...), so I
>>>>> >> look forward to understanding more! :)
>>>>> >>
>>>>> >>
>>>>> >> Otis
>>>>> >> --
>>>>> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> ----- Original Message ----
>>>>> >> > From: Yonik Seeley <[hidden email]>
>>>>> >> > To: [hidden email]
>>>>> >> > Sent: Thursday, September 4, 2008 10:13:32 AM
>>>>> >> > Subject: Re: Realtime Search for Social Networks Collaboration
>>>>> >> >
>>>>> >> > On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>>>>> >> > wrote:
>>>>> >> > > I also think it's got a
>>>>> >> > > lot of things now which makes integration difficult to do properly.
>>>>> >> >
>>>>> >> > I agree, and that's why the major bump in version number rather than
>>>>> >> > minor - we recognize that some features will need some amount of
>>>>> >> > rearchitecture.
>>>>> >> >
>>>>> >> > > I think the problem with integration with SOLR is it was designed
>>>>> >> > > with
>>>>> >> > > a different problem set in mind than Ocean, originally the CNET
>>>>> >> > > shopping application.
>>>>> >> >
>>>>> >> > That was the first use of Solr, but it actually existed before that
>>>>> >> > w/o any defined use other than to be a "plan B" alternative to MySQL
>>>>> >> > based search servers (that's actually where some of the parameter
>>>>> >> > names come from... the default /select URL instead of /search, the
>>>>> >> > "rows" parameter, etc).
>>>>> >> >
>>>>> >> > But you're right... some things like the replication strategy were
>>>>> >> > designed (well, borrowed from Doug to be exact) with the idea that it
>>>>> >> > would be OK to have slightly "stale" views of the data in the range
>>>>> >> > of
>>>>> >> > minutes.  It just made things easier/possible at the time.  But tons
>>>>> >> > of Solr and Lucene users want almost instantaneous visibility of
>>>>> >> > added
>>>>> >> > documents, if they can get it.  It's hardly restricted to social
>>>>> >> > network applications.
>>>>> >> >
>>>>> >> > Bottom line is that Solr aims to be a general enterprise search
>>>>> >> > platform, and getting as real-time as we can get, and as scalable as
>>>>> >> > we can get are some of the top priorities going forward.
>>>>> >> >
>>>>> >> > -Yonik
>>>>> >> >
>>>>> >> > ---------------------------------------------------------------------
>>>>> >> > To unsubscribe, e-mail: [hidden email]
>>>>> >> > For additional commands, e-mail: [hidden email]
>>>>> >>
>>>>> >>
>>>>> >> ---------------------------------------------------------------------
>>>>> >> To unsubscribe, e-mail: [hidden email]
>>>>> >> For additional commands, e-mail: [hidden email]
>>>>> >>
>>>>> >
>>>>> >
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [hidden email]
>>>>> For additional commands, e-mail: [hidden email]
>>>>>
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>>
>>
>> --
>> Marcelo F. Ochoa
>> http://marceloochoa.blogspot.com/
>> http://marcelo.ochoa.googlepages.com/home
>> ______________
>> Do you Know DBPrism? Look @ DB Prism's Web Site
>> http://www.dbprism.com.ar/index.html
>> More info?
>> Chapter 17 of the book "Programming the Oracle Database using Java &
>> Web Services"
>> http://www.amazon.com/gp/product/1555583296/
>> Chapter 21 of the book "Professional XML Databases" - Wrox Press
>> http://www.amazon.com/gp/product/1861003587/
>> Chapter 8 of the book "Oracle & Open Source" - O'Reilly
>> http://www.oreilly.com/catalog/oracleopen/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>



--
Marcelo F. Ochoa
http://marceloochoa.blogspot.com/
http://marcelo.ochoa.googlepages.com/home
______________
Do you Know DBPrism? Look @ DB Prism's Web Site
http://www.dbprism.com.ar/index.html
More info?
Chapter 17 of the book "Programming the Oracle Database using Java &
Web Services"
http://www.amazon.com/gp/product/1555583296/
Chapter 21 of the book "Professional XML Databases" - Wrox Press
http://www.amazon.com/gp/product/1861003587/
Chapter 8 of the book "Oracle & Open Source" - O'Reilly
http://www.oreilly.com/catalog/oracleopen/

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Michael McCandless-2
In reply to this post by Yonik Seeley-2

Yonik Seeley wrote:

> On Mon, Sep 8, 2008 at 3:04 PM, Michael McCandless
> <[hidden email]> wrote:
>> Right, getCurrentIndex would return a MultiReader that includes
>> SegmentReader for each segment in the index, plus a "RAMReader" that
>> searches the RAM buffer.  That RAMReader is a tiny shell class that  
>> would
>> basically just record the max docID it's allowed to go up to (the  
>> docID as
>> of when it was opened), and stop enumerating docIDs (eg in the  
>> TermDocs)
>> when it hits a docID beyond that limit.
>
> What about something like term freq?  Would it need to count the
> number of docs after the local maxDoc or is there a better way?

Good question...

I think we'd have to take a full copy of the term -> termFreq on  
reopen?  I don't see how else to do it (I don't understand your  
suggestion above).  So, this will clearly add to the cost of reopen.

>> For reading stored fields and term vectors, which are now flushed
>> immediately to disk, we need to somehow get an IndexInput from the
>> IndexOutputs that IndexWriter holds open on these files.  Or,  
>> maybe, just
>> open new IndexInputs?
>
> Hmmm, seems like a case of our nice and simple Directory model not
> having quite enough features in this case.

I think we can simply open IndexInputs on these files.  I believe Java  
does the right thing on windows, such that if we are already writing  
to the file, it does not prevent another file handle from opening the  
file for reading.

>>> Another thing that will help is if users could get their hands on  
>>> the
>>> sub-readers of a multi-segment reader.  Right now that is hidden in
>>> MultiSegmentReader and makes updating anything incrementally
>>> difficult.
>>
>> Besides what's handled by MultiSegmentReader.reopen already, what  
>> else do
>> you need to incrementally update?
>
> Anything that you want to incrementally update and uses an  
> IndexReader as a key.
> Mostly caches I would think... Solr has user-level (application
> specific) caches, faceting caches, etc.

Ahh ok.  We should just open up access and mark this as advanced?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

1234