Realtime Search for Social Networks Collaboration

classic Classic list List threaded Threaded
61 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Realtime Search for Social Networks Collaboration

Jason Rutherglen
Hello all,

I don't mean this to sound like a solicitation.  I've been working on
realtime search and created some Lucene patches etc.  I am wondering
if there are social networks (or anyone else) out there who would be
interested in collaborating with Apache on realtime search to get it
to the point it can be used in production.  It is a challenging
problem that only Google has solved and made to scale.  I've been
working on the problem for a while and though a lot has been
completed, there is still a lot more to do and collaboration amongst
the most probable users (social networks) seems like a good thing to
try to do at this point.  I guess I'm saying it seems like a hard
enough problem that perhaps it's best to work together on it rather
than each company try to complete their own.  However I could be
wrong.

Realtime search benefits social networks by providing a scalable
searchable alternative to large Mysql implementations.  Mysql I have
heard is difficult to scale at a certain point.  Apparently Google has
created things like BigTable (a large database) and an online service
called GData (which Google has not published any whitepapers on the
technology underneath) to address scaling large database systems.
BigTable does not offer search.   GData does and is used by all of
Google's web services instead of something like Mysql (this is at
least how I understand it).  Social networks usually grow and so
scaling is continually an issue.  It is possible to build a realtime
search system that scales linearly, something that I have heard
becomes difficult with Mysql.  There is an article that discusses some
of these issues
http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337  I
don't think the current GData implementation is perfect and there is a
lot that can be improved on.  It might be helpful to figure out
together what helpful things can be added.

If this sounds like something of interest to anyone feel free to send
your input.

Take care,
Jason

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Yonik Seeley-2
On Wed, Sep 3, 2008 at 3:20 PM, Jason Rutherglen
<[hidden email]> wrote:
> I am wondering
> if there are social networks (or anyone else) out there who would be
> interested in collaborating with Apache on realtime search to get it
> to the point it can be used in production.

Good timing Jason, I think you'll find some other people right here
at Apache (solr-dev) that want to collaborate in this area:

http://www.nabble.com/solr2%3A-Onward-and-Upward-td19224805.html

I've looked at your wiki briefly, and all the high level goals/features seem
to really be synergistic with where we are going with Solr2.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
Hi Yonik,

The SOLR 2 list looks good.  The question is, who is going to do the
work?  I tried to simplify the scope of Ocean as much as possible to
make it possible (and slowly at that over time) for me to eventually
finish what is mentioned on the wiki.  I think SOLR is very cool and
was   major step forward when it came out.  I also think it's got a
lot of things now which makes integration difficult to do properly.  I
did try to integrate and received a lukewarm response and so decided
to just move ahead separately until folks have time to collaborate.
We probably should try to integrate SOLR and Ocean somehow however we
may want to simply reduce the scope a bit and figure what is needed
most, with the main use case being social networks.

I think the problem with integration with SOLR is it was designed with
a different problem set in mind than Ocean, originally the CNET
shopping application.  Facets were important, realtime was not needed
because pricing doesn't change very often.  I designed Ocean for
social networks and actually further into the future realtime
messaging based mobile applications.

SOLR needs to be backward compatible and support it's existing user
base.  How do you plan on doing this for a SOLR 2 if the architecture
is changed dramatically?  SOLR solves a problem set that is very
common making SOLR very useful in many situations.  However I wanted
Ocean to be like GData.  So I wanted the scalability of Google which
SOLR doesn't quite have yet, and the realtime, and then I figured the
other stuff could be added later, stuff people seem to spend a lot of
time on in the SOLR community currently (spellchecker, db imports,
many others).  I did use some of the SOLR terminology in building
Ocean, like snapshots!  But most of it is a digression.  I tried to
use schemas, but they just make the system harder to use.  For
distributed search I prefer serialized objects as this enables things
like SpanQueries and payloads without writing request handlers and
such.  Also there is no need to write new request handlers and deploy
(an expensive operation for systems that are in the 100s of servers)
them as any new classes are simply dynamically loaded by the server
from the client.

A lot is now outlined on the wiki site
http://wiki.apache.org/lucene-java/OceanRealtimeSearch now and there
will be a lot more javadocs in the forthcoming patch.  The latest code
is also available all the time at
http://oceansearch.googlecode.com/svn/trunk/trunk/oceanlucene

I do welcome more discussion and if there are Solr developers who wish
to work on Ocean feel free to drop me a line.  Most of all though I
think it would be useful for social networks interested in realtime
search to get involved as it may be something that is difficult for
one company to have enough resources to implement to a production
level.  I think this is where open source collaboration is
particularly useful.

Cheers,

Jason Rutherglen
[hidden email]

On Wed, Sep 3, 2008 at 4:56 PM, Yonik Seeley <[hidden email]> wrote:

> On Wed, Sep 3, 2008 at 3:20 PM, Jason Rutherglen
> <[hidden email]> wrote:
>> I am wondering
>> if there are social networks (or anyone else) out there who would be
>> interested in collaborating with Apache on realtime search to get it
>> to the point it can be used in production.
>
> Good timing Jason, I think you'll find some other people right here
> at Apache (solr-dev) that want to collaborate in this area:
>
> http://www.nabble.com/solr2%3A-Onward-and-Upward-td19224805.html
>
> I've looked at your wiki briefly, and all the high level goals/features seem
> to really be synergistic with where we are going with Solr2.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Yonik Seeley-2
On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
<[hidden email]> wrote:
> I also think it's got a
> lot of things now which makes integration difficult to do properly.

I agree, and that's why the major bump in version number rather than
minor - we recognize that some features will need some amount of
rearchitecture.

> I think the problem with integration with SOLR is it was designed with
> a different problem set in mind than Ocean, originally the CNET
> shopping application.

That was the first use of Solr, but it actually existed before that
w/o any defined use other than to be a "plan B" alternative to MySQL
based search servers (that's actually where some of the parameter
names come from... the default /select URL instead of /search, the
"rows" parameter, etc).

But you're right... some things like the replication strategy were
designed (well, borrowed from Doug to be exact) with the idea that it
would be OK to have slightly "stale" views of the data in the range of
minutes.  It just made things easier/possible at the time.  But tons
of Solr and Lucene users want almost instantaneous visibility of added
documents, if they can get it.  It's hardly restricted to social
network applications.

Bottom line is that Solr aims to be a general enterprise search
platform, and getting as real-time as we can get, and as scalable as
we can get are some of the top priorities going forward.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
Hi Yonik,

I found the basic integration with SOLR and Ocean to be fairly
straightforward, the https://issues.apache.org/jira/browse/SOLR-567
patch is key to that.  SOLR just needs an optimistic concurrency
update handler and most of the functionality would work.  I guess the
problem would be, removing the ability to do things that in realtime
are unnecessary like commit, optimize, and others. It's doable, let me
know if you need some help making some decisions about things.

Take care,
Jason

On Thu, Sep 4, 2008 at 10:13 AM, Yonik Seeley <[hidden email]> wrote:

> On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
> <[hidden email]> wrote:
>> I also think it's got a
>> lot of things now which makes integration difficult to do properly.
>
> I agree, and that's why the major bump in version number rather than
> minor - we recognize that some features will need some amount of
> rearchitecture.
>
>> I think the problem with integration with SOLR is it was designed with
>> a different problem set in mind than Ocean, originally the CNET
>> shopping application.
>
> That was the first use of Solr, but it actually existed before that
> w/o any defined use other than to be a "plan B" alternative to MySQL
> based search servers (that's actually where some of the parameter
> names come from... the default /select URL instead of /search, the
> "rows" parameter, etc).
>
> But you're right... some things like the replication strategy were
> designed (well, borrowed from Doug to be exact) with the idea that it
> would be OK to have slightly "stale" views of the data in the range of
> minutes.  It just made things easier/possible at the time.  But tons
> of Solr and Lucene users want almost instantaneous visibility of added
> documents, if they can get it.  It's hardly restricted to social
> network applications.
>
> Bottom line is that Solr aims to be a general enterprise search
> platform, and getting as real-time as we can get, and as scalable as
> we can get are some of the top priorities going forward.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Otis Gospodnetic-2
In reply to this post by Jason Rutherglen
Regarding real-time search and Solr, my feeling is the focus should be on first adding real-time search to Lucene, and then we'll figure out how to incorporate that into Solr later.

I've read Jason's Wiki as well.  Actually, I had to read it a number of times to understand bits and pieces of it.  I have to admit there is still some fuzziness about the whole things in my head - is "Ocean" something that already works, a separate project on googlecode.com?  I think so.  If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as "real-time search", so there is no confusion?

If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented?  I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast.  But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on "once we get there".  I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think).  Bringing other non-essential elements into discussion at the same time makes it more difficult to
 process all this new stuff, at least for me.  Am I the only one who finds this hard?

That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Yonik Seeley <[hidden email]>
> To: [hidden email]
> Sent: Thursday, September 4, 2008 10:13:32 AM
> Subject: Re: Realtime Search for Social Networks Collaboration
>
> On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
> wrote:
> > I also think it's got a
> > lot of things now which makes integration difficult to do properly.
>
> I agree, and that's why the major bump in version number rather than
> minor - we recognize that some features will need some amount of
> rearchitecture.
>
> > I think the problem with integration with SOLR is it was designed with
> > a different problem set in mind than Ocean, originally the CNET
> > shopping application.
>
> That was the first use of Solr, but it actually existed before that
> w/o any defined use other than to be a "plan B" alternative to MySQL
> based search servers (that's actually where some of the parameter
> names come from... the default /select URL instead of /search, the
> "rows" parameter, etc).
>
> But you're right... some things like the replication strategy were
> designed (well, borrowed from Doug to be exact) with the idea that it
> would be OK to have slightly "stale" views of the data in the range of
> minutes.  It just made things easier/possible at the time.  But tons
> of Solr and Lucene users want almost instantaneous visibility of added
> documents, if they can get it.  It's hardly restricted to social
> network applications.
>
> Bottom line is that Solr aims to be a general enterprise search
> platform, and getting as real-time as we can get, and as scalable as
> we can get are some of the top priorities going forward.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
Hi Otis,

LUCENE-1313 is realtime search.  The Ocean name should be removed from
it but I was not sure "realtime search" is what the technical name
should be at the time.  I have seen it used elsewhere (such as at
Summize the search company Twitter recently purchased, Bebo, LinkedIn)
now and so believe it is an accepted proper name.  The question is,
and this is for folks like Michael McCandless, what features should it
have, what version of Lucene should it target, does it need to be in
core or contrib, and when.  I will leave those discussions to others.

The wiki site has become more or less a dumping ground for the many
components of a next generation search database system hence the name
Ocean Realtime Search.  I prefer to work at the non-linear system
level rather than at the class component level and the documentation
reflects this.  I believe there is no comparable solution to Google's
GData in open source.  In that regard Ocean is more like Nutch in that
it solves a common problem (Nutch solves web indexing, Ocean solves
realtime search databases, and they are both based more or less on
paths Google paved).  Nutch also works above the Lucene level, just
like Ocean.  This is to minimize impact on Lucene and provide a
solution that works today rather than 1-2 years from now when
integration with SOLR and core Lucene may take place.  This simply
reflects my preference for working at the systems level and getting
the entire system working so that the Ocean system may be used in
production applications.

The feedback is helpful and I will start to divide up the
documentation into more discrete pieces like the code itself.  I found
SOLR to be incomplete as a system, at least the system I wanted which
is more in line with how Hadoop and Nutch operate.  Hadoop and Nutch
implement distributed objects which makes coding much simpler and
faster, they're designed for 1000s of servers scalability, and
always-on operation.  In SOLR (which has happened in production) when
the master fails or the master index is corrupted it replicates the
corrupted index to the slaves which causes the entire system to
immediately fail.  These are things that when I tried to address them
in SOLR became a coding nightmare because of the RequestHandlers and
things like this requiring XML which requires writing a custom client.
 Whereas in Nutch, Hadoop, and Ocean one simply writes the Java code
for the operation and it's completed (minutes compared to hours or
days).

While replication is not necessary in the Lucene core realtime search
(it is not included in LUCENE-1313), it is required for the search
systems I have worked on in the past and so I addressed it in the
Ocean search database system.  This way it would not need to be bolted
on later, and perhaps require a major rewrite of the realtime search
component.  I prefer this sort of advanced planning so that later on,
I do not have to rewrite core code which destroys valuable testing and
software contributed over time.  The TagIndex is another example of
something that I started on to see how it would work, then stopped
once I understood how it would fit in with the overall system.  This
way, again, I do not have to go back and rewrite core code that needs
to be retested again potentially over several months.

It is unfortunate that I cannot explain the system well enough for
folks to understand it.  It would help to go over it with someone who
does not know too much about it who can format the documentation in a
way that is easily digested by the Lucene community.

Have a nice weekend,
Jason


On Sat, Sep 6, 2008 at 4:36 AM, Otis Gospodnetic
<[hidden email]> wrote:

> Regarding real-time search and Solr, my feeling is the focus should be on first adding real-time search to Lucene, and then we'll figure out how to incorporate that into Solr later.
>
> I've read Jason's Wiki as well.  Actually, I had to read it a number of times to understand bits and pieces of it.  I have to admit there is still some fuzziness about the whole things in my head - is "Ocean" something that already works, a separate project on googlecode.com?  I think so.  If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as "real-time search", so there is no confusion?
>
> If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented?  I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast.  But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on "once we get there".  I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think).  Bringing other non-essential elements into discussion at the same time makes it more difficult to
>  process all this new stuff, at least for me.  Am I the only one who finds this hard?
>
> That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :)
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Yonik Seeley <[hidden email]>
>> To: [hidden email]
>> Sent: Thursday, September 4, 2008 10:13:32 AM
>> Subject: Re: Realtime Search for Social Networks Collaboration
>>
>> On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
>> wrote:
>> > I also think it's got a
>> > lot of things now which makes integration difficult to do properly.
>>
>> I agree, and that's why the major bump in version number rather than
>> minor - we recognize that some features will need some amount of
>> rearchitecture.
>>
>> > I think the problem with integration with SOLR is it was designed with
>> > a different problem set in mind than Ocean, originally the CNET
>> > shopping application.
>>
>> That was the first use of Solr, but it actually existed before that
>> w/o any defined use other than to be a "plan B" alternative to MySQL
>> based search servers (that's actually where some of the parameter
>> names come from... the default /select URL instead of /search, the
>> "rows" parameter, etc).
>>
>> But you're right... some things like the replication strategy were
>> designed (well, borrowed from Doug to be exact) with the idea that it
>> would be OK to have slightly "stale" views of the data in the range of
>> minutes.  It just made things easier/possible at the time.  But tons
>> of Solr and Lucene users want almost instantaneous visibility of added
>> documents, if they can get it.  It's hardly restricted to social
>> network applications.
>>
>> Bottom line is that Solr aims to be a general enterprise search
>> platform, and getting as real-time as we can get, and as scalable as
>> we can get are some of the top priorities going forward.
>>
>> -Yonik
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Yonik Seeley-2
There's a good percent of the Solr community that is looking to add
everything you are (from a functional point of view).  Some of the
other little things that we haven't considered (like a remote Java
API) sound cool... no reason not to add that also.  We're also
planning on adding alternatives to some of the things you don't
currently like about Solr (HTTP, XML config, etc).

Apache has always emphasized "community over code"... and it's a large
part of what open source is about here.  It's not always easier and
faster to work in an open community, making compromises and trying to
reach general consensus, but it tends to be good for projects in the
long term.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
Hi Yonik,

I fully agree with "good for projects in the long term".  I just
figured it would be best if someone went ahead and built the things
and they could be integrated later into other projects, that's why I
checked them into Apache as patches.  Sounds like a few folks like
Shalin and Noble would like to build a SOLR specific realtime search.
I think that's a good idea that I may be able to offer some help on.
Realtime is relative anyways, for many projects database like updates
are probably not necessary, neither is replication, or perhaps even
100% uptime and scalability.  I just want the features, and if someone
would like to work with me to get them into core Lucene and SOLR
projects that would be cool.  If not at least the code is out there to
get ideas from.  These discussions are a good starting point.

Cheers,
Jason

On Sat, Sep 6, 2008 at 11:21 AM, Yonik Seeley <[hidden email]> wrote:

> There's a good percent of the Solr community that is looking to add
> everything you are (from a functional point of view).  Some of the
> other little things that we haven't considered (like a remote Java
> API) sound cool... no reason not to add that also.  We're also
> planning on adding alternatives to some of the things you don't
> currently like about Solr (HTTP, XML config, etc).
>
> Apache has always emphasized "community over code"... and it's a large
> part of what open source is about here.  It's not always easier and
> faster to work in an open community, making compromises and trying to
> reach general consensus, but it tends to be good for projects in the
> long term.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Shalin Shekhar Mangar
Hi Jason,

I think this is a misunderstanding. I only want to add these features incrementally so that users can use them as soon as possible, rather than delay them to a later release by re-architecting (which may take more time and shift our focus from our users).

The features are more important than the code but it will of course help a lot too. I think a good starting point for us (Lucene/Solr folks) would be to study Ocean's source and any documentation that you can provide so that we can also suggest an optimal integration strategy or alternate implementation ideas. Until now the bulk of such work has been on your shoulders. I appreciate your patience and the amount of work you have put in. These features will be a huge value proposition for our users and a collaboration will be the good for the community in the long term.

On Sat, Sep 6, 2008 at 9:11 PM, Jason Rutherglen <[hidden email]> wrote:
Hi Yonik,

I fully agree with "good for projects in the long term".  I just
figured it would be best if someone went ahead and built the things
and they could be integrated later into other projects, that's why I
checked them into Apache as patches.  Sounds like a few folks like
Shalin and Noble would like to build a SOLR specific realtime search.
I think that's a good idea that I may be able to offer some help on.
Realtime is relative anyways, for many projects database like updates
are probably not necessary, neither is replication, or perhaps even
100% uptime and scalability.  I just want the features, and if someone
would like to work with me to get them into core Lucene and SOLR
projects that would be cool.  If not at least the code is out there to
get ideas from.  These discussions are a good starting point.

Cheers,
Jason

On Sat, Sep 6, 2008 at 11:21 AM, Yonik Seeley <[hidden email]> wrote:
> There's a good percent of the Solr community that is looking to add
> everything you are (from a functional point of view).  Some of the
> other little things that we haven't considered (like a remote Java
> API) sound cool... no reason not to add that also.  We're also
> planning on adding alternatives to some of the things you don't
> currently like about Solr (HTTP, XML config, etc).
>
> Apache has always emphasized "community over code"... and it's a large
> part of what open source is about here.  It's not always easier and
> faster to work in an open community, making compromises and trying to
> reach general consensus, but it tends to be good for projects in the
> long term.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Regards,
Shalin Shekhar Mangar.
Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Grant Ingersoll-2
In reply to this post by Otis Gospodnetic-2

On Sep 6, 2008, at 4:36 AM, Otis Gospodnetic wrote:

> Regarding real-time search and Solr, my feeling is the focus should  
> be on first adding real-time search to Lucene, and then we'll figure  
> out how to incorporate that into Solr later.
>
> I've read Jason's Wiki as well.  Actually, I had to read it a number  
> of times to understand bits and pieces of it.  I have to admit there  
> is still some fuzziness about the whole things in my head - is  
> "Ocean" something that already works, a separate project on  
> googlecode.com?  I think so.  If so, and if you are working on  
> getting it integrated into Lucene, would it make it less confusing  
> to just refer to it as "real-time search", so there is no confusion?
>
> If this is to be initially integrated into Lucene, why are things  
> like replication, crowding/field collapsing, locallucene, name  
> service, tag index, etc. all mentioned there on the Wiki and bundled  
> with description of how real-time search works and is to be  
> implemented?  I suppose mentioning replication kind-of makes sense  
> because the replication approach is closely tied to real-time search  
> - all query nodes need to see index changes fast.  But Lucene itself  
> offers no replication mechanism, so maybe the replication is  
> something to figure out separately, say on the Solr level, later on  
> "once we get there".  I think even just the essential real-time  
> search requires substantial changes to Lucene (I remember seeing  
> large patches in JIRA), which makes it hard to digest, understand,  
> comment on, and ultimately commit (hence the luke warm response, I  
> think).  Bringing other non-essential elements into discussion at  
> the same time makes it more difficult to
> process all this new stuff, at least for me.  Am I the only one who  
> finds this hard?

Yeah, I agree.  There's a place for RT search in Lucene, but it seems  
to me we have a pretty good search server in Solr that needs some  
things going forward, but are reasonable to work on there.  It makes  
sense to me not to duplicate efforts on all of those fronts and have  
two projects/communities that share > 80-90% of their functionality  
(either existing, or planned).  As Yonik says, it may take longer than  
just doing it by oneself, but in the long run, the outcome is usually  
better.

My two cents,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Paul Elschot
In reply to this post by Shalin Shekhar Mangar
Op Saturday 06 September 2008 18:53:39 schreef Shalin Shekhar Mangar:
...

>
> The features are more important than the code but it will of course
> help a lot too. I think a good starting point for us (Lucene/Solr
> folks) would be to study Ocean's source and any documentation that
> you can provide so that we can also suggest an optimal integration
> strategy or alternate implementation ideas. Until now the bulk of
> such work has been on your shoulders. I appreciate your patience and
> the amount of work you have put in. These features will be a huge
> value proposition for our users and a collaboration will be the good
> for the community in the long term.

Some experience from larger patches:
- stepwise is good,
- so plan for steps, in which
- each step is improvement on its own.

Then:
- try to keep the first step as small as possible,
- with some luck, someone else will improve the first step,
- learn from the improvement,
- repeat, and never hurry.


Some comments on the current patch at LUCENE-1313:
- Copyright is assigned to individual authors, better assign that to
  ASF.
- Individual authors are mentioned in the code, that's not lucene
  policy at the moment.
- Some files do not contain an ASF licence, not a real problem.
- The directory structure could also be in contrib/ocean as
  top directory.
- There is a whole package of logging in there, but there's no logging
  in lucene at the moment.
- There is at least one empty class, SearcherPolicy.
- Unseen so far:
   - the second half of the patch,
   - the java code within the class {...} statements (sorry.)


Even though the patch is down to 25% of it's first size,
it's still 474 kb, which is large by any standard. So the
question is: is there a first step to be taken from this
patch that would be an improvement on its own?

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
In reply to this post by Shalin Shekhar Mangar
Hello Shalin,

When I tried to integrate before it seemed fairly simple.  However the
Ocean core code wasn't quite up to par yet so that needed work.  It
will help to work with SOLR people directly who can figure how they
want to integrate such as yourself.  Right now I'm finishing up the
OceanDatabase portion (sorry for all the Ocean names and things, these
can be changed, doesn't matter, but it should be something we agree
on).  The methods to TransactionSystem are like IndexWriter.  The
update method for OceanDatabase is perform(Action action).  There are
3 actions, Insert, Update, Delete.  To execute queries the whole thing
is abstracted out as a Task.  The method is Object run(Task task).
Where task gets a reference to the TransactionSytem. I implemented a
MultiThreadSearchTask that as the name suggests, executes a query in
multiple threads over the latest Snapshot.  The reason for the Task
abstraction is to give the client complete access to the server via a
potentially dynamically loaded subclass of Task.  OceanDatabase should
be the main class for most uses of the realtime system because it
implements optimistic concurrency.  I prefer the simplicity of the
main entry point into the search server being only two methods, with
the run method offering unlimited functionality without recompiling,
building and deploying the server for each new piece of functionality
required.

Regards,
Jason

On Sat, Sep 6, 2008 at 12:53 PM, Shalin Shekhar Mangar
<[hidden email]> wrote:

> Hi Jason,
>
> I think this is a misunderstanding. I only want to add these features
> incrementally so that users can use them as soon as possible, rather than
> delay them to a later release by re-architecting (which may take more time
> and shift our focus from our users).
>
> The features are more important than the code but it will of course help a
> lot too. I think a good starting point for us (Lucene/Solr folks) would be
> to study Ocean's source and any documentation that you can provide so that
> we can also suggest an optimal integration strategy or alternate
> implementation ideas. Until now the bulk of such work has been on your
> shoulders. I appreciate your patience and the amount of work you have put
> in. These features will be a huge value proposition for our users and a
> collaboration will be the good for the community in the long term.
>
> On Sat, Sep 6, 2008 at 9:11 PM, Jason Rutherglen
> <[hidden email]> wrote:
>>
>> Hi Yonik,
>>
>> I fully agree with "good for projects in the long term".  I just
>> figured it would be best if someone went ahead and built the things
>> and they could be integrated later into other projects, that's why I
>> checked them into Apache as patches.  Sounds like a few folks like
>> Shalin and Noble would like to build a SOLR specific realtime search.
>> I think that's a good idea that I may be able to offer some help on.
>> Realtime is relative anyways, for many projects database like updates
>> are probably not necessary, neither is replication, or perhaps even
>> 100% uptime and scalability.  I just want the features, and if someone
>> would like to work with me to get them into core Lucene and SOLR
>> projects that would be cool.  If not at least the code is out there to
>> get ideas from.  These discussions are a good starting point.
>>
>> Cheers,
>> Jason
>>
>> On Sat, Sep 6, 2008 at 11:21 AM, Yonik Seeley <[hidden email]> wrote:
>> > There's a good percent of the Solr community that is looking to add
>> > everything you are (from a functional point of view).  Some of the
>> > other little things that we haven't considered (like a remote Java
>> > API) sound cool... no reason not to add that also.  We're also
>> > planning on adding alternatives to some of the things you don't
>> > currently like about Solr (HTTP, XML config, etc).
>> >
>> > Apache has always emphasized "community over code"... and it's a large
>> > part of what open source is about here.  It's not always easier and
>> > faster to work in an open community, making compromises and trying to
>> > reach general consensus, but it tends to be good for projects in the
>> > long term.
>> >
>> > -Yonik
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
In reply to this post by Grant Ingersoll-2
Hi Grant,

I think the way to integrate with SOLR and Lucene is if people who are
committers to the respective projects work with me (if they want) on
the integration which will make it fairly straightforward as it was
designed and intended to be.

Cheers,
Jason

On Sat, Sep 6, 2008 at 3:16 PM, Grant Ingersoll <[hidden email]> wrote:

>
> On Sep 6, 2008, at 4:36 AM, Otis Gospodnetic wrote:
>
>> Regarding real-time search and Solr, my feeling is the focus should be on
>> first adding real-time search to Lucene, and then we'll figure out how to
>> incorporate that into Solr later.
>>
>> I've read Jason's Wiki as well.  Actually, I had to read it a number of
>> times to understand bits and pieces of it.  I have to admit there is still
>> some fuzziness about the whole things in my head - is "Ocean" something that
>> already works, a separate project on googlecode.com?  I think so.  If so,
>> and if you are working on getting it integrated into Lucene, would it make
>> it less confusing to just refer to it as "real-time search", so there is no
>> confusion?
>>
>> If this is to be initially integrated into Lucene, why are things like
>> replication, crowding/field collapsing, locallucene, name service, tag
>> index, etc. all mentioned there on the Wiki and bundled with description of
>> how real-time search works and is to be implemented?  I suppose mentioning
>> replication kind-of makes sense because the replication approach is closely
>> tied to real-time search - all query nodes need to see index changes fast.
>>  But Lucene itself offers no replication mechanism, so maybe the replication
>> is something to figure out separately, say on the Solr level, later on "once
>> we get there".  I think even just the essential real-time search requires
>> substantial changes to Lucene (I remember seeing large patches in JIRA),
>> which makes it hard to digest, understand, comment on, and ultimately commit
>> (hence the luke warm response, I think).  Bringing other non-essential
>> elements into discussion at the same time makes it more difficult to
>> process all this new stuff, at least for me.  Am I the only one who finds
>> this hard?
>
> Yeah, I agree.  There's a place for RT search in Lucene, but it seems to me
> we have a pretty good search server in Solr that needs some things going
> forward, but are reasonable to work on there.  It makes sense to me not to
> duplicate efforts on all of those fronts and have two projects/communities
> that share > 80-90% of their functionality (either existing, or planned).
>  As Yonik says, it may take longer than just doing it by oneself, but in the
> long run, the outcome is usually better.
>
> My two cents,
> Grant
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
In reply to this post by Paul Elschot
Hi Paul,

It's unfortunate the code is larger than most contribs.  The libraries
can be factored out.  The next patch includes OceanDatabase.  The
Ocean package and class names can be removed in favor of "realtime"?

> - There is a whole package of logging in there, but there's no logging
>  in lucene at the moment.

Can be removed, in favor of the IndexWriter style logging?  Is this
really the best way to go?  Makes debugging more painful with no
automatic method and class insertion in the log entries.  I can do it,
just thinking of other folks who work on it.

The locking and such uses JDK 1.5, I can downgrade it but for such
locking, and with 3.0 possibly coming out soon is that best?

> SearcherPolicy

It's a marker class like MergePolicy or Serializable

> - Individual authors are mentioned in the code, that's not lucene
>  policy at the moment.

Agreed, Eclipse throws them in, I delete them, maybe some made it in.
Maybe the @author should be removed from FieldCacheImpl, FieldDoc, and
FieldCache.

On Sat, Sep 6, 2008 at 3:41 PM, Paul Elschot <[hidden email]> wrote:

> Op Saturday 06 September 2008 18:53:39 schreef Shalin Shekhar Mangar:
> ...
>>
>> The features are more important than the code but it will of course
>> help a lot too. I think a good starting point for us (Lucene/Solr
>> folks) would be to study Ocean's source and any documentation that
>> you can provide so that we can also suggest an optimal integration
>> strategy or alternate implementation ideas. Until now the bulk of
>> such work has been on your shoulders. I appreciate your patience and
>> the amount of work you have put in. These features will be a huge
>> value proposition for our users and a collaboration will be the good
>> for the community in the long term.
>
> Some experience from larger patches:
> - stepwise is good,
> - so plan for steps, in which
> - each step is improvement on its own.
>
> Then:
> - try to keep the first step as small as possible,
> - with some luck, someone else will improve the first step,
> - learn from the improvement,
> - repeat, and never hurry.
>
>
> Some comments on the current patch at LUCENE-1313:
> - Copyright is assigned to individual authors, better assign that to
>  ASF.
> - Individual authors are mentioned in the code, that's not lucene
>  policy at the moment.
> - Some files do not contain an ASF licence, not a real problem.
> - The directory structure could also be in contrib/ocean as
>  top directory.
> - There is a whole package of logging in there, but there's no logging
>  in lucene at the moment.
> - There is at least one empty class, SearcherPolicy.
> - Unseen so far:
>   - the second half of the patch,
>   - the java code within the class {...} statements (sorry.)
>
>
> Even though the patch is down to 25% of it's first size,
> it's still 474 kb, which is large by any standard. So the
> question is: is there a first step to be taken from this
> patch that would be an improvement on its own?
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

J. Delgado
In reply to this post by Otis Gospodnetic-2
On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic <[hidden email]> wrote:
Regarding real-time search and Solr, my feeling is the focus should be on first adding real-time search to Lucene, and then we'll figure out how to incorporate that into Solr later.
 
Otis, what do you mean exactly by "adding real-time search to Lucene"?  Note that Lucene, being a indexing/search library (and not a full blown search engine), is by definition "real-time": once you add/write a document to the index it becomes immediately searchable and if a document is logically deleted and no longer returned in a search, though physical deletion happens during an index optimization.

Now, the problem of adding/deleting
documents in bulk, as part of a transaction and making these documents available for search immediately after the transaction is commited sounds more like a search engine problem (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to be I/O expensive and thus are usually implemented bached proceeses with some kind of sync mechanism, which makes them non real-time.

For example, in my previous life, I designed and help implement a quasi-realtime enterprise search engine using Lucene, having a set of multi-threaded indexers hitting a set of multiple indexes alocatted accross different search services which powered a broker based distributed search interface. The most recent documents provided to the indexers were always added to the smaller in-memory (RAM) indexes which usually could absorbe the load of a bulk "add" transaction and later would be merged into larger disk based indexes and then flushed to make them ready to absorbe new fresh docs. We even had further partitioning of the indexes that reflected time periods with caps on size for them to be merged into older more archive based indexes which were used less (yes the search engine default search was on data no more than 1 month old, though user could open the time window by including archives).

As for SOLR and OCEAN,  I would argue that these semi-structured search engines are becomming more and more like relational databases with full-text search capablities (without the benefit of full reletional algebra -- for example joins are not possible using SOLR). Notice that "real-time" CRUD operations and transactionality are core DB concepts adn have been studied and developed by database communities for aquite long time. There has been recent efforts on how to effeciently integrate Lucene into releational databases (see Lucene JVM ORACLE integration, see http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)


I think we should seriously look at joining efforts with open-source Database engine projects, written in Java (see http://java-source.net/open-source/database-engines) in order to blend IR and ORM for once and for all.

-- Joaquin

 
 


I've read Jason's Wiki as well.  Actually, I had to read it a number of times to understand bits and pieces of it.  I have to admit there is still some fuzziness about the whole things in my head - is "Ocean" something that already works, a separate project on googlecode.com?  I think so.  If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as "real-time search", so there is no confusion?

If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented?  I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast.  But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on "once we get there".  I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think).  Bringing other non-essential elements into discussion at the same time makes it more difficult t o
 process all this new stuff, at least for me.  Am I the only one who finds this hard?

That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Yonik Seeley <[hidden email]>
> To: [hidden email]
> Sent: Thursday, September 4, 2008 10:13:32 AM
> Subject: Re: Realtime Search for Social Networks Collaboration
>
> On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
> wrote:
> > I also think it's got a
> > lot of things now which makes integration difficult to do properly.
>
> I agree, and that's why the major bump in version number rather than
> minor - we recognize that some features will need some amount of
> rearchitecture.
>
> > I think the problem with integration with SOLR is it was designed with
> > a different problem set in mind than Ocean, originally the CNET
> > shopping application.
>
> That was the first use of Solr, but it actually existed before that
> w/o any defined use other than to be a "plan B" alternative to MySQL
> based search servers (that's actually where some of the parameter
> names come from... the default /select URL instead of /search, the
> "rows" parameter, etc).
>
> But you're right... some things like the replication strategy were
> designed (well, borrowed from Doug to be exact) with the idea that it
> would be OK to have slightly "stale" views of the data in the range of
> minutes.  It just made things easier/possible at the time.  But tons
> of Solr and Lucene users want almost instantaneous visibility of added
> documents, if they can get it.  It's hardly restricted to social
> network applications.
>
> Bottom line is that Solr aims to be a general enterprise search
> platform, and getting as real-time as we can get, and as scalable as
> we can get are some of the top priorities going forward.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

mark harwood
In reply to this post by Jason Rutherglen


Interesting discussion.

>>
I think we should seriously look at joining efforts with open-source Database engine projects

I posted some initial dabblings here with a couple of the databases on your list :http://markmail.org/message/3bu5klzzc5i6uhl7 but this is not really a scalable solution (which is what Jason and others need)


>>for example joins are not possible using SOLR).

It's largely *because* Lucene doesn't do joins that it can be made to scale out. I've replaced two large-scale database systems this year with distributed Lucene solutions because this scale-out architecture provided significantly better performance. These were "semi-structured" systems too. Lucene's comparitively simplistic data model/query model is both a weakness and a strength in this regard.


Cheers,
Mark.



Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

J. Delgado
On Sun, Sep 7, 2008 at 2:41 AM, mark harwood <[hidden email]> wrote:

>>for example joins are not possible using SOLR).

It's largely *because* Lucene doesn't do joins that it can be made to scale out. I've replaced two large-scale database systems this year with distributed Lucene solutions because this scale-out architecture provided significantly better performance. These were "semi-structured" systems too. Lucene's comparitively simplistic data model/query model is both a weakness and a strength in this regard.

 Hey, maybe the right way to go for a truly scalable and high performance semi-structured database is to marry HBase (Big-table like data storage) with SOLR/Lucene.I concur with you in the sense that simplistic data models coupled with high performance are the killer.

Let me quote this from the original Bigtable paper from Google:

" Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk."

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

J. Delgado
BTW, quoting Marcelo Ochoa (the developer behind the Oracle/Lucene implementation) the three minimal features a transactional DB should support for Lucene integration are:

  1) The ability to define new functions (e.g. lcontains() lscore) which would allow to bind queries to lucene and obtain document/row scores
  2) An API that would allow DML intercepts, like  Oracle's ODCI.
  3) The ability to extend and/or implement new types of "domain" indexes that the engine's query evaluation and execution/optimization planner can use efficiently.

Thanks Marcelo.

-- Joaquin

On Sun, Sep 7, 2008 at 8:16 AM, J. Delgado <[hidden email]> wrote:
On Sun, Sep 7, 2008 at 2:41 AM, mark harwood <[hidden email]> wrote:

>>for example joins are not possible using SOLR).

It's largely *because* Lucene doesn't do joins that it can be made to scale out. I've replaced two large-scale database systems this year with distributed Lucene solutions because this scale-out architecture provided significantly better performance. These were "semi-structured" systems too. Lucene's comparitively simplistic data model/query model is both a weakness and a strength in this regard.

 Hey, maybe the right way to go for a truly scalable and high performance semi-structured database is to marry HBase (Big-table like data storage) with SOLR/Lucene.I concur with you in the sense that simplistic data models coupled with high performance are the killer.

Let me quote this from the original Bigtable paper from Google:

" Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk."


Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Otis Gospodnetic-2
In reply to this post by Jason Rutherglen
Hi,

----- Original Message ----
From: J. Delgado <[hidden email]>
To: [hidden email]
Sent: Sunday, September 7, 2008 4:04:58 AM
Subject: Re: Realtime Search for Social Networks Collaboration

On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic <[hidden email]> wrote:
Regarding real-time search and Solr, my feeling is the focus should be on first adding real-time search to Lucene, and then we'll figure out how to incorporate that into Solr later.
 
Otis, what do you mean exactly by "adding real-time search to Lucene"?  Note that Lucene, being a indexing/search library (and not a full blown search engine), is by definition "real-time": once you add/write a document to the index it becomes immediately searchable and if a document is logically deleted and no longer returned in a search, though physical deletion happens during an index optimization.

OG: When I think about real-time search I see it as: "Make the newly added document show up in search results without closing and reopening the whole index with IndexWriter.  In other words, minimize re-reading of the old/unchanged data just to be able to see the newly added data."

I believe this is similar to what IndexReader.reopen does.... and Jason does make use of it.

Otis


Now, the problem of adding/deleting
documents in bulk, as part of a transaction and making these documents available for search immediately after the transaction is commited sounds more like a search engine problem (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to be I/O expensive and thus are usually implemented bached proceeses with some kind of sync mechanism, which makes them non real-time.

For example, in my previous life, I designed and help implement a quasi-realtime enterprise search engine using Lucene, having a set of multi-threaded indexers hitting a set of multiple indexes alocatted accross different search services which powered a broker based distributed search interface. The most recent documents provided to the indexers were always added to the smaller in-memory (RAM) indexes which usually could absorbe the load of a bulk "add" transaction and later would be merged into larger disk based indexes and then flushed to make them ready to absorbe new fresh docs. We even had further partitioning of the indexes that reflected time periods with caps on size for them to be merged into older more archive based indexes which were used less (yes the search engine default search was on data no more than 1 month old, though user could open the time window by including archives).

As for SOLR and OCEAN,  I would argue that these semi-structured search engines are becomming more and more like relational databases with full-text search capablities (without the benefit of full reletional algebra -- for example joins are not possible using SOLR). Notice that "real-time" CRUD operations and transactionality are core DB concepts adn have been studied and developed by database communities for aquite long time. There has been recent efforts on how to effeciently integrate Lucene into releational databases (see Lucene JVM ORACLE integration, see http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html)


I think we should seriously look at joining efforts with open-source Database engine projects, written in Java (see http://java-source.net/open-source/database-engines) in order to blend IR and ORM for once and for all.

-- Joaquin

 
 


I've read Jason's Wiki as well.  Actually, I had to read it a number of times to understand bits and pieces of it.  I have to admit there is still some fuzziness about the whole things in my head - is "Ocean" something that already works, a separate project on googlecode.com?  I think so.  If so, and if you are working on getting it integrated into Lucene, would it make it less confusing to just refer to it as "real-time search", so there is no confusion?

If this is to be initially integrated into Lucene, why are things like replication, crowding/field collapsing, locallucene, name service, tag index, etc. all mentioned there on the Wiki and bundled with description of how real-time search works and is to be implemented?  I suppose mentioning replication kind-of makes sense because the replication approach is closely tied to real-time search - all query nodes need to see index changes fast.  But Lucene itself offers no replication mechanism, so maybe the replication is something to figure out separately, say on the Solr level, later on "once we get there".  I think even just the essential real-time search requires substantial changes to Lucene (I remember seeing large patches in JIRA), which makes it hard to digest, understand, comment on, and ultimately commit (hence the luke warm response, I think).  Bringing other non-essential elements into discussion at the same time makes it more difficult t o
 process all this new stuff, at least for me.  Am I the only one who finds this hard?

That said, it sounds like we have some discussion going (Karl...), so I look forward to understanding more! :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Yonik Seeley <[hidden email]>
> To: [hidden email]
> Sent: Thursday, September 4, 2008 10:13:32 AM
> Subject: Re: Realtime Search for Social Networks Collaboration
>
> On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
> wrote:
> > I also think it's got a
> > lot of things now which makes integration difficult to do properly.
>
> I agree, and that's why the major bump in version number rather than
> minor - we recognize that some features will need some amount of
> rearchitecture.
>
> > I think the problem with integration with SOLR is it was designed with
> > a different problem set in mind than Ocean, originally the CNET
> > shopping application.
>
> That was the first use of Solr, but it actually existed before that
> w/o any defined use other than to be a "plan B" alternative to MySQL
> based search servers (that's actually where some of the parameter
> names come from... the default /select URL instead of /search, the
> "rows" parameter, etc).
>
> But you're right... some things like the replication strategy were
> designed (well, borrowed from Doug to be exact) with the idea that it
> would be OK to have slightly "stale" views of the data in the range of
> minutes.  It just made things easier/possible at the time.  But tons
> of Solr and Lucene users want almost instantaneous visibility of added
> documents, if they can get it.  It's hardly restricted to social
> network applications.
>
> Bottom line is that Solr aims to be a general enterprise search
> platform, and getting as real-time as we can get, and as scalable as
> we can get are some of the top priorities going forward.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


1234