Realtime Search for Social Networks Collaboration

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Realtime Search for Social Networks Collaboration

Jason Rutherglen
Hello all,

I don't mean this to sound like a solicitation.  I've been working on
realtime search and created some Lucene patches etc.  I am wondering
if there are social networks (or anyone else) out there who would be
interested in collaborating with Apache on realtime search to get it
to the point it can be used in production.  It is a challenging
problem that only Google has solved and made to scale.  I've been
working on the problem for a while and though a lot has been
completed, there is still a lot more to do and collaboration amongst
the most probable users (social networks) seems like a good thing to
try to do at this point.  I guess I'm saying it seems like a hard
enough problem that perhaps it's best to work together on it rather
than each company try to complete their own.  However I could be
wrong.

Realtime search benefits social networks by providing a scalable
searchable alternative to large Mysql implementations.  Mysql I have
heard is difficult to scale at a certain point.  Apparently Google has
created things like BigTable (a large database) and an online service
called GData (which Google has not published any whitepapers on the
technology underneath) to address scaling large database systems.
BigTable does not offer search.   GData does and is used by all of
Google's web services instead of something like Mysql (this is at
least how I understand it).  Social networks usually grow and so
scaling is continually an issue.  It is possible to build a realtime
search system that scales linearly, something that I have heard
becomes difficult with Mysql.  There is an article that discusses some
of these issues
http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337  I
don't think the current GData implementation is perfect and there is a
lot that can be improved on.  It might be helpful to figure out
together what helpful things can be added.

If this sounds like something of interest to anyone feel free to send
your input.

Take care,
Jason

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Yonik Seeley-2
On Wed, Sep 3, 2008 at 3:20 PM, Jason Rutherglen
<[hidden email]> wrote:
> I am wondering
> if there are social networks (or anyone else) out there who would be
> interested in collaborating with Apache on realtime search to get it
> to the point it can be used in production.

Good timing Jason, I think you'll find some other people right here
at Apache (solr-dev) that want to collaborate in this area:

http://www.nabble.com/solr2%3A-Onward-and-Upward-td19224805.html

I've looked at your wiki briefly, and all the high level goals/features seem
to really be synergistic with where we are going with Solr2.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
Hi Yonik,

The SOLR 2 list looks good.  The question is, who is going to do the
work?  I tried to simplify the scope of Ocean as much as possible to
make it possible (and slowly at that over time) for me to eventually
finish what is mentioned on the wiki.  I think SOLR is very cool and
was   major step forward when it came out.  I also think it's got a
lot of things now which makes integration difficult to do properly.  I
did try to integrate and received a lukewarm response and so decided
to just move ahead separately until folks have time to collaborate.
We probably should try to integrate SOLR and Ocean somehow however we
may want to simply reduce the scope a bit and figure what is needed
most, with the main use case being social networks.

I think the problem with integration with SOLR is it was designed with
a different problem set in mind than Ocean, originally the CNET
shopping application.  Facets were important, realtime was not needed
because pricing doesn't change very often.  I designed Ocean for
social networks and actually further into the future realtime
messaging based mobile applications.

SOLR needs to be backward compatible and support it's existing user
base.  How do you plan on doing this for a SOLR 2 if the architecture
is changed dramatically?  SOLR solves a problem set that is very
common making SOLR very useful in many situations.  However I wanted
Ocean to be like GData.  So I wanted the scalability of Google which
SOLR doesn't quite have yet, and the realtime, and then I figured the
other stuff could be added later, stuff people seem to spend a lot of
time on in the SOLR community currently (spellchecker, db imports,
many others).  I did use some of the SOLR terminology in building
Ocean, like snapshots!  But most of it is a digression.  I tried to
use schemas, but they just make the system harder to use.  For
distributed search I prefer serialized objects as this enables things
like SpanQueries and payloads without writing request handlers and
such.  Also there is no need to write new request handlers and deploy
(an expensive operation for systems that are in the 100s of servers)
them as any new classes are simply dynamically loaded by the server
from the client.

A lot is now outlined on the wiki site
http://wiki.apache.org/lucene-java/OceanRealtimeSearch now and there
will be a lot more javadocs in the forthcoming patch.  The latest code
is also available all the time at
http://oceansearch.googlecode.com/svn/trunk/trunk/oceanlucene

I do welcome more discussion and if there are Solr developers who wish
to work on Ocean feel free to drop me a line.  Most of all though I
think it would be useful for social networks interested in realtime
search to get involved as it may be something that is difficult for
one company to have enough resources to implement to a production
level.  I think this is where open source collaboration is
particularly useful.

Cheers,

Jason Rutherglen
[hidden email]

On Wed, Sep 3, 2008 at 4:56 PM, Yonik Seeley <[hidden email]> wrote:

> On Wed, Sep 3, 2008 at 3:20 PM, Jason Rutherglen
> <[hidden email]> wrote:
>> I am wondering
>> if there are social networks (or anyone else) out there who would be
>> interested in collaborating with Apache on realtime search to get it
>> to the point it can be used in production.
>
> Good timing Jason, I think you'll find some other people right here
> at Apache (solr-dev) that want to collaborate in this area:
>
> http://www.nabble.com/solr2%3A-Onward-and-Upward-td19224805.html
>
> I've looked at your wiki briefly, and all the high level goals/features seem
> to really be synergistic with where we are going with Solr2.
>
> -Yonik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Cam Bazz
In reply to this post by Jason Rutherglen
Hello Jason,
I have been trying to do this for a long time on my own. keep up the good
work.

What I tried was a document cache using apache collections. and before a
indexwrite/delete i would sync the cache with index.

I am waiting for lucene 2.4 to proceed. (query by delete)

Best.

On Wed, Sep 3, 2008 at 10:20 PM, Jason Rutherglen <
[hidden email]> wrote:

> Hello all,
>
> I don't mean this to sound like a solicitation.  I've been working on
> realtime search and created some Lucene patches etc.  I am wondering
> if there are social networks (or anyone else) out there who would be
> interested in collaborating with Apache on realtime search to get it
> to the point it can be used in production.  It is a challenging
> problem that only Google has solved and made to scale.  I've been
> working on the problem for a while and though a lot has been
> completed, there is still a lot more to do and collaboration amongst
> the most probable users (social networks) seems like a good thing to
> try to do at this point.  I guess I'm saying it seems like a hard
> enough problem that perhaps it's best to work together on it rather
> than each company try to complete their own.  However I could be
> wrong.
>
> Realtime search benefits social networks by providing a scalable
> searchable alternative to large Mysql implementations.  Mysql I have
> heard is difficult to scale at a certain point.  Apparently Google has
> created things like BigTable (a large database) and an online service
> called GData (which Google has not published any whitepapers on the
> technology underneath) to address scaling large database systems.
> BigTable does not offer search.   GData does and is used by all of
> Google's web services instead of something like Mysql (this is at
> least how I understand it).  Social networks usually grow and so
> scaling is continually an issue.  It is possible to build a realtime
> search system that scales linearly, something that I have heard
> becomes difficult with Mysql.  There is an article that discusses some
> of these issues
> http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337  I
> don't think the current GData implementation is perfect and there is a
> lot that can be improved on.  It might be helpful to figure out
> together what helpful things can be added.
>
> If this sounds like something of interest to anyone feel free to send
> your input.
>
> Take care,
> Jason
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen
Hi Cam,

Thanks!  It has not been easy, probably has taken 3 years or so to get
this far.  At first I thought the new reopen code would be the
solution.  I used it, but then needed to modify it to do a clone
instead of reference the old deleted docs.  Then as I iterated,
realized that just using reopen on a ramdirectory would not be quite
fast enough because of the merging.  Then started using
InstantiatedIndex which provides an in memory version of the document,
without the overhead of merging during the transaction.  There are
other complexities as well.  The basic code works if you are
interested in trying it out.

Take care,
Jason

On Thu, Sep 4, 2008 at 9:08 AM, Cam Bazz <[hidden email]> wrote:

> Hello Jason,
> I have been trying to do this for a long time on my own. keep up the good
> work.
>
> What I tried was a document cache using apache collections. and before a
> indexwrite/delete i would sync the cache with index.
>
> I am waiting for lucene 2.4 to proceed. (query by delete)
>
> Best.
>
> On Wed, Sep 3, 2008 at 10:20 PM, Jason Rutherglen <
> [hidden email]> wrote:
>
>> Hello all,
>>
>> I don't mean this to sound like a solicitation.  I've been working on
>> realtime search and created some Lucene patches etc.  I am wondering
>> if there are social networks (or anyone else) out there who would be
>> interested in collaborating with Apache on realtime search to get it
>> to the point it can be used in production.  It is a challenging
>> problem that only Google has solved and made to scale.  I've been
>> working on the problem for a while and though a lot has been
>> completed, there is still a lot more to do and collaboration amongst
>> the most probable users (social networks) seems like a good thing to
>> try to do at this point.  I guess I'm saying it seems like a hard
>> enough problem that perhaps it's best to work together on it rather
>> than each company try to complete their own.  However I could be
>> wrong.
>>
>> Realtime search benefits social networks by providing a scalable
>> searchable alternative to large Mysql implementations.  Mysql I have
>> heard is difficult to scale at a certain point.  Apparently Google has
>> created things like BigTable (a large database) and an online service
>> called GData (which Google has not published any whitepapers on the
>> technology underneath) to address scaling large database systems.
>> BigTable does not offer search.   GData does and is used by all of
>> Google's web services instead of something like Mysql (this is at
>> least how I understand it).  Social networks usually grow and so
>> scaling is continually an issue.  It is possible to build a realtime
>> search system that scales linearly, something that I have heard
>> becomes difficult with Mysql.  There is an article that discusses some
>> of these issues
>> http://acmqueue.com/modules.php?name=Content&pa=showpage&pid=337  I
>> don't think the current GData implementation is perfect and there is a
>> lot that can be improved on.  It might be helpful to figure out
>> together what helpful things can be added.
>>
>> If this sounds like something of interest to anyone feel free to send
>> your input.
>>
>> Take care,
>> Jason
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]