GData

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

GData

jason rutherglen-2
http://jeremy.zawodny.com/blog/archives/006687.html

Here is a good blog entry with a talk on GData from someone who worked on it.  The only thing I think Solr needs is faster replication, which perhaps can be done faster using a direct replication model, preferably over HTTP of the segments files instead of rsync?  Reserving rsync for the optimized index sync.  The only other thing GData does is versioning of the documents.  

Reply | Threaded
Open this post in threaded view
|

Re: GData

Yonik Seeley
On 4/25/06, jason rutherglen <[hidden email]> wrote:
> Here is a good blog entry with a talk on GData from someone who worked on it.  The only thing I think Solr needs is faster replication, which perhaps can be done faster using a direct replication model, preferably over HTTP of the segments files instead of rsync?

rsync should be very fast if you configure it to not checksum the
files, and just go by timestamp and size.  It will only transfer the
changed segments.  We get very good performance with this model.

>  Reserving rsync for the optimized index sync.  The only other thing GData does is
> versioning of the documents.

Hmmm, that might require some thought...  I guess it depends on what
GData allows you to do with the different versions.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: GData

jason rutherglen-2
Also they have created what looks like fine grained date based queries in use with the Calendar application.  Perhaps having a predefined out of the box way of handling date queries using date ranges in Solr would be useful.  

----- Original Message ----
From: Yonik Seeley <[hidden email]>
To: [hidden email]; jason rutherglen <[hidden email]>
Sent: Tuesday, April 25, 2006 12:42:58 PM
Subject: Re: GData

On 4/25/06, jason rutherglen <[hidden email]> wrote:
> Here is a good blog entry with a talk on GData from someone who worked on it.  The only thing I think Solr needs is faster replication, which perhaps can be done faster using a direct replication model, preferably over HTTP of the segments files instead of rsync?

rsync should be very fast if you configure it to not checksum the
files, and just go by timestamp and size.  It will only transfer the
changed segments.  We get very good performance with this model.

>  Reserving rsync for the optimized index sync.  The only other thing GData does is
> versioning of the documents.

Hmmm, that might require some thought...  I guess it depends on what
GData allows you to do with the different versions.

-Yonik



Reply | Threaded
Open this post in threaded view
|

Re: GData

Erik Hatcher
In reply to this post by jason rutherglen-2
Anyone here an old timer Apple Newton user?

I've been really getting jazzed on the ideas I'm getting thanks to  
Solr and contemplating Ruby integration.  I've been re-reading my  
dusty "Programming for the Newton" (using Windows!) book.  The  
discussion of the Newton "soup" data storage mechanism is very much  
on track with what I'd like to implement from the Ruby side of things  
using Solr as the "soups" storage.   I think more needs to be done  
with Solr than just faster replication to enable a flexible schema  
scenario.  Back to the Newton analogy, each application registers its  
own schema but everything fits into a common storage system allowing  
a unified querying mechanism.  Merging queries/data across soups is  
not done except at the application level, but I can see in the Solr  
case that custom handlers can facilitate this sort of thing to free  
the client from having to deal with the massive amount of data.

I've been mulling over the idea of having a single Solr instance  
morph into system that can handle multiple client-defined schemas  
(why not?  Lucene itself can handle it) rather than a static XML file  
and allow the schemas themselves to be retrievable (yes, I know it  
already is).  I'm still talking about a single Lucene index, but with  
each Document given a "soup" name field and filters automatically  
available to single out a specific soup.

Make sense?  I think the GData thing fits with the loosely defined  
schema scenario as well.

Thoughts?

I was going to wait until my thoughts were more gelled on this topic,  
but the GData thread brought me out of my cave earlier.

        Erik



On Apr 25, 2006, at 3:16 PM, jason rutherglen wrote:

> http://jeremy.zawodny.com/blog/archives/006687.html
>
> Here is a good blog entry with a talk on GData from someone who  
> worked on it.  The only thing I think Solr needs is faster  
> replication, which perhaps can be done faster using a direct  
> replication model, preferably over HTTP of the segments files  
> instead of rsync?  Reserving rsync for the optimized index sync.  
> The only other thing GData does is versioning of the documents.
>

Reply | Threaded
Open this post in threaded view
|

Re: GData

Chris Hostetter-3

: I've been mulling over the idea of having a single Solr instance
: morph into system that can handle multiple client-defined schemas
: (why not?  Lucene itself can handle it) rather than a static XML file
: and allow the schemas themselves to be retrievable (yes, I know it
: already is).  I'm still talking about a single Lucene index, but with
: each Document given a "soup" name field and filters automatically
: available to single out a specific soup.

Given the flexability of dynamicFields, i think we're 99% of the way there
-- all we'd need is support for <copyField source="*_t"  dest="foo"/> and
then you could define a "soup" schema with nothing but dynamic fields
(one per datatype/stored/index triple you care about) and a few common
fields for partitioning and generic text searching.


-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: GData

jason rutherglen-2
In reply to this post by Yonik Seeley
Ok, if Google is using the GData architecture to store the GCalendar data, assuming they are, how long do you think a write takes to show up on the GCalendar web site?  I think in this case something other than rsync may be a better option.

----- Original Message ----
From: Yonik Seeley <[hidden email]>
To: [hidden email]; jason rutherglen <[hidden email]>
Sent: Tuesday, April 25, 2006 12:42:58 PM
Subject: Re: GData

On 4/25/06, jason rutherglen <[hidden email]> wrote:
> Here is a good blog entry with a talk on GData from someone who worked on it.  The only thing I think Solr needs is faster replication, which perhaps can be done faster using a direct replication model, preferably over HTTP of the segments files instead of rsync?

rsync should be very fast if you configure it to not checksum the
files, and just go by timestamp and size.  It will only transfer the
changed segments.  We get very good performance with this model.

>  Reserving rsync for the optimized index sync.  The only other thing GData does is
> versioning of the documents.

Hmmm, that might require some thought...  I guess it depends on what
GData allows you to do with the different versions.

-Yonik



Reply | Threaded
Open this post in threaded view
|

Re: GData

Ian Holsman-5
I noticed you guys have created a 'gdata-lucene' server in the SoC project.
are you planning on doing this via SoLR? or is it something brand new?

--i

On 4/26/06, jason rutherglen <[hidden email]> wrote:

> Ok, if Google is using the GData architecture to store the GCalendar data, assuming they are, how long do you think a write takes to show up on the GCalendar web site?  I think in this case something other than rsync may be a better option.
>
> ----- Original Message ----
> From: Yonik Seeley <[hidden email]>
> To: [hidden email]; jason rutherglen <[hidden email]>
> Sent: Tuesday, April 25, 2006 12:42:58 PM
> Subject: Re: GData
>
> On 4/25/06, jason rutherglen <[hidden email]> wrote:
> > Here is a good blog entry with a talk on GData from someone who worked on it.  The only thing I think Solr needs is faster replication, which perhaps can be done faster using a direct replication model, preferably over HTTP of the segments files instead of rsync?
>
> rsync should be very fast if you configure it to not checksum the
> files, and just go by timestamp and size.  It will only transfer the
> changed segments.  We get very good performance with this model.
>
> >  Reserving rsync for the optimized index sync.  The only other thing GData does is
> > versioning of the documents.
>
> Hmmm, that might require some thought...  I guess it depends on what
> GData allows you to do with the different versions.
>
> -Yonik
>
>
>
>
>


--
[hidden email] -- blog: http://feh.holsman.net/ -- PH: ++61-3-9818-0132

If everything seems under control, you're not going fast enough. -
Mario Andretti
Reply | Threaded
Open this post in threaded view
|

Re: GData

Yonik Seeley
In reply to this post by jason rutherglen-2
On 4/25/06, jason rutherglen <[hidden email]> wrote:
> Ok, if Google is using the GData architecture to store the GCalendar data, assuming they are, how long do you think a write takes to show up on the GCalendar web site?  I think in this case something other than rsync may be a better option.

rsync is just used as a replication transport, and I don't think it's
the limiting factor.

Opening a new IndexSearcher in Lucene is a relatively expensive
operation, esp when you factor in populating the fieldCache and field
norms.  You shouldn't be doing it too often (once a minute maybe).

If updates need to be immediately visible in conjunction with a high
update rate, a database is a better solution.

For Solr, I'd solve GData for the single-server case first, then go
about figuring out replication requirements.



> ----- Original Message ----
> From: Yonik Seeley <[hidden email]>
> To: [hidden email]; jason rutherglen <[hidden email]>
> Sent: Tuesday, April 25, 2006 12:42:58 PM
> Subject: Re: GData
>
> On 4/25/06, jason rutherglen <[hidden email]> wrote:
> > Here is a good blog entry with a talk on GData from someone who worked on it.  The only thing I think Solr needs is faster replication, which perhaps can be done faster using a direct replication model, preferably over HTTP of the segments files instead of rsync?
>
> rsync should be very fast if you configure it to not checksum the
> files, and just go by timestamp and size.  It will only transfer the
> changed segments.  We get very good performance with this model.
>
> >  Reserving rsync for the optimized index sync.  The only other thing GData does is
> > versioning of the documents.
>
> Hmmm, that might require some thought...  I guess it depends on what
> GData allows you to do with the different versions.
>
> -Yonik
>
>
>
>
>


--
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
Reply | Threaded
Open this post in threaded view
|

Re: GData

Doug Cutting
In reply to this post by Ian Holsman-5
Ian Holsman wrote:
> I noticed you guys have created a 'gdata-lucene' server in the SoC project.
> are you planning on doing this via SoLR? or is it something brand new?

We decided that doing this via Solr would probably make it more
complicated.  A simple, standalone GData server built just using just
Lucene is what we had in mind for the SoC project.  This could then
become a Lucene contrib module.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: GData

jason rutherglen-2
In reply to this post by Yonik Seeley
I would be curious then how the Google architecture works given that it seems to combine search and database concepts together and the Adam Bosworth talk seems to imply a replication redundant architecture like Solr.  Is a faster method of loading or updating the IndexSearcher something that makes sense for Lucene?  Or just assume the Google architecture is a lot more complex.

----- Original Message ----
From: Yonik Seeley <[hidden email]>
To: [hidden email]; jason rutherglen <[hidden email]>
Sent: Tuesday, April 25, 2006 3:21:07 PM
Subject: Re: GData

On 4/25/06, jason rutherglen <[hidden email]> wrote:
> Ok, if Google is using the GData architecture to store the GCalendar data, assuming they are, how long do you think a write takes to show up on the GCalendar web site?  I think in this case something other than rsync may be a better option.

rsync is just used as a replication transport, and I don't think it's
the limiting factor.

Opening a new IndexSearcher in Lucene is a relatively expensive
operation, esp when you factor in populating the fieldCache and field
norms.  You shouldn't be doing it too often (once a minute maybe).

If updates need to be immediately visible in conjunction with a high
update rate, a database is a better solution.

For Solr, I'd solve GData for the single-server case first, then go
about figuring out replication requirements.



> ----- Original Message ----
> From: Yonik Seeley <[hidden email]>
> To: [hidden email]; jason rutherglen <[hidden email]>
> Sent: Tuesday, April 25, 2006 12:42:58 PM
> Subject: Re: GData
>
> On 4/25/06, jason rutherglen <[hidden email]> wrote:
> > Here is a good blog entry with a talk on GData from someone who worked on it.  The only thing I think Solr needs is faster replication, which perhaps can be done faster using a direct replication model, preferably over HTTP of the segments files instead of rsync?
>
> rsync should be very fast if you configure it to not checksum the
> files, and just go by timestamp and size.  It will only transfer the
> changed segments.  We get very good performance with this model.
>
> >  Reserving rsync for the optimized index sync.  The only other thing GData does is
> > versioning of the documents.
>
> Hmmm, that might require some thought...  I guess it depends on what
> GData allows you to do with the different versions.
>
> -Yonik
>
>
>
>
>


--
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server



Reply | Threaded
Open this post in threaded view
|

Re: GData

Doug Cutting
jason rutherglen wrote:
> Is a faster method of loading or updating the IndexSearcher something that makes sense for Lucene?

Yes.  Folks have developed incrementally updateable IndexSearchers
before, but none is yet part of Lucene.

>  Or just assume the Google architecture is a lot more complex.

That's probably a safe assumption.  Their architecture is designed to
support real-time things like calendars, email, etc.  Search engines,
Lucene's normal domain, are not usually real-time, but have indexing delays.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: GData

jason rutherglen-2
I tried the find this Nutch answer in the docs and mailing list, sorry that it's a bit naive.  Assuming Nutch distributes the index over many machines, does it use the NutchFS as a the Directory for IndexSearcher or does not use RemoteMultiSearcher?  

> support real-time things like calendars, email, etc.  Search engines, Lucene's normal domain, are not usually real-time, but have indexing delays.

True, however it may be an interesting direction to go in.  They seem to make the information nearly immediately searchable.  Surely we can do the same.

----- Original Message ----
From: Doug Cutting <[hidden email]>
To: [hidden email]
Sent: Tuesday, April 25, 2006 4:10:36 PM
Subject: Re: GData

jason rutherglen wrote:
> Is a faster method of loading or updating the IndexSearcher something that makes sense for Lucene?

Yes.  Folks have developed incrementally updateable IndexSearchers
before, but none is yet part of Lucene.

>  Or just assume the Google architecture is a lot more complex.

That's probably a safe assumption.  Their architecture is designed to
support real-time things like calendars, email, etc.  Search engines,
Lucene's normal domain, are not usually real-time, but have indexing delays.

Doug



Reply | Threaded
Open this post in threaded view
|

Re: GData

jason rutherglen-2
In reply to this post by Doug Cutting
Ah ok, think I found it: org.apache.nutch.indexer.FsDirectory no?

Couldn't this be used in Solr and distribute all the data rather than master/slave it?

----- Original Message ----
From: Doug Cutting <[hidden email]>
To: [hidden email]
Sent: Tuesday, April 25, 2006 4:10:36 PM
Subject: Re: GData

jason rutherglen wrote:
> Is a faster method of loading or updating the IndexSearcher something that makes sense for Lucene?

Yes.  Folks have developed incrementally updateable IndexSearchers
before, but none is yet part of Lucene.

>  Or just assume the Google architecture is a lot more complex.

That's probably a safe assumption.  Their architecture is designed to
support real-time things like calendars, email, etc.  Search engines,
Lucene's normal domain, are not usually real-time, but have indexing delays.

Doug



Reply | Threaded
Open this post in threaded view
|

Re: GData

Doug Cutting
jason rutherglen wrote:
> Ah ok, think I found it: org.apache.nutch.indexer.FsDirectory no?
>
> Couldn't this be used in Solr and distribute all the data rather than master/slave it?

It's possible to search a Lucene index that lives in Hadoop's DFS, but
not recommended.  It's very slow.  It's much faster to copy the index to
a local drive.

The rsync approach, of only transmitting index diffs, is a very
efficient way to distribute an index.  In particular, it supports
scaling the number of *readers* well.

For read/write stuff (e.g. a calendar) such scaling might not be
paramount.  Rather, you might be happy to route all requests for a
particular calendar to a particular server.  The index/database could
still be somehow replicated/synced, in case that server dies, but a
single server can probably handle all requests for a particular
index/database.  And keeping things coherent is much simpler in this case.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: GData

gsingers
In reply to this post by Erik Hatcher


Erik Hatcher <[hidden email]> wrote:
I've been mulling over the idea of having a single Solr instance  
morph into system that can handle multiple client-defined schemas  
(why not?  Lucene itself can handle it) rather than a static XML file  
and allow the schemas themselves to be retrievable (yes, I know it  
already is).  I'm still talking about a single Lucene index, but with  
each Document given a "soup" name field and filters automatically  
available to single out a specific soup.


I agree.  

I also don't think it is going to work well to have one webapp per index/schema for those of us who want multiple indexes.  I think SOLR needs to be able to support multiple Lucene indexes per WAR deployment as well (although your soup idea would work well too) and then allow the requests to specify which index to use, maybe with some defaults.  With this idea, you could even imagine an application could upload a schema to SOLR and have it create the index, etc. upon which you could then add documents, etc.  No need to do anything command line at all.


----------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com
               
---------------------------------
Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.
Reply | Threaded
Open this post in threaded view
|

Re: GData

Yonik Seeley
> I think SOLR needs to be able to support multiple Lucene indexes per WAR
> deployment as well

Is this because single requests need to query across multiple indexes?
Or do you have different document types that you don't want to stick
in the same physical Lucene index?

> With this idea, you could even imagine an application could upload a schema to
> SOLR and have it create the index, etc. upon which you could then add documents, etc.

It's a slightly foreign use-case for me since our search collections
are used to power websites (both content generation and
relevancy-search).  No one is going to share Solr servers, or even
change much of anything on-the-fly.  Any major changes require
validation (performance and otherwise).

What types of applications do you see this useful for?

-Yonik

On 4/26/06, Grant Ingersoll <[hidden email]> wrote:

>
>
> Erik Hatcher <[hidden email]> wrote:
> I've been mulling over the idea of having a single Solr instance
> morph into system that can handle multiple client-defined schemas
> (why not?  Lucene itself can handle it) rather than a static XML file
> and allow the schemas themselves to be retrievable (yes, I know it
> already is).  I'm still talking about a single Lucene index, but with
> each Document given a "soup" name field and filters automatically
> available to single out a specific soup.
>
>
> I agree.
>
> I also don't think it is going to work well to have one webapp per index/schema for those of us who want multiple indexes.  I think SOLR needs to be able to support multiple Lucene indexes per WAR deployment as well (although your soup idea would work well too) and then allow the requests to specify which index to use, maybe with some defaults.  With this idea, you could even imagine an application could upload a schema to SOLR and have it create the index, etc. upon which you could then add documents, etc.  No need to do anything command line at all.
Reply | Threaded
Open this post in threaded view
|

Re: GData, updateable IndexSearcher

jason rutherglen-2
In reply to this post by Doug Cutting
Hi Doug,

Thanks for the info, makes sense.

> In particular, it supports scaling the number of *readers* well.

Yes this is very true and a good architecture and in fact because Java comes in 64-bit flavors allows for a smaller number of machines as per 32-bit built C systems that have memory limitations like the current Google architecture.  

> Yes.  Folks have developed incrementally updateable IndexSearchers before, but none is yet part of Lucene.

Interesting, does this mean there is a plan for incrementally updateable IndexSearchers to become part of Lucene?  Are there any negatives to updateable IndexSearchers?  

Thanks,

Jason



----- Original Message ----
From: Doug Cutting <[hidden email]>
To: [hidden email]
Sent: Tuesday, April 25, 2006 9:04:47 PM
Subject: Re: GData

jason rutherglen wrote:
> Ah ok, think I found it: org.apache.nutch.indexer.FsDirectory no?
>
> Couldn't this be used in Solr and distribute all the data rather than master/slave it?

It's possible to search a Lucene index that lives in Hadoop's DFS, but
not recommended.  It's very slow.  It's much faster to copy the index to
a local drive.

The rsync approach, of only transmitting index diffs, is a very
efficient way to distribute an index.  In particular, it supports
scaling the number of *readers* well.

For read/write stuff (e.g. a calendar) such scaling might not be
paramount.  Rather, you might be happy to route all requests for a
particular calendar to a particular server.  The index/database could
still be somehow replicated/synced, in case that server dies, but a
single server can probably handle all requests for a particular
index/database.  And keeping things coherent is much simpler in this case.

Doug



Reply | Threaded
Open this post in threaded view
|

Re: GData, updateable IndexSearcher

Doug Cutting
jason rutherglen wrote:
> Interesting, does this mean there is a plan for incrementally updateable IndexSearchers to become part of Lucene?

In general, there is no plan for Lucene.  If someone implements a
generally useful, efficient, feature in a back-compatible, easy to use,
manner, and submits it as a patch, then it becomes a part of Lucene.
That's the way Lucene changes.  Since we don't pay anyone, we can't make
plans and assign tasks.  So if you're particularly interested in this
feature, you might search the archives to find past efforts, or simply
try to implement it yourself.

I think a good approach would be to create a new IndexSearcher instance
based on an existing one, that shares IndexReaders.  Similarly, one
should be able to create a new IndexReader based on an existing one.
This would be a MultiReader that shares many of the same SegmentReaders.

Things get a little tricky after this.

Lucene caches filters based on the IndexReader.  So filters would need
to be re-created.  Ideally these could be incrementally re-created, but
that might be difficult.  What might be simpler would be to use a
MultiSearcher constructed with an IndexSearcher per SegmentReader,
avoiding the use of MultiReader.  Then the caches would still work.
This would require making a few things public that are not at present.
Perhaps adding a 'MultiReader.getSubReaders()' method, combined with an
'static IndexReader.reopen(IndexReader)' method.  The latter would
return a new MultiReader that shared SegmentReaders with the old
version.  Then one could use getSubReaders() on the new multi reader to
extract the current set to use when constructing a MultiSearcher.

Another tricky bit is figuring out when to close readers.

Does this make sense?  This discussion should probably move to the
lucene-dev list.

> Are there any negatives to updateable IndexSearchers?  

Not if implemented well!

Doug
Reply | Threaded
Open this post in threaded view
|

Re: GData, updateable IndexSearcher

jason rutherglen-2
This originated on the Solr mailing list.

> That's the way Lucene changes.

I was thinking you implied that you knew of someone who had customized their own, but it was a closed source solution.  And if so then you would know how that project faired.  

I definitely sounds like an interesting project, it will take me several days to digest the design you described.  As this would be used with Solr I wonder if there would be a good way to also update the Solr caches.  Wouldn't there also need to be a hack on the IndexWriter to keep track of new segments?

----- Original Message ----
From: Doug Cutting <[hidden email]>
To: [hidden email]
Sent: Wednesday, April 26, 2006 11:27:44 AM
Subject: Re: GData, updateable IndexSearcher

jason rutherglen wrote:
> Interesting, does this mean there is a plan for incrementally updateable IndexSearchers to become part of Lucene?

In general, there is no plan for Lucene.  If someone implements a
generally useful, efficient, feature in a back-compatible, easy to use,
manner, and submits it as a patch, then it becomes a part of Lucene.
That's the way Lucene changes.  Since we don't pay anyone, we can't make
plans and assign tasks.  So if you're particularly interested in this
feature, you might search the archives to find past efforts, or simply
try to implement it yourself.

I think a good approach would be to create a new IndexSearcher instance
based on an existing one, that shares IndexReaders.  Similarly, one
should be able to create a new IndexReader based on an existing one.
This would be a MultiReader that shares many of the same SegmentReaders.

Things get a little tricky after this.

Lucene caches filters based on the IndexReader.  So filters would need
to be re-created.  Ideally these could be incrementally re-created, but
that might be difficult.  What might be simpler would be to use a
MultiSearcher constructed with an IndexSearcher per SegmentReader,
avoiding the use of MultiReader.  Then the caches would still work.
This would require making a few things public that are not at present.
Perhaps adding a 'MultiReader.getSubReaders()' method, combined with an
'static IndexReader.reopen(IndexReader)' method.  The latter would
return a new MultiReader that shared SegmentReaders with the old
version.  Then one could use getSubReaders() on the new multi reader to
extract the current set to use when constructing a MultiSearcher.

Another tricky bit is figuring out when to close readers.

Does this make sense?  This discussion should probably move to the
lucene-dev list.

> Are there any negatives to updateable IndexSearchers?  

Not if implemented well!

Doug



Reply | Threaded
Open this post in threaded view
|

Re: GData, updateable IndexSearcher

Yonik Seeley
On 4/26/06, jason rutherglen <[hidden email]> wrote:
> As this would be used with Solr I wonder if there would be a good way to also update the Solr caches.

Other than re-executing queries that generated the results? Probably not.
The nice thing about knowing exactly when the view of an index changes
(and having it only change once in a while), is that you can do very
aggressive caching.

If you want an IndexSearcher who's view of the index changed every
second (for example), I don't think Solr's type of caching would be
useful at all (or even possible, if you have big caches and
autowarming).

-Yonik
12