scaling / sharding questions

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

scaling / sharding questions

Jeremy Hinegardner
Hi all,

This may be a bit rambling, but let see how it goes.  I'm not a Lucene or Solr
guru by any means, I have been prototyping with solr and understanding how all
the pieces and parts fit together.

We are migrating our current document storage infrastructure to a decent sized
solr cluster, using 1.3-snapshots right now.  Eventually this will be in the
billion+ documents, with about 1M new documents added per day.  

Our main sticking point right now is that a significant number of our documents
will be updated, at least once, but possibly more than once.  The volatility of
a document decreases over time.

With this in mind, we've been considering using a cascading series of shard
clusters.  That is :

 1) a cluster of shards holding recent data ( most recent week or two ) smaller
    indexes that take a small amount of time to commit updates and optimise,
    since this will hold the most volatile documents.

 2) Following that another cluster of shards that holds some relatively recent
    ( 3-6 months ? ), but not super volatile, documents, these are items that
    could potentially receive updates, but generally not.

 3) A final set of 'archive' shards holding the final resting place for
    documents.  These would not receive updates.  These would be online for
    searching and analysis "forever".

We are not sure if this is the best way to go, but it is the approach we are
leaning toward right now.  I would like some feedback from the folks here if you
think that is a reasonable approach.

One of the other things I'm wondering about is how to manipulate indexes
We'll need to roll documents around between indexes over time, or at least
migrate indexes from one set of shards to another as the documents 'age' and
merge/aggregate them with more 'stable' indexes.   I know about merging complete
indexes together, but what about migrating a subset of documents from one index
into another index?

In addition, what is generally considered a 'manageable' index of large size?  I
was attempting to find some information on the relationship between search
response times, the amount of memory for used for a search, and the number of
documents in an index, but I wasn't having much luck.  

I'm not sure if I'm making sense here, but just thought I would throw this out
there and see what people think.  Ther eis the distinct possibility that I am
not asking the right questions or considering the right parameters, so feel free
to correct me, or ask questions as you see fit.

And yes, I will report how we are doing things when we get this all figured out,
and if there are items that we can contribute back to Solr we will.  If nothing
else there will be a nice article of how we manage TB of data with Solr.

enjoy,

-jeremy

--
========================================================================
 Jeremy Hinegardner                              [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scaling / sharding questions

Marcus Herou
Cool sharding technique.

We as well are thinking of howto "move" docs from one index to another
because we need to re-balance the docs when we add new nodes to the cluster.
We do only store id's in the index otherwise we could have moved stuff
around with IndexReader.document(x) or so. Luke (http://www.getopt.org/luke/)
is able to reconstruct the indexed Document data so it should be doable.
However I'm thinking of actually just delete the docs from the old index and
add new Documents to the new node. It would be cool to not waste cpu cycles
by reindexing already indexed stuff but...

And we as well will have data amounts in the range you are talking about. We
perhaps could share ideas ?

How do you plan to store where each document is located ? I mean you
probably need to store info about the Document and it's location somewhere
perhaps in a clustered DB ? We will probably go for HBase for this.

I think the number of documents is less important than the actual data size
(just speculating). We currently search 10M (will get much much larger)
indexed blog entries on one machine where the JVM has 1G heap, the index
size is 3G and response times are still quite fast. This is a readonly node
though and is updated every morning with a freshly optimized index. Someone
told me that you probably need twice the RAM if you plan to both index and
search at the same time. If I were you I would just test to index X entries
of your data and then start to search in the index with lower JVM settings
each round and when response times get too slow or you hit OOE then you get
a rough estimate of the bare minimum X RAM needed for Y entries.

I think we will do with something like 2G per 50M docs but I will need to
test it out.

If you get an answer in this matter please let me know.

Kindly

//Marcus


On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner <[hidden email]>
wrote:

> Hi all,
>
> This may be a bit rambling, but let see how it goes.  I'm not a Lucene or
> Solr
> guru by any means, I have been prototyping with solr and understanding how
> all
> the pieces and parts fit together.
>
> We are migrating our current document storage infrastructure to a decent
> sized
> solr cluster, using 1.3-snapshots right now.  Eventually this will be in
> the
> billion+ documents, with about 1M new documents added per day.
>
> Our main sticking point right now is that a significant number of our
> documents
> will be updated, at least once, but possibly more than once.  The
> volatility of
> a document decreases over time.
>
> With this in mind, we've been considering using a cascading series of shard
> clusters.  That is :
>
>  1) a cluster of shards holding recent data ( most recent week or two )
> smaller
>    indexes that take a small amount of time to commit updates and optimise,
>    since this will hold the most volatile documents.
>
>  2) Following that another cluster of shards that holds some relatively
> recent
>    ( 3-6 months ? ), but not super volatile, documents, these are items
> that
>    could potentially receive updates, but generally not.
>
>  3) A final set of 'archive' shards holding the final resting place for
>    documents.  These would not receive updates.  These would be online for
>    searching and analysis "forever".
>
> We are not sure if this is the best way to go, but it is the approach we
> are
> leaning toward right now.  I would like some feedback from the folks here
> if you
> think that is a reasonable approach.
>
> One of the other things I'm wondering about is how to manipulate indexes
> We'll need to roll documents around between indexes over time, or at least
> migrate indexes from one set of shards to another as the documents 'age'
> and
> merge/aggregate them with more 'stable' indexes.   I know about merging
> complete
> indexes together, but what about migrating a subset of documents from one
> index
> into another index?
>
> In addition, what is generally considered a 'manageable' index of large
> size?  I
> was attempting to find some information on the relationship between search
> response times, the amount of memory for used for a search, and the number
> of
> documents in an index, but I wasn't having much luck.
>
> I'm not sure if I'm making sense here, but just thought I would throw this
> out
> there and see what people think.  Ther eis the distinct possibility that I
> am
> not asking the right questions or considering the right parameters, so feel
> free
> to correct me, or ask questions as you see fit.
>
> And yes, I will report how we are doing things when we get this all figured
> out,
> and if there are items that we can contribute back to Solr we will.  If
> nothing
> else there will be a nice article of how we manage TB of data with Solr.
>
> enjoy,
>
> -jeremy
>
> --
> ========================================================================
>  Jeremy Hinegardner                              [hidden email]
>
>


--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: scaling / sharding questions

Otis Gospodnetic-2
In reply to this post by Jeremy Hinegardner
Hola,

That's a pretty big an open question, but here is some info.

Jeremy's sharding approach sounds OK.  We did something similar at Technorati, where a document/blog timestamp was the main sharding factor.  You can't really move individual docs without reindexing (i.e. delete docX from shard1 and index docX to shard2), unless all your fields are stored, which you will not want to do with data volumes that you are describing.


As for how much can be handled by a single machine, this is a FAQ and we really need to put it on Lucene/Solr FAQ wiki page if it's not there already.  The answer is this depends on many factors (size of index, # of concurrent searches, complexity of queries, number of searchers, type of disk, amount of RAM, cache settings, # of CPUs...)

The questions are right, it's just that there is no single non-generic answer.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----

> From: Marcus Herou <[hidden email]>
> To: [hidden email]; [hidden email]
> Sent: Friday, June 6, 2008 9:14:10 AM
> Subject: Re: scaling / sharding questions
>
> Cool sharding technique.
>
> We as well are thinking of howto "move" docs from one index to another
> because we need to re-balance the docs when we add new nodes to the cluster.
> We do only store id's in the index otherwise we could have moved stuff
> around with IndexReader.document(x) or so. Luke (http://www.getopt.org/luke/)
> is able to reconstruct the indexed Document data so it should be doable.
> However I'm thinking of actually just delete the docs from the old index and
> add new Documents to the new node. It would be cool to not waste cpu cycles
> by reindexing already indexed stuff but...
>
> And we as well will have data amounts in the range you are talking about. We
> perhaps could share ideas ?
>
> How do you plan to store where each document is located ? I mean you
> probably need to store info about the Document and it's location somewhere
> perhaps in a clustered DB ? We will probably go for HBase for this.
>
> I think the number of documents is less important than the actual data size
> (just speculating). We currently search 10M (will get much much larger)
> indexed blog entries on one machine where the JVM has 1G heap, the index
> size is 3G and response times are still quite fast. This is a readonly node
> though and is updated every morning with a freshly optimized index. Someone
> told me that you probably need twice the RAM if you plan to both index and
> search at the same time. If I were you I would just test to index X entries
> of your data and then start to search in the index with lower JVM settings
> each round and when response times get too slow or you hit OOE then you get
> a rough estimate of the bare minimum X RAM needed for Y entries.
>
> I think we will do with something like 2G per 50M docs but I will need to
> test it out.
>
> If you get an answer in this matter please let me know.
>
> Kindly
>
> //Marcus
>
>
> On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner
> wrote:
>
> > Hi all,
> >
> > This may be a bit rambling, but let see how it goes.  I'm not a Lucene or
> > Solr
> > guru by any means, I have been prototyping with solr and understanding how
> > all
> > the pieces and parts fit together.
> >
> > We are migrating our current document storage infrastructure to a decent
> > sized
> > solr cluster, using 1.3-snapshots right now.  Eventually this will be in
> > the
> > billion+ documents, with about 1M new documents added per day.
> >
> > Our main sticking point right now is that a significant number of our
> > documents
> > will be updated, at least once, but possibly more than once.  The
> > volatility of
> > a document decreases over time.
> >
> > With this in mind, we've been considering using a cascading series of shard
> > clusters.  That is :
> >
> >  1) a cluster of shards holding recent data ( most recent week or two )
> > smaller
> >    indexes that take a small amount of time to commit updates and optimise,
> >    since this will hold the most volatile documents.
> >
> >  2) Following that another cluster of shards that holds some relatively
> > recent
> >    ( 3-6 months ? ), but not super volatile, documents, these are items
> > that
> >    could potentially receive updates, but generally not.
> >
> >  3) A final set of 'archive' shards holding the final resting place for
> >    documents.  These would not receive updates.  These would be online for
> >    searching and analysis "forever".
> >
> > We are not sure if this is the best way to go, but it is the approach we
> > are
> > leaning toward right now.  I would like some feedback from the folks here
> > if you
> > think that is a reasonable approach.
> >
> > One of the other things I'm wondering about is how to manipulate indexes
> > We'll need to roll documents around between indexes over time, or at least
> > migrate indexes from one set of shards to another as the documents 'age'
> > and
> > merge/aggregate them with more 'stable' indexes.   I know about merging
> > complete
> > indexes together, but what about migrating a subset of documents from one
> > index
> > into another index?
> >
> > In addition, what is generally considered a 'manageable' index of large
> > size?  I
> > was attempting to find some information on the relationship between search
> > response times, the amount of memory for used for a search, and the number
> > of
> > documents in an index, but I wasn't having much luck.
> >
> > I'm not sure if I'm making sense here, but just thought I would throw this
> > out
> > there and see what people think.  Ther eis the distinct possibility that I
> > am
> > not asking the right questions or considering the right parameters, so feel
> > free
> > to correct me, or ask questions as you see fit.
> >
> > And yes, I will report how we are doing things when we get this all figured
> > out,
> > and if there are items that we can contribute back to Solr we will.  If
> > nothing
> > else there will be a nice article of how we manage TB of data with Solr.
> >
> > enjoy,
> >
> > -jeremy
> >
> > --
> > ========================================================================
> >  Jeremy Hinegardner                              [hidden email]
> >
> >
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [hidden email]
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/

Reply | Threaded
Open this post in threaded view
|

Re: scaling / sharding questions

Jeremy Hinegardner
In reply to this post by Marcus Herou
Sorry for not keeping this thread alive, lets see what we can do...

One option I've thought of for 'resharding' would splitting an index into two by
just copying it, the deleting 1/2 the documents from one, doing a commit, and
delete the other 1/2 from the other index and commit.  That is:

  1) Take original index
  2) copy to b1 and b2
  3) delete docs from b1 that match a particular query A
  4) delete docs from b2 that do not match a particular query A
  5) commit b1 and b2

Has anyone tried something like that?

As for how to know where each document is stored, generally we're considering
unique_document_id % N.  If we rebalance we change N and redistribute, but that
probably will take too much time.    That makes us move more towards a staggered
age based approach where the most recent docs filter down to "permanent" indexes
based upon time.

Another thought we've had recently is to have many many many physical shards, on
the indexing writer side, but then merge groups of them into logical shards
which are snapshotted to reader solrs' on a frequent basis.  I haven't done any
testing along these lines, but logically it seems like an idea worth pursuing.

enjoy,

-jeremy

On Fri, Jun 06, 2008 at 03:14:10PM +0200, Marcus Herou wrote:

> Cool sharding technique.
>
> We as well are thinking of howto "move" docs from one index to another
> because we need to re-balance the docs when we add new nodes to the cluster.
> We do only store id's in the index otherwise we could have moved stuff
> around with IndexReader.document(x) or so. Luke (http://www.getopt.org/luke/)
> is able to reconstruct the indexed Document data so it should be doable.
> However I'm thinking of actually just delete the docs from the old index and
> add new Documents to the new node. It would be cool to not waste cpu cycles
> by reindexing already indexed stuff but...
>
> And we as well will have data amounts in the range you are talking about. We
> perhaps could share ideas ?
>
> How do you plan to store where each document is located ? I mean you
> probably need to store info about the Document and it's location somewhere
> perhaps in a clustered DB ? We will probably go for HBase for this.
>
> I think the number of documents is less important than the actual data size
> (just speculating). We currently search 10M (will get much much larger)
> indexed blog entries on one machine where the JVM has 1G heap, the index
> size is 3G and response times are still quite fast. This is a readonly node
> though and is updated every morning with a freshly optimized index. Someone
> told me that you probably need twice the RAM if you plan to both index and
> search at the same time. If I were you I would just test to index X entries
> of your data and then start to search in the index with lower JVM settings
> each round and when response times get too slow or you hit OOE then you get
> a rough estimate of the bare minimum X RAM needed for Y entries.
>
> I think we will do with something like 2G per 50M docs but I will need to
> test it out.
>
> If you get an answer in this matter please let me know.
>
> Kindly
>
> //Marcus
>
>
> On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > This may be a bit rambling, but let see how it goes.  I'm not a Lucene or
> > Solr
> > guru by any means, I have been prototyping with solr and understanding how
> > all
> > the pieces and parts fit together.
> >
> > We are migrating our current document storage infrastructure to a decent
> > sized
> > solr cluster, using 1.3-snapshots right now.  Eventually this will be in
> > the
> > billion+ documents, with about 1M new documents added per day.
> >
> > Our main sticking point right now is that a significant number of our
> > documents
> > will be updated, at least once, but possibly more than once.  The
> > volatility of
> > a document decreases over time.
> >
> > With this in mind, we've been considering using a cascading series of shard
> > clusters.  That is :
> >
> >  1) a cluster of shards holding recent data ( most recent week or two )
> > smaller
> >    indexes that take a small amount of time to commit updates and optimise,
> >    since this will hold the most volatile documents.
> >
> >  2) Following that another cluster of shards that holds some relatively
> > recent
> >    ( 3-6 months ? ), but not super volatile, documents, these are items
> > that
> >    could potentially receive updates, but generally not.
> >
> >  3) A final set of 'archive' shards holding the final resting place for
> >    documents.  These would not receive updates.  These would be online for
> >    searching and analysis "forever".
> >
> > We are not sure if this is the best way to go, but it is the approach we
> > are
> > leaning toward right now.  I would like some feedback from the folks here
> > if you
> > think that is a reasonable approach.
> >
> > One of the other things I'm wondering about is how to manipulate indexes
> > We'll need to roll documents around between indexes over time, or at least
> > migrate indexes from one set of shards to another as the documents 'age'
> > and
> > merge/aggregate them with more 'stable' indexes.   I know about merging
> > complete
> > indexes together, but what about migrating a subset of documents from one
> > index
> > into another index?
> >
> > In addition, what is generally considered a 'manageable' index of large
> > size?  I
> > was attempting to find some information on the relationship between search
> > response times, the amount of memory for used for a search, and the number
> > of
> > documents in an index, but I wasn't having much luck.
> >
> > I'm not sure if I'm making sense here, but just thought I would throw this
> > out
> > there and see what people think.  Ther eis the distinct possibility that I
> > am
> > not asking the right questions or considering the right parameters, so feel
> > free
> > to correct me, or ask questions as you see fit.
> >
> > And yes, I will report how we are doing things when we get this all figured
> > out,
> > and if there are items that we can contribute back to Solr we will.  If
> > nothing
> > else there will be a nice article of how we manage TB of data with Solr.
> >
> > enjoy,
> >
> > -jeremy
> >
> > --
> > ========================================================================
> >  Jeremy Hinegardner                              [hidden email]
> >
> >
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [hidden email]
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/

--
========================================================================
 Jeremy Hinegardner                              [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: scaling / sharding questions

Jeremy Hinegardner
In reply to this post by Otis Gospodnetic-2
Hi,

I agree, there is definitely no generic answer, the best sources I can find
so far, relating to performance are:

  http://wiki.apache.org/solr/SolrPerformanceData
  http://wiki.apache.org/solr/SolrPerformanceFactors
  http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
  http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
  http://lucene.apache.org/java/docs/benchmarks.html

Although most of the items discussed on these pages relate directly to speeding
up searching and indexing.  The relationship I am looking for is how does index
size relate to searching and indexing, that particular question doesn't appear
to be answered.  

If no one has any information on that front I guess I'll just have to dive in
and figure it out :-).

As for storing the fields, our initial testing is showing that we get better
performance overall by storing the data in Solr and returning it with the
results instead of using the results to go look up the original documents
elsewhere.

Is there something I am missing here?

enjoy,

-jeremy

On Fri, Jun 06, 2008 at 09:01:14AM -0700, Otis Gospodnetic wrote:

> Hola,
>
> That's a pretty big an open question, but here is some info.
>
> Jeremy's sharding approach sounds OK.  We did something similar at Technorati,
> where a document/blog timestamp was the main sharding factor.  You can't
> really move individual docs without reindexing (i.e. delete docX from shard1
> and index docX to shard2), unless all your fields are stored, which you will
> not want to do with data volumes that you are describing.
>
>
> As for how much can be handled by a single machine, this is a FAQ and we
> really need to put it on Lucene/Solr FAQ wiki page if it's not there already.
> The answer is this depends on many factors (size of index, # of concurrent
> searches, complexity of queries, number of searchers, type of disk, amount of
> RAM, cache settings, # of CPUs...)
>
> The questions are right, it's just that there is no single non-generic answer.
>
> Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> ----- Original Message ----
> > From: Marcus Herou <[hidden email]> To:
> > [hidden email]; [hidden email] Sent: Friday, June 6,
> > 2008 9:14:10 AM Subject: Re: scaling / sharding questions
> >
> > Cool sharding technique.
> >
> > We as well are thinking of howto "move" docs from one index to another
> > because we need to re-balance the docs when we add new nodes to the cluster.
> > We do only store id's in the index otherwise we could have moved stuff
> > around with IndexReader.document(x) or so. Luke
> > (http://www.getopt.org/luke/) is able to reconstruct the indexed Document
> > data so it should be doable.  However I'm thinking of actually just delete
> > the docs from the old index and add new Documents to the new node. It would
> > be cool to not waste cpu cycles by reindexing already indexed stuff but...
> >
> > And we as well will have data amounts in the range you are talking about. We
> > perhaps could share ideas ?
> >
> > How do you plan to store where each document is located ? I mean you
> > probably need to store info about the Document and it's location somewhere
> > perhaps in a clustered DB ? We will probably go for HBase for this.
> >
> > I think the number of documents is less important than the actual data size
> > (just speculating). We currently search 10M (will get much much larger)
> > indexed blog entries on one machine where the JVM has 1G heap, the index
> > size is 3G and response times are still quite fast. This is a readonly node
> > though and is updated every morning with a freshly optimized index. Someone
> > told me that you probably need twice the RAM if you plan to both index and
> > search at the same time. If I were you I would just test to index X entries
> > of your data and then start to search in the index with lower JVM settings
> > each round and when response times get too slow or you hit OOE then you get
> > a rough estimate of the bare minimum X RAM needed for Y entries.
> >
> > I think we will do with something like 2G per 50M docs but I will need to
> > test it out.
> >
> > If you get an answer in this matter please let me know.
> >
> > Kindly
> >
> > //Marcus
> >
> >
> > On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner wrote:
> >
> > > Hi all,
> > >
> > > This may be a bit rambling, but let see how it goes.  I'm not a Lucene or
> > > Solr guru by any means, I have been prototyping with solr and
> > > understanding how all the pieces and parts fit together.
> > >
> > > We are migrating our current document storage infrastructure to a decent
> > > sized solr cluster, using 1.3-snapshots right now.  Eventually this will
> > > be in the billion+ documents, with about 1M new documents added per day.
> > >
> > > Our main sticking point right now is that a significant number of our
> > > documents will be updated, at least once, but possibly more than once.
> > > The volatility of a document decreases over time.
> > >
> > > With this in mind, we've been considering using a cascading series of
> > > shard clusters.  That is :
> > >
> > >  1) a cluster of shards holding recent data ( most recent week or two )
> > >  smaller indexes that take a small amount of time to commit updates and
> > >  optimise, since this will hold the most volatile documents.
> > >
> > >  2) Following that another cluster of shards that holds some relatively
> > >  recent ( 3-6 months ? ), but not super volatile, documents, these are
> > >  items that could potentially receive updates, but generally not.
> > >
> > >  3) A final set of 'archive' shards holding the final resting place for
> > >  documents.  These would not receive updates.  These would be online for
> > >  searching and analysis "forever".
> > >
> > > We are not sure if this is the best way to go, but it is the approach we
> > > are leaning toward right now.  I would like some feedback from the folks
> > > here if you think that is a reasonable approach.
> > >
> > > One of the other things I'm wondering about is how to manipulate indexes
> > > We'll need to roll documents around between indexes over time, or at least
> > > migrate indexes from one set of shards to another as the documents 'age'
> > > and merge/aggregate them with more 'stable' indexes.   I know about
> > > merging complete indexes together, but what about migrating a subset of
> > > documents from one index into another index?
> > >
> > > In addition, what is generally considered a 'manageable' index of large
> > > size?  I was attempting to find some information on the relationship
> > > between search response times, the amount of memory for used for a search,
> > > and the number of documents in an index, but I wasn't having much luck.
> > >
> > > I'm not sure if I'm making sense here, but just thought I would throw this
> > > out there and see what people think.  Ther eis the distinct possibility
> > > that I am not asking the right questions or considering the right
> > > parameters, so feel free to correct me, or ask questions as you see fit.
> > >
> > > And yes, I will report how we are doing things when we get this all
> > > figured out, and if there are items that we can contribute back to Solr we
> > > will.  If nothing else there will be a nice article of how we manage TB of
> > > data with Solr.
> > >
> > > enjoy,
> > >
> > > -jeremy
> > >
> > > --
> > > ========================================================================
> > > Jeremy Hinegardner                              [hidden email]
> > >
> > >
Reply | Threaded
Open this post in threaded view
|

RE: scaling / sharding questions

Lance Norskog-2
In reply to this post by Jeremy Hinegardner
Yes, I've done this split-by-delete several times. The halved index still
uses as much disk space until you optimize it.

As to splitting policy: we use an MD5 signature as our unique ID. This has
the lovely property that we can wildcard.  'contentid:f*' denotes 1/16 of
the whole index. This 1/16 is a very random sample of the whole index. We
use this for several things. If we use this for shards, we have a query that
matches a shard's contents.

The Solr/Lucene syntax does not support modular arithmetic,and so it will
not let you query a subset that matches one of your shards.

We also found that searching a few smaller indexes via the Solr 1.3
Distributed Search feature is actually faster than searching one large
index, YMMV. So for us, a large pile of shards will be optimal anyway, so we
have to need "rebalance".

It sounds like you're not storing the data in a backing store, but are
storing all data in the index itself. We have found this "challenging".

Cheers,

Lance Norskog

-----Original Message-----
From: Jeremy Hinegardner [mailto:[hidden email]]
Sent: Friday, June 13, 2008 3:36 PM
To: [hidden email]
Subject: Re: scaling / sharding questions

Sorry for not keeping this thread alive, lets see what we can do...

One option I've thought of for 'resharding' would splitting an index into
two by just copying it, the deleting 1/2 the documents from one, doing a
commit, and delete the other 1/2 from the other index and commit.  That is:

  1) Take original index
  2) copy to b1 and b2
  3) delete docs from b1 that match a particular query A
  4) delete docs from b2 that do not match a particular query A
  5) commit b1 and b2

Has anyone tried something like that?

As for how to know where each document is stored, generally we're
considering unique_document_id % N.  If we rebalance we change N and
redistribute, but that
probably will take too much time.    That makes us move more towards a
staggered
age based approach where the most recent docs filter down to "permanent"
indexes based upon time.

Another thought we've had recently is to have many many many physical
shards, on the indexing writer side, but then merge groups of them into
logical shards which are snapshotted to reader solrs' on a frequent basis.
I haven't done any testing along these lines, but logically it seems like an
idea worth pursuing.

enjoy,

-jeremy

On Fri, Jun 06, 2008 at 03:14:10PM +0200, Marcus Herou wrote:
> Cool sharding technique.
>
> We as well are thinking of howto "move" docs from one index to another
> because we need to re-balance the docs when we add new nodes to the
cluster.
> We do only store id's in the index otherwise we could have moved stuff
> around with IndexReader.document(x) or so. Luke
> (http://www.getopt.org/luke/) is able to reconstruct the indexed Document
data so it should be doable.

> However I'm thinking of actually just delete the docs from the old
> index and add new Documents to the new node. It would be cool to not
> waste cpu cycles by reindexing already indexed stuff but...
>
> And we as well will have data amounts in the range you are talking
> about. We perhaps could share ideas ?
>
> How do you plan to store where each document is located ? I mean you
> probably need to store info about the Document and it's location
> somewhere perhaps in a clustered DB ? We will probably go for HBase for
this.

>
> I think the number of documents is less important than the actual data
> size (just speculating). We currently search 10M (will get much much
> larger) indexed blog entries on one machine where the JVM has 1G heap,
> the index size is 3G and response times are still quite fast. This is
> a readonly node though and is updated every morning with a freshly
> optimized index. Someone told me that you probably need twice the RAM
> if you plan to both index and search at the same time. If I were you I
> would just test to index X entries of your data and then start to
> search in the index with lower JVM settings each round and when
> response times get too slow or you hit OOE then you get a rough estimate
of the bare minimum X RAM needed for Y entries.

>
> I think we will do with something like 2G per 50M docs but I will need
> to test it out.
>
> If you get an answer in this matter please let me know.
>
> Kindly
>
> //Marcus
>
>
> On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner
> <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > This may be a bit rambling, but let see how it goes.  I'm not a
> > Lucene or Solr guru by any means, I have been prototyping with solr
> > and understanding how all the pieces and parts fit together.
> >
> > We are migrating our current document storage infrastructure to a
> > decent sized solr cluster, using 1.3-snapshots right now.  
> > Eventually this will be in the
> > billion+ documents, with about 1M new documents added per day.
> >
> > Our main sticking point right now is that a significant number of
> > our documents will be updated, at least once, but possibly more than
> > once.  The volatility of a document decreases over time.
> >
> > With this in mind, we've been considering using a cascading series
> > of shard clusters.  That is :
> >
> >  1) a cluster of shards holding recent data ( most recent week or
> > two ) smaller
> >    indexes that take a small amount of time to commit updates and
optimise,

> >    since this will hold the most volatile documents.
> >
> >  2) Following that another cluster of shards that holds some
> > relatively recent
> >    ( 3-6 months ? ), but not super volatile, documents, these are
> > items that
> >    could potentially receive updates, but generally not.
> >
> >  3) A final set of 'archive' shards holding the final resting place for
> >    documents.  These would not receive updates.  These would be online
for

> >    searching and analysis "forever".
> >
> > We are not sure if this is the best way to go, but it is the
> > approach we are leaning toward right now.  I would like some
> > feedback from the folks here if you think that is a reasonable
> > approach.
> >
> > One of the other things I'm wondering about is how to manipulate
> > indexes We'll need to roll documents around between indexes over
> > time, or at least migrate indexes from one set of shards to another as
the documents 'age'

> > and
> > merge/aggregate them with more 'stable' indexes.   I know about merging
> > complete
> > indexes together, but what about migrating a subset of documents
> > from one index into another index?
> >
> > In addition, what is generally considered a 'manageable' index of
> > large size?  I was attempting to find some information on the
> > relationship between search response times, the amount of memory for
> > used for a search, and the number of documents in an index, but I
> > wasn't having much luck.
> >
> > I'm not sure if I'm making sense here, but just thought I would
> > throw this out there and see what people think.  Ther eis the
> > distinct possibility that I am not asking the right questions or
> > considering the right parameters, so feel free to correct me, or ask
> > questions as you see fit.
> >
> > And yes, I will report how we are doing things when we get this all
> > figured out, and if there are items that we can contribute back to
> > Solr we will.  If nothing else there will be a nice article of how
> > we manage TB of data with Solr.
> >
> > enjoy,
> >
> > -jeremy
> >
> > --
> > ========================================================================
> >  Jeremy Hinegardner                              [hidden email]
> >
> >
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [hidden email]
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/

--
========================================================================
 Jeremy Hinegardner                              [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: scaling / sharding questions

Marcus Herou
Hi.

We as well use md5 as the uid.

I guess by saying each 1/16th is because the md5 is hex, right? (0-f).
Thinking about md5 sharding.
1 shard: 0-f
2 shards: 0-7:8-f
3 shards: problem!
4 shards: 0-3....

This technique would require that you double the amount of shards each time
you split right ?

Split by delete sounds really smart, damn that I did'nt think of that :)

Anyway over time the technique of moving the whole index to a new shard and
then delete would probably be more than challenging.




I will never ever store the data in Lucene mainly because of bad exp and
since I want to create modules which are fast,  scalable and flexible and
storing the data alongside with the index do not match that for me at least.

So yes I will have the need to do a "foreach id in ids get document"
approach in the searcher code, but at least I can optimize the retrieval of
docs myself and let Lucene do what it's good at: indexing and searching not
storage.

I am more and more thinking in terms of having different levels of searching
instead of searcing in all shards at the same time.

Let's say you start with 4 shards where you each document is replicated 4
times based on publishdate. Since all shards have the same data you can lb
the query to any of the 4 shards.

One day you find that 4 shards is not enough because of search performance
so you add 4 new shards. Now you only index these 4 new shards with the new
documents making the old ones readonly.

The searcher would then prioritize the new shards and only if the query
returns less than X results you start querying the old shards.

This have a nice side effect of having the most relevant/recent entries in
the index which is searched the most. Since the old shards will be mostly
idle you can as well convert 2 of the old shards to "new" shards reducing
the need for buying new servers.

What I'm trying to say is that you will end up with an architecture which
have many nodes on top which each have few documents and fewer and fewer
nodes as you go down the architecture but where each node store more
documents since the search speed get's less and less relevant.

Something like this:

xxxxxxxx - Primary: 10M docs per shard, make sure 95% of the results comes
from here.
   yyyy - Standby: 100M docs per shard - merges of 10 primary indices.
     zz - Archive: 1000M docs per shard - merges of 10 standby indices.

Search top-down.
The numbers are just speculative. The drawback with this architecture is
that you get no indexing benefit at all if the architecture drawn above is
the same as which you use for indexing. I think personally you should use X
indexers which then merge indices (MapReduce) for max performance and lay
them out as described above.

I think Google do something like this.


Kindly

//Marcus



















On Sat, Jun 14, 2008 at 2:27 AM, Lance Norskog <[hidden email]> wrote:

> Yes, I've done this split-by-delete several times. The halved index still
> uses as much disk space until you optimize it.
>
> As to splitting policy: we use an MD5 signature as our unique ID. This has
> the lovely property that we can wildcard.  'contentid:f*' denotes 1/16 of
> the whole index. This 1/16 is a very random sample of the whole index. We
> use this for several things. If we use this for shards, we have a query
> that
> matches a shard's contents.
>
> The Solr/Lucene syntax does not support modular arithmetic,and so it will
> not let you query a subset that matches one of your shards.
>
> We also found that searching a few smaller indexes via the Solr 1.3
> Distributed Search feature is actually faster than searching one large
> index, YMMV. So for us, a large pile of shards will be optimal anyway, so
> we
> have to need "rebalance".
>
> It sounds like you're not storing the data in a backing store, but are
> storing all data in the index itself. We have found this "challenging".
>
> Cheers,
>
> Lance Norskog
>
> -----Original Message-----
> From: Jeremy Hinegardner [mailto:[hidden email]]
> Sent: Friday, June 13, 2008 3:36 PM
> To: [hidden email]
> Subject: Re: scaling / sharding questions
>
> Sorry for not keeping this thread alive, lets see what we can do...
>
> One option I've thought of for 'resharding' would splitting an index into
> two by just copying it, the deleting 1/2 the documents from one, doing a
> commit, and delete the other 1/2 from the other index and commit.  That is:
>
>  1) Take original index
>  2) copy to b1 and b2
>  3) delete docs from b1 that match a particular query A
>  4) delete docs from b2 that do not match a particular query A
>  5) commit b1 and b2
>
> Has anyone tried something like that?
>
> As for how to know where each document is stored, generally we're
> considering unique_document_id % N.  If we rebalance we change N and
> redistribute, but that
> probably will take too much time.    That makes us move more towards a
> staggered
> age based approach where the most recent docs filter down to "permanent"
> indexes based upon time.
>
> Another thought we've had recently is to have many many many physical
> shards, on the indexing writer side, but then merge groups of them into
> logical shards which are snapshotted to reader solrs' on a frequent basis.
> I haven't done any testing along these lines, but logically it seems like
> an
> idea worth pursuing.
>
> enjoy,
>
> -jeremy
>
> On Fri, Jun 06, 2008 at 03:14:10PM +0200, Marcus Herou wrote:
> > Cool sharding technique.
> >
> > We as well are thinking of howto "move" docs from one index to another
> > because we need to re-balance the docs when we add new nodes to the
> cluster.
> > We do only store id's in the index otherwise we could have moved stuff
> > around with IndexReader.document(x) or so. Luke
> > (http://www.getopt.org/luke/) is able to reconstruct the indexed
> Document
> data so it should be doable.
> > However I'm thinking of actually just delete the docs from the old
> > index and add new Documents to the new node. It would be cool to not
> > waste cpu cycles by reindexing already indexed stuff but...
> >
> > And we as well will have data amounts in the range you are talking
> > about. We perhaps could share ideas ?
> >
> > How do you plan to store where each document is located ? I mean you
> > probably need to store info about the Document and it's location
> > somewhere perhaps in a clustered DB ? We will probably go for HBase for
> this.
> >
> > I think the number of documents is less important than the actual data
> > size (just speculating). We currently search 10M (will get much much
> > larger) indexed blog entries on one machine where the JVM has 1G heap,
> > the index size is 3G and response times are still quite fast. This is
> > a readonly node though and is updated every morning with a freshly
> > optimized index. Someone told me that you probably need twice the RAM
> > if you plan to both index and search at the same time. If I were you I
> > would just test to index X entries of your data and then start to
> > search in the index with lower JVM settings each round and when
> > response times get too slow or you hit OOE then you get a rough estimate
> of the bare minimum X RAM needed for Y entries.
> >
> > I think we will do with something like 2G per 50M docs but I will need
> > to test it out.
> >
> > If you get an answer in this matter please let me know.
> >
> > Kindly
> >
> > //Marcus
> >
> >
> > On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner
> > <[hidden email]>
> > wrote:
> >
> > > Hi all,
> > >
> > > This may be a bit rambling, but let see how it goes.  I'm not a
> > > Lucene or Solr guru by any means, I have been prototyping with solr
> > > and understanding how all the pieces and parts fit together.
> > >
> > > We are migrating our current document storage infrastructure to a
> > > decent sized solr cluster, using 1.3-snapshots right now.
> > > Eventually this will be in the
> > > billion+ documents, with about 1M new documents added per day.
> > >
> > > Our main sticking point right now is that a significant number of
> > > our documents will be updated, at least once, but possibly more than
> > > once.  The volatility of a document decreases over time.
> > >
> > > With this in mind, we've been considering using a cascading series
> > > of shard clusters.  That is :
> > >
> > >  1) a cluster of shards holding recent data ( most recent week or
> > > two ) smaller
> > >    indexes that take a small amount of time to commit updates and
> optimise,
> > >    since this will hold the most volatile documents.
> > >
> > >  2) Following that another cluster of shards that holds some
> > > relatively recent
> > >    ( 3-6 months ? ), but not super volatile, documents, these are
> > > items that
> > >    could potentially receive updates, but generally not.
> > >
> > >  3) A final set of 'archive' shards holding the final resting place for
> > >    documents.  These would not receive updates.  These would be online
> for
> > >    searching and analysis "forever".
> > >
> > > We are not sure if this is the best way to go, but it is the
> > > approach we are leaning toward right now.  I would like some
> > > feedback from the folks here if you think that is a reasonable
> > > approach.
> > >
> > > One of the other things I'm wondering about is how to manipulate
> > > indexes We'll need to roll documents around between indexes over
> > > time, or at least migrate indexes from one set of shards to another as
> the documents 'age'
> > > and
> > > merge/aggregate them with more 'stable' indexes.   I know about merging
> > > complete
> > > indexes together, but what about migrating a subset of documents
> > > from one index into another index?
> > >
> > > In addition, what is generally considered a 'manageable' index of
> > > large size?  I was attempting to find some information on the
> > > relationship between search response times, the amount of memory for
> > > used for a search, and the number of documents in an index, but I
> > > wasn't having much luck.
> > >
> > > I'm not sure if I'm making sense here, but just thought I would
> > > throw this out there and see what people think.  Ther eis the
> > > distinct possibility that I am not asking the right questions or
> > > considering the right parameters, so feel free to correct me, or ask
> > > questions as you see fit.
> > >
> > > And yes, I will report how we are doing things when we get this all
> > > figured out, and if there are items that we can contribute back to
> > > Solr we will.  If nothing else there will be a nice article of how
> > > we manage TB of data with Solr.
> > >
> > > enjoy,
> > >
> > > -jeremy
> > >
> > > --
> > >
> ========================================================================
> > >  Jeremy Hinegardner
> [hidden email]
> > >
> > >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > [hidden email]
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
>
> --
> ========================================================================
>  Jeremy Hinegardner                              [hidden email]
>
>
>


--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: scaling / sharding questions

Otis Gospodnetic-2
In reply to this post by Jeremy Hinegardner
With Lance's MD5 schema you'd do this:

1 shard: 0-f*
2 shards: 0-8*, 9-f*
3 shards: 0-5*, 6-a*, b-f*
4 shards: 0-3*, 4-7*, 8-b*, c-f*
...
16 shards: 0*, 1*, 2*....... d*, e*, f*

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----

> From: Marcus Herou <[hidden email]>
> To: [hidden email]
> Cc: [hidden email]
> Sent: Saturday, June 14, 2008 5:53:35 AM
> Subject: Re: scaling / sharding questions
>
> Hi.
>
> We as well use md5 as the uid.
>
> I guess by saying each 1/16th is because the md5 is hex, right? (0-f).
> Thinking about md5 sharding.
> 1 shard: 0-f
> 2 shards: 0-7:8-f
> 3 shards: problem!
> 4 shards: 0-3....
>
> This technique would require that you double the amount of shards each time
> you split right ?
>
> Split by delete sounds really smart, damn that I did'nt think of that :)
>
> Anyway over time the technique of moving the whole index to a new shard and
> then delete would probably be more than challenging.
>
>
>
>
> I will never ever store the data in Lucene mainly because of bad exp and
> since I want to create modules which are fast,  scalable and flexible and
> storing the data alongside with the index do not match that for me at least.
>
> So yes I will have the need to do a "foreach id in ids get document"
> approach in the searcher code, but at least I can optimize the retrieval of
> docs myself and let Lucene do what it's good at: indexing and searching not
> storage.
>
> I am more and more thinking in terms of having different levels of searching
> instead of searcing in all shards at the same time.
>
> Let's say you start with 4 shards where you each document is replicated 4
> times based on publishdate. Since all shards have the same data you can lb
> the query to any of the 4 shards.
>
> One day you find that 4 shards is not enough because of search performance
> so you add 4 new shards. Now you only index these 4 new shards with the new
> documents making the old ones readonly.
>
> The searcher would then prioritize the new shards and only if the query
> returns less than X results you start querying the old shards.
>
> This have a nice side effect of having the most relevant/recent entries in
> the index which is searched the most. Since the old shards will be mostly
> idle you can as well convert 2 of the old shards to "new" shards reducing
> the need for buying new servers.
>
> What I'm trying to say is that you will end up with an architecture which
> have many nodes on top which each have few documents and fewer and fewer
> nodes as you go down the architecture but where each node store more
> documents since the search speed get's less and less relevant.
>
> Something like this:
>
> xxxxxxxx - Primary: 10M docs per shard, make sure 95% of the results comes
> from here.
>    yyyy - Standby: 100M docs per shard - merges of 10 primary indices.
>      zz - Archive: 1000M docs per shard - merges of 10 standby indices.
>
> Search top-down.
> The numbers are just speculative. The drawback with this architecture is
> that you get no indexing benefit at all if the architecture drawn above is
> the same as which you use for indexing. I think personally you should use X
> indexers which then merge indices (MapReduce) for max performance and lay
> them out as described above.
>
> I think Google do something like this.
>
>
> Kindly
>
> //Marcus
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Sat, Jun 14, 2008 at 2:27 AM, Lance Norskog wrote:
>
> > Yes, I've done this split-by-delete several times. The halved index still
> > uses as much disk space until you optimize it.
> >
> > As to splitting policy: we use an MD5 signature as our unique ID. This has
> > the lovely property that we can wildcard.  'contentid:f*' denotes 1/16 of
> > the whole index. This 1/16 is a very random sample of the whole index. We
> > use this for several things. If we use this for shards, we have a query
> > that
> > matches a shard's contents.
> >
> > The Solr/Lucene syntax does not support modular arithmetic,and so it will
> > not let you query a subset that matches one of your shards.
> >
> > We also found that searching a few smaller indexes via the Solr 1.3
> > Distributed Search feature is actually faster than searching one large
> > index, YMMV. So for us, a large pile of shards will be optimal anyway, so
> > we
> > have to need "rebalance".
> >
> > It sounds like you're not storing the data in a backing store, but are
> > storing all data in the index itself. We have found this "challenging".
> >
> > Cheers,
> >
> > Lance Norskog
> >
> > -----Original Message-----
> > From: Jeremy Hinegardner [mailto:[hidden email]]
> > Sent: Friday, June 13, 2008 3:36 PM
> > To: [hidden email]
> > Subject: Re: scaling / sharding questions
> >
> > Sorry for not keeping this thread alive, lets see what we can do...
> >
> > One option I've thought of for 'resharding' would splitting an index into
> > two by just copying it, the deleting 1/2 the documents from one, doing a
> > commit, and delete the other 1/2 from the other index and commit.  That is:
> >
> >  1) Take original index
> >  2) copy to b1 and b2
> >  3) delete docs from b1 that match a particular query A
> >  4) delete docs from b2 that do not match a particular query A
> >  5) commit b1 and b2
> >
> > Has anyone tried something like that?
> >
> > As for how to know where each document is stored, generally we're
> > considering unique_document_id % N.  If we rebalance we change N and
> > redistribute, but that
> > probably will take too much time.    That makes us move more towards a
> > staggered
> > age based approach where the most recent docs filter down to "permanent"
> > indexes based upon time.
> >
> > Another thought we've had recently is to have many many many physical
> > shards, on the indexing writer side, but then merge groups of them into
> > logical shards which are snapshotted to reader solrs' on a frequent basis.
> > I haven't done any testing along these lines, but logically it seems like
> > an
> > idea worth pursuing.
> >
> > enjoy,
> >
> > -jeremy
> >
> > On Fri, Jun 06, 2008 at 03:14:10PM +0200, Marcus Herou wrote:
> > > Cool sharding technique.
> > >
> > > We as well are thinking of howto "move" docs from one index to another
> > > because we need to re-balance the docs when we add new nodes to the
> > cluster.
> > > We do only store id's in the index otherwise we could have moved stuff
> > > around with IndexReader.document(x) or so. Luke
> > > (http://www.getopt.org/luke/) is able to reconstruct the indexed
> > Document
> > data so it should be doable.
> > > However I'm thinking of actually just delete the docs from the old
> > > index and add new Documents to the new node. It would be cool to not
> > > waste cpu cycles by reindexing already indexed stuff but...
> > >
> > > And we as well will have data amounts in the range you are talking
> > > about. We perhaps could share ideas ?
> > >
> > > How do you plan to store where each document is located ? I mean you
> > > probably need to store info about the Document and it's location
> > > somewhere perhaps in a clustered DB ? We will probably go for HBase for
> > this.
> > >
> > > I think the number of documents is less important than the actual data
> > > size (just speculating). We currently search 10M (will get much much
> > > larger) indexed blog entries on one machine where the JVM has 1G heap,
> > > the index size is 3G and response times are still quite fast. This is
> > > a readonly node though and is updated every morning with a freshly
> > > optimized index. Someone told me that you probably need twice the RAM
> > > if you plan to both index and search at the same time. If I were you I
> > > would just test to index X entries of your data and then start to
> > > search in the index with lower JVM settings each round and when
> > > response times get too slow or you hit OOE then you get a rough estimate
> > of the bare minimum X RAM needed for Y entries.
> > >
> > > I think we will do with something like 2G per 50M docs but I will need
> > > to test it out.
> > >
> > > If you get an answer in this matter please let me know.
> > >
> > > Kindly
> > >
> > > //Marcus
> > >
> > >
> > > On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner
> > >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > This may be a bit rambling, but let see how it goes.  I'm not a
> > > > Lucene or Solr guru by any means, I have been prototyping with solr
> > > > and understanding how all the pieces and parts fit together.
> > > >
> > > > We are migrating our current document storage infrastructure to a
> > > > decent sized solr cluster, using 1.3-snapshots right now.
> > > > Eventually this will be in the
> > > > billion+ documents, with about 1M new documents added per day.
> > > >
> > > > Our main sticking point right now is that a significant number of
> > > > our documents will be updated, at least once, but possibly more than
> > > > once.  The volatility of a document decreases over time.
> > > >
> > > > With this in mind, we've been considering using a cascading series
> > > > of shard clusters.  That is :
> > > >
> > > >  1) a cluster of shards holding recent data ( most recent week or
> > > > two ) smaller
> > > >    indexes that take a small amount of time to commit updates and
> > optimise,
> > > >    since this will hold the most volatile documents.
> > > >
> > > >  2) Following that another cluster of shards that holds some
> > > > relatively recent
> > > >    ( 3-6 months ? ), but not super volatile, documents, these are
> > > > items that
> > > >    could potentially receive updates, but generally not.
> > > >
> > > >  3) A final set of 'archive' shards holding the final resting place for
> > > >    documents.  These would not receive updates.  These would be online
> > for
> > > >    searching and analysis "forever".
> > > >
> > > > We are not sure if this is the best way to go, but it is the
> > > > approach we are leaning toward right now.  I would like some
> > > > feedback from the folks here if you think that is a reasonable
> > > > approach.
> > > >
> > > > One of the other things I'm wondering about is how to manipulate
> > > > indexes We'll need to roll documents around between indexes over
> > > > time, or at least migrate indexes from one set of shards to another as
> > the documents 'age'
> > > > and
> > > > merge/aggregate them with more 'stable' indexes.   I know about merging
> > > > complete
> > > > indexes together, but what about migrating a subset of documents
> > > > from one index into another index?
> > > >
> > > > In addition, what is generally considered a 'manageable' index of
> > > > large size?  I was attempting to find some information on the
> > > > relationship between search response times, the amount of memory for
> > > > used for a search, and the number of documents in an index, but I
> > > > wasn't having much luck.
> > > >
> > > > I'm not sure if I'm making sense here, but just thought I would
> > > > throw this out there and see what people think.  Ther eis the
> > > > distinct possibility that I am not asking the right questions or
> > > > considering the right parameters, so feel free to correct me, or ask
> > > > questions as you see fit.
> > > >
> > > > And yes, I will report how we are doing things when we get this all
> > > > figured out, and if there are items that we can contribute back to
> > > > Solr we will.  If nothing else there will be a nice article of how
> > > > we manage TB of data with Solr.
> > > >
> > > > enjoy,
> > > >
> > > > -jeremy
> > > >
> > > > --
> > > >
> > ========================================================================
> > > >  Jeremy Hinegardner
> > [hidden email]
> > > >
> > > >
> > >
> > >
> > > --
> > > Marcus Herou CTO and co-founder Tailsweep AB
> > > +46702561312
> > > [hidden email]
> > > http://www.tailsweep.com/
> > > http://blogg.tailsweep.com/
> >
> > --
> > ========================================================================
> >  Jeremy Hinegardner                              [hidden email]
> >
> >
> >
>
>
> --
> Marcus Herou CTO and co-founder Tailsweep AB
> +46702561312
> [hidden email]
> http://www.tailsweep.com/
> http://blogg.tailsweep.com/

Reply | Threaded
Open this post in threaded view
|

Re: scaling / sharding questions

Marcus Herou
Yep got that.

Thanks.

/M

On Sun, Jun 15, 2008 at 8:42 PM, Otis Gospodnetic <
[hidden email]> wrote:

> With Lance's MD5 schema you'd do this:
>
> 1 shard: 0-f*
> 2 shards: 0-8*, 9-f*
> 3 shards: 0-5*, 6-a*, b-f*
> 4 shards: 0-3*, 4-7*, 8-b*, c-f*
> ...
> 16 shards: 0*, 1*, 2*....... d*, e*, f*
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> ----- Original Message ----
> > From: Marcus Herou <[hidden email]>
> > To: [hidden email]
> > Cc: [hidden email]
> > Sent: Saturday, June 14, 2008 5:53:35 AM
> > Subject: Re: scaling / sharding questions
> >
> > Hi.
> >
> > We as well use md5 as the uid.
> >
> > I guess by saying each 1/16th is because the md5 is hex, right? (0-f).
> > Thinking about md5 sharding.
> > 1 shard: 0-f
> > 2 shards: 0-7:8-f
> > 3 shards: problem!
> > 4 shards: 0-3....
> >
> > This technique would require that you double the amount of shards each
> time
> > you split right ?
> >
> > Split by delete sounds really smart, damn that I did'nt think of that :)
> >
> > Anyway over time the technique of moving the whole index to a new shard
> and
> > then delete would probably be more than challenging.
> >
> >
> >
> >
> > I will never ever store the data in Lucene mainly because of bad exp and
> > since I want to create modules which are fast,  scalable and flexible and
> > storing the data alongside with the index do not match that for me at
> least.
> >
> > So yes I will have the need to do a "foreach id in ids get document"
> > approach in the searcher code, but at least I can optimize the retrieval
> of
> > docs myself and let Lucene do what it's good at: indexing and searching
> not
> > storage.
> >
> > I am more and more thinking in terms of having different levels of
> searching
> > instead of searcing in all shards at the same time.
> >
> > Let's say you start with 4 shards where you each document is replicated 4
> > times based on publishdate. Since all shards have the same data you can
> lb
> > the query to any of the 4 shards.
> >
> > One day you find that 4 shards is not enough because of search
> performance
> > so you add 4 new shards. Now you only index these 4 new shards with the
> new
> > documents making the old ones readonly.
> >
> > The searcher would then prioritize the new shards and only if the query
> > returns less than X results you start querying the old shards.
> >
> > This have a nice side effect of having the most relevant/recent entries
> in
> > the index which is searched the most. Since the old shards will be mostly
> > idle you can as well convert 2 of the old shards to "new" shards reducing
> > the need for buying new servers.
> >
> > What I'm trying to say is that you will end up with an architecture which
> > have many nodes on top which each have few documents and fewer and fewer
> > nodes as you go down the architecture but where each node store more
> > documents since the search speed get's less and less relevant.
> >
> > Something like this:
> >
> > xxxxxxxx - Primary: 10M docs per shard, make sure 95% of the results
> comes
> > from here.
> >    yyyy - Standby: 100M docs per shard - merges of 10 primary indices.
> >      zz - Archive: 1000M docs per shard - merges of 10 standby indices.
> >
> > Search top-down.
> > The numbers are just speculative. The drawback with this architecture is
> > that you get no indexing benefit at all if the architecture drawn above
> is
> > the same as which you use for indexing. I think personally you should use
> X
> > indexers which then merge indices (MapReduce) for max performance and lay
> > them out as described above.
> >
> > I think Google do something like this.
> >
> >
> > Kindly
> >
> > //Marcus
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sat, Jun 14, 2008 at 2:27 AM, Lance Norskog wrote:
> >
> > > Yes, I've done this split-by-delete several times. The halved index
> still
> > > uses as much disk space until you optimize it.
> > >
> > > As to splitting policy: we use an MD5 signature as our unique ID. This
> has
> > > the lovely property that we can wildcard.  'contentid:f*' denotes 1/16
> of
> > > the whole index. This 1/16 is a very random sample of the whole index.
> We
> > > use this for several things. If we use this for shards, we have a query
> > > that
> > > matches a shard's contents.
> > >
> > > The Solr/Lucene syntax does not support modular arithmetic,and so it
> will
> > > not let you query a subset that matches one of your shards.
> > >
> > > We also found that searching a few smaller indexes via the Solr 1.3
> > > Distributed Search feature is actually faster than searching one large
> > > index, YMMV. So for us, a large pile of shards will be optimal anyway,
> so
> > > we
> > > have to need "rebalance".
> > >
> > > It sounds like you're not storing the data in a backing store, but are
> > > storing all data in the index itself. We have found this "challenging".
> > >
> > > Cheers,
> > >
> > > Lance Norskog
> > >
> > > -----Original Message-----
> > > From: Jeremy Hinegardner [mailto:[hidden email]]
> > > Sent: Friday, June 13, 2008 3:36 PM
> > > To: [hidden email]
> > > Subject: Re: scaling / sharding questions
> > >
> > > Sorry for not keeping this thread alive, lets see what we can do...
> > >
> > > One option I've thought of for 'resharding' would splitting an index
> into
> > > two by just copying it, the deleting 1/2 the documents from one, doing
> a
> > > commit, and delete the other 1/2 from the other index and commit.  That
> is:
> > >
> > >  1) Take original index
> > >  2) copy to b1 and b2
> > >  3) delete docs from b1 that match a particular query A
> > >  4) delete docs from b2 that do not match a particular query A
> > >  5) commit b1 and b2
> > >
> > > Has anyone tried something like that?
> > >
> > > As for how to know where each document is stored, generally we're
> > > considering unique_document_id % N.  If we rebalance we change N and
> > > redistribute, but that
> > > probably will take too much time.    That makes us move more towards a
> > > staggered
> > > age based approach where the most recent docs filter down to
> "permanent"
> > > indexes based upon time.
> > >
> > > Another thought we've had recently is to have many many many physical
> > > shards, on the indexing writer side, but then merge groups of them into
> > > logical shards which are snapshotted to reader solrs' on a frequent
> basis.
> > > I haven't done any testing along these lines, but logically it seems
> like
> > > an
> > > idea worth pursuing.
> > >
> > > enjoy,
> > >
> > > -jeremy
> > >
> > > On Fri, Jun 06, 2008 at 03:14:10PM +0200, Marcus Herou wrote:
> > > > Cool sharding technique.
> > > >
> > > > We as well are thinking of howto "move" docs from one index to
> another
> > > > because we need to re-balance the docs when we add new nodes to the
> > > cluster.
> > > > We do only store id's in the index otherwise we could have moved
> stuff
> > > > around with IndexReader.document(x) or so. Luke
> > > > (http://www.getopt.org/luke/) is able to reconstruct the indexed
> > > Document
> > > data so it should be doable.
> > > > However I'm thinking of actually just delete the docs from the old
> > > > index and add new Documents to the new node. It would be cool to not
> > > > waste cpu cycles by reindexing already indexed stuff but...
> > > >
> > > > And we as well will have data amounts in the range you are talking
> > > > about. We perhaps could share ideas ?
> > > >
> > > > How do you plan to store where each document is located ? I mean you
> > > > probably need to store info about the Document and it's location
> > > > somewhere perhaps in a clustered DB ? We will probably go for HBase
> for
> > > this.
> > > >
> > > > I think the number of documents is less important than the actual
> data
> > > > size (just speculating). We currently search 10M (will get much much
> > > > larger) indexed blog entries on one machine where the JVM has 1G
> heap,
> > > > the index size is 3G and response times are still quite fast. This is
> > > > a readonly node though and is updated every morning with a freshly
> > > > optimized index. Someone told me that you probably need twice the RAM
> > > > if you plan to both index and search at the same time. If I were you
> I
> > > > would just test to index X entries of your data and then start to
> > > > search in the index with lower JVM settings each round and when
> > > > response times get too slow or you hit OOE then you get a rough
> estimate
> > > of the bare minimum X RAM needed for Y entries.
> > > >
> > > > I think we will do with something like 2G per 50M docs but I will
> need
> > > > to test it out.
> > > >
> > > > If you get an answer in this matter please let me know.
> > > >
> > > > Kindly
> > > >
> > > > //Marcus
> > > >
> > > >
> > > > On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner
> > > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > This may be a bit rambling, but let see how it goes.  I'm not a
> > > > > Lucene or Solr guru by any means, I have been prototyping with solr
> > > > > and understanding how all the pieces and parts fit together.
> > > > >
> > > > > We are migrating our current document storage infrastructure to a
> > > > > decent sized solr cluster, using 1.3-snapshots right now.
> > > > > Eventually this will be in the
> > > > > billion+ documents, with about 1M new documents added per day.
> > > > >
> > > > > Our main sticking point right now is that a significant number of
> > > > > our documents will be updated, at least once, but possibly more
> than
> > > > > once.  The volatility of a document decreases over time.
> > > > >
> > > > > With this in mind, we've been considering using a cascading series
> > > > > of shard clusters.  That is :
> > > > >
> > > > >  1) a cluster of shards holding recent data ( most recent week or
> > > > > two ) smaller
> > > > >    indexes that take a small amount of time to commit updates and
> > > optimise,
> > > > >    since this will hold the most volatile documents.
> > > > >
> > > > >  2) Following that another cluster of shards that holds some
> > > > > relatively recent
> > > > >    ( 3-6 months ? ), but not super volatile, documents, these are
> > > > > items that
> > > > >    could potentially receive updates, but generally not.
> > > > >
> > > > >  3) A final set of 'archive' shards holding the final resting place
> for
> > > > >    documents.  These would not receive updates.  These would be
> online
> > > for
> > > > >    searching and analysis "forever".
> > > > >
> > > > > We are not sure if this is the best way to go, but it is the
> > > > > approach we are leaning toward right now.  I would like some
> > > > > feedback from the folks here if you think that is a reasonable
> > > > > approach.
> > > > >
> > > > > One of the other things I'm wondering about is how to manipulate
> > > > > indexes We'll need to roll documents around between indexes over
> > > > > time, or at least migrate indexes from one set of shards to another
> as
> > > the documents 'age'
> > > > > and
> > > > > merge/aggregate them with more 'stable' indexes.   I know about
> merging
> > > > > complete
> > > > > indexes together, but what about migrating a subset of documents
> > > > > from one index into another index?
> > > > >
> > > > > In addition, what is generally considered a 'manageable' index of
> > > > > large size?  I was attempting to find some information on the
> > > > > relationship between search response times, the amount of memory
> for
> > > > > used for a search, and the number of documents in an index, but I
> > > > > wasn't having much luck.
> > > > >
> > > > > I'm not sure if I'm making sense here, but just thought I would
> > > > > throw this out there and see what people think.  Ther eis the
> > > > > distinct possibility that I am not asking the right questions or
> > > > > considering the right parameters, so feel free to correct me, or
> ask
> > > > > questions as you see fit.
> > > > >
> > > > > And yes, I will report how we are doing things when we get this all
> > > > > figured out, and if there are items that we can contribute back to
> > > > > Solr we will.  If nothing else there will be a nice article of how
> > > > > we manage TB of data with Solr.
> > > > >
> > > > > enjoy,
> > > > >
> > > > > -jeremy
> > > > >
> > > > > --
> > > > >
> > >
> ========================================================================
> > > > >  Jeremy Hinegardner
> > > [hidden email]
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Marcus Herou CTO and co-founder Tailsweep AB
> > > > +46702561312
> > > > [hidden email]
> > > > http://www.tailsweep.com/
> > > > http://blogg.tailsweep.com/
> > >
> > > --
> > >
> ========================================================================
> > >  Jeremy Hinegardner
> [hidden email]
> > >
> > >
> > >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > [hidden email]
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
>
>


--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

RE: scaling / sharding questions

Lance Norskog
I cannot facet on one huge index; it runs out of ram when it attempts to
allocate a giant array. If I store several shards in one JVM, there is
no problem.

Are there any performance benefits to a large index v.s. several small
indexes?

Lance

-----Original Message-----
From: Marcus Herou [mailto:[hidden email]]
Sent: Sunday, June 15, 2008 10:24 PM
To: [hidden email]
Subject: Re: scaling / sharding questions

Yep got that.

Thanks.

/M

On Sun, Jun 15, 2008 at 8:42 PM, Otis Gospodnetic <
[hidden email]> wrote:

> With Lance's MD5 schema you'd do this:
>
> 1 shard: 0-f*
> 2 shards: 0-8*, 9-f*
> 3 shards: 0-5*, 6-a*, b-f*
> 4 shards: 0-3*, 4-7*, 8-b*, c-f*
> ...
> 16 shards: 0*, 1*, 2*....... d*, e*, f*
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> ----- Original Message ----
> > From: Marcus Herou <[hidden email]>
> > To: [hidden email]
> > Cc: [hidden email]
> > Sent: Saturday, June 14, 2008 5:53:35 AM
> > Subject: Re: scaling / sharding questions
> >
> > Hi.
> >
> > We as well use md5 as the uid.
> >
> > I guess by saying each 1/16th is because the md5 is hex, right?
(0-f).

> > Thinking about md5 sharding.
> > 1 shard: 0-f
> > 2 shards: 0-7:8-f
> > 3 shards: problem!
> > 4 shards: 0-3....
> >
> > This technique would require that you double the amount of shards
> > each
> time
> > you split right ?
> >
> > Split by delete sounds really smart, damn that I did'nt think of
> > that :)
> >
> > Anyway over time the technique of moving the whole index to a new
> > shard
> and
> > then delete would probably be more than challenging.
> >
> >
> >
> >
> > I will never ever store the data in Lucene mainly because of bad exp

> > and since I want to create modules which are fast,  scalable and
> > flexible and storing the data alongside with the index do not match
> > that for me at
> least.
> >
> > So yes I will have the need to do a "foreach id in ids get document"
> > approach in the searcher code, but at least I can optimize the
> > retrieval
> of
> > docs myself and let Lucene do what it's good at: indexing and
> > searching
> not
> > storage.
> >
> > I am more and more thinking in terms of having different levels of
> searching
> > instead of searcing in all shards at the same time.
> >
> > Let's say you start with 4 shards where you each document is
> > replicated 4 times based on publishdate. Since all shards have the
> > same data you can
> lb
> > the query to any of the 4 shards.
> >
> > One day you find that 4 shards is not enough because of search
> performance
> > so you add 4 new shards. Now you only index these 4 new shards with
> > the
> new
> > documents making the old ones readonly.
> >
> > The searcher would then prioritize the new shards and only if the
> > query returns less than X results you start querying the old shards.
> >
> > This have a nice side effect of having the most relevant/recent
> > entries
> in
> > the index which is searched the most. Since the old shards will be
> > mostly idle you can as well convert 2 of the old shards to "new"
> > shards reducing the need for buying new servers.
> >
> > What I'm trying to say is that you will end up with an architecture
> > which have many nodes on top which each have few documents and fewer

> > and fewer nodes as you go down the architecture but where each node
> > store more documents since the search speed get's less and less
relevant.
> >
> > Something like this:
> >
> > xxxxxxxx - Primary: 10M docs per shard, make sure 95% of the results
> comes
> > from here.
> >    yyyy - Standby: 100M docs per shard - merges of 10 primary
indices.
> >      zz - Archive: 1000M docs per shard - merges of 10 standby
indices.

> >
> > Search top-down.
> > The numbers are just speculative. The drawback with this
> > architecture is that you get no indexing benefit at all if the
> > architecture drawn above
> is
> > the same as which you use for indexing. I think personally you
> > should use
> X
> > indexers which then merge indices (MapReduce) for max performance
> > and lay them out as described above.
> >
> > I think Google do something like this.
> >
> >
> > Kindly
> >
> > //Marcus
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sat, Jun 14, 2008 at 2:27 AM, Lance Norskog wrote:
> >
> > > Yes, I've done this split-by-delete several times. The halved
> > > index
> still
> > > uses as much disk space until you optimize it.
> > >
> > > As to splitting policy: we use an MD5 signature as our unique ID.
> > > This
> has
> > > the lovely property that we can wildcard.  'contentid:f*' denotes
> > > 1/16
> of
> > > the whole index. This 1/16 is a very random sample of the whole
index.

> We
> > > use this for several things. If we use this for shards, we have a
> > > query that matches a shard's contents.
> > >
> > > The Solr/Lucene syntax does not support modular arithmetic,and so
> > > it
> will
> > > not let you query a subset that matches one of your shards.
> > >
> > > We also found that searching a few smaller indexes via the Solr
> > > 1.3 Distributed Search feature is actually faster than searching
> > > one large index, YMMV. So for us, a large pile of shards will be
> > > optimal anyway,
> so
> > > we
> > > have to need "rebalance".
> > >
> > > It sounds like you're not storing the data in a backing store, but

> > > are storing all data in the index itself. We have found this
"challenging".

> > >
> > > Cheers,
> > >
> > > Lance Norskog
> > >
> > > -----Original Message-----
> > > From: Jeremy Hinegardner [mailto:[hidden email]]
> > > Sent: Friday, June 13, 2008 3:36 PM
> > > To: [hidden email]
> > > Subject: Re: scaling / sharding questions
> > >
> > > Sorry for not keeping this thread alive, lets see what we can
do...
> > >
> > > One option I've thought of for 'resharding' would splitting an
> > > index
> into
> > > two by just copying it, the deleting 1/2 the documents from one,
> > > doing
> a
> > > commit, and delete the other 1/2 from the other index and commit.

> > > That
> is:
> > >
> > >  1) Take original index
> > >  2) copy to b1 and b2
> > >  3) delete docs from b1 that match a particular query A
> > >  4) delete docs from b2 that do not match a particular query A
> > >  5) commit b1 and b2
> > >
> > > Has anyone tried something like that?
> > >
> > > As for how to know where each document is stored, generally we're
> > > considering unique_document_id % N.  If we rebalance we change N
> > > and redistribute, but that
> > > probably will take too much time.    That makes us move more
towards a
> > > staggered
> > > age based approach where the most recent docs filter down to
> "permanent"
> > > indexes based upon time.
> > >
> > > Another thought we've had recently is to have many many many
> > > physical shards, on the indexing writer side, but then merge
> > > groups of them into logical shards which are snapshotted to reader

> > > solrs' on a frequent
> basis.
> > > I haven't done any testing along these lines, but logically it
> > > seems
> like
> > > an
> > > idea worth pursuing.
> > >
> > > enjoy,
> > >
> > > -jeremy
> > >
> > > On Fri, Jun 06, 2008 at 03:14:10PM +0200, Marcus Herou wrote:
> > > > Cool sharding technique.
> > > >
> > > > We as well are thinking of howto "move" docs from one index to
> another
> > > > because we need to re-balance the docs when we add new nodes to
> > > > the
> > > cluster.
> > > > We do only store id's in the index otherwise we could have moved
> stuff
> > > > around with IndexReader.document(x) or so. Luke
> > > > (http://www.getopt.org/luke/) is able to reconstruct the indexed
> > > Document
> > > data so it should be doable.
> > > > However I'm thinking of actually just delete the docs from the
> > > > old index and add new Documents to the new node. It would be
> > > > cool to not waste cpu cycles by reindexing already indexed stuff
but...
> > > >
> > > > And we as well will have data amounts in the range you are
> > > > talking about. We perhaps could share ideas ?
> > > >
> > > > How do you plan to store where each document is located ? I mean

> > > > you probably need to store info about the Document and it's
> > > > location somewhere perhaps in a clustered DB ? We will probably
> > > > go for HBase
> for
> > > this.
> > > >
> > > > I think the number of documents is less important than the
> > > > actual
> data
> > > > size (just speculating). We currently search 10M (will get much
> > > > much
> > > > larger) indexed blog entries on one machine where the JVM has 1G
> heap,
> > > > the index size is 3G and response times are still quite fast.
> > > > This is a readonly node though and is updated every morning with

> > > > a freshly optimized index. Someone told me that you probably
> > > > need twice the RAM if you plan to both index and search at the
> > > > same time. If I were you
> I
> > > > would just test to index X entries of your data and then start
> > > > to search in the index with lower JVM settings each round and
> > > > when response times get too slow or you hit OOE then you get a
> > > > rough
> estimate
> > > of the bare minimum X RAM needed for Y entries.
> > > >
> > > > I think we will do with something like 2G per 50M docs but I
> > > > will
> need
> > > > to test it out.
> > > >
> > > > If you get an answer in this matter please let me know.
> > > >
> > > > Kindly
> > > >
> > > > //Marcus
> > > >
> > > >
> > > > On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner
> > > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > This may be a bit rambling, but let see how it goes.  I'm not
> > > > > a Lucene or Solr guru by any means, I have been prototyping
> > > > > with solr and understanding how all the pieces and parts fit
together.
> > > > >
> > > > > We are migrating our current document storage infrastructure
> > > > > to a decent sized solr cluster, using 1.3-snapshots right now.
> > > > > Eventually this will be in the
> > > > > billion+ documents, with about 1M new documents added per day.
> > > > >
> > > > > Our main sticking point right now is that a significant number

> > > > > of our documents will be updated, at least once, but possibly
> > > > > more
> than
> > > > > once.  The volatility of a document decreases over time.
> > > > >
> > > > > With this in mind, we've been considering using a cascading
> > > > > series of shard clusters.  That is :
> > > > >
> > > > >  1) a cluster of shards holding recent data ( most recent week

> > > > > or two ) smaller
> > > > >    indexes that take a small amount of time to commit updates
> > > > > and
> > > optimise,
> > > > >    since this will hold the most volatile documents.
> > > > >
> > > > >  2) Following that another cluster of shards that holds some
> > > > > relatively recent
> > > > >    ( 3-6 months ? ), but not super volatile, documents, these
> > > > > are items that
> > > > >    could potentially receive updates, but generally not.
> > > > >
> > > > >  3) A final set of 'archive' shards holding the final resting
> > > > > place
> for
> > > > >    documents.  These would not receive updates.  These would
> > > > > be
> online
> > > for
> > > > >    searching and analysis "forever".
> > > > >
> > > > > We are not sure if this is the best way to go, but it is the
> > > > > approach we are leaning toward right now.  I would like some
> > > > > feedback from the folks here if you think that is a reasonable

> > > > > approach.
> > > > >
> > > > > One of the other things I'm wondering about is how to
> > > > > manipulate indexes We'll need to roll documents around between

> > > > > indexes over time, or at least migrate indexes from one set of

> > > > > shards to another
> as
> > > the documents 'age'
> > > > > and
> > > > > merge/aggregate them with more 'stable' indexes.   I know
about
> merging
> > > > > complete
> > > > > indexes together, but what about migrating a subset of
> > > > > documents from one index into another index?
> > > > >
> > > > > In addition, what is generally considered a 'manageable' index

> > > > > of large size?  I was attempting to find some information on
> > > > > the relationship between search response times, the amount of
> > > > > memory
> for
> > > > > used for a search, and the number of documents in an index,
> > > > > but I wasn't having much luck.
> > > > >
> > > > > I'm not sure if I'm making sense here, but just thought I
> > > > > would throw this out there and see what people think.  Ther
> > > > > eis the distinct possibility that I am not asking the right
> > > > > questions or considering the right parameters, so feel free to

> > > > > correct me, or
> ask
> > > > > questions as you see fit.
> > > > >
> > > > > And yes, I will report how we are doing things when we get
> > > > > this all figured out, and if there are items that we can
> > > > > contribute back to Solr we will.  If nothing else there will
> > > > > be a nice article of how we manage TB of data with Solr.
> > > > >
> > > > > enjoy,
> > > > >
> > > > > -jeremy
> > > > >
> > > > > --
> > > > >
> > >
> ======================================================================
> ==
> > > > >  Jeremy Hinegardner
> > > [hidden email]
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Marcus Herou CTO and co-founder Tailsweep AB
> > > > +46702561312
> > > > [hidden email]
> > > > http://www.tailsweep.com/
> > > > http://blogg.tailsweep.com/
> > >
> > > --
> > >
> ======================================================================
> ==
> > >  Jeremy Hinegardner
> [hidden email]
> > >
> > >
> > >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > [hidden email]
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
>
>


--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[hidden email]
http://www.tailsweep.com/
http://blogg.tailsweep.com/
Reply | Threaded
Open this post in threaded view
|

Re: scaling / sharding questions

Phillip Farber
In reply to this post by Lance Norskog-2
This may be slightly off topic, for which I apologize, but is related to
the question of searching several indexes as Lance describes below, quoting:

  "We also found that searching a few smaller indexes via the Solr 1.3
Distributed Search feature is actually faster than searching one large
index, YMMV."

The wiki describing distributed search lists several limitations which
set me to wonder about two limitations in particular and what the value
is mainly with respect to scoring:

1) No distributed idf

Does this mean that the Lucene scoring algorithm is computed without the
idf factor, i.e. we just get term frequency scoring?

2) Doesn't support consistency between stages, e.g. a shard index can be
changed between STAGE_EXECUTE_QUERY and STAGE_GET_FIELDS

What does this mean or where can I find out what it means?

Thanks!

Phil




Lance Norskog wrote:

> Yes, I've done this split-by-delete several times. The halved index still
> uses as much disk space until you optimize it.
>
> As to splitting policy: we use an MD5 signature as our unique ID. This has
> the lovely property that we can wildcard.  'contentid:f*' denotes 1/16 of
> the whole index. This 1/16 is a very random sample of the whole index. We
> use this for several things. If we use this for shards, we have a query that
> matches a shard's contents.
>
> The Solr/Lucene syntax does not support modular arithmetic,and so it will
> not let you query a subset that matches one of your shards.
>
> We also found that searching a few smaller indexes via the Solr 1.3
> Distributed Search feature is actually faster than searching one large
> index, YMMV. So for us, a large pile of shards will be optimal anyway, so we
> have to need "rebalance".
>
> It sounds like you're not storing the data in a backing store, but are
> storing all data in the index itself. We have found this "challenging".
>
> Cheers,
>
> Lance Norskog
>
> -----Original Message-----
> From: Jeremy Hinegardner [mailto:[hidden email]]
> Sent: Friday, June 13, 2008 3:36 PM
> To: [hidden email]
> Subject: Re: scaling / sharding questions
>
> Sorry for not keeping this thread alive, lets see what we can do...
>
> One option I've thought of for 'resharding' would splitting an index into
> two by just copying it, the deleting 1/2 the documents from one, doing a
> commit, and delete the other 1/2 from the other index and commit.  That is:
>
>   1) Take original index
>   2) copy to b1 and b2
>   3) delete docs from b1 that match a particular query A
>   4) delete docs from b2 that do not match a particular query A
>   5) commit b1 and b2
>
> Has anyone tried something like that?
>
> As for how to know where each document is stored, generally we're
> considering unique_document_id % N.  If we rebalance we change N and
> redistribute, but that
> probably will take too much time.    That makes us move more towards a
> staggered
> age based approach where the most recent docs filter down to "permanent"
> indexes based upon time.
>
> Another thought we've had recently is to have many many many physical
> shards, on the indexing writer side, but then merge groups of them into
> logical shards which are snapshotted to reader solrs' on a frequent basis.
> I haven't done any testing along these lines, but logically it seems like an
> idea worth pursuing.
>
> enjoy,
>
> -jeremy
>
> On Fri, Jun 06, 2008 at 03:14:10PM +0200, Marcus Herou wrote:
>> Cool sharding technique.
>>
>> We as well are thinking of howto "move" docs from one index to another
>> because we need to re-balance the docs when we add new nodes to the
> cluster.
>> We do only store id's in the index otherwise we could have moved stuff
>> around with IndexReader.document(x) or so. Luke
>> (http://www.getopt.org/luke/) is able to reconstruct the indexed Document
> data so it should be doable.
>> However I'm thinking of actually just delete the docs from the old
>> index and add new Documents to the new node. It would be cool to not
>> waste cpu cycles by reindexing already indexed stuff but...
>>
>> And we as well will have data amounts in the range you are talking
>> about. We perhaps could share ideas ?
>>
>> How do you plan to store where each document is located ? I mean you
>> probably need to store info about the Document and it's location
>> somewhere perhaps in a clustered DB ? We will probably go for HBase for
> this.
>> I think the number of documents is less important than the actual data
>> size (just speculating). We currently search 10M (will get much much
>> larger) indexed blog entries on one machine where the JVM has 1G heap,
>> the index size is 3G and response times are still quite fast. This is
>> a readonly node though and is updated every morning with a freshly
>> optimized index. Someone told me that you probably need twice the RAM
>> if you plan to both index and search at the same time. If I were you I
>> would just test to index X entries of your data and then start to
>> search in the index with lower JVM settings each round and when
>> response times get too slow or you hit OOE then you get a rough estimate
> of the bare minimum X RAM needed for Y entries.
>> I think we will do with something like 2G per 50M docs but I will need
>> to test it out.
>>
>> If you get an answer in this matter please let me know.
>>
>> Kindly
>>
>> //Marcus
>>
>>
>> On Fri, Jun 6, 2008 at 7:21 AM, Jeremy Hinegardner
>> <[hidden email]>
>> wrote:
>>
>>> Hi all,
>>>
>>> This may be a bit rambling, but let see how it goes.  I'm not a
>>> Lucene or Solr guru by any means, I have been prototyping with solr
>>> and understanding how all the pieces and parts fit together.
>>>
>>> We are migrating our current document storage infrastructure to a
>>> decent sized solr cluster, using 1.3-snapshots right now.  
>>> Eventually this will be in the
>>> billion+ documents, with about 1M new documents added per day.
>>>
>>> Our main sticking point right now is that a significant number of
>>> our documents will be updated, at least once, but possibly more than
>>> once.  The volatility of a document decreases over time.
>>>
>>> With this in mind, we've been considering using a cascading series
>>> of shard clusters.  That is :
>>>
>>>  1) a cluster of shards holding recent data ( most recent week or
>>> two ) smaller
>>>    indexes that take a small amount of time to commit updates and
> optimise,
>>>    since this will hold the most volatile documents.
>>>
>>>  2) Following that another cluster of shards that holds some
>>> relatively recent
>>>    ( 3-6 months ? ), but not super volatile, documents, these are
>>> items that
>>>    could potentially receive updates, but generally not.
>>>
>>>  3) A final set of 'archive' shards holding the final resting place for
>>>    documents.  These would not receive updates.  These would be online
> for
>>>    searching and analysis "forever".
>>>
>>> We are not sure if this is the best way to go, but it is the
>>> approach we are leaning toward right now.  I would like some
>>> feedback from the folks here if you think that is a reasonable
>>> approach.
>>>
>>> One of the other things I'm wondering about is how to manipulate
>>> indexes We'll need to roll documents around between indexes over
>>> time, or at least migrate indexes from one set of shards to another as
> the documents 'age'
>>> and
>>> merge/aggregate them with more 'stable' indexes.   I know about merging
>>> complete
>>> indexes together, but what about migrating a subset of documents
>>> from one index into another index?
>>>
>>> In addition, what is generally considered a 'manageable' index of
>>> large size?  I was attempting to find some information on the
>>> relationship between search response times, the amount of memory for
>>> used for a search, and the number of documents in an index, but I
>>> wasn't having much luck.
>>>
>>> I'm not sure if I'm making sense here, but just thought I would
>>> throw this out there and see what people think.  Ther eis the
>>> distinct possibility that I am not asking the right questions or
>>> considering the right parameters, so feel free to correct me, or ask
>>> questions as you see fit.
>>>
>>> And yes, I will report how we are doing things when we get this all
>>> figured out, and if there are items that we can contribute back to
>>> Solr we will.  If nothing else there will be a nice article of how
>>> we manage TB of data with Solr.
>>>
>>> enjoy,
>>>
>>> -jeremy
>>>
>>> --
>>> ========================================================================
>>>  Jeremy Hinegardner                              [hidden email]
>>>
>>>
>>
>> --
>> Marcus Herou CTO and co-founder Tailsweep AB
>> +46702561312
>> [hidden email]
>> http://www.tailsweep.com/
>> http://blogg.tailsweep.com/
>
> --
> ========================================================================
>  Jeremy Hinegardner                              [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: scaling / sharding questions

Yonik Seeley-2
On Wed, Jun 18, 2008 at 5:53 PM, Phillip Farber <[hidden email]> wrote:
> Does this mean that the Lucene scoring algorithm is computed without the idf
> factor, i.e. we just get term frequency scoring?

No, it means that the idf calculation is done locally on a single shard.
With a big index that is randomly mixed, this should not have a
practical impact.

> 2) Doesn't support consistency between stages, e.g. a shard index can be
> changed between STAGE_EXECUTE_QUERY and STAGE_GET_FIELDS
>
> What does this mean or where can I find out what it means?

STAGE_EXECUTE_QUERY finds the ids of matching documents.
STAGE_GET_FIELDS retrieves the fields of matching documents.

A change to a document could possibly happen inbetween, and one would
end up retrieving a document that no longer matched the query.  In
practice, this is rarely an issue.

-Yonik