High add/delete rate and index fragmentation

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

High add/delete rate and index fragmentation

Rodrigo De Castro
We are considering Solr to store events which will be added and deleted from
the index in a very fast rate. Solr will be used, in this case, to find the
right event we need to process (since they may have several attributes and
we may search the best match based on the query attributes). Our
understanding is that the common use cases are those wherein the read rate
is much higher than writes, and deletes are not as frequent, so we are not
sure Solr handles our use case very well or if it is the right fit. Given
that, I have a few questions:

1 - How does Solr/Lucene degrade with the fragmentation? That would probably
determine the rate at which we would need to optimize the index. I presume
that it depends on the rate of insertions and deletions, but would you have
any benchmark on this degradation? Or, in general, how has been your
experience with this use case?

2 - Optimizing seems to be a very expensive process. While optimizing the
index, how much does search performance degrade? In this case, having a huge
degradation would not allow us to optimize unless we switch to another copy
of the index while optimize is running.

3 - In terms of high availability, what has been your experience detecting
failure of master and having a slave taking over?

Thanks,
Rodrigo
Reply | Threaded
Open this post in threaded view
|

Re: High add/delete rate and index fragmentation

Jason Rutherglen
Rodrigo,

It sounds like you're asking about near realtime search support,
I'm not sure.  So here's few ideas.

#1 How often do you need to be able to search on the latest
updates (as opposed to updates from lets say, 10 minutes ago)?

To topic #2, Solr provides master slave replication. The
optimize would happen on the master and the new index files
replicated to the slave(s).

#3 is a mixed bag at this point, and there is no official
solution, yet. Shell scripts, and a load balancer could kind of
work. Check out SOLR-1277 or SOLR-1395 for progress along these
lines.

Jason
On Wed, Dec 2, 2009 at 11:53 AM, Rodrigo De Castro <[hidden email]> wrote:

> We are considering Solr to store events which will be added and deleted from
> the index in a very fast rate. Solr will be used, in this case, to find the
> right event we need to process (since they may have several attributes and
> we may search the best match based on the query attributes). Our
> understanding is that the common use cases are those wherein the read rate
> is much higher than writes, and deletes are not as frequent, so we are not
> sure Solr handles our use case very well or if it is the right fit. Given
> that, I have a few questions:
>
> 1 - How does Solr/Lucene degrade with the fragmentation? That would probably
> determine the rate at which we would need to optimize the index. I presume
> that it depends on the rate of insertions and deletions, but would you have
> any benchmark on this degradation? Or, in general, how has been your
> experience with this use case?
>
> 2 - Optimizing seems to be a very expensive process. While optimizing the
> index, how much does search performance degrade? In this case, having a huge
> degradation would not allow us to optimize unless we switch to another copy
> of the index while optimize is running.
>
> 3 - In terms of high availability, what has been your experience detecting
> failure of master and having a slave taking over?
>
> Thanks,
> Rodrigo
>
Reply | Threaded
Open this post in threaded view
|

Re: High add/delete rate and index fragmentation

Lance Norskog-2
#1: Yes, compared to relational DBs, Solr/Lucene in general are biased
towards slow indexing and fast queries. It automatically merges
segments and keeps fragmentation down. The rate of merging can be
controlled.

#2: The standard architecture is with a master that only does indexing
and one or more slaves that only handle queries. The slaves poll the
master for index updates regularly. Java 1.4 has a built-in system for
this.

#3: The standard architecture puts the query servers behind a load
balancer. It's the load balancer's job to watch for query servers
coming on and off line.

An alternate architecture has multiple servers which do both indexing
and queries in the same index. This provides the shortest "pipeline"
time from recieving the data to making it available for search.

On Wed, Dec 2, 2009 at 2:43 PM, Jason Rutherglen
<[hidden email]> wrote:

> Rodrigo,
>
> It sounds like you're asking about near realtime search support,
> I'm not sure.  So here's few ideas.
>
> #1 How often do you need to be able to search on the latest
> updates (as opposed to updates from lets say, 10 minutes ago)?
>
> To topic #2, Solr provides master slave replication. The
> optimize would happen on the master and the new index files
> replicated to the slave(s).
>
> #3 is a mixed bag at this point, and there is no official
> solution, yet. Shell scripts, and a load balancer could kind of
> work. Check out SOLR-1277 or SOLR-1395 for progress along these
> lines.
>
> Jason
> On Wed, Dec 2, 2009 at 11:53 AM, Rodrigo De Castro <[hidden email]> wrote:
>> We are considering Solr to store events which will be added and deleted from
>> the index in a very fast rate. Solr will be used, in this case, to find the
>> right event we need to process (since they may have several attributes and
>> we may search the best match based on the query attributes). Our
>> understanding is that the common use cases are those wherein the read rate
>> is much higher than writes, and deletes are not as frequent, so we are not
>> sure Solr handles our use case very well or if it is the right fit. Given
>> that, I have a few questions:
>>
>> 1 - How does Solr/Lucene degrade with the fragmentation? That would probably
>> determine the rate at which we would need to optimize the index. I presume
>> that it depends on the rate of insertions and deletions, but would you have
>> any benchmark on this degradation? Or, in general, how has been your
>> experience with this use case?
>>
>> 2 - Optimizing seems to be a very expensive process. While optimizing the
>> index, how much does search performance degrade? In this case, having a huge
>> degradation would not allow us to optimize unless we switch to another copy
>> of the index while optimize is running.
>>
>> 3 - In terms of high availability, what has been your experience detecting
>> failure of master and having a slave taking over?
>>
>> Thanks,
>> Rodrigo
>>
>



--
Lance Norskog
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: High add/delete rate and index fragmentation

Rodrigo De Castro
On Thu, Dec 3, 2009 at 3:59 PM, Lance Norskog <[hidden email]> wrote:

> #2: The standard architecture is with a master that only does indexing
> and one or more slaves that only handle queries. The slaves poll the
> master for index updates regularly. Java 1.4 has a built-in system for
> this.
>


How do you achieve durability with the standard architecture? For one of our
use cases (which does not have much churn), we are considering this
architecture, but I don't want an update to be lost if the master goes down
before slaves update. What I was thinking initially is that this could be
achieved having a master per datacenter, which would synchronously update
other masters through a RequestHandler. So I could guarantee this
durability, but of course this architecture would have issues of its own.
like when there is a network partitioning, how you could handle master no
longer being in sync. Is there some work being done to address this use
case?



> An alternate architecture has multiple servers which do both indexing
> and queries in the same index. This provides the shortest "pipeline"
> time from recieving the data to making it available for search.
>


For our use case where there is a high add/delete rate, I was thinking of
using this architecture, as I noticed that records become available right
away. But in this case we have the concern about how well it performs when
adding/deleting. I did an initial test adding many thousands of elements and
did not see any degradation, that's why I asked about its performance when
deleting records (since it only marks for deletion and we have some control
over the automatic segment mergin, I guess this is not much of a problem).

Rodrigo


>
> On Wed, Dec 2, 2009 at 2:43 PM, Jason Rutherglen
> <[hidden email]> wrote:
> > Rodrigo,
> >
> > It sounds like you're asking about near realtime search support,
> > I'm not sure.  So here's few ideas.
> >
> > #1 How often do you need to be able to search on the latest
> > updates (as opposed to updates from lets say, 10 minutes ago)?
> >
> > To topic #2, Solr provides master slave replication. The
> > optimize would happen on the master and the new index files
> > replicated to the slave(s).
> >
> > #3 is a mixed bag at this point, and there is no official
> > solution, yet. Shell scripts, and a load balancer could kind of
> > work. Check out SOLR-1277 or SOLR-1395 for progress along these
> > lines.
> >
> > Jason
> > On Wed, Dec 2, 2009 at 11:53 AM, Rodrigo De Castro <[hidden email]>
> wrote:
> >> We are considering Solr to store events which will be added and deleted
> from
> >> the index in a very fast rate. Solr will be used, in this case, to find
> the
> >> right event we need to process (since they may have several attributes
> and
> >> we may search the best match based on the query attributes). Our
> >> understanding is that the common use cases are those wherein the read
> rate
> >> is much higher than writes, and deletes are not as frequent, so we are
> not
> >> sure Solr handles our use case very well or if it is the right fit.
> Given
> >> that, I have a few questions:
> >>
> >> 1 - How does Solr/Lucene degrade with the fragmentation? That would
> probably
> >> determine the rate at which we would need to optimize the index. I
> presume
> >> that it depends on the rate of insertions and deletions, but would you
> have
> >> any benchmark on this degradation? Or, in general, how has been your
> >> experience with this use case?
> >>
> >> 2 - Optimizing seems to be a very expensive process. While optimizing
> the
> >> index, how much does search performance degrade? In this case, having a
> huge
> >> degradation would not allow us to optimize unless we switch to another
> copy
> >> of the index while optimize is running.
> >>
> >> 3 - In terms of high availability, what has been your experience
> detecting
> >> failure of master and having a slave taking over?
> >>
> >> Thanks,
> >> Rodrigo
> >>
> >
>
>
>
> --
> Lance Norskog
> [hidden email]
>
Reply | Threaded
Open this post in threaded view
|

Re: High add/delete rate and index fragmentation

Rodrigo De Castro
In reply to this post by Jason Rutherglen
On Wed, Dec 2, 2009 at 2:43 PM, Jason Rutherglen <[hidden email]
> wrote:

> It sounds like you're asking about near realtime search support,
> I'm not sure.  So here's few ideas.
>
> #1 How often do you need to be able to search on the latest
> updates (as opposed to updates from lets say, 10 minutes ago)?
>

You are right that we would need near realtime support. The problem is not
so much about new records becoming available, but guaranteeing that deleted
records will not be returned. For this reason, our plan would be to update
and search a master index, provided that: (1) search while updating records
is ok, (2) performance is not degraded substantially due to fragmentation,
(3) optimization does not impact search, and (4) we ensure durability - if a
node goes down, an update was replicated to another node who can take over.
It seems that 1 and 2 are not so much of a problem, 3 would need to be
tested. I would like know more about how 4 has been addressed, so we don't
lose updates if a master goes down between updates and index replication.


> #3 is a mixed bag at this point, and there is no official
> solution, yet. Shell scripts, and a load balancer could kind of
> work. Check out SOLR-1277 or SOLR-1395 for progress along these
> lines.
>

Thanks for the links.

Rodrigo


> On Wed, Dec 2, 2009 at 11:53 AM, Rodrigo De Castro <[hidden email]>
> wrote:
> > We are considering Solr to store events which will be added and deleted
> from
> > the index in a very fast rate. Solr will be used, in this case, to find
> the
> > right event we need to process (since they may have several attributes
> and
> > we may search the best match based on the query attributes). Our
> > understanding is that the common use cases are those wherein the read
> rate
> > is much higher than writes, and deletes are not as frequent, so we are
> not
> > sure Solr handles our use case very well or if it is the right fit. Given
> > that, I have a few questions:
> >
> > 1 - How does Solr/Lucene degrade with the fragmentation? That would
> probably
> > determine the rate at which we would need to optimize the index. I
> presume
> > that it depends on the rate of insertions and deletions, but would you
> have
> > any benchmark on this degradation? Or, in general, how has been your
> > experience with this use case?
> >
> > 2 - Optimizing seems to be a very expensive process. While optimizing the
> > index, how much does search performance degrade? In this case, having a
> huge
> > degradation would not allow us to optimize unless we switch to another
> copy
> > of the index while optimize is running.
> >
> > 3 - In terms of high availability, what has been your experience
> detecting
> > failure of master and having a slave taking over?
> >
> > Thanks,
> > Rodrigo
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: High add/delete rate and index fragmentation

Otis Gospodnetic-2
Hello,

> You are right that we would need near realtime support. The problem is not
> so much about new records becoming available, but guaranteeing that deleted
> records will not be returned. For this reason, our plan would be to update
> and search a master index, provided that: (1) search while updating records
> is ok,

It is in general, though I haven't fully tested NRT under high load.

> (2) performance is not degraded substantially due to fragmentation,

You can control that somewhat via mergeFactor.

> (3) optimization does not impact search,

It will - disk IO, OS cache, and such will be affected, and that will affect search.

> and (4) we ensure durability - if a
> node goes down, an update was replicated to another node who can take over.

Maybe just index to > 1 masters?  For example, another non-search tool I'm using (Voldemort) has the notion of "required writes", which represents how many copies of data should be written at insert/add time.

> It seems that 1 and 2 are not so much of a problem, 3 would need to be
> tested. I would like know more about how 4 has been addressed, so we don't
> lose updates if a master goes down between updates and index replication.

Lucene buffers documents while indexing, to avoid constant disk writes.  HDD itself does some of that, too.  So I think you can always lose some data is whatever is in the buffers doesn't get flushed when somebody trips over the power cord in the data center.

Otis

> > #3 is a mixed bag at this point, and there is no official
> > solution, yet. Shell scripts, and a load balancer could kind of
> > work. Check out SOLR-1277 or SOLR-1395 for progress along these
> > lines.
> >
>
> Thanks for the links.
>
> Rodrigo
>
>
> > On Wed, Dec 2, 2009 at 11:53 AM, Rodrigo De Castro
> > wrote:
> > > We are considering Solr to store events which will be added and deleted
> > from
> > > the index in a very fast rate. Solr will be used, in this case, to find
> > the
> > > right event we need to process (since they may have several attributes
> > and
> > > we may search the best match based on the query attributes). Our
> > > understanding is that the common use cases are those wherein the read
> > rate
> > > is much higher than writes, and deletes are not as frequent, so we are
> > not
> > > sure Solr handles our use case very well or if it is the right fit. Given
> > > that, I have a few questions:
> > >
> > > 1 - How does Solr/Lucene degrade with the fragmentation? That would
> > probably
> > > determine the rate at which we would need to optimize the index. I
> > presume
> > > that it depends on the rate of insertions and deletions, but would you
> > have
> > > any benchmark on this degradation? Or, in general, how has been your
> > > experience with this use case?
> > >
> > > 2 - Optimizing seems to be a very expensive process. While optimizing the
> > > index, how much does search performance degrade? In this case, having a
> > huge
> > > degradation would not allow us to optimize unless we switch to another
> > copy
> > > of the index while optimize is running.
> > >
> > > 3 - In terms of high availability, what has been your experience
> > detecting
> > > failure of master and having a slave taking over?
> > >
> > > Thanks,
> > > Rodrigo
> > >
> >