How large is your solr index?

classic Classic list List threaded Threaded
47 messages Options
123
Reply | Threaded
Open this post in threaded view
|

How large is your solr index?

Bram Van Dam
Hi folks,

I'm trying to get a feel of how large Solr can grow without slowing down
too much. We're looking into a use-case with up to 100 billion documents
(SolrCloud), and we're a little afraid that we'll end up requiring 100
servers to pull it off.

The largest index we currently have is ~2billion documents in a single
Solr instance. Documents are smallish (5k each) and we have ~50 fields
in the schema, with an index size of about 2TB. Performance is mostly
OK. Cold searchers take a while, but most queries are alright after
warming up. I wish I could provide more statistics, but I only have very
limited access to the data (...banks...).

I'd very grateful to anyone sharing statistics, especially on the larger
end of the spectrum -- with or without SolrCloud.

Thanks,

  - Bram
Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Erick Erickson
When you say 2B docs on a single Solr instance, are you talking only one shard?
Because if you are, you're very close to the absolute upper limit of a
shard, internally
the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems.

But yeah, your 100B documents are going to use up a lot of servers...

Best,
Erick

On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam <[hidden email]> wrote:

> Hi folks,
>
> I'm trying to get a feel of how large Solr can grow without slowing down too
> much. We're looking into a use-case with up to 100 billion documents
> (SolrCloud), and we're a little afraid that we'll end up requiring 100
> servers to pull it off.
>
> The largest index we currently have is ~2billion documents in a single Solr
> instance. Documents are smallish (5k each) and we have ~50 fields in the
> schema, with an index size of about 2TB. Performance is mostly OK. Cold
> searchers take a while, but most queries are alright after warming up. I
> wish I could provide more statistics, but I only have very limited access to
> the data (...banks...).
>
> I'd very grateful to anyone sharing statistics, especially on the larger end
> of the spectrum -- with or without SolrCloud.
>
> Thanks,
>
>  - Bram
Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

ralph tice
Like all things it really depends on your use case.  We have >160B
documents in our largest SolrCloud and doing a *:* to get that count takes
~13-14 seconds.  Doing a text:happy query only takes ~3.5-3.6 seconds cold,
subsequent queries for the same terms take <500ms.  We have a little over
3TB of RAM in the cluster which is around 1/10th size on disk which are
fast SSDs (rated 300K IOPS per machine), but more importantly we are using
12-13 large machines rather than dozens or hundreds of small machines, and
if your use case is primarily full text search you probably could get away
with even fewer machines depending on query patterns.  We run several JVMs
per machine and many shards per JVM, but are careful to order shards so
that queries get dispersed across multiple JVMs across multiple machines
wherever possible.

Facets over high cardinality fields are going to be painful.  We currently
programmatically limit the range to around 1/12th or 1/13th of the data set
for facet queries, but plan on evaluating Heliosearch (initial results
didn't look promising) and Toke's sparse faceting patch (SOLR-5894) to help
out there.

If any given JVM goes OOM that also becomes a rough time operationally.  If
your indexing rate spikes past what your sharding strategy can handle, that
sucks too.

There could be more support / ease of use enhancements for moving shards
across SolrClouds, moving shards across physically nodes within a
SolrCloud, and snapshot/restore of a SolrCloud, but there has also been a
lot of recent work in these areas that are starting to provide the
underlying infrastructure for more advanced shard management.

I think there are more people getting into the space of >100B documents but
I only ran into or discovered a handful during my time at Lucene/Solr
Revolution this November.  The majority of large scale SolrCloud users seem
to have many collections (collections per logical user) rather than many
documents in one/few collections.

Regards,
--Ralph

On Mon Dec 29 2014 at 11:55:41 AM Erick Erickson <[hidden email]>
wrote:

> When you say 2B docs on a single Solr instance, are you talking only one
> shard?
> Because if you are, you're very close to the absolute upper limit of a
> shard, internally
> the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems.
>
> But yeah, your 100B documents are going to use up a lot of servers...
>
> Best,
> Erick
>
> On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam <[hidden email]>
> wrote:
> > Hi folks,
> >
> > I'm trying to get a feel of how large Solr can grow without slowing down
> too
> > much. We're looking into a use-case with up to 100 billion documents
> > (SolrCloud), and we're a little afraid that we'll end up requiring 100
> > servers to pull it off.
> >
> > The largest index we currently have is ~2billion documents in a single
> Solr
> > instance. Documents are smallish (5k each) and we have ~50 fields in the
> > schema, with an index size of about 2TB. Performance is mostly OK. Cold
> > searchers take a while, but most queries are alright after warming up. I
> > wish I could provide more statistics, but I only have very limited
> access to
> > the data (...banks...).
> >
> > I'd very grateful to anyone sharing statistics, especially on the larger
> end
> > of the spectrum -- with or without SolrCloud.
> >
> > Thanks,
> >
> >  - Bram
>
Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Jack Krupansky-3
In reply to this post by Erick Erickson
And that Lucene index document limit includes deleted and updated
documents, so even if your actual document count stays under 2^31-1,
deleting and updating documents can push the apparent document count over
the limit unless you very aggressively merge segments to expunge deleted
documents.

-- Jack Krupansky

-- Jack Krupansky

On Mon, Dec 29, 2014 at 12:54 PM, Erick Erickson <[hidden email]>
wrote:

> When you say 2B docs on a single Solr instance, are you talking only one
> shard?
> Because if you are, you're very close to the absolute upper limit of a
> shard, internally
> the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems.
>
> But yeah, your 100B documents are going to use up a lot of servers...
>
> Best,
> Erick
>
> On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam <[hidden email]>
> wrote:
> > Hi folks,
> >
> > I'm trying to get a feel of how large Solr can grow without slowing down
> too
> > much. We're looking into a use-case with up to 100 billion documents
> > (SolrCloud), and we're a little afraid that we'll end up requiring 100
> > servers to pull it off.
> >
> > The largest index we currently have is ~2billion documents in a single
> Solr
> > instance. Documents are smallish (5k each) and we have ~50 fields in the
> > schema, with an index size of about 2TB. Performance is mostly OK. Cold
> > searchers take a while, but most queries are alright after warming up. I
> > wish I could provide more statistics, but I only have very limited
> access to
> > the data (...banks...).
> >
> > I'd very grateful to anyone sharing statistics, especially on the larger
> end
> > of the spectrum -- with or without SolrCloud.
> >
> > Thanks,
> >
> >  - Bram
>
Reply | Threaded
Open this post in threaded view
|

RE: How large is your solr index?

Toke Eskildsen
In reply to this post by Bram Van Dam
Bram Van Dam [[hidden email]] wrote:
> I'm trying to get a feel of how large Solr can grow without slowing down
> too much. We're looking into a use-case with up to 100 billion documents
> (SolrCloud), and we're a little afraid that we'll end up requiring 100
> servers to pull it off.

One recurring theme on this list is that it is very hard to compare indexes. Even if the data structure happens to be the same, performance will very drastically depending on the types of queries and the processing requested. That being said, I acknowledge that it helps with stories to get a feel of what can be done.

One second caveat is that I find it an exercise in futility to talk about scale without an idea of expected response times as well as the expected number of concurrent users. If you are just doing some nightly batch processing, you could probably run your (scaling up from your description) 100TB index off spinning drives on a couple of boxes. If you expect to be hammered with millions of requests per day, you would have to put a zero or two behind that number.

End of sermon.

At Lucene/Solr Revolution 2014, Grant Ingersoll also asked for user stories and pointed to https://wiki.apache.org/solr/SolrUseCases - sadly it has not caught on. The only entry is for our (State and University Library, Denmark) setup with 21TB / 7 billion documents on a single machine. To follow my own advice, I can elaborate that we have 1-3 concurrent users and a design goal of median response times below 2 seconds for faceted search. I guess that is at the larger end at the spectrum for pure size, but at the very low end for usage.

- Toke Eskildsen
Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Shawn Heisey-2
On 12/29/2014 2:30 PM, Toke Eskildsen wrote:
> At Lucene/Solr Revolution 2014, Grant Ingersoll also asked for user stories and pointed to https://wiki.apache.org/solr/SolrUseCases - sadly it has not caught on. The only entry is for our (State and University Library, Denmark) setup with 21TB / 7 billion documents on a single machine. To follow my own advice, I can elaborate that we have 1-3 concurrent users and a design goal of median response times below 2 seconds for faceted search. I guess that is at the larger end at the spectrum for pure size, but at the very low end for usage.

Off-Topic tangent:

I believe it would be useful to organize a session at Lucene Revolution,
possibly more interactive than a straight presentation, where users with
very large indexes are encouraged to attend.  The point of this session
would be to exchange war stories, configuration requirements, hardware
requirements, and observations.

Bringing people with similar goals together to discuss their solutions
should be beneficial.  The discussions could pinpoint areas where Solr
and SolrCloud are weak on scalability, and hopefully lead to issues in
Jira and fixes for those problems.  Better documentation for extreme
scaling is also a possible outcome.

Another idea, not sure if it would be good as an alternate idea or
supplemental, is a less formal gathering, perhaps over a meal or three.

My index is hardly large enough to mention, but I would be interested in
attending such a gathering to learn more about the topic.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Alexandre Rafalovitch
On 29 December 2014 at 21:42, Shawn Heisey <[hidden email]> wrote:
> I believe it would be useful to organize a session at Lucene Revolution,
> possibly more interactive than a straight presentation, where users with
> very large indexes are encouraged to attend.  The point of this session
> would be to exchange war stories, configuration requirements, hardware
> requirements, and observations.

+1

And have a scribe to take notes with whom to follow-up later :-) And
interview separately for Solr podcast too.

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/
Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Bram Van Dam
In reply to this post by ralph tice
On 12/29/2014 08:08 PM, ralph tice wrote:
> Like all things it really depends on your use case.  We have >160B
> documents in our largest SolrCloud and doing a *:* to get that count takes
> ~13-14 seconds.  Doing a text:happy query only takes ~3.5-3.6 seconds cold,
> subsequent queries for the same terms take <500ms.

That seems perfectly reasonable.

> Facets over high cardinality fields are going to be painful.  We currently
> programmatically limit the range to around 1/12th or 1/13th of the data set
> for facet queries, but plan on evaluating Heliosearch (initial results
> didn't look promising) and Toke's sparse faceting patch (SOLR-5894) to help
> out there.

We had a look at Heliosearch a while ago and found it unsuitable. Seems
like they're trying to make use of some native x86_64 code and HotSpot
JVM specific features which we can't use. Some of our clients use IBM's
JVM so we're pretty much limited to strictly Java.

> There could be more support / ease of use enhancements for moving shards
> across SolrClouds, moving shards across physically nodes within a
> SolrCloud, and snapshot/restore of a SolrCloud, but there has also been a
> lot of recent work in these areas that are starting to provide the
> underlying infrastructure for more advanced shard management.

That's reassuring to hear. If we run in to these issues we can probably
donate some time to work on them, so I'm not too worried about that.

> I think there are more people getting into the space of >100B documents but
> I only ran into or discovered a handful during my time at Lucene/Solr
> Revolution this November.  The majority of large scale SolrCloud users seem
> to have many collections (collections per logical user) rather than many
> documents in one/few collections.

That's my understanding as well. Lucene Revolution is on the wrong side
of the Atlantic for me. But there's an Open Source Search devroom at
FOSDEM this year, which seems like a sensible place to discuss these
things. I'll make a post on the relevant mailing lists about this after
the holidays if anyone is interested.

Thanks for your detailed response!

  - Bram
Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Bram Van Dam
In reply to this post by Jack Krupansky-3
On 12/29/2014 09:53 PM, Jack Krupansky wrote:

> And that Lucene index document limit includes deleted and updated
> documents, so even if your actual document count stays under 2^31-1,
> deleting and updating documents can push the apparent document count over
> the limit unless you very aggressively merge segments to expunge deleted
> documents.
> On Mon, Dec 29, 2014 at 12:54 PM, Erick Erickson <[hidden email]>
> wrote:
>> When you say 2B docs on a single Solr instance, are you talking only one
>> shard?
>> Because if you are, you're very close to the absolute upper limit of a
>> shard, internally
>> the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems.

Thankfully we're not doing any updates on that particular instance. But
yes, we are getting close to the limits there. Is there any way to query
the internal document ID? :-/

  - Bram
Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Bram Van Dam
In reply to this post by Toke Eskildsen
On 12/29/2014 10:30 PM, Toke Eskildsen wrote:
> That being said, I acknowledge that it helps with stories to get a feel of what can be done.

That's pretty much what I'm after, mostly to reassure myself that it can
be done. Even if it does require a lot of hardware (which is fine).


> At Lucene/Solr Revolution 2014, Grant Ingersoll also asked for user stories and pointed to https://wiki.apache.org/solr/SolrUseCases - sadly it has not caught on. The only entry is for our (State and University Library, Denmark) setup with 21TB / 7 billion documents on a single machine. To follow my own advice, I can elaborate that we have 1-3 concurrent users and a design goal of median response times below 2 seconds for faceted search. I guess that is at the larger end at the spectrum for pure size, but at the very low end for usage.

Thanks. I'll try to add some of our use cases!

  - Bram

Reply | Threaded
Open this post in threaded view
|

RE: How large is your solr index?

Toke Eskildsen
In reply to this post by Shawn Heisey-2
Shawn Heisey [[hidden email]] wrote:
> I believe it would be useful to organize a session at Lucene Revolution,
> possibly more interactive than a straight presentation, where users with
> very large indexes are encouraged to attend.  The point of this session
> would be to exchange war stories, configuration requirements, hardware
> requirements, and observations.

From the perspective of the conference it might tie up a lot of time: If we were to get down to the configuration level, one session would not be enough. Some sort of pre-conference bar camp might do it? Or maybe even a whole pre-conference day?

(side-note to the side-note: Living in Europe, going to Lucene/Solr Revolution means spending more time on travel than the actual conference - extending the activities to 3 days would increase the odds of me going next year)

> Better documentation for extreme scaling is also a possible outcome.

I did at some point try to write a long blog entry on Solr hardware and setup for non-small corpuses, but have to give up: There were just too many "but if you need to scale X, you might be better off by choosing Y, unless your usage is Z". I think multiple detailed descriptions of setups is a great starting point. If we get enough of them, some pattern will hopefully emerge, although I am afraid that the pattern will be "to get this to work, we had to write custom code".

> Another idea, not sure if it would be good as an alternate idea or
> supplemental, is a less formal gathering, perhaps over a meal or three.

Outside of Lucene/Solr Revolution? How would that work geographically?

- Toke Eskildsen
Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Norgorn
In reply to this post by ralph tice
Please, tell a bit more about how you run SOLRs.
When we trying to run SOLR with 5 shards, 50GB per shard, we often get OutOfMemory (especially for group queries). And while indexing SOLR often falls (without exceptions - some JVM issue).

We are using Heliosearch.
Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Shawn Heisey-2
In reply to this post by Toke Eskildsen
On 12/30/2014 5:43 AM, Toke Eskildsen wrote:

> Shawn Heisey [[hidden email]] wrote:
>> I believe it would be useful to organize a session at Lucene Revolution,
>> possibly more interactive than a straight presentation, where users with
>> very large indexes are encouraged to attend.  The point of this session
>> would be to exchange war stories, configuration requirements, hardware
>> requirements, and observations.
>
> From the perspective of the conference it might tie up a lot of time: If we were to get down to the configuration level, one session would not be enough. Some sort of pre-conference bar camp might do it? Or maybe even a whole pre-conference day?
>
> (side-note to the side-note: Living in Europe, going to Lucene/Solr Revolution means spending more time on travel than the actual conference - extending the activities to 3 days would increase the odds of me going next year)

I've had the same problem with my desires to attend conventions.  Even
the ones that happen on my home continent (US) have extreme travel
expenses.  I may be able to attend ApacheCon this year, because I know
someone who lives in the hosting city.

It could indeed tie up a lot of time ... but for those who are dealing
with large indexes and those interested in the topic, I bet they would
be willing to dedicate that time.

I acknowledge that the whole idea as I have envisioned it might be
unworkable, too.

>> Better documentation for extreme scaling is also a possible outcome.
>
> I did at some point try to write a long blog entry on Solr hardware and setup for non-small corpuses, but have to give up: There were just too many "but if you need to scale X, you might be better off by choosing Y, unless your usage is Z". I think multiple detailed descriptions of setups is a great starting point. If we get enough of them, some pattern will hopefully emerge, although I am afraid that the pattern will be "to get this to work, we had to write custom code".

Even with all the caveats, a blog post like that might still be useful.
 I can understand why you gave up ... different people will need
scalability in different ways, and unless you have direct experience
with all aspects, the resulting documentation or blog post would contain
much speculation ... and even the non-speculation parts probably would
only be correct for a subset of users.

You could be right about it coming down to custom code ... although if
that code is really useful, it could be donated and incorporated into
the project.

>> Another idea, not sure if it would be good as an alternate idea or
>> supplemental, is a less formal gathering, perhaps over a meal or three.
>
> Outside of Lucene/Solr Revolution? How would that work geographically?

I was thinking about this happening at or concurrently with LR, but now
that you have raised the possibility, a completely separate formal or
informal gathering does sound like a really good idea.  I don't know how
it would get organized or who would sponsor it, though.  I'd like to
discuss it folks from LucidWorks and others who have experience with events.

I might not be able to attend something like that, but as much as I
would like to go, I am not really part of the main audience.

If there is sufficient interest, we could do a small-scale electronic
version of this gathering at any time ... either via email or at
pre-determined time on IRC.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Erick Erickson
In reply to this post by Toke Eskildsen
bq: I did at some point try to write a long blog entry on Solr
hardware and setup for non-small corpuses, but have to give up:

Man, this makes me laugh! Oh the memories!

A common question from sales, quite a reasonable one at that; "can we
have a checklist that we can use to give clients an idea how much
hardware to buy?". And do note that sales folks are talking to clients
with all different types and sizes.

I sat down and tried to do this... three separate times. Pretty soon
I'd get to the point of realizing that the doc was worthless exactly
because of all the "if this then that" phrases. I guess I can take
some comfort from the fact that it only took me about an hour the
third time to remember that it was hopeless, and after that i
remembered to not even try.

I think that it would be _extremely_ helpful to have a bunch of "war
stories" to reference. In my experience, people dealing with large
numbers of documents really are most concerned with whether what
they're doing is _possible_, and are mostly looking to see if someone
else has "been there and done that". Of course they'd like all the
specificity possible, but there's a lot of comfort in knowing
something similar has been done before.

Best,
Erick

On Tue, Dec 30, 2014 at 4:43 AM, Toke Eskildsen <[hidden email]> wrote:

> Shawn Heisey [[hidden email]] wrote:
>> I believe it would be useful to organize a session at Lucene Revolution,
>> possibly more interactive than a straight presentation, where users with
>> very large indexes are encouraged to attend.  The point of this session
>> would be to exchange war stories, configuration requirements, hardware
>> requirements, and observations.
>
> From the perspective of the conference it might tie up a lot of time: If we were to get down to the configuration level, one session would not be enough. Some sort of pre-conference bar camp might do it? Or maybe even a whole pre-conference day?
>
> (side-note to the side-note: Living in Europe, going to Lucene/Solr Revolution means spending more time on travel than the actual conference - extending the activities to 3 days would increase the odds of me going next year)
>
>> Better documentation for extreme scaling is also a possible outcome.
>
> I did at some point try to write a long blog entry on Solr hardware and setup for non-small corpuses, but have to give up: There were just too many "but if you need to scale X, you might be better off by choosing Y, unless your usage is Z". I think multiple detailed descriptions of setups is a great starting point. If we get enough of them, some pattern will hopefully emerge, although I am afraid that the pattern will be "to get this to work, we had to write custom code".
>
>> Another idea, not sure if it would be good as an alternate idea or
>> supplemental, is a less formal gathering, perhaps over a meal or three.
>
> Outside of Lucene/Solr Revolution? How would that work geographically?
>
> - Toke Eskildsen
Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Jack Krupansky-3
If people are so gung-ho to go down the "lots on endless pain" rabbit-hole
route by heavily under-configuring their clusters, I guess that's their
choice, but I would strongly advise against it. Sure, a small "the few and
the proud" warhorses can proudly proclaim how they "did it", and a small
number of elite young Turks can probably do it as well, but it's quite the
fool's errand for average developers to try to replicate the "heroic
efforts" of the few.

Rather, "average developers" are well-advised to simply seek "the easy
path" and cease and desist from trying to configure Solr clusters with a
billion documents or more per node, or even 500 million for that matter.
"Just say no" to any demands that you run Solr on so-called "fat nodes".

Go with relatively commodity hardware (e.g., 16-32 GB per node), even if
that means you need a lot more more nodes. Or virtualize fat nodes into a
bunch of skinny nodes if that's all you have to work with.

My bottom line advice: use 100 million documents per node as your baseline
target, and make sure your index fits entirely in memory, with a proof of
concept implementation to validate whether the sweet spot for your
particular data, data model, and application access patterns may be well
above or even below that.

Yes, indeed, sing praises for heroes, but don't kill yourself and drag down
others trying to be one yourself.

</sermon>

-- Jack Krupansky


-- Jack Krupansky

On Tue, Dec 30, 2014 at 11:03 AM, Erick Erickson <[hidden email]>
wrote:

> bq: I did at some point try to write a long blog entry on Solr
> hardware and setup for non-small corpuses, but have to give up:
>
> Man, this makes me laugh! Oh the memories!
>
> A common question from sales, quite a reasonable one at that; "can we
> have a checklist that we can use to give clients an idea how much
> hardware to buy?". And do note that sales folks are talking to clients
> with all different types and sizes.
>
> I sat down and tried to do this... three separate times. Pretty soon
> I'd get to the point of realizing that the doc was worthless exactly
> because of all the "if this then that" phrases. I guess I can take
> some comfort from the fact that it only took me about an hour the
> third time to remember that it was hopeless, and after that i
> remembered to not even try.
>
> I think that it would be _extremely_ helpful to have a bunch of "war
> stories" to reference. In my experience, people dealing with large
> numbers of documents really are most concerned with whether what
> they're doing is _possible_, and are mostly looking to see if someone
> else has "been there and done that". Of course they'd like all the
> specificity possible, but there's a lot of comfort in knowing
> something similar has been done before.
>
> Best,
> Erick
>
> On Tue, Dec 30, 2014 at 4:43 AM, Toke Eskildsen <[hidden email]>
> wrote:
> > Shawn Heisey [[hidden email]] wrote:
> >> I believe it would be useful to organize a session at Lucene Revolution,
> >> possibly more interactive than a straight presentation, where users with
> >> very large indexes are encouraged to attend.  The point of this session
> >> would be to exchange war stories, configuration requirements, hardware
> >> requirements, and observations.
> >
> > From the perspective of the conference it might tie up a lot of time: If
> we were to get down to the configuration level, one session would not be
> enough. Some sort of pre-conference bar camp might do it? Or maybe even a
> whole pre-conference day?
> >
> > (side-note to the side-note: Living in Europe, going to Lucene/Solr
> Revolution means spending more time on travel than the actual conference -
> extending the activities to 3 days would increase the odds of me going next
> year)
> >
> >> Better documentation for extreme scaling is also a possible outcome.
> >
> > I did at some point try to write a long blog entry on Solr hardware and
> setup for non-small corpuses, but have to give up: There were just too many
> "but if you need to scale X, you might be better off by choosing Y, unless
> your usage is Z". I think multiple detailed descriptions of setups is a
> great starting point. If we get enough of them, some pattern will hopefully
> emerge, although I am afraid that the pattern will be "to get this to work,
> we had to write custom code".
> >
> >> Another idea, not sure if it would be good as an alternate idea or
> >> supplemental, is a less formal gathering, perhaps over a meal or three.
> >
> > Outside of Lucene/Solr Revolution? How would that work geographically?
> >
> > - Toke Eskildsen
>
Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Alexandre Rafalovitch
I bet that while there are no specific numbers, there are indicators
that everybody - who knows what they are doing - look at to decide
which particular aspect of configuration is hurting most.

So perhaps a good article would be not so much the concrete numbers
but the indicators to check. I think I saw people throwing around the
cache utilization as one of them. Any others?

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 30 December 2014 at 11:24, Jack Krupansky <[hidden email]> wrote:

> If people are so gung-ho to go down the "lots on endless pain" rabbit-hole
> route by heavily under-configuring their clusters, I guess that's their
> choice, but I would strongly advise against it. Sure, a small "the few and
> the proud" warhorses can proudly proclaim how they "did it", and a small
> number of elite young Turks can probably do it as well, but it's quite the
> fool's errand for average developers to try to replicate the "heroic
> efforts" of the few.
>
> Rather, "average developers" are well-advised to simply seek "the easy
> path" and cease and desist from trying to configure Solr clusters with a
> billion documents or more per node, or even 500 million for that matter.
> "Just say no" to any demands that you run Solr on so-called "fat nodes".
>
> Go with relatively commodity hardware (e.g., 16-32 GB per node), even if
> that means you need a lot more more nodes. Or virtualize fat nodes into a
> bunch of skinny nodes if that's all you have to work with.
>
> My bottom line advice: use 100 million documents per node as your baseline
> target, and make sure your index fits entirely in memory, with a proof of
> concept implementation to validate whether the sweet spot for your
> particular data, data model, and application access patterns may be well
> above or even below that.
>
> Yes, indeed, sing praises for heroes, but don't kill yourself and drag down
> others trying to be one yourself.
>
> </sermon>
>
> -- Jack Krupansky
>
>
> -- Jack Krupansky
>
> On Tue, Dec 30, 2014 at 11:03 AM, Erick Erickson <[hidden email]>
> wrote:
>
>> bq: I did at some point try to write a long blog entry on Solr
>> hardware and setup for non-small corpuses, but have to give up:
>>
>> Man, this makes me laugh! Oh the memories!
>>
>> A common question from sales, quite a reasonable one at that; "can we
>> have a checklist that we can use to give clients an idea how much
>> hardware to buy?". And do note that sales folks are talking to clients
>> with all different types and sizes.
>>
>> I sat down and tried to do this... three separate times. Pretty soon
>> I'd get to the point of realizing that the doc was worthless exactly
>> because of all the "if this then that" phrases. I guess I can take
>> some comfort from the fact that it only took me about an hour the
>> third time to remember that it was hopeless, and after that i
>> remembered to not even try.
>>
>> I think that it would be _extremely_ helpful to have a bunch of "war
>> stories" to reference. In my experience, people dealing with large
>> numbers of documents really are most concerned with whether what
>> they're doing is _possible_, and are mostly looking to see if someone
>> else has "been there and done that". Of course they'd like all the
>> specificity possible, but there's a lot of comfort in knowing
>> something similar has been done before.
>>
>> Best,
>> Erick
>>
>> On Tue, Dec 30, 2014 at 4:43 AM, Toke Eskildsen <[hidden email]>
>> wrote:
>> > Shawn Heisey [[hidden email]] wrote:
>> >> I believe it would be useful to organize a session at Lucene Revolution,
>> >> possibly more interactive than a straight presentation, where users with
>> >> very large indexes are encouraged to attend.  The point of this session
>> >> would be to exchange war stories, configuration requirements, hardware
>> >> requirements, and observations.
>> >
>> > From the perspective of the conference it might tie up a lot of time: If
>> we were to get down to the configuration level, one session would not be
>> enough. Some sort of pre-conference bar camp might do it? Or maybe even a
>> whole pre-conference day?
>> >
>> > (side-note to the side-note: Living in Europe, going to Lucene/Solr
>> Revolution means spending more time on travel than the actual conference -
>> extending the activities to 3 days would increase the odds of me going next
>> year)
>> >
>> >> Better documentation for extreme scaling is also a possible outcome.
>> >
>> > I did at some point try to write a long blog entry on Solr hardware and
>> setup for non-small corpuses, but have to give up: There were just too many
>> "but if you need to scale X, you might be better off by choosing Y, unless
>> your usage is Z". I think multiple detailed descriptions of setups is a
>> great starting point. If we get enough of them, some pattern will hopefully
>> emerge, although I am afraid that the pattern will be "to get this to work,
>> we had to write custom code".
>> >
>> >> Another idea, not sure if it would be good as an alternate idea or
>> >> supplemental, is a less formal gathering, perhaps over a meal or three.
>> >
>> > Outside of Lucene/Solr Revolution? How would that work geographically?
>> >
>> > - Toke Eskildsen
>>
Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Shawn Heisey-2
In reply to this post by Bram Van Dam
On 12/30/2014 1:19 AM, Bram Van Dam wrote:
> We had a look at Heliosearch a while ago and found it unsuitable. Seems
> like they're trying to make use of some native x86_64 code and HotSpot
> JVM specific features which we can't use. Some of our clients use IBM's
> JVM so we're pretty much limited to strictly Java.

Using IBM's Java is not recommended at all, for Solr or Heliosearch.
They enable many optimizations by default that are known to cause bugs
with Lucene and Solr.  Lucene has uncovered bugs with all JVMs, but the
bugs in IBM's Java are particularly persistent, and IBM seems to have
little interest in learning about them or fixing them.  The project does
have a good relationship with Oracle for problems in their code, so that
is the recommended implementation.

> That's my understanding as well. Lucene Revolution is on the wrong side
> of the Atlantic for me. But there's an Open Source Search devroom at
> FOSDEM this year, which seems like a sensible place to discuss these
> things. I'll make a post on the relevant mailing lists about this after
> the holidays if anyone is interested.

Lucene Revolution has happened in Europe.  In 2013, it was in Dublin.  I
don't have any information on the 2015 location.  There is also
something called Lucene EuroCon, but I can find no information about a
new event.  ApacheCon is another possibility, and the 2014 EU conference
was in Budapest.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Bram Van Dam
In reply to this post by Erick Erickson
On 12/30/2014 05:03 PM, Erick Erickson wrote:
> I think that it would be _extremely_ helpful to have a bunch of "war
> stories" to reference. In my experience, people dealing with large
> numbers of documents really are most concerned with whether what
> they're doing is _possible_, and are mostly looking to see if someone
> else has "been there and done that". Of course they'd like all the
> specificity possible, but there's a lot of comfort in knowing
> something similar has been done before.

That's right. We deal with some pretty interesting use cases for banks.
Some of them don't mind throwing hardware at a problem (some do).

One use case I can talk about is an archiving application. A customer
calls in, asks about something, someone has to physically walk down to
an archive, get a tape/cd/folder, plonk it in some ancient piece of
hardware, and then rely on awful tools like windows file search to find
whatever it is they were looking for.

No matter *how bad* Solr performance might get in the billions of
documents on cheap and crappy hardware scale, it's *always* going to be
better than the manual steps I just described. Even if it takes an hour
to run, the value added by being able to search and report using
structured & full-text search is immense.


Reply | Threaded
Open this post in threaded view
|

Re: How large is your solr index?

Billnbell
In reply to this post by Jack Krupansky-3
For Solr 5 why don't we switch it to 64 bit ??

Bill Bell
Sent from mobile


> On Dec 29, 2014, at 1:53 PM, Jack Krupansky <[hidden email]> wrote:
>
> And that Lucene index document limit includes deleted and updated
> documents, so even if your actual document count stays under 2^31-1,
> deleting and updating documents can push the apparent document count over
> the limit unless you very aggressively merge segments to expunge deleted
> documents.
>
> -- Jack Krupansky
>
> -- Jack Krupansky
>
> On Mon, Dec 29, 2014 at 12:54 PM, Erick Erickson <[hidden email]>
> wrote:
>
>> When you say 2B docs on a single Solr instance, are you talking only one
>> shard?
>> Because if you are, you're very close to the absolute upper limit of a
>> shard, internally
>> the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems.
>>
>> But yeah, your 100B documents are going to use up a lot of servers...
>>
>> Best,
>> Erick
>>
>> On Mon, Dec 29, 2014 at 7:24 AM, Bram Van Dam <[hidden email]>
>> wrote:
>>> Hi folks,
>>>
>>> I'm trying to get a feel of how large Solr can grow without slowing down
>> too
>>> much. We're looking into a use-case with up to 100 billion documents
>>> (SolrCloud), and we're a little afraid that we'll end up requiring 100
>>> servers to pull it off.
>>>
>>> The largest index we currently have is ~2billion documents in a single
>> Solr
>>> instance. Documents are smallish (5k each) and we have ~50 fields in the
>>> schema, with an index size of about 2TB. Performance is mostly OK. Cold
>>> searchers take a while, but most queries are alright after warming up. I
>>> wish I could provide more statistics, but I only have very limited
>> access to
>>> the data (...banks...).
>>>
>>> I'd very grateful to anyone sharing statistics, especially on the larger
>> end
>>> of the spectrum -- with or without SolrCloud.
>>>
>>> Thanks,
>>>
>>> - Bram
>>
Reply | Threaded
Open this post in threaded view
|

RE: How large is your solr index?

Toke Eskildsen
Bill Bell [[hidden email]] wrote:

[solr maxdoc limit of 2b]

> For Solr 5 why don't we switch it to 64 bit ??

The biggest challenge for a switch is that Java's arrays can only hold 2b values. I support the idea of switching to much larger minimums throughout the code. But it is a larger fix than replacing int with long.

- Toke Eskildsen
123