Largest number of indexed documents used by Solr

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Largest number of indexed documents used by Solr

Steven White
Hi everyone,

I'm about to start a project that requires indexing 36 million records
using Solr 7.2.1.  Each record range from 500 KB to 0.25 MB where the
average is 0.1 MB.

Has anyone indexed this number of records?  What are the things I should
worry about?  And out of curiosity, what is the largest number of records
that Solr has indexed which is published out there?

Thanks

Steven
Reply | Threaded
Open this post in threaded view
|

Re: Largest number of indexed documents used by Solr

Abhi Basu
We have tested Solr 4.10 with 200 million docs with avg doc size of 250 KB.
No issues with performance when using 3 shards / 2 replicas.



On Tue, Apr 3, 2018 at 8:12 PM, Steven White <[hidden email]> wrote:

> Hi everyone,
>
> I'm about to start a project that requires indexing 36 million records
> using Solr 7.2.1.  Each record range from 500 KB to 0.25 MB where the
> average is 0.1 MB.
>
> Has anyone indexed this number of records?  What are the things I should
> worry about?  And out of curiosity, what is the largest number of records
> that Solr has indexed which is published out there?
>
> Thanks
>
> Steven
>



--
Abhi Basu
Reply | Threaded
Open this post in threaded view
|

Re: Largest number of indexed documents used by Solr

Walter Underwood
In reply to this post by Steven White
We have a 24 million document index. Our documents are a bit smaller than yours, homework problems.

The Hathi Trust probably has the record. They haven’t updated their blog for a while, but they were at 11 million books and billions of pages in 2014.

https://www.hathitrust.org/blogslarge-scale-search

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Apr 3, 2018, at 6:12 PM, Steven White <[hidden email]> wrote:
>
> Hi everyone,
>
> I'm about to start a project that requires indexing 36 million records
> using Solr 7.2.1.  Each record range from 500 KB to 0.25 MB where the
> average is 0.1 MB.
>
> Has anyone indexed this number of records?  What are the things I should
> worry about?  And out of curiosity, what is the largest number of records
> that Solr has indexed which is published out there?
>
> Thanks
>
> Steven

Reply | Threaded
Open this post in threaded view
|

Re: Largest number of indexed documents used by Solr

Yago Riveiro
In reply to this post by Abhi Basu
Hi,

In my company we are running a 12 node cluster with 10 (american) Billion documents 12 shards / 2 replicas.

We do mainly faceting queries with a very reasonable performance.

36 million documents it's not an issue, you can handle that volume of documents with 2 nodes with SSDs and 32G of ram

Regards.

--

Yago Riveiro

On 4 Apr 2018 02:15 +0100, Abhi Basu <[hidden email]>, wrote:

> We have tested Solr 4.10 with 200 million docs with avg doc size of 250 KB.
> No issues with performance when using 3 shards / 2 replicas.
>
>
>
> On Tue, Apr 3, 2018 at 8:12 PM, Steven White <[hidden email]> wrote:
>
> > Hi everyone,
> >
> > I'm about to start a project that requires indexing 36 million records
> > using Solr 7.2.1. Each record range from 500 KB to 0.25 MB where the
> > average is 0.1 MB.
> >
> > Has anyone indexed this number of records? What are the things I should
> > worry about? And out of curiosity, what is the largest number of records
> > that Solr has indexed which is published out there?
> >
> > Thanks
> >
> > Steven
> >
>
>
>
> --
> Abhi Basu
Best regards /Yago
Reply | Threaded
Open this post in threaded view
|

Re: Largest number of indexed documents used by Solr

苗海泉
When we have 49 shards per collection, there are more than 600 collections.
Solr will have serious performance problems. I don't know how to deal with
them. My advice to you is to minimize the number of collections.
Our environment is 49 solr server nodes, each with 32cpu/128g, and the data
volume is about 50 billion per day.



<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality&>

2018-04-04 9:23 GMT+08:00 Yago Riveiro <[hidden email]>:

> Hi,
>
> In my company we are running a 12 node cluster with 10 (american) Billion
> documents 12 shards / 2 replicas.
>
> We do mainly faceting queries with a very reasonable performance.
>
> 36 million documents it's not an issue, you can handle that volume of
> documents with 2 nodes with SSDs and 32G of ram
>
> Regards.
>
> --
>
> Yago Riveiro
>
> On 4 Apr 2018 02:15 +0100, Abhi Basu <[hidden email]>, wrote:
> > We have tested Solr 4.10 with 200 million docs with avg doc size of 250
> KB.
> > No issues with performance when using 3 shards / 2 replicas.
> >
> >
> >
> > On Tue, Apr 3, 2018 at 8:12 PM, Steven White <[hidden email]>
> wrote:
> >
> > > Hi everyone,
> > >
> > > I'm about to start a project that requires indexing 36 million records
> > > using Solr 7.2.1. Each record range from 500 KB to 0.25 MB where the
> > > average is 0.1 MB.
> > >
> > > Has anyone indexed this number of records? What are the things I should
> > > worry about? And out of curiosity, what is the largest number of
> records
> > > that Solr has indexed which is published out there?
> > >
> > > Thanks
> > >
> > > Steven
> > >
> >
> >
> >
> > --
> > Abhi Basu
>



--
==============================
联创科技
知行如一
==============================
Reply | Threaded
Open this post in threaded view
|

Re: Largest number of indexed documents used by Solr

Kelly, Frank
In reply to this post by Steven White
For us we have ~ 350M documents stored using r3.xlarge nodes with 8GB Heap
and about 31GB of RAM

We are using Solr 5.3.1 in a SolrCloud setup (3 collections, each with 3
shards and 3 replicas).

For us lots of RAM memory is not as important as CPU (as the EBS disk we
run on top of
is quite fast and our memory hit rate is quite low).

Some things that helped
1) Turned off the filter cache (it required too much heap)
2) Set a limit on replication bandwidth (when nodes are recovering they
can tie up a lot of CPU) in particular maxWriteMBPerSec=100
3) Set query timeout to 2 seconds to help kill ³heavy² queries
4) Set preferLocalShards=true to help mitigate when any EC2 nodes are
having a ³noisy neighbor"
5) We implemented our own CloudWatch based monitoring so that when Solr VM
CPU is high (> 90%) we queue up indexing traffic rather than send it to be
indexed.
We found that if you peg Solr CPU for too long replicas can¹t keep up,
they go into recovery, which drives CPU even higher and eventually the
cluster thinks the nodes are ³down² when they repeatedly fail at recovery.
So we really try to manage Solr CPU load (We¹ll probably look to switching
to compute optimized nodes in the future)

Best

-Frank


On 4/3/18, 9:12 PM, "Steven White" <[hidden email]> wrote:

>Hi everyone,
>
>I'm about to start a project that requires indexing 36 million records
>using Solr 7.2.1.  Each record range from 500 KB to 0.25 MB where the
>average is 0.1 MB.
>
>Has anyone indexed this number of records?  What are the things I should
>worry about?  And out of curiosity, what is the largest number of records
>that Solr has indexed which is published out there?
>
>Thanks
>
>Steven

Reply | Threaded
Open this post in threaded view
|

Re: Largest number of indexed documents used by Solr

Joe Obernberger
In reply to this post by 苗海泉
50 billion per day?  Wow!  How large are these documents?

We have a cluster with one large collection that contains 2.4 billion
documents spread across 40 machines using HDFS for the index.  We store
our data inside of HBase, and in order to re-index data we pull from
HBase and index with solr cloud.  Most we can do is around 57 million
per day; usually limited by pulling data out of HBase not Solr.

-Joe


On 4/4/2018 10:57 PM, 苗海泉 wrote:

> When we have 49 shards per collection, there are more than 600 collections.
> Solr will have serious performance problems. I don't know how to deal with
> them. My advice to you is to minimize the number of collections.
> Our environment is 49 solr server nodes, each with 32cpu/128g, and the data
> volume is about 50 billion per day.
>
>
> ‌
> <https://mailtrack.io/> Sent with Mailtrack
> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality&>
>
> 2018-04-04 9:23 GMT+08:00 Yago Riveiro <[hidden email]>:
>
>> Hi,
>>
>> In my company we are running a 12 node cluster with 10 (american) Billion
>> documents 12 shards / 2 replicas.
>>
>> We do mainly faceting queries with a very reasonable performance.
>>
>> 36 million documents it's not an issue, you can handle that volume of
>> documents with 2 nodes with SSDs and 32G of ram
>>
>> Regards.
>>
>> --
>>
>> Yago Riveiro
>>
>> On 4 Apr 2018 02:15 +0100, Abhi Basu <[hidden email]>, wrote:
>>> We have tested Solr 4.10 with 200 million docs with avg doc size of 250
>> KB.
>>> No issues with performance when using 3 shards / 2 replicas.
>>>
>>>
>>>
>>> On Tue, Apr 3, 2018 at 8:12 PM, Steven White <[hidden email]>
>> wrote:
>>>> Hi everyone,
>>>>
>>>> I'm about to start a project that requires indexing 36 million records
>>>> using Solr 7.2.1. Each record range from 500 KB to 0.25 MB where the
>>>> average is 0.1 MB.
>>>>
>>>> Has anyone indexed this number of records? What are the things I should
>>>> worry about? And out of curiosity, what is the largest number of
>> records
>>>> that Solr has indexed which is published out there?
>>>>
>>>> Thanks
>>>>
>>>> Steven
>>>>
>>>
>>>
>>> --
>>> Abhi Basu
>
>