push to the limit without going over

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

push to the limit without going over

Arturas Mazeika
Hi Solr Folk,

I am trying to push solr to the limit and sometimes I succeed. The
questions is how to not go over it, e.g., avoid:

java.lang.RuntimeException: Tried fetching cluster state using the node
names we knew of, i.e. [192.168.56.1:9998_solr, 192.168.56.1:9997_solr,
192.168.56.1:9999_solr, 192.168.56.1:9996_solr]. However, succeeded in
obtaining the cluster state from none of them.If you think your Solr
cluster is up and is accessible, you could try re-creating a new
CloudSolrClient using working solrUrl(s) or zkHost(s).
        at org.apache.solr.client.solrj.impl.HttpClusterStateProvider.
getState(HttpClusterStateProvider.java:109)
        at org.apache.solr.client.solrj.impl.CloudSolrClient.resolveAliases(
CloudSolrClient.java:1113)
        at org.apache.solr.client.solrj.impl.CloudSolrClient.
requestWithRetryOnStaleState(CloudSolrClient.java:845)
        at org.apache.solr.client.solrj.impl.CloudSolrClient.request(
CloudSolrClient.java:818)
        at org.apache.solr.client.solrj.SolrRequest.process(
SolrRequest.java:194)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:173)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:152)
        at com.asc.InsertDEWikiSimple$SimpleThread.run(
InsertDEWikiSimple.java:132)


Details:

I am benchmarking solrcloud setup on a single machine (Intel 7 with 8 "cpu
cores", an SSD as well as a HDD) using the German Wikipedia collection. I
created 4 nodes, 4 shards, rep factor: 2 cluster on the same machine (and
managed to push the CPU or SSD to the hardware limits, i.e., ~200MB/s,
~100% CPU). Now I wanted to see what happens if I push HDD to the limits.
Indexing the files from the SSD (I am able to scan the collection at the
actual rate 400-500MB/s) with 16 threads, I tried to send those to the solr
cluster with all indexes on the HDD.

Clearly solr needs to deal with a very slow hard drive (10-20MB/s actual
rate). If the cluster is not touched, solrj may start loosing connections
after a few hours. If one checks the status of the cluster, it may happen
sooner. After the connection is lost, the cluster calms down with writing
after a half a dozen of minutes.

What would be a reasonable way to push to the limit without going over?

The exact parameters are:

- 4 cores running 2gb ram
- Schema:

  <fieldType name="ft_wiki_de" class="solr.TextField"
positionIncrementGap="100">
     <analyzer>
       <charFilter class="solr.HTMLStripCharFilterFactory"/>
       <tokenizer  class="solr.StandardTokenizerFactory"/>
       <filter     class="solr.GermanMinimalStemFilterFactory"/>
       <filter     class="solr.LowerCaseFilterFactory"/>
     </analyzer>
  </fieldType>

  <fieldType name="ft_url" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
       <tokenizer  class="solr.StandardTokenizerFactory"/>
       <filter     class="solr.LowerCaseFilterFactory"/>
     </analyzer>
  </fieldType>

  <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
  <field name="id" type="uuid" indexed="true" stored="true" required="true"/>
  <field name="_root_" type="uuid" indexed="true" stored="false"
docValues="false" />

  <field name="size"    type="pint"       indexed="true" stored="true"/>
  <field name="time"    type="pdate"      indexed="true" stored="true"/>
  <field name="content" type="ft_wiki_de" indexed="true" stored="true"/>
  <field name="url"     type="ft_url"     indexed="true" stored="true"/>

  <field name="_version_" type="plong"        indexed="false" stored="false"/>

I SolrJ-connect once:

        ArrayList<String> urls = new ArrayList<>();
        urls.add("http://localhost:9999/solr");
        urls.add("http://localhost:9998/solr");
        urls.add("http://localhost:9997/solr");
        urls.add("http://localhost:9996/solr");

        solrClient = new CloudSolrClient.Builder(urls)
            .withConnectionTimeout(10000)
            .withSocketTimeout(60000)
            .build();
        solrClient.setDefaultCollection("de_wiki_man");

and then execute in 16 threads till there's anything to execute:

                    Path p = getJobPath();
                                           String content = new String
(Files.readAllBytes(p));
                    UUID id = UUID.randomUUID();
                    SolrInputDocument doc = new SolrInputDocument();

                    BasicFileAttributes attr = Files.readAttributes(p,
BasicFileAttributes.class);

                    doc.addField("id",      id.toString());
                    doc.addField("content", content);
                    doc.addField("time",    attr.creationTime().toString());
                    doc.addField("size",    content.length());
                    doc.addField("url",     p.getFileName().
toAbsolutePath().toString());
                    solrClient.add(doc);


to go through all the wiki html files.

Cheers,
Arturas
Reply | Threaded
Open this post in threaded view
|

Re: push to the limit without going over

Erick Erickson
First, I usually prefer to construct your CloudSolrClient by
using the Zookeeper ensemble string rather than URLs,
although that's probably not a cure for your problem.

Here's what I _think_ is happening. If you're slamming Solr
with a lot of updates, you're doing a lot of merging. At some point
when there are a lot of merges going on incoming
updates block until one or more merge threads is done.

At that point, I suspect your client is timing out. And (perhaps)
if you used the Zookeeper ensemble instead of HTTP, the
cluster state fetch would go away. I suspect that another
issue would come up, but....

It's also possible this would all go away if you increase your
timeouts significantly. That's still a "set it and hope" approach
rather than a totally robust solution though.

Let's assume that the above works and you start getting timeouts.
You can back off the indexing rate at that point, or just go to
sleep for a while. This isn't what you'd like for a permanent solution,
but may let you get by.

There's work afoot to separate out update thread pools from query
thread pools so _querying_ doesn't suffer when indexing is heavy,
but that hasn't been implemented yet. This could also address
your cluster state fetch error.

You will get significantly better throughput if you batch your
docs and use the client.add(list_of_documents) BTW.

Another possibility is to use the new metrics (since Solr 6.4). They
provide over 200 metrics you can query, and it's quite
possible that they'd help your clients know when to self-throttle
but AFAIK, there's nothing built in to help you there.

Best,
Erick

On Wed, Jul 4, 2018 at 2:32 AM, Arturas Mazeika <[hidden email]> wrote:

> Hi Solr Folk,
>
> I am trying to push solr to the limit and sometimes I succeed. The
> questions is how to not go over it, e.g., avoid:
>
> java.lang.RuntimeException: Tried fetching cluster state using the node
> names we knew of, i.e. [192.168.56.1:9998_solr, 192.168.56.1:9997_solr,
> 192.168.56.1:9999_solr, 192.168.56.1:9996_solr]. However, succeeded in
> obtaining the cluster state from none of them.If you think your Solr
> cluster is up and is accessible, you could try re-creating a new
> CloudSolrClient using working solrUrl(s) or zkHost(s).
>         at org.apache.solr.client.solrj.impl.HttpClusterStateProvider.
> getState(HttpClusterStateProvider.java:109)
>         at org.apache.solr.client.solrj.impl.CloudSolrClient.resolveAliases(
> CloudSolrClient.java:1113)
>         at org.apache.solr.client.solrj.impl.CloudSolrClient.
> requestWithRetryOnStaleState(CloudSolrClient.java:845)
>         at org.apache.solr.client.solrj.impl.CloudSolrClient.request(
> CloudSolrClient.java:818)
>         at org.apache.solr.client.solrj.SolrRequest.process(
> SolrRequest.java:194)
>         at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:173)
>         at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
>         at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:152)
>         at com.asc.InsertDEWikiSimple$SimpleThread.run(
> InsertDEWikiSimple.java:132)
>
>
> Details:
>
> I am benchmarking solrcloud setup on a single machine (Intel 7 with 8 "cpu
> cores", an SSD as well as a HDD) using the German Wikipedia collection. I
> created 4 nodes, 4 shards, rep factor: 2 cluster on the same machine (and
> managed to push the CPU or SSD to the hardware limits, i.e., ~200MB/s,
> ~100% CPU). Now I wanted to see what happens if I push HDD to the limits.
> Indexing the files from the SSD (I am able to scan the collection at the
> actual rate 400-500MB/s) with 16 threads, I tried to send those to the solr
> cluster with all indexes on the HDD.
>
> Clearly solr needs to deal with a very slow hard drive (10-20MB/s actual
> rate). If the cluster is not touched, solrj may start loosing connections
> after a few hours. If one checks the status of the cluster, it may happen
> sooner. After the connection is lost, the cluster calms down with writing
> after a half a dozen of minutes.
>
> What would be a reasonable way to push to the limit without going over?
>
> The exact parameters are:
>
> - 4 cores running 2gb ram
> - Schema:
>
>   <fieldType name="ft_wiki_de" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer>
>        <charFilter class="solr.HTMLStripCharFilterFactory"/>
>        <tokenizer  class="solr.StandardTokenizerFactory"/>
>        <filter     class="solr.GermanMinimalStemFilterFactory"/>
>        <filter     class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>   </fieldType>
>
>   <fieldType name="ft_url" class="solr.TextField" positionIncrementGap="100">
>      <analyzer>
>        <tokenizer  class="solr.StandardTokenizerFactory"/>
>        <filter     class="solr.LowerCaseFilterFactory"/>
>      </analyzer>
>   </fieldType>
>
>   <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
>   <field name="id" type="uuid" indexed="true" stored="true" required="true"/>
>   <field name="_root_" type="uuid" indexed="true" stored="false"
> docValues="false" />
>
>   <field name="size"    type="pint"       indexed="true" stored="true"/>
>   <field name="time"    type="pdate"      indexed="true" stored="true"/>
>   <field name="content" type="ft_wiki_de" indexed="true" stored="true"/>
>   <field name="url"     type="ft_url"     indexed="true" stored="true"/>
>
>   <field name="_version_" type="plong"        indexed="false" stored="false"/>
>
> I SolrJ-connect once:
>
>         ArrayList<String> urls = new ArrayList<>();
>         urls.add("http://localhost:9999/solr");
>         urls.add("http://localhost:9998/solr");
>         urls.add("http://localhost:9997/solr");
>         urls.add("http://localhost:9996/solr");
>
>         solrClient = new CloudSolrClient.Builder(urls)
>             .withConnectionTimeout(10000)
>             .withSocketTimeout(60000)
>             .build();
>         solrClient.setDefaultCollection("de_wiki_man");
>
> and then execute in 16 threads till there's anything to execute:
>
>                     Path p = getJobPath();
>                                            String content = new String
> (Files.readAllBytes(p));
>                     UUID id = UUID.randomUUID();
>                     SolrInputDocument doc = new SolrInputDocument();
>
>                     BasicFileAttributes attr = Files.readAttributes(p,
> BasicFileAttributes.class);
>
>                     doc.addField("id",      id.toString());
>                     doc.addField("content", content);
>                     doc.addField("time",    attr.creationTime().toString());
>                     doc.addField("size",    content.length());
>                     doc.addField("url",     p.getFileName().
> toAbsolutePath().toString());
>                     solrClient.add(doc);
>
>
> to go through all the wiki html files.
>
> Cheers,
> Arturas
Reply | Threaded
Open this post in threaded view
|

Re: push to the limit without going over

Arturas Mazeika
Hi Erick et al,

Thanks a lot for the response. Your explanation seems very plausible and
I'd love to investigate those further.

Batching the docs (for me  surprisingly) improved the numbers:

Buffer size secs MB/s Docs/s
N:500 1117 34.4077538 2400.72695
N:100 1073 35.8186962 2499.17241
N:10 1170 32.849112 2291.97607
N:5 1234 31.1454303 2173.10535
N:3 1433 26.8202798 1871.32729
N:2 1758 21.862037 1525.37656
N:1 2307 16.6594976 1162.38058

It looks like the larger the buffer (in terms of number of documents), the
faster the processing. I thought the gains would not been too high as (1)
solr buffers it itself, (2) the documents are pretty large.

SolrJ API changed a bit since the last few releases and it is becoming
incredibly difficult to find working code. You mentioned that I can connect
to zkHost directly. I tried [1], [2], and [3]  and its variants without any
success (the returned object was null) . How would it look like in 7.2+
branch (I am currently running the embedded zookeeper, solr runs on 9999,
so the zookeeper should be on 10999 [4])?

I am impressed by the number of metrics I can get from the solr with my
very limited knowledge. You mentioned that there are 200+ metrics one can
get about the system. As the primary resource of infos, would you recommend:

https://lucene.apache.org/solr/guide/7_4/collections-api.html

Can you maybe expand this list with additional references?

Cheers,
Arturas

Refs:
            [::1]:10999            ESTABLISHED     13984

[1]
        String zkHostString = "localhost:10999";
        SolrClient solrClient = new CloudSolrClient(zkHostString, true);
        solrClient.setDefaultCollection("de_wiki_man");

[2]
        String zkHostString = "localhost:10999";
        SolrClient solrClient = new
CloudSolrClient.Builder().withZkHost(zkHostString).build();

[3]

        ArrayList<String> zkHosts = new ArrayList<>();
        zkHosts.add("localhost:10999");

        solrClient = new CloudSolrClient.Builder(zkHosts, null)
            .withConnectionTimeout(1000000)
            .withSocketTimeout(6000000)
            .build();

        solrClient.setDefaultCollection("de_wiki_man");

[4]
C:\WINDOWS\system32>netstat -aon | grep 13984
  TCP    0.0.0.0:9999           0.0.0.0:0              LISTENING       13984
  TCP    0.0.0.0:10999          0.0.0.0:0              LISTENING       13984
  TCP    127.0.0.1:8999         0.0.0.0:0              LISTENING       13984
  TCP    127.0.0.1:62888        127.0.0.1:62889        ESTABLISHED     13984
  TCP    127.0.0.1:62889        127.0.0.1:62888        ESTABLISHED     13984
  TCP    127.0.0.1:62891        127.0.0.1:62892        ESTABLISHED     13984
  TCP    127.0.0.1:62892        127.0.0.1:62891        ESTABLISHED     13984
  TCP    127.0.0.1:62900        127.0.0.1:62901        ESTABLISHED     13984
  TCP    127.0.0.1:62901        127.0.0.1:62900        ESTABLISHED     13984
  TCP    127.0.0.1:62902        127.0.0.1:62903        ESTABLISHED     13984
  TCP    127.0.0.1:62903        127.0.0.1:62902        ESTABLISHED     13984
  TCP    127.0.0.1:62904        127.0.0.1:62905        ESTABLISHED     13984
  TCP    127.0.0.1:62905        127.0.0.1:62904        ESTABLISHED     13984
  TCP    127.0.0.1:62906        127.0.0.1:62907        ESTABLISHED     13984
  TCP    127.0.0.1:62907        127.0.0.1:62906        ESTABLISHED     13984
  TCP    [::]:9999              [::]:0                 LISTENING       13984
  TCP    [::]:10999             [::]:0                 LISTENING       13984
  TCP    [::1]:10999            [::1]:62893            ESTABLISHED     13984
  TCP    [::1]:62893


On Wed, Jul 4, 2018 at 6:06 PM, Erick Erickson <[hidden email]>
wrote:

> First, I usually prefer to construct your CloudSolrClient by
> using the Zookeeper ensemble string rather than URLs,
> although that's probably not a cure for your problem.
>
> Here's what I _think_ is happening. If you're slamming Solr
> with a lot of updates, you're doing a lot of merging. At some point
> when there are a lot of merges going on incoming
> updates block until one or more merge threads is done.
>
> At that point, I suspect your client is timing out. And (perhaps)
> if you used the Zookeeper ensemble instead of HTTP, the
> cluster state fetch would go away. I suspect that another
> issue would come up, but....
>
> It's also possible this would all go away if you increase your
> timeouts significantly. That's still a "set it and hope" approach
> rather than a totally robust solution though.
>
> Let's assume that the above works and you start getting timeouts.
> You can back off the indexing rate at that point, or just go to
> sleep for a while. This isn't what you'd like for a permanent solution,
> but may let you get by.
>
> There's work afoot to separate out update thread pools from query
> thread pools so _querying_ doesn't suffer when indexing is heavy,
> but that hasn't been implemented yet. This could also address
> your cluster state fetch error.
>
> You will get significantly better throughput if you batch your
> docs and use the client.add(list_of_documents) BTW.
>
> Another possibility is to use the new metrics (since Solr 6.4). They
> provide over 200 metrics you can query, and it's quite
> possible that they'd help your clients know when to self-throttle
> but AFAIK, there's nothing built in to help you there.
>
> Best,
> Erick
>
> On Wed, Jul 4, 2018 at 2:32 AM, Arturas Mazeika <[hidden email]> wrote:
> > Hi Solr Folk,
> >
> > I am trying to push solr to the limit and sometimes I succeed. The
> > questions is how to not go over it, e.g., avoid:
> >
> > java.lang.RuntimeException: Tried fetching cluster state using the node
> > names we knew of, i.e. [192.168.56.1:9998_solr, 192.168.56.1:9997_solr,
> > 192.168.56.1:9999_solr, 192.168.56.1:9996_solr]. However, succeeded in
> > obtaining the cluster state from none of them.If you think your Solr
> > cluster is up and is accessible, you could try re-creating a new
> > CloudSolrClient using working solrUrl(s) or zkHost(s).
> >         at org.apache.solr.client.solrj.impl.HttpClusterStateProvider.
> > getState(HttpClusterStateProvider.java:109)
> >         at org.apache.solr.client.solrj.impl.CloudSolrClient.
> resolveAliases(
> > CloudSolrClient.java:1113)
> >         at org.apache.solr.client.solrj.impl.CloudSolrClient.
> > requestWithRetryOnStaleState(CloudSolrClient.java:845)
> >         at org.apache.solr.client.solrj.impl.CloudSolrClient.request(
> > CloudSolrClient.java:818)
> >         at org.apache.solr.client.solrj.SolrRequest.process(
> > SolrRequest.java:194)
> >         at org.apache.solr.client.solrj.SolrClient.add(SolrClient.
> java:173)
> >         at org.apache.solr.client.solrj.SolrClient.add(SolrClient.
> java:138)
> >         at org.apache.solr.client.solrj.SolrClient.add(SolrClient.
> java:152)
> >         at com.asc.InsertDEWikiSimple$SimpleThread.run(
> > InsertDEWikiSimple.java:132)
> >
> >
> > Details:
> >
> > I am benchmarking solrcloud setup on a single machine (Intel 7 with 8
> "cpu
> > cores", an SSD as well as a HDD) using the German Wikipedia collection. I
> > created 4 nodes, 4 shards, rep factor: 2 cluster on the same machine (and
> > managed to push the CPU or SSD to the hardware limits, i.e., ~200MB/s,
> > ~100% CPU). Now I wanted to see what happens if I push HDD to the limits.
> > Indexing the files from the SSD (I am able to scan the collection at the
> > actual rate 400-500MB/s) with 16 threads, I tried to send those to the
> solr
> > cluster with all indexes on the HDD.
> >
> > Clearly solr needs to deal with a very slow hard drive (10-20MB/s actual
> > rate). If the cluster is not touched, solrj may start loosing connections
> > after a few hours. If one checks the status of the cluster, it may happen
> > sooner. After the connection is lost, the cluster calms down with writing
> > after a half a dozen of minutes.
> >
> > What would be a reasonable way to push to the limit without going over?
> >
> > The exact parameters are:
> >
> > - 4 cores running 2gb ram
> > - Schema:
> >
> >   <fieldType name="ft_wiki_de" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer>
> >        <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >        <tokenizer  class="solr.StandardTokenizerFactory"/>
> >        <filter     class="solr.GermanMinimalStemFilterFactory"/>
> >        <filter     class="solr.LowerCaseFilterFactory"/>
> >      </analyzer>
> >   </fieldType>
> >
> >   <fieldType name="ft_url" class="solr.TextField"
> positionIncrementGap="100">
> >      <analyzer>
> >        <tokenizer  class="solr.StandardTokenizerFactory"/>
> >        <filter     class="solr.LowerCaseFilterFactory"/>
> >      </analyzer>
> >   </fieldType>
> >
> >   <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
> >   <field name="id" type="uuid" indexed="true" stored="true"
> required="true"/>
> >   <field name="_root_" type="uuid" indexed="true" stored="false"
> > docValues="false" />
> >
> >   <field name="size"    type="pint"       indexed="true" stored="true"/>
> >   <field name="time"    type="pdate"      indexed="true" stored="true"/>
> >   <field name="content" type="ft_wiki_de" indexed="true" stored="true"/>
> >   <field name="url"     type="ft_url"     indexed="true" stored="true"/>
> >
> >   <field name="_version_" type="plong"        indexed="false"
> stored="false"/>
> >
> > I SolrJ-connect once:
> >
> >         ArrayList<String> urls = new ArrayList<>();
> >         urls.add("http://localhost:9999/solr");
> >         urls.add("http://localhost:9998/solr");
> >         urls.add("http://localhost:9997/solr");
> >         urls.add("http://localhost:9996/solr");
> >
> >         solrClient = new CloudSolrClient.Builder(urls)
> >             .withConnectionTimeout(10000)
> >             .withSocketTimeout(60000)
> >             .build();
> >         solrClient.setDefaultCollection("de_wiki_man");
> >
> > and then execute in 16 threads till there's anything to execute:
> >
> >                     Path p = getJobPath();
> >                                            String content = new String
> > (Files.readAllBytes(p));
> >                     UUID id = UUID.randomUUID();
> >                     SolrInputDocument doc = new SolrInputDocument();
> >
> >                     BasicFileAttributes attr = Files.readAttributes(p,
> > BasicFileAttributes.class);
> >
> >                     doc.addField("id",      id.toString());
> >                     doc.addField("content", content);
> >                     doc.addField("time",
> attr.creationTime().toString());
> >                     doc.addField("size",    content.length());
> >                     doc.addField("url",     p.getFileName().
> > toAbsolutePath().toString());
> >                     solrClient.add(doc);
> >
> >
> > to go through all the wiki html files.
> >
> > Cheers,
> > Arturas
>
Reply | Threaded
Open this post in threaded view
|

Re: push to the limit without going over

Shawn Heisey-2
In reply to this post by Arturas Mazeika
On 7/4/2018 3:32 AM, Arturas Mazeika wrote:

> Details:
>
> I am benchmarking solrcloud setup on a single machine (Intel 7 with 8 "cpu
> cores", an SSD as well as a HDD) using the German Wikipedia collection. I
> created 4 nodes, 4 shards, rep factor: 2 cluster on the same machine (and
> managed to push the CPU or SSD to the hardware limits, i.e., ~200MB/s,
> ~100% CPU). Now I wanted to see what happens if I push HDD to the limits.
> Indexing the files from the SSD (I am able to scan the collection at the
> actual rate 400-500MB/s) with 16 threads, I tried to send those to the solr
> cluster with all indexes on the HDD.
<snip>
> - 4 cores running 2gb ram

If this is saying that the machine running Solr has 2GB of installed
memory, that's going to be a serious problem.

The default heap size that Solr starts with is 512MB.  With 4 Solr nodes
running on the machine, each with a 512MB heap, all of your 2GB of
memory is going to be required by the heaps.  Java requires memory
beyond the heap to run.  Your operating system and its other processes
will also require some memory.

This means that not only are you going to have no memory left for the OS
disk cache, you're actually going to allocating MORE than the 2GB of
installed memory, which means the OS is going to start swapping to
accommodate memory allocations.

When you don't have enough memory for good disk caching, Solr
performance is absolutely terrible.  When Solr has to wait for data to
be read off of disk, even if the disk is SSD, its performance will not
be good.

When the OS starts swapping, the performance of ANY software on the
system drops SIGNIFICANTLY.

You need a lot more memory than 2GB on your server.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: push to the limit without going over

Erick Erickson
Arturas:

" it is becoming incredibly difficult to find working code"

Yeah, I sympathize totally. What I usually do is go into the test code
of whatever version of Solr I'm using and find examples there. _That_
code _must_ be kept up to date ;).

About batching docs. What you gain basically more efficient I/O, you
don't have to wait around for the client to connect/disconnect for
every doc. Here's some numbers:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/ with
all the caveats that YMMV.

Best,
Erick

On Thu, Jul 5, 2018 at 7:48 AM, Shawn Heisey <[hidden email]> wrote:

> On 7/4/2018 3:32 AM, Arturas Mazeika wrote:
>>
>> Details:
>>
>> I am benchmarking solrcloud setup on a single machine (Intel 7 with 8 "cpu
>> cores", an SSD as well as a HDD) using the German Wikipedia collection. I
>> created 4 nodes, 4 shards, rep factor: 2 cluster on the same machine (and
>> managed to push the CPU or SSD to the hardware limits, i.e., ~200MB/s,
>> ~100% CPU). Now I wanted to see what happens if I push HDD to the limits.
>> Indexing the files from the SSD (I am able to scan the collection at the
>> actual rate 400-500MB/s) with 16 threads, I tried to send those to the
>> solr
>> cluster with all indexes on the HDD.
>
> <snip>
>>
>> - 4 cores running 2gb ram
>
>
> If this is saying that the machine running Solr has 2GB of installed memory,
> that's going to be a serious problem.
>
> The default heap size that Solr starts with is 512MB.  With 4 Solr nodes
> running on the machine, each with a 512MB heap, all of your 2GB of memory is
> going to be required by the heaps.  Java requires memory beyond the heap to
> run.  Your operating system and its other processes will also require some
> memory.
>
> This means that not only are you going to have no memory left for the OS
> disk cache, you're actually going to allocating MORE than the 2GB of
> installed memory, which means the OS is going to start swapping to
> accommodate memory allocations.
>
> When you don't have enough memory for good disk caching, Solr performance is
> absolutely terrible.  When Solr has to wait for data to be read off of disk,
> even if the disk is SSD, its performance will not be good.
>
> When the OS starts swapping, the performance of ANY software on the system
> drops SIGNIFICANTLY.
>
> You need a lot more memory than 2GB on your server.
>
> Thanks,
> Shawn
>