20180917-Need Apache SOLR support

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

20180917-Need Apache SOLR support

KARTHICKRM
Dear SOLR Team,

 

We are beginners to Apache SOLR, We need following clarifications from you.

 

1.      In SOLRCloud, How can we install more than one Shared on Single PC?

 

2.      How many maximum number of shared can be added under on SOLRCloud?

 

3.      In my application there is no need of ACID properties, other than
this can I use SOLR as a Complete Database?

 

4.      In Which OS we can feel the better performance, Windows Server OS /
Linux?

 

5.      If a SOLR Core contains 2 Billion indexes, what is the recommended
RAM size and Java heap space for better performance?

 

6.      I have 20 fields per document, how many maximum number of documents
can be inserted / retrieved in a single request?

 

7.       If I have Billions of indexes, If the "start" parameter is 10th
Million index and "end" parameter is  start+100th index, for this case any
performance issue will be raised ?

 

8.      Which .net client is best for SOLR?

 

9.      Is there any limitation for single field, I mean about the size for
blob data?

 

 

Thanks,

Karthick.R.M

+91 8124774480

 

 

Reply | Threaded
Open this post in threaded view
|

Re: 20180917-Need Apache SOLR support

Jan Høydahl / Cominvent
> We are beginners to Apache SOLR, We need following clarifications from you.
>
>
>
> 1.      In SOLRCloud, How can we install more than one Shared on Single PC?

You typically have one installation of Solr on each server. Then you can add a collection with multiple shards, specifying how many shards you wish when creating the collection, e.g.

bin/solr create -c mycoll -shards 4

Although possible, it is normally not advised to install multiple instances of Solr on the same server.

> 2.      How many maximum number of shared can be added under on SOLRCloud?

There is no limit. You should find a good number based on the number of documents, the size of your data, the number of servers in your cluster, available RAM and disk size and the required performance.

In practice you will guess the initial #shards and then benchmark a few different settings before you decide.
Note that you can also adjust the number of shards as you go through CREATESHARD / SPLITSHARD APIs, so even if you start out with few shards you can grow later.

> 3.      In my application there is no need of ACID properties, other than
> this can I use SOLR as a Complete Database?

You COULD, but Solr is not intended to be your primary data store. You should always design your system so that you can re-index all content from some source (does not need to be a database) when needed. There are several use cases for a complete re-index that you should consider.

> 4.      In Which OS we can feel the better performance, Windows Server OS /
> Linux?

I'd say Linux if you can. If you HAVE to, then you could also run on Windows :-)

> 5.      If a SOLR Core contains 2 Billion indexes, what is the recommended
> RAM size and Java heap space for better performance?

It depends. It is not likely that you will ever put 2bn docs in one single core. Normally you would have sharded long before that number.
The amount of physical RAM and the amount of Java heap to allocate to Solr must be calculated and decided on a per case basis.
You could also benchmark this - test if a larger RAM size improves performance due to caching. Depending on your bottlennecks, adding more RAM may be a way to scale further before needing to add more servers.

Sounds like you should consult with a Solr expert to dive deep into your exact usecase and architect the optimal setup for your case, if you have these amounts of data.

> 6.      I have 20 fields per document, how many maximum number of documents
> can be inserted / retrieved in a single request?

No limit. But there are practical limits.
For indexing (update), attempt various batch sizes and find which gives the best performance for you. It is just as important to do inserts (updates) in many parallell connections as in large batches.

For searching, why would you want to know a maximum? Normally the usecase for search is to get TOP N docs, not a maximum number?
If you need to retrieve thousands of results, you should have a look at /export handler and/or streaming expressions.

> 7.       If I have Billions of indexes, If the "start" parameter is 10th
> Million index and "end" parameter is  start+100th index, for this case any
> performance issue will be raised ?

Don't do it!
This is a warning sign that you are using Solr in a wrong way.

If you need to scroll through all docs in the index, have a look at streaming expressions or cursorMark instead!

> 8.      Which .net client is best for SOLR?

The only I'm aware of is SolrNET. There may be others. None of them are supported by the Solr project.

> 9.      Is there any limitation for single field, I mean about the size for
> blob data?

I think there is some default cutoff for very large values.

Why would you want to put very large blobs into documents?
This is a warning flag that you may be using the search index in a wrong way. Consider storing large blobs outside of the search index and reference them from the docs.


In general, it would help a lot if you start telling us WHAT you intend to use Solr for, what you try to achieve, what performance goals/requirements you have etc, instead of a lot of very specific max/min questions. There are very seldom hard limits, and if there are, it is usually not a good idea to approach them :)

Jan

Reply | Threaded
Open this post in threaded view
|

Re: 20180917-Need Apache SOLR support

Susheel Kumar-3
I'll highly advice if you can use Java library/SolrJ to connect to Solr
than .Net.  There are many things taken care by CloudSolrClient and other
classes when communicated with Solr Cloud having shards/replica's etc and
if your .Net port for SolrJ are not up to date/having all the functionality
(which I am sure) , you may run into issues.

Thnx

On Mon, Sep 17, 2018 at 10:01 AM Jan Høydahl <[hidden email]> wrote:

> > We are beginners to Apache SOLR, We need following clarifications from
> you.
> >
> >
> >
> > 1.      In SOLRCloud, How can we install more than one Shared on Single
> PC?
>
> You typically have one installation of Solr on each server. Then you can
> add a collection with multiple shards, specifying how many shards you wish
> when creating the collection, e.g.
>
> bin/solr create -c mycoll -shards 4
>
> Although possible, it is normally not advised to install multiple
> instances of Solr on the same server.
>
> > 2.      How many maximum number of shared can be added under on
> SOLRCloud?
>
> There is no limit. You should find a good number based on the number of
> documents, the size of your data, the number of servers in your cluster,
> available RAM and disk size and the required performance.
>
> In practice you will guess the initial #shards and then benchmark a few
> different settings before you decide.
> Note that you can also adjust the number of shards as you go through
> CREATESHARD / SPLITSHARD APIs, so even if you start out with few shards you
> can grow later.
>
> > 3.      In my application there is no need of ACID properties, other than
> > this can I use SOLR as a Complete Database?
>
> You COULD, but Solr is not intended to be your primary data store. You
> should always design your system so that you can re-index all content from
> some source (does not need to be a database) when needed. There are several
> use cases for a complete re-index that you should consider.
>
> > 4.      In Which OS we can feel the better performance, Windows Server
> OS /
> > Linux?
>
> I'd say Linux if you can. If you HAVE to, then you could also run on
> Windows :-)
>
> > 5.      If a SOLR Core contains 2 Billion indexes, what is the
> recommended
> > RAM size and Java heap space for better performance?
>
> It depends. It is not likely that you will ever put 2bn docs in one single
> core. Normally you would have sharded long before that number.
> The amount of physical RAM and the amount of Java heap to allocate to Solr
> must be calculated and decided on a per case basis.
> You could also benchmark this - test if a larger RAM size improves
> performance due to caching. Depending on your bottlennecks, adding more RAM
> may be a way to scale further before needing to add more servers.
>
> Sounds like you should consult with a Solr expert to dive deep into your
> exact usecase and architect the optimal setup for your case, if you have
> these amounts of data.
>
> > 6.      I have 20 fields per document, how many maximum number of
> documents
> > can be inserted / retrieved in a single request?
>
> No limit. But there are practical limits.
> For indexing (update), attempt various batch sizes and find which gives
> the best performance for you. It is just as important to do inserts
> (updates) in many parallell connections as in large batches.
>
> For searching, why would you want to know a maximum? Normally the usecase
> for search is to get TOP N docs, not a maximum number?
> If you need to retrieve thousands of results, you should have a look at
> /export handler and/or streaming expressions.
>
> > 7.       If I have Billions of indexes, If the "start" parameter is 10th
> > Million index and "end" parameter is  start+100th index, for this case
> any
> > performance issue will be raised ?
>
> Don't do it!
> This is a warning sign that you are using Solr in a wrong way.
>
> If you need to scroll through all docs in the index, have a look at
> streaming expressions or cursorMark instead!
>
> > 8.      Which .net client is best for SOLR?
>
> The only I'm aware of is SolrNET. There may be others. None of them are
> supported by the Solr project.
>
> > 9.      Is there any limitation for single field, I mean about the size
> for
> > blob data?
>
> I think there is some default cutoff for very large values.
>
> Why would you want to put very large blobs into documents?
> This is a warning flag that you may be using the search index in a wrong
> way. Consider storing large blobs outside of the search index and reference
> them from the docs.
>
>
> In general, it would help a lot if you start telling us WHAT you intend to
> use Solr for, what you try to achieve, what performance goals/requirements
> you have etc, instead of a lot of very specific max/min questions. There
> are very seldom hard limits, and if there are, it is usually not a good
> idea to approach them :)
>
> Jan
>
>
Reply | Threaded
Open this post in threaded view
|

Re: 20180917-Need Apache SOLR support

Walter Underwood
In reply to this post by Jan Høydahl / Cominvent
Do not use Solr as a database. It was never designed to be a database.
It is missing a lot of features that are normal in databases.

* no transactions
* no rollback (in Solr Cloud)
* no session isolation (one client’s commit will commit all data in progress)
* no schema migration
* no version migration
* no real backups (Solr backup is a cold server, not a dump/load)
* no dump/load
* modify record (atomic updates are a subset of this)

Solr assumes you can always reload all the data from a repository. This is done
instead of migration or backups.

If you use Solr as a database and lose all your data, don’t blame us. It was
never designed to do that.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Sep 17, 2018, at 7:01 AM, Jan Høydahl <[hidden email]> wrote:
>
>> We are beginners to Apache SOLR, We need following clarifications from you.
>>
>>
>>
>> 1.      In SOLRCloud, How can we install more than one Shared on Single PC?
>
> You typically have one installation of Solr on each server. Then you can add a collection with multiple shards, specifying how many shards you wish when creating the collection, e.g.
>
> bin/solr create -c mycoll -shards 4
>
> Although possible, it is normally not advised to install multiple instances of Solr on the same server.
>
>> 2.      How many maximum number of shared can be added under on SOLRCloud?
>
> There is no limit. You should find a good number based on the number of documents, the size of your data, the number of servers in your cluster, available RAM and disk size and the required performance.
>
> In practice you will guess the initial #shards and then benchmark a few different settings before you decide.
> Note that you can also adjust the number of shards as you go through CREATESHARD / SPLITSHARD APIs, so even if you start out with few shards you can grow later.
>
>> 3.      In my application there is no need of ACID properties, other than
>> this can I use SOLR as a Complete Database?
>
> You COULD, but Solr is not intended to be your primary data store. You should always design your system so that you can re-index all content from some source (does not need to be a database) when needed. There are several use cases for a complete re-index that you should consider.
>
>> 4.      In Which OS we can feel the better performance, Windows Server OS /
>> Linux?
>
> I'd say Linux if you can. If you HAVE to, then you could also run on Windows :-)
>
>> 5.      If a SOLR Core contains 2 Billion indexes, what is the recommended
>> RAM size and Java heap space for better performance?
>
> It depends. It is not likely that you will ever put 2bn docs in one single core. Normally you would have sharded long before that number.
> The amount of physical RAM and the amount of Java heap to allocate to Solr must be calculated and decided on a per case basis.
> You could also benchmark this - test if a larger RAM size improves performance due to caching. Depending on your bottlennecks, adding more RAM may be a way to scale further before needing to add more servers.
>
> Sounds like you should consult with a Solr expert to dive deep into your exact usecase and architect the optimal setup for your case, if you have these amounts of data.
>
>> 6.      I have 20 fields per document, how many maximum number of documents
>> can be inserted / retrieved in a single request?
>
> No limit. But there are practical limits.
> For indexing (update), attempt various batch sizes and find which gives the best performance for you. It is just as important to do inserts (updates) in many parallell connections as in large batches.
>
> For searching, why would you want to know a maximum? Normally the usecase for search is to get TOP N docs, not a maximum number?
> If you need to retrieve thousands of results, you should have a look at /export handler and/or streaming expressions.
>
>> 7.       If I have Billions of indexes, If the "start" parameter is 10th
>> Million index and "end" parameter is  start+100th index, for this case any
>> performance issue will be raised ?
>
> Don't do it!
> This is a warning sign that you are using Solr in a wrong way.
>
> If you need to scroll through all docs in the index, have a look at streaming expressions or cursorMark instead!
>
>> 8.      Which .net client is best for SOLR?
>
> The only I'm aware of is SolrNET. There may be others. None of them are supported by the Solr project.
>
>> 9.      Is there any limitation for single field, I mean about the size for
>> blob data?
>
> I think there is some default cutoff for very large values.
>
> Why would you want to put very large blobs into documents?
> This is a warning flag that you may be using the search index in a wrong way. Consider storing large blobs outside of the search index and reference them from the docs.
>
>
> In general, it would help a lot if you start telling us WHAT you intend to use Solr for, what you try to achieve, what performance goals/requirements you have etc, instead of a lot of very specific max/min questions. There are very seldom hard limits, and if there are, it is usually not a good idea to approach them :)
>
> Jan
>

Reply | Threaded
Open this post in threaded view
|

Re: 20180917-Need Apache SOLR support

Shawn Heisey-2
In reply to this post by KARTHICKRM
On 9/17/2018 7:04 AM, KARTHICKRM wrote:
> Dear SOLR Team,
>
> We are beginners to Apache SOLR, We need following clarifications from you.

Much of what I'm going to say is a mirror of what you were already told
by Jan.  All of Jan's responses are good.

> 1.      In SOLRCloud, How can we install more than one Shared on Single PC?

One Solr instance can run multiple indexes.  Except for one specific
scenario that I hope you don't run into, you should NOT run multiple
Solr instances per server.  There should only be one.  If your query
rate is very low, then you can get good performance from multiple shards
per node, but with a high query rate, you'll only want one shard per node.

> 2.      How many maximum number of shared can be added under on SOLRCloud?

There is no practical limit.  If you create enough of them (more than a
few hundred), you can end up with severe scalability problems related to
SolrCloud's interaction with ZooKeeper.

> 3.      In my application there is no need of ACID properties, other than
> this can I use SOLR as a Complete Database?

Solr is NOT a database.  All of its capability and all the optimizations
it contains are all geared towards search.  If you try to use it as a
database, you're going to be disappointed with it.

> 4.      In Which OS we can feel the better performance, Windows Server OS /
> Linux?

 From those two choices, I would strongly recommend Linux. If you have
an open source operating system that you prefer to Linux, go with that.

> 5.      If a SOLR Core contains 2 Billion indexes, what is the recommended
> RAM size and Java heap space for better performance?

I hope you mean 2 billion documents here, not 2 billion indexes.  Even
though technically speaking there's nothing preventing SolrCloud from
handling that many indexes, you'll run into scalability problems long
before you reach that many.

If you do mean documents ... don't put that many documents in one core. 
That number includes deleted documents, which means there's a good
possibility of going beyond the actual limit if you try to have 2
billion documents that haven't been deleted.

> 6.      I have 20 fields per document, how many maximum number of documents
> can be inserted / retrieved in a single request?

There's no limit to the number that can be retrieved.  But because the
entire response must be built in memory, you can run your Solr install
out of heap memory by trying to build a large response.  Streaming
expressions can be used for really large results to avoid the memory issues.

As for the number of documents that can be inserted by a single request
... Solr defaults to a maximum POST body size of 2 megabytes.  This can
be increased through an option in solrconfig.xml.  Unless your documents
are huge, this is usually enough to send several thousand at once, which
should be plenty.

> 7.       If I have Billions of indexes, If the "start" parameter is 10th
> Million index and "end" parameter is  start+100th index, for this case any
> performance issue will be raised ?

Let's say that you send a request with these parameters, and the index
has three shards:

start=10000000&rows=100

Every shard in the index is going to return a result to the coordinating
node of ten million plus 100.  That's thirty million individual
results.  The coordinating node will combine those results, sort them,
and then request full documents for the 100 specific rows that were
requested.  This takes a lot of time and a lot of memory.

For deep paging, use cursorMark.  For large result sets, use streaming
expressions.  I have used cursorMark ... it's only disadvantage is that
you can't jump straight to page 10000, you must go through all of the
earlier pages too.  But page 10000 will be just as fast as page 1.  I
have never used streaming expressions.

> 8.      Which .net client is best for SOLR?

No idea.  The only client produced by this project is the Java client. 
All other clients are third-party, including .NET clients.

> 9.      Is there any limitation for single field, I mean about the size for
> blob data?

There are technically no limitations here.  But if your data is big
enough, it begins to cause scalability problems.  It takes time to read
data off the disk, for the CPU to process it, etc.

In conclusion, I have much the same thing to say as Jan said.  It sounds
to me like you're not after a search engine, and that Solr might not be
the right product for what you're trying to accomplish.  I'll say this
again: Solr is NOT a database.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: 20180917-Need Apache SOLR support

Ere Maijala

Shawn Heisey kirjoitti 17.9.2018 klo 19.03:

>> 7.       If I have Billions of indexes, If the "start" parameter is 10th
>> Million index and "end" parameter is  start+100th index, for this case
>> any
>> performance issue will be raised ?
>
> Let's say that you send a request with these parameters, and the index
> has three shards:
>
> start=10000000&rows=100
>
> Every shard in the index is going to return a result to the coordinating
> node of ten million plus 100.  That's thirty million individual
> results.  The coordinating node will combine those results, sort them,
> and then request full documents for the 100 specific rows that were
> requested.  This takes a lot of time and a lot of memory.

What Shawn says above means that even if you give Solr a heap big enough
to handle that, you'll run into serious performance issues even with a
light load since the these huge allocations easily lead to
stop-the-world garbage collections that kill performance. I've tried it
and it was bad.

If you are thinking of a user interface that allows jumping to an
arbitrary result page, you'll have to limit it to some sensible number
of results (10 000 is probably safe, 100 000 may also work) or use
something else than Solr. Cursor mark or streaming are great options,
but only if you want to process all the records. Often the deep paging
need is practically the need to see the last results, and that can also
be achieved by allowing reverse sorting.

Regards,
Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland
Reply | Threaded
Open this post in threaded view
|

Re: 20180917-Need Apache SOLR support

zhenyuan wei
In reply to this post by Shawn Heisey-2
Is that means: Small amount of shards  gains  better performance?
I also have a usecase which contains 3 billion documents,the collection
contains 60 shard now. Is that 10 shard is better than 60 shard?



Shawn Heisey <[hidden email]> 于2018年9月18日周二 上午12:04写道:

> On 9/17/2018 7:04 AM, KARTHICKRM wrote:
> > Dear SOLR Team,
> >
> > We are beginners to Apache SOLR, We need following clarifications from
> you.
>
> Much of what I'm going to say is a mirror of what you were already told
> by Jan.  All of Jan's responses are good.
>
> > 1.      In SOLRCloud, How can we install more than one Shared on Single
> PC?
>
> One Solr instance can run multiple indexes.  Except for one specific
> scenario that I hope you don't run into, you should NOT run multiple
> Solr instances per server.  There should only be one.  If your query
> rate is very low, then you can get good performance from multiple shards
> per node, but with a high query rate, you'll only want one shard per node.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: 20180917-Need Apache SOLR support

Shawn Heisey-2
On 9/17/2018 9:05 PM, zhenyuan wei wrote:
> Is that means: Small amount of shards  gains  better performance?
> I also have a usecase which contains 3 billion documents,the collection
> contains 60 shard now. Is that 10 shard is better than 60 shard?

There is no definite answer to this question.  It depends on a bunch of
things.  How big is each shard once it's finally built?  What's your
query rate?  How many machines do you have, and how much memory do those
machines have?

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: 20180917-Need Apache SOLR support

zhenyuan wei
I have 6 machines,and each machine run a solr server, each solr server use
RAM 18GB.  Total document number is 3.2billion,1.4TB ,
my collection‘s replica factor is 1。collection shard number is
 60,currently each shard is 20~30GB。
15 fields per document。 Query rate is slow now,maybe 100-500 requests per
second.

Shawn Heisey <[hidden email]> 于2018年9月18日周二 下午12:07写道:

> On 9/17/2018 9:05 PM, zhenyuan wei wrote:
> > Is that means: Small amount of shards  gains  better performance?
> > I also have a usecase which contains 3 billion documents,the collection
> > contains 60 shard now. Is that 10 shard is better than 60 shard?
>
> There is no definite answer to this question.  It depends on a bunch of
> things.  How big is each shard once it's finally built?  What's your
> query rate?  How many machines do you have, and how much memory do those
> machines have?
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: 20180917-Need Apache SOLR support

Shawn Heisey-2
On 9/18/2018 1:11 AM, zhenyuan wei wrote:
> I have 6 machines,and each machine run a solr server, each solr server use
> RAM 18GB.  Total document number is 3.2billion,1.4TB ,
> my collection‘s replica factor is 1。collection shard number is
>   60,currently each shard is 20~30GB。
> 15 fields per document。 Query rate is slow now,maybe 100-500 requests per
> second.

That is NOT a slow query rate.  In the recent past, I was the
administrator to a Solr install.  When things got *REALLY BUSY*, the
servers would see as many as five requests per second. Usually the
request rate was less than one per second.  A high request rate can
drastically impact overall performance.

I have heard of big Solr installs that handle thousands of requests per
second, which is certainly larger than yours ... but 100-500 is NOT
slow.  I'm surprised that you can get acceptable performance on an index
that big, with that many queries, and only six machines.  Congratulations.

Despite appearances, I wasn't actually asking you for this information. 
I was telling you that those things would all be factors in the decision
about how many shards you should have.Perhaps I should have worded the
message differently.

See this page for a discussion about how total memory size and index
size affect performance:

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

RE: 20180917-Need Apache SOLR support

Liu, Daphne
In reply to this post by zhenyuan wei
You have to increase your RAM. We have upgraded our Solr cluster to  12 solr nodes, each with 64G RAM, our shard size is around 25G, each server only hosts either one shard ( leading node or replica),  Performance is very good.
For better performance, memory needs to be over your shard size.


Kind regards,

Daphne Liu
BI Architect • Big Data - Matrix SCM

CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 USA / www.cevalogistics.com
T 904.9281448 / F 904.928.1525 / [hidden email]

Making business flow

-----Original Message-----
From: zhenyuan wei <[hidden email]>
Sent: Tuesday, September 18, 2018 3:12 AM
To: [hidden email]
Subject: Re: 20180917-Need Apache SOLR support

I have 6 machines,and each machine run a solr server, each solr server use RAM 18GB.  Total document number is 3.2billion,1.4TB ,
my collection‘s replica factor is 1。collection shard number is
 60,currently each shard is 20~30GB。
15 fields per document。 Query rate is slow now,maybe 100-500 requests per second.

Shawn Heisey <[hidden email]> 于2018年9月18日周二 下午12:07写道:

> On 9/17/2018 9:05 PM, zhenyuan wei wrote:
> > Is that means: Small amount of shards  gains  better performance?
> > I also have a usecase which contains 3 billion documents,the
> > collection contains 60 shard now. Is that 10 shard is better than 60 shard?
>
> There is no definite answer to this question.  It depends on a bunch
> of things.  How big is each shard once it's finally built?  What's
> your query rate?  How many machines do you have, and how much memory
> do those machines have?
>
> Thanks,
> Shawn
>
>

NVOCC Services are provided by CEVA as agents for and on behalf of Pyramid Lines Limited trading as Pyramid Lines.
This e-mail message is intended for the above named recipient(s) only. It may contain confidential information that is privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this e-mail and any attachment(s) is strictly prohibited. If you have received this e-mail by error, please immediately notify the sender by replying to this e-mail and deleting the message including any attachment(s) from your system. Thank you in advance for your cooperation and assistance. Although the company has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments.
Reply | Threaded
Open this post in threaded view
|

Re: [OT] 20180917-Need Apache SOLR support

Christopher Schultz
In reply to this post by Walter Underwood
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Walter,

On 9/17/18 11:39, Walter Underwood wrote:
> Do not use Solr as a database. It was never designed to be a
> database. It is missing a lot of features that are normal in
> databases.
>
> [...] * no real backups (Solr backup is a cold server, not a
> dump/load)

I'm just curious... if Solr has "no real backups", why is there a
complete client API for performing backups and restores?

https://lucene.apache.org/solr/guide/7_4/making-and-restoring-backups.ht
ml

Thanks,
- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAluhFp8ACgkQHPApP6U8
pFgnhBAAre3Zb2mu++WVmY6rZlcc3uoRkDRva6iR602wA/w/EUabCmHEkO9maYEm
NoUREgBH9NtFPvYnjkEEL7/P/2hUErvRw0RfwsAo89ClYjjyMEH25+p5SNmudUmK
fKRSLRUyCbpE8ahKTPG44gRlki03uJJ2GA0r3vbTLvdqm1p5KO6sE4k/r3IYJ0QI
qZfUY4Un+LQ5vGMQ7qeGRcFhaAXVOaJmnLCRqGTS2hMTM1uM01TCblhOaeX5XHYD
Yra4m15Sr1H8p3S0CFsP8oqvDND0jEC4MxM9mQvHOvq9IwMreTSwACga35Wm6ItD
h1/Td9H/Puo8o9vQMaVfNcFD4TAqt+FkIHzQEb+FkQAMfbC9ZHsmBgvl8EUtPBq1
h2ODETEcD5SsmdfrP5OWUz+0OBhH7/HEgWRjHW9nSMzhPn4kYgpF/7VuFL8iy3re
/8TviTf446I859QNragWXACdARhCzMo8AoXIs/dC70CGDvxuKmEcI6tad9Zsxcf2
+yaFa3Fzddulaeao4juZVbRVJ9eewFOSawMXDc14TeL6t13CxzxFasHiYu0C5euV
XhKSWEHYj58ijS/KU4FMDCEWZhr1KWEKwfVp7hZ2CZZNW5kNPbv97otKvxB0cKyS
LTK6PtZoZbTWXFa8rT3yq28/x6gMULQeo0ZBZLTXEJKpfAT2vAU=
=Fh1S
-----END PGP SIGNATURE-----
Reply | Threaded
Open this post in threaded view
|

Re: [OT] 20180917-Need Apache SOLR support

Walter Underwood
It isn’t very clear from that page, but the two backup methods make a copy
of the indexes in a commit-aware way. That is all. One method copies them
to a new server, the other to files in the data directory.

Database backups generally have a separate backup format which is
independent of the database version. For example, mysqldump generates
a backup as SQL statements.

The Solr backup is version-locked, because it is just a copy of the index files.
People who are used to database backups might be very surprised when they
could not load a Solr backup into a server with a different version or on a
different architecture.

The only version-independent restore in Solr is to reload the data from the
source repository.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Sep 18, 2018, at 8:15 AM, Christopher Schultz <[hidden email]> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Walter,
>
> On 9/17/18 11:39, Walter Underwood wrote:
>> Do not use Solr as a database. It was never designed to be a
>> database. It is missing a lot of features that are normal in
>> databases.
>>
>> [...] * no real backups (Solr backup is a cold server, not a
>> dump/load)
>
> I'm just curious... if Solr has "no real backups", why is there a
> complete client API for performing backups and restores?
>
> https://lucene.apache.org/solr/guide/7_4/making-and-restoring-backups.ht
> ml
>
> Thanks,
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAluhFp8ACgkQHPApP6U8
> pFgnhBAAre3Zb2mu++WVmY6rZlcc3uoRkDRva6iR602wA/w/EUabCmHEkO9maYEm
> NoUREgBH9NtFPvYnjkEEL7/P/2hUErvRw0RfwsAo89ClYjjyMEH25+p5SNmudUmK
> fKRSLRUyCbpE8ahKTPG44gRlki03uJJ2GA0r3vbTLvdqm1p5KO6sE4k/r3IYJ0QI
> qZfUY4Un+LQ5vGMQ7qeGRcFhaAXVOaJmnLCRqGTS2hMTM1uM01TCblhOaeX5XHYD
> Yra4m15Sr1H8p3S0CFsP8oqvDND0jEC4MxM9mQvHOvq9IwMreTSwACga35Wm6ItD
> h1/Td9H/Puo8o9vQMaVfNcFD4TAqt+FkIHzQEb+FkQAMfbC9ZHsmBgvl8EUtPBq1
> h2ODETEcD5SsmdfrP5OWUz+0OBhH7/HEgWRjHW9nSMzhPn4kYgpF/7VuFL8iy3re
> /8TviTf446I859QNragWXACdARhCzMo8AoXIs/dC70CGDvxuKmEcI6tad9Zsxcf2
> +yaFa3Fzddulaeao4juZVbRVJ9eewFOSawMXDc14TeL6t13CxzxFasHiYu0C5euV
> XhKSWEHYj58ijS/KU4FMDCEWZhr1KWEKwfVp7hZ2CZZNW5kNPbv97otKvxB0cKyS
> LTK6PtZoZbTWXFa8rT3yq28/x6gMULQeo0ZBZLTXEJKpfAT2vAU=
> =Fh1S
> -----END PGP SIGNATURE-----

Reply | Threaded
Open this post in threaded view
|

Re: [OT] 20180917-Need Apache SOLR support

Christopher Schultz
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Walter,

On 9/18/18 11:24, Walter Underwood wrote:

> It isn’t very clear from that page, but the two backup methods make
> a copy of the indexes in a commit-aware way. That is all. One
> method copies them to a new server, the other to files in the data
> directory.
>
> Database backups generally have a separate backup format which is
> independent of the database version. For example, mysqldump
> generates a backup as SQL statements.
>
> The Solr backup is version-locked, because it is just a copy of the
> index files. People who are used to database backups might be very
> surprised when they could not load a Solr backup into a server with
> a different version or on a different architecture.
>
> The only version-independent restore in Solr is to reload the data
> from the source repository.

Thanks for the explanation.

We recently re-built from source and it took about 10 minutes. If we
can get better performance for a restore starting with a "backup"
(which is likely), we'll probably go ahead and do that, with the
understanding that the ultimate fallback is reload-from-source.

When upgrading to a new version of Solr, what are the rules for when
you have to discard your whole index and reload from source? We have
been in the 7.x line since we began development and testing and have
not had any reason to reload from source so far. (Well, except when we
had to make schema changes.)

Thanks,
- -chris

>> On Sep 18, 2018, at 8:15 AM, Christopher Schultz
>> <[hidden email]> wrote:
>>
> Walter,
>
> On 9/17/18 11:39, Walter Underwood wrote:
>>>> Do not use Solr as a database. It was never designed to be a
>>>> database. It is missing a lot of features that are normal in
>>>> databases.
>>>>
>>>> [...] * no real backups (Solr backup is a cold server, not a
>>>> dump/load)
>
> I'm just curious... if Solr has "no real backups", why is there a
> complete client API for performing backups and restores?
>
> https://lucene.apache.org/solr/guide/7_4/making-and-restoring-backups.
ht
>
>
ml
>
> Thanks, -chris
>
>
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAluhQlkACgkQHPApP6U8
pFgcyRAAm4/FeeGn3eGv4CwNVfc9GrsUYc4/YexdwRT7oFUgqTC2kYeegj/YAgm3
ZwgfLDkDL0HR51i/pp4UG8MDTB5NFtp8Jg6+JSE4SutAA72N6vnwnC1Z/T52i0xG
OqT0lFKeIL7Tt5c0FffbAMx5rgbFkzWHNWgFFqYFB0WZEzj4JM6rmAiDqLunRGPA
xAZUnZCRMXhcVZT0bmmnSGlyU+JHL0ZQrJD/WX4DOJo2ZyAvP7pSYBEU+nTfyjzJ
kE3rx1W9o269yc052FJTk5rRADuHIdirQQ/SrUN3O7Nn7Hqqi2/6sqyM34CF6wmX
IPv9frb/WTvXQ3nsFYmQVB1jEBBr5S+9pztO3jOtUbGGKCjBpVGDcOXJVBwEDzPW
yII5EjpjkoYwVB6shUI2nfaM/Y6r4aQLrZO6A5FFePhQTm6BGa/i2i1A1uLqfvHY
WMmv/QMYqXZu7hXW6l5NKpO1AtSKTZBq8iXi9BiOXSHNSxo9mT9kPLu40Uh63Gyp
EHI/SfAPWNwOj01pkbyV+siyhAWBVWpolN1SinnW3ZR16Yddd2lRmNxdfVCC32pL
OfRxrChtZ736kvm4ELzmUAUjITxpZf7AFgsrB6zyTlPRn/jvnW7sRsIsOa4BHdGC
e4oCzK7waITu6jam4Zz6e3efyxSDfT2YZ7811L098mody1n2g5k=
=PaVE
-----END PGP SIGNATURE-----
Reply | Threaded
Open this post in threaded view
|

Re: [OT] 20180917-Need Apache SOLR support

Erick Erickson
The only hard-and-fast rule is that you must re-index from source when
you upgrade to Solr X+2. Solr (well, Lucene) tries very hard to
maintain one-major-version back-compatibility, so Solr 8 will function
with Solr 7 indexes but _not_ any index _ever touched_ by 6x.

That said, it's usually a good idea to re-index anyway when jumping a
major version (say Solr 7 -> Solr 8) if possible.

Best,
Erick
On Tue, Sep 18, 2018 at 11:22 AM Christopher Schultz
<[hidden email]> wrote:

>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Walter,
>
> On 9/18/18 11:24, Walter Underwood wrote:
> > It isn’t very clear from that page, but the two backup methods make
> > a copy of the indexes in a commit-aware way. That is all. One
> > method copies them to a new server, the other to files in the data
> > directory.
> >
> > Database backups generally have a separate backup format which is
> > independent of the database version. For example, mysqldump
> > generates a backup as SQL statements.
> >
> > The Solr backup is version-locked, because it is just a copy of the
> > index files. People who are used to database backups might be very
> > surprised when they could not load a Solr backup into a server with
> > a different version or on a different architecture.
> >
> > The only version-independent restore in Solr is to reload the data
> > from the source repository.
>
> Thanks for the explanation.
>
> We recently re-built from source and it took about 10 minutes. If we
> can get better performance for a restore starting with a "backup"
> (which is likely), we'll probably go ahead and do that, with the
> understanding that the ultimate fallback is reload-from-source.
>
> When upgrading to a new version of Solr, what are the rules for when
> you have to discard your whole index and reload from source? We have
> been in the 7.x line since we began development and testing and have
> not had any reason to reload from source so far. (Well, except when we
> had to make schema changes.)
>
> Thanks,
> - -chris
>
> >> On Sep 18, 2018, at 8:15 AM, Christopher Schultz
> >> <[hidden email]> wrote:
> >>
> > Walter,
> >
> > On 9/17/18 11:39, Walter Underwood wrote:
> >>>> Do not use Solr as a database. It was never designed to be a
> >>>> database. It is missing a lot of features that are normal in
> >>>> databases.
> >>>>
> >>>> [...] * no real backups (Solr backup is a cold server, not a
> >>>> dump/load)
> >
> > I'm just curious... if Solr has "no real backups", why is there a
> > complete client API for performing backups and restores?
> >
> > https://lucene.apache.org/solr/guide/7_4/making-and-restoring-backups.
> ht
> >
> >
> ml
> >
> > Thanks, -chris
> >
> >
> -----BEGIN PGP SIGNATURE-----
> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAluhQlkACgkQHPApP6U8
> pFgcyRAAm4/FeeGn3eGv4CwNVfc9GrsUYc4/YexdwRT7oFUgqTC2kYeegj/YAgm3
> ZwgfLDkDL0HR51i/pp4UG8MDTB5NFtp8Jg6+JSE4SutAA72N6vnwnC1Z/T52i0xG
> OqT0lFKeIL7Tt5c0FffbAMx5rgbFkzWHNWgFFqYFB0WZEzj4JM6rmAiDqLunRGPA
> xAZUnZCRMXhcVZT0bmmnSGlyU+JHL0ZQrJD/WX4DOJo2ZyAvP7pSYBEU+nTfyjzJ
> kE3rx1W9o269yc052FJTk5rRADuHIdirQQ/SrUN3O7Nn7Hqqi2/6sqyM34CF6wmX
> IPv9frb/WTvXQ3nsFYmQVB1jEBBr5S+9pztO3jOtUbGGKCjBpVGDcOXJVBwEDzPW
> yII5EjpjkoYwVB6shUI2nfaM/Y6r4aQLrZO6A5FFePhQTm6BGa/i2i1A1uLqfvHY
> WMmv/QMYqXZu7hXW6l5NKpO1AtSKTZBq8iXi9BiOXSHNSxo9mT9kPLu40Uh63Gyp
> EHI/SfAPWNwOj01pkbyV+siyhAWBVWpolN1SinnW3ZR16Yddd2lRmNxdfVCC32pL
> OfRxrChtZ736kvm4ELzmUAUjITxpZf7AFgsrB6zyTlPRn/jvnW7sRsIsOa4BHdGC
> e4oCzK7waITu6jam4Zz6e3efyxSDfT2YZ7811L098mody1n2g5k=
> =PaVE
> -----END PGP SIGNATURE-----
Reply | Threaded
Open this post in threaded view
|

Re: [OT] 20180917-Need Apache SOLR support

Jan Høydahl / Cominvent
In reply to this post by Walter Underwood
I guess you could do a version-independent backup with /export handler and store
docs in XML or JSON format. Or you could use streaming and store the entire index
as JSON tuples, which could then be ingested into another version.

But it is correct that the backup/restore feature of Solr is not primarily intended for archival
or moving a collection to a completely different version. It is primarily intended as a
much faster disaster recovery method than reindex from slow sources. But you COULD
also use it to quickly migrate from an old cluster to the next major version.

It would be cool to investigate an alternate backup command, which instructs each shard
leader to stream all documents to JSON inside the backup folder, in parallell. But you may
still get issues with the Zookeeper part if restoring to a very different version.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 18. sep. 2018 kl. 17:24 skrev Walter Underwood <[hidden email]>:
>
> It isn’t very clear from that page, but the two backup methods make a copy
> of the indexes in a commit-aware way. That is all. One method copies them
> to a new server, the other to files in the data directory.
>
> Database backups generally have a separate backup format which is
> independent of the database version. For example, mysqldump generates
> a backup as SQL statements.
>
> The Solr backup is version-locked, because it is just a copy of the index files.
> People who are used to database backups might be very surprised when they
> could not load a Solr backup into a server with a different version or on a
> different architecture.
>
> The only version-independent restore in Solr is to reload the data from the
> source repository.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
>> On Sep 18, 2018, at 8:15 AM, Christopher Schultz <[hidden email]> wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>>
>> Walter,
>>
>> On 9/17/18 11:39, Walter Underwood wrote:
>>> Do not use Solr as a database. It was never designed to be a
>>> database. It is missing a lot of features that are normal in
>>> databases.
>>>
>>> [...] * no real backups (Solr backup is a cold server, not a
>>> dump/load)
>>
>> I'm just curious... if Solr has "no real backups", why is there a
>> complete client API for performing backups and restores?
>>
>> https://lucene.apache.org/solr/guide/7_4/making-and-restoring-backups.ht
>> ml
>>
>> Thanks,
>> - -chris
>> -----BEGIN PGP SIGNATURE-----
>> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>>
>> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAluhFp8ACgkQHPApP6U8
>> pFgnhBAAre3Zb2mu++WVmY6rZlcc3uoRkDRva6iR602wA/w/EUabCmHEkO9maYEm
>> NoUREgBH9NtFPvYnjkEEL7/P/2hUErvRw0RfwsAo89ClYjjyMEH25+p5SNmudUmK
>> fKRSLRUyCbpE8ahKTPG44gRlki03uJJ2GA0r3vbTLvdqm1p5KO6sE4k/r3IYJ0QI
>> qZfUY4Un+LQ5vGMQ7qeGRcFhaAXVOaJmnLCRqGTS2hMTM1uM01TCblhOaeX5XHYD
>> Yra4m15Sr1H8p3S0CFsP8oqvDND0jEC4MxM9mQvHOvq9IwMreTSwACga35Wm6ItD
>> h1/Td9H/Puo8o9vQMaVfNcFD4TAqt+FkIHzQEb+FkQAMfbC9ZHsmBgvl8EUtPBq1
>> h2ODETEcD5SsmdfrP5OWUz+0OBhH7/HEgWRjHW9nSMzhPn4kYgpF/7VuFL8iy3re
>> /8TviTf446I859QNragWXACdARhCzMo8AoXIs/dC70CGDvxuKmEcI6tad9Zsxcf2
>> +yaFa3Fzddulaeao4juZVbRVJ9eewFOSawMXDc14TeL6t13CxzxFasHiYu0C5euV
>> XhKSWEHYj58ijS/KU4FMDCEWZhr1KWEKwfVp7hZ2CZZNW5kNPbv97otKvxB0cKyS
>> LTK6PtZoZbTWXFa8rT3yq28/x6gMULQeo0ZBZLTXEJKpfAT2vAU=
>> =Fh1S
>> -----END PGP SIGNATURE-----
>