SOLR Data Locality

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

SOLR Data Locality

Muhammad Imad Qureshi
We have a 30 node Hadoop cluster and each data node has a SOLR instance also running. Data is stored in HDFS. We are adding 10 nodes to the cluster. After adding nodes, we'll run HDFS balancer and also create SOLR replicas on new nodes. This will affect data locality. does this impact how solr works (I mean performance) if the data is on a remote node? ThanksImad
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SOLR Data Locality

Mike Thomsen
I've only ever used the HDFS support with Cloudera's build, but my
experience turned me off to use HDFS. I'd much rather use the native file
system over HDFS.

On Tue, Mar 14, 2017 at 10:19 AM, Muhammad Imad Qureshi <
[hidden email]> wrote:

> We have a 30 node Hadoop cluster and each data node has a SOLR instance
> also running. Data is stored in HDFS. We are adding 10 nodes to the
> cluster. After adding nodes, we'll run HDFS balancer and also create SOLR
> replicas on new nodes. This will affect data locality. does this impact how
> solr works (I mean performance) if the data is on a remote node? ThanksImad
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SOLR Data Locality

Muhammad Imad Qureshi
Hi Mike

I understand that but unfortunately that's not an option right now. We already have 16 TB of index in HDFS.

So let me rephrase this question. How important is data locality for SOLR. Is performance impacted if SOLR data is on a remote node?

Thanks
Imad

> On Mar 17, 2017, at 12:02 PM, Mike Thomsen <[hidden email]> wrote:
>
> I've only ever used the HDFS support with Cloudera's build, but my experience turned me off to use HDFS. I'd much rather use the native file system over HDFS.
>
>> On Tue, Mar 14, 2017 at 10:19 AM, Muhammad Imad Qureshi <[hidden email]> wrote:
>> We have a 30 node Hadoop cluster and each data node has a SOLR instance also running. Data is stored in HDFS. We are adding 10 nodes to the cluster. After adding nodes, we'll run HDFS balancer and also create SOLR replicas on new nodes. This will affect data locality. does this impact how solr works (I mean performance) if the data is on a remote node? ThanksImad
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SOLR Data Locality

Toke Eskildsen-2
Imad Qureshi <[hidden email]> wrote:
> I understand that but unfortunately that's not an option right now.
> We already have 16 TB of index in HDFS.
>
> So let me rephrase this question. How important is data locality for
> SOLR. Is performance impacted if SOLR data is on a remote node?

The short answer is yes, the long answer is https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Anecdotally we did some experiments prior to building our multi-TB search setup, where we compared local SSDs with remote (Isilon) SSDs. That setup was with simple searches and some faceting. I was a bit surprised that the slowdown was only 3x. I would expect the speed difference to be even smaller if the underlying storage is slow (spinning disks). Old blog post at https://sbdevel.wordpress.com/2013/12/06/danish-webscale/


I don't understand the expected gain of adding replicas, if the data are remote. Why can't the replica Solrs run on the nodes with the data? Do you have very CPU-intensive search?

- Toke Eskildsen
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SOLR Data Locality

Shawn Heisey-2
In reply to this post by Muhammad Imad Qureshi
On 3/17/2017 11:14 AM, Imad Qureshi wrote:
> I understand that but unfortunately that's not an option right now. We already have 16 TB of index in HDFS.
>
> So let me rephrase this question. How important is data locality for SOLR. Is performance impacted if SOLR data is on a remote node?

What's going to matter is how fast the data can be retrieved.  With
standard local filesystems, the operating system will use unallocated
memory to cache the data, so if you have enough available memory for
that caching to be effective, access is lightning fast -- the most
requested index data will be in memory, and pulled directly from there
into the application.  If the disk has to be read to obtain the needed
data, it will be slow.  If data has to be transferred over a network
that's gigabit or slower, that is also slow.  Faster network
technologies are available for a price premium, but if a disk has to be
read to get the data, the network speed won't matter.  Good performance
means avoiding going to the disk or transferring over the network.

SSD storage is faster than regular disks, but still not as fast as main
memory, and increased storage speed probably won't matter if the network
can't keep up.

If I'm not mistaken, I think an HDFS client can allocate system memory
for caching purposes to avoid the slow transfer for frequently requested
data.  If my understanding is correct, then enough memory allocated to
the HDFS client MIGHT avoid network/disk transfer for the important data
in the index ... but whether this works in practice is a question I
cannot answer.

Unless your 16TB of index data is being utilized by MANY Solr servers
that each use a very small part of the data and have the ability to
cache a significant percentage of the data they're using, it's highly
unlikely that you're going to have enough memory for good caching.
Indexes that large are typically slow unless you can afford a LOT of
hardware, which means a lot of money.

Thanks,
Shawn

Loading...