Data duplication using Cloud+HDFS+Mirroring

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Data duplication using Cloud+HDFS+Mirroring

Greg Walters
While testing Solr's new ability to store data and transaction directories in HDFS I added an additional core to one of my testing servers that was configured as a backup (active but not leader) core for a shard elsewhere. It looks like this extra core copies the data into its own directory rather than just using the existing directory with the data that's already available to it.

Since HDFS likely already has redundancy of the data covered via the replicationFactor is there a reason for non-leader cores to create their own data directory rather than doing reads on the existing master copy? I searched Jira for anything that suggests this behavior might change and didn't find any issues; is there any intent to address this?

Thanks,
Greg
Reply | Threaded
Open this post in threaded view
|

Re: Data duplication using Cloud+HDFS+Mirroring

Isaac Hebsh
Hi Greg, Did you get an answer?
I'm interested in the same question.

More generally, what are the benefits of HdfsDirectoryFactory, besides the
transparent restore of the shard contents in case of a disk failure, and
the ability to rebuild index using MR?
Is the next statement exact? blocks of a particular shard, which are
replicated to another node, will be never queried, since there is no solr
core configured to read them.


On Wed, Aug 7, 2013 at 8:46 PM, Greg Walters
<[hidden email]>wrote:

> While testing Solr's new ability to store data and transaction directories
> in HDFS I added an additional core to one of my testing servers that was
> configured as a backup (active but not leader) core for a shard elsewhere.
> It looks like this extra core copies the data into its own directory rather
> than just using the existing directory with the data that's already
> available to it.
>
> Since HDFS likely already has redundancy of the data covered via the
> replicationFactor is there a reason for non-leader cores to create their
> own data directory rather than doing reads on the existing master copy? I
> searched Jira for anything that suggests this behavior might change and
> didn't find any issues; is there any intent to address this?
>
> Thanks,
> Greg
>