Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Safdar Kureishy
Hi,

I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
crawled + indexed every *4 weeks, *with a search latency of less than 0.5
seconds.

Needless to mention, the search index needs to scale to 5Billion pages. It
is also possible that I might need to store multiple indexes -- one for
crawled content, and one for ancillary data that is also very large. Each
of these indices would likely require a logically distributed and
replicated index.

However, I would like for such a system to be homogenous with the Hadoop
infrastructure that is already installed on the cluster (for the crawl). In
other words, I would much prefer if the replication and distribution of the
Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
using another scalability framework (such as SolrCloud). In addition, it
would be ideal if this environment was flexible enough to be dynamically
scaled based on the size requirements of the index and the search traffic
at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
enough to automatically provision additional processing power into the
cluster without requiring server re-starts).

However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
mature enough and would be the right architectural choice to go along with
a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
above.

Lastly, how much hardware (assuming a medium sized EC2 instance) would you
estimate my needing with this setup, for regular web-data (HTML text) at
this scale?

Any architectural guidance would be greatly appreciated. The more details
provided, the wider my grin :).

Many many thanks in advance.

Thanks,
Safdar
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

project2501
You could use SolrCloud (for the automatic scaling) and just mount a
fuse[1] HDFS directory and configure solr to use that directory for its
data.

[1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS

On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:

> Hi,
>
> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
> seconds.
>
> Needless to mention, the search index needs to scale to 5Billion pages. It
> is also possible that I might need to store multiple indexes -- one for
> crawled content, and one for ancillary data that is also very large. Each
> of these indices would likely require a logically distributed and
> replicated index.
>
> However, I would like for such a system to be homogenous with the Hadoop
> infrastructure that is already installed on the cluster (for the crawl). In
> other words, I would much prefer if the replication and distribution of the
> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
> using another scalability framework (such as SolrCloud). In addition, it
> would be ideal if this environment was flexible enough to be dynamically
> scaled based on the size requirements of the index and the search traffic
> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
> enough to automatically provision additional processing power into the
> cluster without requiring server re-starts).
>
> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
> mature enough and would be the right architectural choice to go along with
> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
> above.
>
> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
> estimate my needing with this setup, for regular web-data (HTML text) at
> this scale?
>
> Any architectural guidance would be greatly appreciated. The more details
> provided, the wider my grin :).
>
> Many many thanks in advance.
>
> Thanks,
> Safdar


Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Safdar Kureishy
Thanks Darren.

Actually, I would like the system to be homogenous - i.e., use Hadoop based
tools that already provide all the necessary scaling for the lucene index
(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
its own layer of sharding/replication that is outside Hadoop, I feel that
using SolrCloud would be redundant, and a step in the opposite
direction, which is what I'm trying to avoid in the first place. Or am I
mistaken?

Thanks,
Safdar


On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni <[hidden email]> wrote:

> You could use SolrCloud (for the automatic scaling) and just mount a
> fuse[1] HDFS directory and configure solr to use that directory for its
> data.
>
> [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
>
> On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
> > Hi,
> >
> > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
> > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
> > crawled + indexed every *4 weeks, *with a search latency of less than 0.5
> > seconds.
> >
> > Needless to mention, the search index needs to scale to 5Billion pages.
> It
> > is also possible that I might need to store multiple indexes -- one for
> > crawled content, and one for ancillary data that is also very large. Each
> > of these indices would likely require a logically distributed and
> > replicated index.
> >
> > However, I would like for such a system to be homogenous with the Hadoop
> > infrastructure that is already installed on the cluster (for the crawl).
> In
> > other words, I would much prefer if the replication and distribution of
> the
> > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
> > using another scalability framework (such as SolrCloud). In addition, it
> > would be ideal if this environment was flexible enough to be dynamically
> > scaled based on the size requirements of the index and the search traffic
> > at the time (i.e. if it is deployed on an Amazon cluster, it should be
> easy
> > enough to automatically provision additional processing power into the
> > cluster without requiring server re-starts).
> >
> > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
> > be ideal for this scenario. I've heard mention of Solr-on-HBase,
> Solandra,
> > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
> is
> > mature enough and would be the right architectural choice to go along
> with
> > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
> aspects
> > above.
> >
> > Lastly, how much hardware (assuming a medium sized EC2 instance) would
> you
> > estimate my needing with this setup, for regular web-data (HTML text) at
> > this scale?
> >
> > Any architectural guidance would be greatly appreciated. The more details
> > provided, the wider my grin :).
> >
> > Many many thanks in advance.
> >
> > Thanks,
> > Safdar
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Darren Govoni
In reply to this post by Safdar Kureishy
Solrcloud or any other tech specific replication isnt going to 'just work' with hadoop replication. But with some significant custom coding anything should be possible. Interesting idea.

br><br><br>------- Original Message -------
On 4/12/2012  09:21 AM Ali S Kureishy wrote:<br>Thanks Darren.
<br>
<br>Actually, I would like the system to be homogenous - i.e., use Hadoop based
<br>tools that already provide all the necessary scaling for the lucene index
<br>(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
<br>its own layer of sharding/replication that is outside Hadoop, I feel that
<br>using SolrCloud would be redundant, and a step in the opposite
<br>direction, which is what I'm trying to avoid in the first place. Or am I
<br>mistaken?
<br>
<br>Thanks,
<br>Safdar
<br>
<br>
<br>On Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni <[hidden email]> wrote:
<br>
<br>> You could use SolrCloud (for the automatic scaling) and just mount a
<br>> fuse[1] HDFS directory and configure solr to use that directory for its
<br>> data.
<br>>
<br>> [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
<br>>
<br>> On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
<br>> > Hi,
<br>> >
<br>> > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
<br>> > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
<br>> > crawled + indexed every *4 weeks, *with a search latency of less than 0.5
<br>> > seconds.
<br>> >
<br>> > Needless to mention, the search index needs to scale to 5Billion pages.
<br>> It
<br>> > is also possible that I might need to store multiple indexes -- one for
<br>> > crawled content, and one for ancillary data that is also very large. Each
<br>> > of these indices would likely require a logically distributed and
<br>> > replicated index.
<br>> >
<br>> > However, I would like for such a system to be homogenous with the Hadoop
<br>> > infrastructure that is already installed on the cluster (for the crawl).
<br>> In
<br>> > other words, I would much prefer if the replication and distribution of
<br>> the
<br>> > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
<br>> > using another scalability framework (such as SolrCloud). In addition, it
<br>> > would be ideal if this environment was flexible enough to be dynamically
<br>> > scaled based on the size requirements of the index and the search traffic
<br>> > at the time (i.e. if it is deployed on an Amazon cluster, it should be
<br>> easy
<br>> > enough to automatically provision additional processing power into the
<br>> > cluster without requiring server re-starts).
<br>> >
<br>> > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
<br>> > be ideal for this scenario. I've heard mention of Solr-on-HBase,
<br>> Solandra,
<br>> > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
<br>> is
<br>> > mature enough and would be the right architectural choice to go along
<br>> with
<br>> > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
<br>> aspects
<br>> > above.
<br>> >
<br>> > Lastly, how much hardware (assuming a medium sized EC2 instance) would
<br>> you
<br>> > estimate my needing with this setup, for regular web-data (HTML text) at
<br>> > this scale?
<br>> >
<br>> > Any architectural guidance would be greatly appreciated. The more details
<br>> > provided, the wider my grin :).
<br>> >
<br>> > Many many thanks in advance.
<br>> >
<br>> > Thanks,
<br>> > Safdar
<br>>
<br>>
<br>>
<br>
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Otis Gospodnetic-2
In reply to this post by Safdar Kureishy
Hello Ali,

> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
> seconds.


That's fine.  Whether it's doable with any tech will depend on how much hardware you give it, among other things.

> Needless to mention, the search index needs to scale to 5Billion pages. It
> is also possible that I might need to store multiple indexes -- one for
> crawled content, and one for ancillary data that is also very large. Each
> of these indices would likely require a logically distributed and
> replicated index.


Yup, OK.

> However, I would like for such a system to be homogenous with the Hadoop
> infrastructure that is already installed on the cluster (for the crawl). In
> other words, I would much prefer if the replication and distribution of the
> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
> using another scalability framework (such as SolrCloud). In addition, it
> would be ideal if this environment was flexible enough to be dynamically
> scaled based on the size requirements of the index and the search traffic
> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
> enough to automatically provision additional processing power into the
> cluster without requiring server re-starts).


There is no such thing just yet.
There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase.

> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
> mature enough and would be the right architectural choice to go along with
> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
> above.


Here is a summary on all of them:
* Search on HBase - I assume you are referring to the same thing I mentioned above.  Not ready.
* Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra.  Looks good.
* Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s)  on the side.  Not really integrated the way you want it to be.
* ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc.  But again, not integrated with Hadoop the way you want it.
* IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already.
* And there is SolrCloud, which is coming soon and will be solid, but is again not integrated.

If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open.  If I had Solr bias I'd give SolrCloud a try first.

> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
> estimate my needing with this setup, for regular web-data (HTML text) at
> this scale?

I don't know off the topic of my head, but I'm guessing several hundred for serving search requests.

HTH,

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html

Scalable Performance Monitoring - http://sematext.com/spm/index.html


> Any architectural guidance would be greatly appreciated. The more details
> provided, the wider my grin :).
>
> Many many thanks in advance.
>
> Thanks,
> Safdar
>
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Jan Høydahl / Cominvent
Hi,

For a web crawl+search like this you will probably need a lot of additional Big Data crunching, so a Hadoop based solution is wise.

In addition to those products mentioned we also now have Amazon's own CloudSearch http://aws.amazon.com/cloudsearch/ It's new, is not as cool as Solr (not even Lucene based), but gives you the elasticity you request I guess. If you run your Hadoop cluster in EC2 already it would be quite efficient to batch-load the crawled and processed data into a "SearchDomain" in the same availability zone. However, both cost and features may prohibit this as a realistic choice for you.

It would be cool to explore a Hadoop/HDFS + SolrCloud integration. SolrCloud would not build the indexes, but be pulling pre-built indexes from HDFS down to local disk every time it's told to. Or perhaps the SolrCloud nodes could be part of the hadoop cluster, being responsible for the Reduce part building the indexes?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 13. apr. 2012, at 04:23, Otis Gospodnetic wrote:

> Hello Ali,
>
>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>
>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>> seconds.
>
>
> That's fine.  Whether it's doable with any tech will depend on how much hardware you give it, among other things.
>
>> Needless to mention, the search index needs to scale to 5Billion pages. It
>> is also possible that I might need to store multiple indexes -- one for
>> crawled content, and one for ancillary data that is also very large. Each
>> of these indices would likely require a logically distributed and
>> replicated index.
>
>
> Yup, OK.
>
>> However, I would like for such a system to be homogenous with the Hadoop
>> infrastructure that is already installed on the cluster (for the crawl). In
>> other words, I would much prefer if the replication and distribution of the
>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>> using another scalability framework (such as SolrCloud). In addition, it
>> would be ideal if this environment was flexible enough to be dynamically
>> scaled based on the size requirements of the index and the search traffic
>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>> enough to automatically provision additional processing power into the
>> cluster without requiring server re-starts).
>
>
> There is no such thing just yet.
> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase.
>
>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>> mature enough and would be the right architectural choice to go along with
>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>> above.
>
>
> Here is a summary on all of them:
> * Search on HBase - I assume you are referring to the same thing I mentioned above.  Not ready.
> * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra.  Looks good.
> * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s)  on the side.  Not really integrated the way you want it to be.
> * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc.  But again, not integrated with Hadoop the way you want it.
> * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already.
> * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated.
>
> If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open.  If I had Solr bias I'd give SolrCloud a try first.
>
>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>> estimate my needing with this setup, for regular web-data (HTML text) at
>> this scale?
>
> I don't know off the topic of my head, but I'm guessing several hundred for serving search requests.
>
> HTH,
>
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
>
> Scalable Performance Monitoring - http://sematext.com/spm/index.html
>
>
>> Any architectural guidance would be greatly appreciated. The more details
>> provided, the wider my grin :).
>>
>> Many many thanks in advance.
>>
>> Thanks,
>> Safdar
>>

Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Safdar Kureishy
In reply to this post by Otis Gospodnetic-2
Thanks Otis.

I really appreciate the details offered here. This was very helpful
information.

I'm going to go through Solandra and Elastic Search and see if those make
sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's
two recommendations for SolrCloud so far), so I will give that a shot when
it is available. However, do you know when SolrCloud IS expected to be
available?

Thanks again!

Warm regards,
Safdar



On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic <
[hidden email]> wrote:

> Hello Ali,
>
> > I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>
> > using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
> > crawled + indexed every *4 weeks, *with a search latency of less than 0.5
> > seconds.
>
>
> That's fine.  Whether it's doable with any tech will depend on how much
> hardware you give it, among other things.
>
> > Needless to mention, the search index needs to scale to 5Billion pages.
> It
> > is also possible that I might need to store multiple indexes -- one for
> > crawled content, and one for ancillary data that is also very large. Each
> > of these indices would likely require a logically distributed and
> > replicated index.
>
>
> Yup, OK.
>
> > However, I would like for such a system to be homogenous with the Hadoop
> > infrastructure that is already installed on the cluster (for the crawl).
> In
> > other words, I would much prefer if the replication and distribution of
> the
> > Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
> > using another scalability framework (such as SolrCloud). In addition, it
> > would be ideal if this environment was flexible enough to be dynamically
> > scaled based on the size requirements of the index and the search traffic
> > at the time (i.e. if it is deployed on an Amazon cluster, it should be
> easy
> > enough to automatically provision additional processing power into the
> > cluster without requiring server re-starts).
>
>
> There is no such thing just yet.
> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to
> automatically index HBase content, but that was either not completed or not
> committed into HBase.
>
> > However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
> > be ideal for this scenario. I've heard mention of Solr-on-HBase,
> Solandra,
> > Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
> is
> > mature enough and would be the right architectural choice to go along
> with
> > a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
> aspects
> > above.
>
>
> Here is a summary on all of them:
> * Search on HBase - I assume you are referring to the same thing I
> mentioned above.  Not ready.
> * Solandra - uses Cassandra+Solr, plus DataStax now has a different
> (commercial) offering that combines search and Cassandra.  Looks good.
> * Lily - data stored in HBase cluster gets indexed to a separate Solr
> instance(s)  on the side.  Not really integrated the way you want it to be.
> * ElasticSearch - solid at this point, the most dynamic solution today,
> can scale well (we are working on a maaaany-B documents index and hundreds
> of nodes with ElasticSearch right now), etc.  But again, not integrated
> with Hadoop the way you want it.
> * IndexTank - has some technical weaknesses, not integrated with Hadoop,
> not sure about its future considering LinkedIn uses Zoie and Sensei already.
> * And there is SolrCloud, which is coming soon and will be solid, but is
> again not integrated.
>
> If I were you and I had to pick today - I'd pick ElasticSearch if I were
> completely open.  If I had Solr bias I'd give SolrCloud a try first.
>
> > Lastly, how much hardware (assuming a medium sized EC2 instance) would
> you
> > estimate my needing with this setup, for regular web-data (HTML text) at
> > this scale?
>
> I don't know off the topic of my head, but I'm guessing several hundred
> for serving search requests.
>
> HTH,
>
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
>
> Scalable Performance Monitoring - http://sematext.com/spm/index.html
>
>
> > Any architectural guidance would be greatly appreciated. The more details
> > provided, the wider my grin :).
> >
> > Many many thanks in advance.
> >
> > Thanks,
> > Safdar
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Jan Høydahl / Cominvent
In reply to this post by project2501
Hi,

This won't give you the performance you need, unless you have enough RAM on the Solr box to cache the whole index in memory.
Have you tested this yourself?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. apr. 2012, at 15:27, Darren Govoni wrote:

> You could use SolrCloud (for the automatic scaling) and just mount a
> fuse[1] HDFS directory and configure solr to use that directory for its
> data.
>
> [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
>
> On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
>> Hi,
>>
>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>> seconds.
>>
>> Needless to mention, the search index needs to scale to 5Billion pages. It
>> is also possible that I might need to store multiple indexes -- one for
>> crawled content, and one for ancillary data that is also very large. Each
>> of these indices would likely require a logically distributed and
>> replicated index.
>>
>> However, I would like for such a system to be homogenous with the Hadoop
>> infrastructure that is already installed on the cluster (for the crawl). In
>> other words, I would much prefer if the replication and distribution of the
>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>> using another scalability framework (such as SolrCloud). In addition, it
>> would be ideal if this environment was flexible enough to be dynamically
>> scaled based on the size requirements of the index and the search traffic
>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>> enough to automatically provision additional processing power into the
>> cluster without requiring server re-starts).
>>
>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>> mature enough and would be the right architectural choice to go along with
>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>> above.
>>
>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>> estimate my needing with this setup, for regular web-data (HTML text) at
>> this scale?
>>
>> Any architectural guidance would be greatly appreciated. The more details
>> provided, the wider my grin :).
>>
>> Many many thanks in advance.
>>
>> Thanks,
>> Safdar
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Otis Gospodnetic-2
In reply to this post by Safdar Kureishy
Hello,

Unfortunately I don't know when exactly SolrCloud release will be ready, but we've used trunk versions in the past and didn't have major issues.

Otis 
----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: Ali S Kureishy <[hidden email]>
>To: Otis Gospodnetic <[hidden email]>
>Cc: "[hidden email]" <[hidden email]>
>Sent: Friday, April 13, 2012 7:16 PM
>Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
>
>
>Thanks Otis.
>
>
>I really appreciate the details offered here. This was very helpful information.
>
>
>I'm going to go through Solandra and Elastic Search and see if those make sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two recommendations for SolrCloud so far), so I will give that a shot when it is available. However, do you know when SolrCloud IS expected to be available?
>
>
>Thanks again!
>
>
>Warm regards,
>Safdar
>
>
>
>
>
>On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic <[hidden email]> wrote:
>
>Hello Ali,
>>
>>
>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>>
>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>>> seconds.
>>
>>
>>That's fine.  Whether it's doable with any tech will depend on how much hardware you give it, among other things.
>>
>>
>>> Needless to mention, the search index needs to scale to 5Billion pages. It
>>> is also possible that I might need to store multiple indexes -- one for
>>> crawled content, and one for ancillary data that is also very large. Each
>>> of these indices would likely require a logically distributed and
>>> replicated index.
>>
>>
>>Yup, OK.
>>
>>
>>> However, I would like for such a system to be homogenous with the Hadoop
>>> infrastructure that is already installed on the cluster (for the crawl). In
>>> other words, I would much prefer if the replication and distribution of the
>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>>> using another scalability framework (such as SolrCloud). In addition, it
>>> would be ideal if this environment was flexible enough to be dynamically
>>> scaled based on the size requirements of the index and the search traffic
>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>>> enough to automatically provision additional processing power into the
>>> cluster without requiring server re-starts).
>>
>>
>>There is no such thing just yet.
>>There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase.
>>
>>
>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>>> mature enough and would be the right architectural choice to go along with
>>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>>> above.
>>
>>
>>Here is a summary on all of them:
>>* Search on HBase - I assume you are referring to the same thing I mentioned above.  Not ready.
>>* Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra.  Looks good.
>>* Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s)  on the side.  Not really integrated the way you want it to be.
>>* ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc.  But again, not integrated with Hadoop the way you want it.
>>* IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already.
>>* And there is SolrCloud, which is coming soon and will be solid, but is again not integrated.
>>
>>If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open.  If I had Solr bias I'd give SolrCloud a try first.
>>
>>
>>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>>> estimate my needing with this setup, for regular web-data (HTML text) at
>>> this scale?
>>
>>I don't know off the topic of my head, but I'm guessing several hundred for serving search requests.
>>
>>HTH,
>>
>>Otis
>>--
>>Search Analytics - http://sematext.com/search-analytics/index.html
>>
>>Scalable Performance Monitoring - http://sematext.com/spm/index.html
>>
>>
>>
>>> Any architectural guidance would be greatly appreciated. The more details
>>> provided, the wider my grin :).
>>>
>>> Many many thanks in advance.
>>>
>>> Thanks,
>>> Safdar
>>>
>>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Lance Norskog-2
It sounds like you really want the final map/reduce phase to put Solr
index files into HDFS. Solr has a feature to do this called 'Embedded
Solr'. This packages Solr as a library instead of an HTTP servlet. The
Solr committers mostly hate it and want it to go away, but it is
useful for exactly this problem.

There is some integration work here, both to bolt ES to the Hadoop
output libraries and also some trickery to write out the HDFS files.
HDFS only appends and most of the codecs (Lucene segment formats) like
to seek a lot. Then at the end it needs a way to tell SolrCloud about
the files.

If someone wants a great Summer Of Code project, Hadoop->Lucene
indexes->SolrCloud would be a lot of fun and make you widely loved by
people with money. I'm not kidding. Do a good job of this and write
clean code, and you'll get offers for very cool jobs.

On Sat, Apr 14, 2012 at 2:27 PM, Otis Gospodnetic
<[hidden email]> wrote:

> Hello,
>
> Unfortunately I don't know when exactly SolrCloud release will be ready, but we've used trunk versions in the past and didn't have major issues.
>
> Otis
> ----
> Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
>
>>________________________________
>> From: Ali S Kureishy <[hidden email]>
>>To: Otis Gospodnetic <[hidden email]>
>>Cc: "[hidden email]" <[hidden email]>
>>Sent: Friday, April 13, 2012 7:16 PM
>>Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
>>
>>
>>Thanks Otis.
>>
>>
>>I really appreciate the details offered here. This was very helpful information.
>>
>>
>>I'm going to go through Solandra and Elastic Search and see if those make sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two recommendations for SolrCloud so far), so I will give that a shot when it is available. However, do you know when SolrCloud IS expected to be available?
>>
>>
>>Thanks again!
>>
>>
>>Warm regards,
>>Safdar
>>
>>
>>
>>
>>
>>On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic <[hidden email]> wrote:
>>
>>Hello Ali,
>>>
>>>
>>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>>>
>>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>>>> seconds.
>>>
>>>
>>>That's fine.  Whether it's doable with any tech will depend on how much hardware you give it, among other things.
>>>
>>>
>>>> Needless to mention, the search index needs to scale to 5Billion pages. It
>>>> is also possible that I might need to store multiple indexes -- one for
>>>> crawled content, and one for ancillary data that is also very large. Each
>>>> of these indices would likely require a logically distributed and
>>>> replicated index.
>>>
>>>
>>>Yup, OK.
>>>
>>>
>>>> However, I would like for such a system to be homogenous with the Hadoop
>>>> infrastructure that is already installed on the cluster (for the crawl). In
>>>> other words, I would much prefer if the replication and distribution of the
>>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>>>> using another scalability framework (such as SolrCloud). In addition, it
>>>> would be ideal if this environment was flexible enough to be dynamically
>>>> scaled based on the size requirements of the index and the search traffic
>>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>>>> enough to automatically provision additional processing power into the
>>>> cluster without requiring server re-starts).
>>>
>>>
>>>There is no such thing just yet.
>>>There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase.
>>>
>>>
>>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>>>> mature enough and would be the right architectural choice to go along with
>>>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>>>> above.
>>>
>>>
>>>Here is a summary on all of them:
>>>* Search on HBase - I assume you are referring to the same thing I mentioned above.  Not ready.
>>>* Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra.  Looks good.
>>>* Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s)  on the side.  Not really integrated the way you want it to be.
>>>* ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc.  But again, not integrated with Hadoop the way you want it.
>>>* IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already.
>>>* And there is SolrCloud, which is coming soon and will be solid, but is again not integrated.
>>>
>>>If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open.  If I had Solr bias I'd give SolrCloud a try first.
>>>
>>>
>>>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>>>> estimate my needing with this setup, for regular web-data (HTML text) at
>>>> this scale?
>>>
>>>I don't know off the topic of my head, but I'm guessing several hundred for serving search requests.
>>>
>>>HTH,
>>>
>>>Otis
>>>--
>>>Search Analytics - http://sematext.com/search-analytics/index.html
>>>
>>>Scalable Performance Monitoring - http://sematext.com/spm/index.html
>>>
>>>
>>>
>>>> Any architectural guidance would be greatly appreciated. The more details
>>>> provided, the wider my grin :).
>>>>
>>>> Many many thanks in advance.
>>>>
>>>> Thanks,
>>>> Safdar
>>>>
>>>
>>
>>
>>



--
Lance Norskog
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Jason Rutherglen
This was done in SOLR-1301 going on several years ago now.

On Sat, Apr 14, 2012 at 4:11 PM, Lance Norskog <[hidden email]> wrote:

> It sounds like you really want the final map/reduce phase to put Solr
> index files into HDFS. Solr has a feature to do this called 'Embedded
> Solr'. This packages Solr as a library instead of an HTTP servlet. The
> Solr committers mostly hate it and want it to go away, but it is
> useful for exactly this problem.
>
> There is some integration work here, both to bolt ES to the Hadoop
> output libraries and also some trickery to write out the HDFS files.
> HDFS only appends and most of the codecs (Lucene segment formats) like
> to seek a lot. Then at the end it needs a way to tell SolrCloud about
> the files.
>
> If someone wants a great Summer Of Code project, Hadoop->Lucene
> indexes->SolrCloud would be a lot of fun and make you widely loved by
> people with money. I'm not kidding. Do a good job of this and write
> clean code, and you'll get offers for very cool jobs.
>
> On Sat, Apr 14, 2012 at 2:27 PM, Otis Gospodnetic
> <[hidden email]> wrote:
>> Hello,
>>
>> Unfortunately I don't know when exactly SolrCloud release will be ready, but we've used trunk versions in the past and didn't have major issues.
>>
>> Otis
>> ----
>> Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html
>>
>>
>>
>>>________________________________
>>> From: Ali S Kureishy <[hidden email]>
>>>To: Otis Gospodnetic <[hidden email]>
>>>Cc: "[hidden email]" <[hidden email]>
>>>Sent: Friday, April 13, 2012 7:16 PM
>>>Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
>>>
>>>
>>>Thanks Otis.
>>>
>>>
>>>I really appreciate the details offered here. This was very helpful information.
>>>
>>>
>>>I'm going to go through Solandra and Elastic Search and see if those make sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two recommendations for SolrCloud so far), so I will give that a shot when it is available. However, do you know when SolrCloud IS expected to be available?
>>>
>>>
>>>Thanks again!
>>>
>>>
>>>Warm regards,
>>>Safdar
>>>
>>>
>>>
>>>
>>>
>>>On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic <[hidden email]> wrote:
>>>
>>>Hello Ali,
>>>>
>>>>
>>>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>>>>
>>>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>>>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>>>>> seconds.
>>>>
>>>>
>>>>That's fine.  Whether it's doable with any tech will depend on how much hardware you give it, among other things.
>>>>
>>>>
>>>>> Needless to mention, the search index needs to scale to 5Billion pages. It
>>>>> is also possible that I might need to store multiple indexes -- one for
>>>>> crawled content, and one for ancillary data that is also very large. Each
>>>>> of these indices would likely require a logically distributed and
>>>>> replicated index.
>>>>
>>>>
>>>>Yup, OK.
>>>>
>>>>
>>>>> However, I would like for such a system to be homogenous with the Hadoop
>>>>> infrastructure that is already installed on the cluster (for the crawl). In
>>>>> other words, I would much prefer if the replication and distribution of the
>>>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>>>>> using another scalability framework (such as SolrCloud). In addition, it
>>>>> would be ideal if this environment was flexible enough to be dynamically
>>>>> scaled based on the size requirements of the index and the search traffic
>>>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>>>>> enough to automatically provision additional processing power into the
>>>>> cluster without requiring server re-starts).
>>>>
>>>>
>>>>There is no such thing just yet.
>>>>There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase.
>>>>
>>>>
>>>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>>>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>>>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>>>>> mature enough and would be the right architectural choice to go along with
>>>>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>>>>> above.
>>>>
>>>>
>>>>Here is a summary on all of them:
>>>>* Search on HBase - I assume you are referring to the same thing I mentioned above.  Not ready.
>>>>* Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra.  Looks good.
>>>>* Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s)  on the side.  Not really integrated the way you want it to be.
>>>>* ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc.  But again, not integrated with Hadoop the way you want it.
>>>>* IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already.
>>>>* And there is SolrCloud, which is coming soon and will be solid, but is again not integrated.
>>>>
>>>>If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open.  If I had Solr bias I'd give SolrCloud a try first.
>>>>
>>>>
>>>>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>>>>> estimate my needing with this setup, for regular web-data (HTML text) at
>>>>> this scale?
>>>>
>>>>I don't know off the topic of my head, but I'm guessing several hundred for serving search requests.
>>>>
>>>>HTH,
>>>>
>>>>Otis
>>>>--
>>>>Search Analytics - http://sematext.com/search-analytics/index.html
>>>>
>>>>Scalable Performance Monitoring - http://sematext.com/spm/index.html
>>>>
>>>>
>>>>
>>>>> Any architectural guidance would be greatly appreciated. The more details
>>>>> provided, the wider my grin :).
>>>>>
>>>>> Many many thanks in advance.
>>>>>
>>>>> Thanks,
>>>>> Safdar
>>>>>
>>>>
>>>
>>>
>>>
>
>
>
> --
> Lance Norskog
> [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Jason Rutherglen
In reply to this post by Otis Gospodnetic-2
One of big weaknesses of Solr Cloud (and ES?) is the lack of the
ability to redistribute shards across servers.  Meaning, as a single
shard grows too large, splitting the shard, while live updates.

How do you plan on elastically adding more servers without this feature?

Cassandra and HBase handle elasticity in their own ways.  Cassandra
has successfully implemented the Dynamo model and HBase uses the
traditional BigTable 'split'.  Both systems are complex though are at
a singular level of maturity.

Also Cassandra [successfully] implements multiple data center support,
is that available in SC or ES?

On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
<[hidden email]> wrote:

> Hello Ali,
>
>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>
>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>> seconds.
>
>
> That's fine.  Whether it's doable with any tech will depend on how much hardware you give it, among other things.
>
>> Needless to mention, the search index needs to scale to 5Billion pages. It
>> is also possible that I might need to store multiple indexes -- one for
>> crawled content, and one for ancillary data that is also very large. Each
>> of these indices would likely require a logically distributed and
>> replicated index.
>
>
> Yup, OK.
>
>> However, I would like for such a system to be homogenous with the Hadoop
>> infrastructure that is already installed on the cluster (for the crawl). In
>> other words, I would much prefer if the replication and distribution of the
>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>> using another scalability framework (such as SolrCloud). In addition, it
>> would be ideal if this environment was flexible enough to be dynamically
>> scaled based on the size requirements of the index and the search traffic
>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>> enough to automatically provision additional processing power into the
>> cluster without requiring server re-starts).
>
>
> There is no such thing just yet.
> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase.
>
>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>> mature enough and would be the right architectural choice to go along with
>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>> above.
>
>
> Here is a summary on all of them:
> * Search on HBase - I assume you are referring to the same thing I mentioned above.  Not ready.
> * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra.  Looks good.
> * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s)  on the side.  Not really integrated the way you want it to be.
> * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc.  But again, not integrated with Hadoop the way you want it.
> * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already.
> * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated.
>
> If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open.  If I had Solr bias I'd give SolrCloud a try first.
>
>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>> estimate my needing with this setup, for regular web-data (HTML text) at
>> this scale?
>
> I don't know off the topic of my head, but I'm guessing several hundred for serving search requests.
>
> HTH,
>
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
>
> Scalable Performance Monitoring - http://sematext.com/spm/index.html
>
>
>> Any architectural guidance would be greatly appreciated. The more details
>> provided, the wider my grin :).
>>
>> Many many thanks in advance.
>>
>> Thanks,
>> Safdar
>>
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Jan Høydahl / Cominvent
Hi,

I think Katta integration is nice, but it is not very real-time. What if you want both?
Perhaps a Katta/SolrCloud integration could make the two frameworks play together, so that some shards in SolrCloud may be marked as "static" while others are "realtime". SolrCloud will handle indexing the realtime shards as today, but indexing the static shards will be handled by Katta. If Katta adds a shard it will tell SolrCloud by updating the ZK tree, and SolrCloud will pick up the shard and start serving search for it..

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 17. apr. 2012, at 02:42, Jason Rutherglen wrote:

> One of big weaknesses of Solr Cloud (and ES?) is the lack of the
> ability to redistribute shards across servers.  Meaning, as a single
> shard grows too large, splitting the shard, while live updates.
>
> How do you plan on elastically adding more servers without this feature?
>
> Cassandra and HBase handle elasticity in their own ways.  Cassandra
> has successfully implemented the Dynamo model and HBase uses the
> traditional BigTable 'split'.  Both systems are complex though are at
> a singular level of maturity.
>
> Also Cassandra [successfully] implements multiple data center support,
> is that available in SC or ES?
>
> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
> <[hidden email]> wrote:
>> Hello Ali,
>>
>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>>
>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>>> seconds.
>>
>>
>> That's fine.  Whether it's doable with any tech will depend on how much hardware you give it, among other things.
>>
>>> Needless to mention, the search index needs to scale to 5Billion pages. It
>>> is also possible that I might need to store multiple indexes -- one for
>>> crawled content, and one for ancillary data that is also very large. Each
>>> of these indices would likely require a logically distributed and
>>> replicated index.
>>
>>
>> Yup, OK.
>>
>>> However, I would like for such a system to be homogenous with the Hadoop
>>> infrastructure that is already installed on the cluster (for the crawl). In
>>> other words, I would much prefer if the replication and distribution of the
>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>>> using another scalability framework (such as SolrCloud). In addition, it
>>> would be ideal if this environment was flexible enough to be dynamically
>>> scaled based on the size requirements of the index and the search traffic
>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>>> enough to automatically provision additional processing power into the
>>> cluster without requiring server re-starts).
>>
>>
>> There is no such thing just yet.
>> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase.
>>
>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>>> mature enough and would be the right architectural choice to go along with
>>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>>> above.
>>
>>
>> Here is a summary on all of them:
>> * Search on HBase - I assume you are referring to the same thing I mentioned above.  Not ready.
>> * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra.  Looks good.
>> * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s)  on the side.  Not really integrated the way you want it to be.
>> * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc.  But again, not integrated with Hadoop the way you want it.
>> * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already.
>> * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated.
>>
>> If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open.  If I had Solr bias I'd give SolrCloud a try first.
>>
>>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>>> estimate my needing with this setup, for regular web-data (HTML text) at
>>> this scale?
>>
>> I don't know off the topic of my head, but I'm guessing several hundred for serving search requests.
>>
>> HTH,
>>
>> Otis
>> --
>> Search Analytics - http://sematext.com/search-analytics/index.html
>>
>> Scalable Performance Monitoring - http://sematext.com/spm/index.html
>>
>>
>>> Any architectural guidance would be greatly appreciated. The more details
>>> provided, the wider my grin :).
>>>
>>> Many many thanks in advance.
>>>
>>> Thanks,
>>> Safdar
>>>

Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Otis Gospodnetic-2
In reply to this post by Jason Rutherglen
I think Jason is right - there is no index splitting in ES and SolrCloud, so one has to think ahead, "overshard", and then count on redistributing shards from oversubscribed nodes to other nodes.  No resharding on demand and no index/shard splitting yet.

Otis 
----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html



>________________________________
> From: Jason Rutherglen <[hidden email]>
>To: [hidden email]
>Sent: Monday, April 16, 2012 8:42 PM
>Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
>
>One of big weaknesses of Solr Cloud (and ES?) is the lack of the
>ability to redistribute shards across servers.  Meaning, as a single
>shard grows too large, splitting the shard, while live updates.
>
>How do you plan on elastically adding more servers without this feature?
>
>Cassandra and HBase handle elasticity in their own ways.  Cassandra
>has successfully implemented the Dynamo model and HBase uses the
>traditional BigTable 'split'.  Both systems are complex though are at
>a singular level of maturity.
>
>Also Cassandra [successfully] implements multiple data center support,
>is that available in SC or ES?
>
>On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
><[hidden email]> wrote:
>> Hello Ali,
>>
>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>>
>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>>> seconds.
>>
>>
>> That's fine.  Whether it's doable with any tech will depend on how much hardware you give it, among other things.
>>
>>> Needless to mention, the search index needs to scale to 5Billion pages. It
>>> is also possible that I might need to store multiple indexes -- one for
>>> crawled content, and one for ancillary data that is also very large. Each
>>> of these indices would likely require a logically distributed and
>>> replicated index.
>>
>>
>> Yup, OK.
>>
>>> However, I would like for such a system to be homogenous with the Hadoop
>>> infrastructure that is already installed on the cluster (for the crawl). In
>>> other words, I would much prefer if the replication and distribution of the
>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>>> using another scalability framework (such as SolrCloud). In addition, it
>>> would be ideal if this environment was flexible enough to be dynamically
>>> scaled based on the size requirements of the index and the search traffic
>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>>> enough to automatically provision additional processing power into the
>>> cluster without requiring server re-starts).
>>
>>
>> There is no such thing just yet.
>> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase.
>>
>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>>> mature enough and would be the right architectural choice to go along with
>>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>>> above.
>>
>>
>> Here is a summary on all of them:
>> * Search on HBase - I assume you are referring to the same thing I mentioned above.  Not ready.
>> * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra.  Looks good.
>> * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s)  on the side.  Not really integrated the way you want it to be.
>> * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc.  But again, not integrated with Hadoop the way you want it.
>> * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already.
>> * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated.
>>
>> If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open.  If I had Solr bias I'd give SolrCloud a try first.
>>
>>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>>> estimate my needing with this setup, for regular web-data (HTML text) at
>>> this scale?
>>
>> I don't know off the topic of my head, but I'm guessing several hundred for serving search requests.
>>
>> HTH,
>>
>> Otis
>> --
>> Search Analytics - http://sematext.com/search-analytics/index.html
>>
>> Scalable Performance Monitoring - http://sematext.com/spm/index.html
>>
>>
>>> Any architectural guidance would be greatly appreciated. The more details
>>> provided, the wider my grin :).
>>>
>>> Many many thanks in advance.
>>>
>>> Thanks,
>>> Safdar
>>>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Jason Rutherglen
> redistributing shards from oversubscribed nodes to other nodes

Redistributing shards on a live system is not possible however because
the updates in-flight will likely be lost.  Also it is not simple
technology to build from the ground-up.

As is today, one would need to schedule downtime, for multi-terabyte
live realtime systems, that not acceptable and will cause the system
to not meet SLAs.

Solr Cloud seems limited to a simple hashing algorithm for sending
updates to the appropriate shard.  This is precisely what Dynamo (and
Cassandra) solves, eg, elastically and dynamically rearranging the
hash 'ring' both logically and physically.

In addition, there is the potential for data loss which Cassandra has
the technology for.

On Tue, Apr 17, 2012 at 1:33 PM, Otis Gospodnetic
<[hidden email]> wrote:

> I think Jason is right - there is no index splitting in ES and SolrCloud, so one has to think ahead, "overshard", and then count on redistributing shards from oversubscribed nodes to other nodes.  No resharding on demand and no index/shard splitting yet.
>
> Otis
> ----
> Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
>
>>________________________________
>> From: Jason Rutherglen <[hidden email]>
>>To: [hidden email]
>>Sent: Monday, April 16, 2012 8:42 PM
>>Subject: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
>>
>>One of big weaknesses of Solr Cloud (and ES?) is the lack of the
>>ability to redistribute shards across servers.  Meaning, as a single
>>shard grows too large, splitting the shard, while live updates.
>>
>>How do you plan on elastically adding more servers without this feature?
>>
>>Cassandra and HBase handle elasticity in their own ways.  Cassandra
>>has successfully implemented the Dynamo model and HBase uses the
>>traditional BigTable 'split'.  Both systems are complex though are at
>>a singular level of maturity.
>>
>>Also Cassandra [successfully] implements multiple data center support,
>>is that available in SC or ES?
>>
>>On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
>><[hidden email]> wrote:
>>> Hello Ali,
>>>
>>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>>>
>>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>>>> seconds.
>>>
>>>
>>> That's fine.  Whether it's doable with any tech will depend on how much hardware you give it, among other things.
>>>
>>>> Needless to mention, the search index needs to scale to 5Billion pages. It
>>>> is also possible that I might need to store multiple indexes -- one for
>>>> crawled content, and one for ancillary data that is also very large. Each
>>>> of these indices would likely require a logically distributed and
>>>> replicated index.
>>>
>>>
>>> Yup, OK.
>>>
>>>> However, I would like for such a system to be homogenous with the Hadoop
>>>> infrastructure that is already installed on the cluster (for the crawl). In
>>>> other words, I would much prefer if the replication and distribution of the
>>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>>>> using another scalability framework (such as SolrCloud). In addition, it
>>>> would be ideal if this environment was flexible enough to be dynamically
>>>> scaled based on the size requirements of the index and the search traffic
>>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>>>> enough to automatically provision additional processing power into the
>>>> cluster without requiring server re-starts).
>>>
>>>
>>> There is no such thing just yet.
>>> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase.
>>>
>>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>>>> mature enough and would be the right architectural choice to go along with
>>>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>>>> above.
>>>
>>>
>>> Here is a summary on all of them:
>>> * Search on HBase - I assume you are referring to the same thing I mentioned above.  Not ready.
>>> * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra.  Looks good.
>>> * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s)  on the side.  Not really integrated the way you want it to be.
>>> * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a maaaany-B documents index and hundreds of nodes with ElasticSearch right now), etc.  But again, not integrated with Hadoop the way you want it.
>>> * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already.
>>> * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated.
>>>
>>> If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open.  If I had Solr bias I'd give SolrCloud a try first.
>>>
>>>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>>>> estimate my needing with this setup, for regular web-data (HTML text) at
>>>> this scale?
>>>
>>> I don't know off the topic of my head, but I'm guessing several hundred for serving search requests.
>>>
>>> HTH,
>>>
>>> Otis
>>> --
>>> Search Analytics - http://sematext.com/search-analytics/index.html
>>>
>>> Scalable Performance Monitoring - http://sematext.com/spm/index.html
>>>
>>>
>>>> Any architectural guidance would be greatly appreciated. The more details
>>>> provided, the wider my grin :).
>>>>
>>>> Many many thanks in advance.
>>>>
>>>> Thanks,
>>>> Safdar
>>>>
>>
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Lukáš Vlček
In reply to this post by Jason Rutherglen
Hi,

speaking about ES I think it would be fair to mention that one has to
specify number of shards upfront when the index is created - that is
correct, however, it is possible to give index one or more aliases which
basically means that you can add new indices on the fly and give them same
alias which is then used to search against. Given that you can add/remove
indices, nodes and aliases on the fly I think there is a way how to handle
growing data set with ease. If anyone is interested such scenario has been
discussed in detail in ES mail list.

Regards,
Lukas

On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen <
[hidden email]> wrote:

> One of big weaknesses of Solr Cloud (and ES?) is the lack of the
> ability to redistribute shards across servers.  Meaning, as a single
> shard grows too large, splitting the shard, while live updates.
>
> How do you plan on elastically adding more servers without this feature?
>
> Cassandra and HBase handle elasticity in their own ways.  Cassandra
> has successfully implemented the Dynamo model and HBase uses the
> traditional BigTable 'split'.  Both systems are complex though are at
> a singular level of maturity.
>
> Also Cassandra [successfully] implements multiple data center support,
> is that available in SC or ES?
>
> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
> <[hidden email]> wrote:
> > Hello Ali,
> >
> >> I'm trying to setup a large scale *Crawl + Index + Search
> *infrastructure
> >
> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
> pages*,
> >> crawled + indexed every *4 weeks, *with a search latency of less than
> 0.5
> >> seconds.
> >
> >
> > That's fine.  Whether it's doable with any tech will depend on how much
> hardware you give it, among other things.
> >
> >> Needless to mention, the search index needs to scale to 5Billion pages.
> It
> >> is also possible that I might need to store multiple indexes -- one for
> >> crawled content, and one for ancillary data that is also very large.
> Each
> >> of these indices would likely require a logically distributed and
> >> replicated index.
> >
> >
> > Yup, OK.
> >
> >> However, I would like for such a system to be homogenous with the Hadoop
> >> infrastructure that is already installed on the cluster (for the
> crawl). In
> >> other words, I would much prefer if the replication and distribution of
> the
> >> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead
> of
> >> using another scalability framework (such as SolrCloud). In addition, it
> >> would be ideal if this environment was flexible enough to be dynamically
> >> scaled based on the size requirements of the index and the search
> traffic
> >> at the time (i.e. if it is deployed on an Amazon cluster, it should be
> easy
> >> enough to automatically provision additional processing power into the
> >> cluster without requiring server re-starts).
> >
> >
> > There is no such thing just yet.
> > There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt
> to automatically index HBase content, but that was either not completed or
> not committed into HBase.
> >
> >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem
> would
> >> be ideal for this scenario. I've heard mention of Solr-on-HBase,
> Solandra,
> >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of
> these is
> >> mature enough and would be the right architectural choice to go along
> with
> >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
> aspects
> >> above.
> >
> >
> > Here is a summary on all of them:
> > * Search on HBase - I assume you are referring to the same thing I
> mentioned above.  Not ready.
> > * Solandra - uses Cassandra+Solr, plus DataStax now has a different
> (commercial) offering that combines search and Cassandra.  Looks good.
> > * Lily - data stored in HBase cluster gets indexed to a separate Solr
> instance(s)  on the side.  Not really integrated the way you want it to be.
> > * ElasticSearch - solid at this point, the most dynamic solution today,
> can scale well (we are working on a maaaany-B documents index and hundreds
> of nodes with ElasticSearch right now), etc.  But again, not integrated
> with Hadoop the way you want it.
> > * IndexTank - has some technical weaknesses, not integrated with Hadoop,
> not sure about its future considering LinkedIn uses Zoie and Sensei already.
> > * And there is SolrCloud, which is coming soon and will be solid, but is
> again not integrated.
> >
> > If I were you and I had to pick today - I'd pick ElasticSearch if I were
> completely open.  If I had Solr bias I'd give SolrCloud a try first.
> >
> >> Lastly, how much hardware (assuming a medium sized EC2 instance) would
> you
> >> estimate my needing with this setup, for regular web-data (HTML text) at
> >> this scale?
> >
> > I don't know off the topic of my head, but I'm guessing several hundred
> for serving search requests.
> >
> > HTH,
> >
> > Otis
> > --
> > Search Analytics - http://sematext.com/search-analytics/index.html
> >
> > Scalable Performance Monitoring - http://sematext.com/spm/index.html
> >
> >
> >> Any architectural guidance would be greatly appreciated. The more
> details
> >> provided, the wider my grin :).
> >>
> >> Many many thanks in advance.
> >>
> >> Thanks,
> >> Safdar
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Jason Rutherglen
I'm curious how on the fly updates are handled as a new shard is added
to an alias.  Eg, how does the system know to which shard to send an
update?

On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček <[hidden email]> wrote:

> Hi,
>
> speaking about ES I think it would be fair to mention that one has to
> specify number of shards upfront when the index is created - that is
> correct, however, it is possible to give index one or more aliases which
> basically means that you can add new indices on the fly and give them same
> alias which is then used to search against. Given that you can add/remove
> indices, nodes and aliases on the fly I think there is a way how to handle
> growing data set with ease. If anyone is interested such scenario has been
> discussed in detail in ES mail list.
>
> Regards,
> Lukas
>
> On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen <
> [hidden email]> wrote:
>
>> One of big weaknesses of Solr Cloud (and ES?) is the lack of the
>> ability to redistribute shards across servers.  Meaning, as a single
>> shard grows too large, splitting the shard, while live updates.
>>
>> How do you plan on elastically adding more servers without this feature?
>>
>> Cassandra and HBase handle elasticity in their own ways.  Cassandra
>> has successfully implemented the Dynamo model and HBase uses the
>> traditional BigTable 'split'.  Both systems are complex though are at
>> a singular level of maturity.
>>
>> Also Cassandra [successfully] implements multiple data center support,
>> is that available in SC or ES?
>>
>> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
>> <[hidden email]> wrote:
>> > Hello Ali,
>> >
>> >> I'm trying to setup a large scale *Crawl + Index + Search
>> *infrastructure
>> >
>> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
>> pages*,
>> >> crawled + indexed every *4 weeks, *with a search latency of less than
>> 0.5
>> >> seconds.
>> >
>> >
>> > That's fine.  Whether it's doable with any tech will depend on how much
>> hardware you give it, among other things.
>> >
>> >> Needless to mention, the search index needs to scale to 5Billion pages.
>> It
>> >> is also possible that I might need to store multiple indexes -- one for
>> >> crawled content, and one for ancillary data that is also very large.
>> Each
>> >> of these indices would likely require a logically distributed and
>> >> replicated index.
>> >
>> >
>> > Yup, OK.
>> >
>> >> However, I would like for such a system to be homogenous with the Hadoop
>> >> infrastructure that is already installed on the cluster (for the
>> crawl). In
>> >> other words, I would much prefer if the replication and distribution of
>> the
>> >> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead
>> of
>> >> using another scalability framework (such as SolrCloud). In addition, it
>> >> would be ideal if this environment was flexible enough to be dynamically
>> >> scaled based on the size requirements of the index and the search
>> traffic
>> >> at the time (i.e. if it is deployed on an Amazon cluster, it should be
>> easy
>> >> enough to automatically provision additional processing power into the
>> >> cluster without requiring server re-starts).
>> >
>> >
>> > There is no such thing just yet.
>> > There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt
>> to automatically index HBase content, but that was either not completed or
>> not committed into HBase.
>> >
>> >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem
>> would
>> >> be ideal for this scenario. I've heard mention of Solr-on-HBase,
>> Solandra,
>> >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of
>> these is
>> >> mature enough and would be the right architectural choice to go along
>> with
>> >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
>> aspects
>> >> above.
>> >
>> >
>> > Here is a summary on all of them:
>> > * Search on HBase - I assume you are referring to the same thing I
>> mentioned above.  Not ready.
>> > * Solandra - uses Cassandra+Solr, plus DataStax now has a different
>> (commercial) offering that combines search and Cassandra.  Looks good.
>> > * Lily - data stored in HBase cluster gets indexed to a separate Solr
>> instance(s)  on the side.  Not really integrated the way you want it to be.
>> > * ElasticSearch - solid at this point, the most dynamic solution today,
>> can scale well (we are working on a maaaany-B documents index and hundreds
>> of nodes with ElasticSearch right now), etc.  But again, not integrated
>> with Hadoop the way you want it.
>> > * IndexTank - has some technical weaknesses, not integrated with Hadoop,
>> not sure about its future considering LinkedIn uses Zoie and Sensei already.
>> > * And there is SolrCloud, which is coming soon and will be solid, but is
>> again not integrated.
>> >
>> > If I were you and I had to pick today - I'd pick ElasticSearch if I were
>> completely open.  If I had Solr bias I'd give SolrCloud a try first.
>> >
>> >> Lastly, how much hardware (assuming a medium sized EC2 instance) would
>> you
>> >> estimate my needing with this setup, for regular web-data (HTML text) at
>> >> this scale?
>> >
>> > I don't know off the topic of my head, but I'm guessing several hundred
>> for serving search requests.
>> >
>> > HTH,
>> >
>> > Otis
>> > --
>> > Search Analytics - http://sematext.com/search-analytics/index.html
>> >
>> > Scalable Performance Monitoring - http://sematext.com/spm/index.html
>> >
>> >
>> >> Any architectural guidance would be greatly appreciated. The more
>> details
>> >> provided, the wider my grin :).
>> >>
>> >> Many many thanks in advance.
>> >>
>> >> Thanks,
>> >> Safdar
>> >>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Lukáš Vlček
AFAIK it can not. You can only add new shards by creating a new index and
you will then need to index new data into that new index. Index aliases are
useful mainly for searching part. So it means that you need to plan for
this when you implement your indexing logic. On the other hand the query
logic does not need to change as you only add new indices and give them all
the same alias.

I am not an expert on this but I think that index splitting and re-sharding
can be expensive for [near] real-time search system and the point is that
you can probably use different techniques to support your large scale
needs. Index aliasing and routing in elasticsearch can help a lot in
supporting various large scale data scenarios, check the following thread
in ES ML for some examples:
https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ

Just to sum it up, the fact that elasticsearch does have fixed number of
shards per index and does not support resharding and index splitting does
not mean you can not scale your data easily.

(I was not following this whole thread in every detail. So may be you may
have specific needs that can be solved only by splitting or resharding, in
such case I would recommend you to ask on ES ML with further questions, I
do not want to run into system X vs system Y flame here...)

Regards,
Lukas

On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen <
[hidden email]> wrote:

> I'm curious how on the fly updates are handled as a new shard is added
> to an alias.  Eg, how does the system know to which shard to send an
> update?
>
> On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček <[hidden email]>
> wrote:
> > Hi,
> >
> > speaking about ES I think it would be fair to mention that one has to
> > specify number of shards upfront when the index is created - that is
> > correct, however, it is possible to give index one or more aliases which
> > basically means that you can add new indices on the fly and give them
> same
> > alias which is then used to search against. Given that you can add/remove
> > indices, nodes and aliases on the fly I think there is a way how to
> handle
> > growing data set with ease. If anyone is interested such scenario has
> been
> > discussed in detail in ES mail list.
> >
> > Regards,
> > Lukas
> >
> > On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen <
> > [hidden email]> wrote:
> >
> >> One of big weaknesses of Solr Cloud (and ES?) is the lack of the
> >> ability to redistribute shards across servers.  Meaning, as a single
> >> shard grows too large, splitting the shard, while live updates.
> >>
> >> How do you plan on elastically adding more servers without this feature?
> >>
> >> Cassandra and HBase handle elasticity in their own ways.  Cassandra
> >> has successfully implemented the Dynamo model and HBase uses the
> >> traditional BigTable 'split'.  Both systems are complex though are at
> >> a singular level of maturity.
> >>
> >> Also Cassandra [successfully] implements multiple data center support,
> >> is that available in SC or ES?
> >>
> >> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
> >> <[hidden email]> wrote:
> >> > Hello Ali,
> >> >
> >> >> I'm trying to setup a large scale *Crawl + Index + Search
> >> *infrastructure
> >> >
> >> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
> >> pages*,
> >> >> crawled + indexed every *4 weeks, *with a search latency of less than
> >> 0.5
> >> >> seconds.
> >> >
> >> >
> >> > That's fine.  Whether it's doable with any tech will depend on how
> much
> >> hardware you give it, among other things.
> >> >
> >> >> Needless to mention, the search index needs to scale to 5Billion
> pages.
> >> It
> >> >> is also possible that I might need to store multiple indexes -- one
> for
> >> >> crawled content, and one for ancillary data that is also very large.
> >> Each
> >> >> of these indices would likely require a logically distributed and
> >> >> replicated index.
> >> >
> >> >
> >> > Yup, OK.
> >> >
> >> >> However, I would like for such a system to be homogenous with the
> Hadoop
> >> >> infrastructure that is already installed on the cluster (for the
> >> crawl). In
> >> >> other words, I would much prefer if the replication and distribution
> of
> >> the
> >> >> Solr/Lucene index be done automagically on top of Hadoop/HDFS,
> instead
> >> of
> >> >> using another scalability framework (such as SolrCloud). In
> addition, it
> >> >> would be ideal if this environment was flexible enough to be
> dynamically
> >> >> scaled based on the size requirements of the index and the search
> >> traffic
> >> >> at the time (i.e. if it is deployed on an Amazon cluster, it should
> be
> >> easy
> >> >> enough to automatically provision additional processing power into
> the
> >> >> cluster without requiring server re-starts).
> >> >
> >> >
> >> > There is no such thing just yet.
> >> > There is no Search+Hadoop/HDFS in a box just yet.  There was an
> attempt
> >> to automatically index HBase content, but that was either not completed
> or
> >> not committed into HBase.
> >> >
> >> >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem
> >> would
> >> >> be ideal for this scenario. I've heard mention of Solr-on-HBase,
> >> Solandra,
> >> >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of
> >> these is
> >> >> mature enough and would be the right architectural choice to go along
> >> with
> >> >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
> >> aspects
> >> >> above.
> >> >
> >> >
> >> > Here is a summary on all of them:
> >> > * Search on HBase - I assume you are referring to the same thing I
> >> mentioned above.  Not ready.
> >> > * Solandra - uses Cassandra+Solr, plus DataStax now has a different
> >> (commercial) offering that combines search and Cassandra.  Looks good.
> >> > * Lily - data stored in HBase cluster gets indexed to a separate Solr
> >> instance(s)  on the side.  Not really integrated the way you want it to
> be.
> >> > * ElasticSearch - solid at this point, the most dynamic solution
> today,
> >> can scale well (we are working on a maaaany-B documents index and
> hundreds
> >> of nodes with ElasticSearch right now), etc.  But again, not integrated
> >> with Hadoop the way you want it.
> >> > * IndexTank - has some technical weaknesses, not integrated with
> Hadoop,
> >> not sure about its future considering LinkedIn uses Zoie and Sensei
> already.
> >> > * And there is SolrCloud, which is coming soon and will be solid, but
> is
> >> again not integrated.
> >> >
> >> > If I were you and I had to pick today - I'd pick ElasticSearch if I
> were
> >> completely open.  If I had Solr bias I'd give SolrCloud a try first.
> >> >
> >> >> Lastly, how much hardware (assuming a medium sized EC2 instance)
> would
> >> you
> >> >> estimate my needing with this setup, for regular web-data (HTML
> text) at
> >> >> this scale?
> >> >
> >> > I don't know off the topic of my head, but I'm guessing several
> hundred
> >> for serving search requests.
> >> >
> >> > HTH,
> >> >
> >> > Otis
> >> > --
> >> > Search Analytics - http://sematext.com/search-analytics/index.html
> >> >
> >> > Scalable Performance Monitoring - http://sematext.com/spm/index.html
> >> >
> >> >
> >> >> Any architectural guidance would be greatly appreciated. The more
> >> details
> >> >> provided, the wider my grin :).
> >> >>
> >> >> Many many thanks in advance.
> >> >>
> >> >> Thanks,
> >> >> Safdar
> >> >>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

Jason Rutherglen
The main point being made is established NoSQL solutions (eg,
Cassandra, HBase, et al) have solved the update problem (among many
other scalability issues, for several years).

If an update is being performed and it is not known where the record
exists, the update capability of the system is inefficient.  In
addition, in a production system, the mere possibility of losing data,
or inaccurate updates is usually a red flag.

On Wed, Apr 18, 2012 at 6:40 AM, Lukáš Vlček <[hidden email]> wrote:

> AFAIK it can not. You can only add new shards by creating a new index and
> you will then need to index new data into that new index. Index aliases are
> useful mainly for searching part. So it means that you need to plan for
> this when you implement your indexing logic. On the other hand the query
> logic does not need to change as you only add new indices and give them all
> the same alias.
>
> I am not an expert on this but I think that index splitting and re-sharding
> can be expensive for [near] real-time search system and the point is that
> you can probably use different techniques to support your large scale
> needs. Index aliasing and routing in elasticsearch can help a lot in
> supporting various large scale data scenarios, check the following thread
> in ES ML for some examples:
> https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ
>
> Just to sum it up, the fact that elasticsearch does have fixed number of
> shards per index and does not support resharding and index splitting does
> not mean you can not scale your data easily.
>
> (I was not following this whole thread in every detail. So may be you may
> have specific needs that can be solved only by splitting or resharding, in
> such case I would recommend you to ask on ES ML with further questions, I
> do not want to run into system X vs system Y flame here...)
>
> Regards,
> Lukas
>
> On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen <
> [hidden email]> wrote:
>
>> I'm curious how on the fly updates are handled as a new shard is added
>> to an alias.  Eg, how does the system know to which shard to send an
>> update?
>>
>> On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček <[hidden email]>
>> wrote:
>> > Hi,
>> >
>> > speaking about ES I think it would be fair to mention that one has to
>> > specify number of shards upfront when the index is created - that is
>> > correct, however, it is possible to give index one or more aliases which
>> > basically means that you can add new indices on the fly and give them
>> same
>> > alias which is then used to search against. Given that you can add/remove
>> > indices, nodes and aliases on the fly I think there is a way how to
>> handle
>> > growing data set with ease. If anyone is interested such scenario has
>> been
>> > discussed in detail in ES mail list.
>> >
>> > Regards,
>> > Lukas
>> >
>> > On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen <
>> > [hidden email]> wrote:
>> >
>> >> One of big weaknesses of Solr Cloud (and ES?) is the lack of the
>> >> ability to redistribute shards across servers.  Meaning, as a single
>> >> shard grows too large, splitting the shard, while live updates.
>> >>
>> >> How do you plan on elastically adding more servers without this feature?
>> >>
>> >> Cassandra and HBase handle elasticity in their own ways.  Cassandra
>> >> has successfully implemented the Dynamo model and HBase uses the
>> >> traditional BigTable 'split'.  Both systems are complex though are at
>> >> a singular level of maturity.
>> >>
>> >> Also Cassandra [successfully] implements multiple data center support,
>> >> is that available in SC or ES?
>> >>
>> >> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
>> >> <[hidden email]> wrote:
>> >> > Hello Ali,
>> >> >
>> >> >> I'm trying to setup a large scale *Crawl + Index + Search
>> >> *infrastructure
>> >> >
>> >> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
>> >> pages*,
>> >> >> crawled + indexed every *4 weeks, *with a search latency of less than
>> >> 0.5
>> >> >> seconds.
>> >> >
>> >> >
>> >> > That's fine.  Whether it's doable with any tech will depend on how
>> much
>> >> hardware you give it, among other things.
>> >> >
>> >> >> Needless to mention, the search index needs to scale to 5Billion
>> pages.
>> >> It
>> >> >> is also possible that I might need to store multiple indexes -- one
>> for
>> >> >> crawled content, and one for ancillary data that is also very large.
>> >> Each
>> >> >> of these indices would likely require a logically distributed and
>> >> >> replicated index.
>> >> >
>> >> >
>> >> > Yup, OK.
>> >> >
>> >> >> However, I would like for such a system to be homogenous with the
>> Hadoop
>> >> >> infrastructure that is already installed on the cluster (for the
>> >> crawl). In
>> >> >> other words, I would much prefer if the replication and distribution
>> of
>> >> the
>> >> >> Solr/Lucene index be done automagically on top of Hadoop/HDFS,
>> instead
>> >> of
>> >> >> using another scalability framework (such as SolrCloud). In
>> addition, it
>> >> >> would be ideal if this environment was flexible enough to be
>> dynamically
>> >> >> scaled based on the size requirements of the index and the search
>> >> traffic
>> >> >> at the time (i.e. if it is deployed on an Amazon cluster, it should
>> be
>> >> easy
>> >> >> enough to automatically provision additional processing power into
>> the
>> >> >> cluster without requiring server re-starts).
>> >> >
>> >> >
>> >> > There is no such thing just yet.
>> >> > There is no Search+Hadoop/HDFS in a box just yet.  There was an
>> attempt
>> >> to automatically index HBase content, but that was either not completed
>> or
>> >> not committed into HBase.
>> >> >
>> >> >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem
>> >> would
>> >> >> be ideal for this scenario. I've heard mention of Solr-on-HBase,
>> >> Solandra,
>> >> >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of
>> >> these is
>> >> >> mature enough and would be the right architectural choice to go along
>> >> with
>> >> >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
>> >> aspects
>> >> >> above.
>> >> >
>> >> >
>> >> > Here is a summary on all of them:
>> >> > * Search on HBase - I assume you are referring to the same thing I
>> >> mentioned above.  Not ready.
>> >> > * Solandra - uses Cassandra+Solr, plus DataStax now has a different
>> >> (commercial) offering that combines search and Cassandra.  Looks good.
>> >> > * Lily - data stored in HBase cluster gets indexed to a separate Solr
>> >> instance(s)  on the side.  Not really integrated the way you want it to
>> be.
>> >> > * ElasticSearch - solid at this point, the most dynamic solution
>> today,
>> >> can scale well (we are working on a maaaany-B documents index and
>> hundreds
>> >> of nodes with ElasticSearch right now), etc.  But again, not integrated
>> >> with Hadoop the way you want it.
>> >> > * IndexTank - has some technical weaknesses, not integrated with
>> Hadoop,
>> >> not sure about its future considering LinkedIn uses Zoie and Sensei
>> already.
>> >> > * And there is SolrCloud, which is coming soon and will be solid, but
>> is
>> >> again not integrated.
>> >> >
>> >> > If I were you and I had to pick today - I'd pick ElasticSearch if I
>> were
>> >> completely open.  If I had Solr bias I'd give SolrCloud a try first.
>> >> >
>> >> >> Lastly, how much hardware (assuming a medium sized EC2 instance)
>> would
>> >> you
>> >> >> estimate my needing with this setup, for regular web-data (HTML
>> text) at
>> >> >> this scale?
>> >> >
>> >> > I don't know off the topic of my head, but I'm guessing several
>> hundred
>> >> for serving search requests.
>> >> >
>> >> > HTH,
>> >> >
>> >> > Otis
>> >> > --
>> >> > Search Analytics - http://sematext.com/search-analytics/index.html
>> >> >
>> >> > Scalable Performance Monitoring - http://sematext.com/spm/index.html
>> >> >
>> >> >
>> >> >> Any architectural guidance would be greatly appreciated. The more
>> >> details
>> >> >> provided, the wider my grin :).
>> >> >>
>> >> >> Many many thanks in advance.
>> >> >>
>> >> >> Thanks,
>> >> >> Safdar
>> >> >>
>> >>
>>