Quantcast

anyone use hadoop+solr?

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

anyone use hadoop+solr?

James liu-2
can u talk about it ?

maybe i will use hadoop + solr.

thks for ur advice.



--
regards
j.L
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Muneeb Ali
Hey James,

Just wondering if you ever had a chance to try out hadoop with solr? Would appreciate any information/directions you could give.

I am particularly interested in indexing using a mapreduce job.

Cheers,
-Ali
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Marc Sturlese
I think there's people using this patch in production: https://issues.apache.org/jira/browse/SOLR-1301
I have tested it myself indexing data from CSV and from HBase and it works properly
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Muneeb Ali
Thanks Marc,

Well I have an HBASE storage architecture and solr master-slave setup with two slave servers.

Would this patch work with my setup? Do I need sharding in place? and what tasks would be run at map and reduce phases?

I was thinking something like:

At Map: read documents as key/value and convert it to solrInputDoc and add it to the server.
At Reduce: merge index? and commit>optimioze?

Also is there any quick guidelines on how to get start with this setup? As I am new to hadoop as well as fairly new to Solr.

Appreciate your help,
-A

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Blargy
Neeb,

Seems like we are in the same boat. Our index consist of 5M records which roughly equals around 30 gigs. All in all thats not too bad however our indexing process (we use DIH but I'm now revisiting that idea) takes a whopping 30+ hours!!!

I just bought the Hadoop In Action early edition but haven't had time to read it yet. I was wondering what resources you are using to learn Hadoop and more importantly its applications to Solr. Would you mind explaining your thought process on how you will be using Hadoop in more detail?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Marc Sturlese
In reply to this post by Muneeb Ali
Well, the patch consumes the data from a csv. You have to modify the input to use TableInputFormat (I don't remember if it's called exaclty like that) and it will work.
Once you've done that, you have to specify as much reducers as shards you want.

I know 2 ways to index using hadoop
method 1 (solr-1301 & nutch):
-Map: just get data from the source and create key-value
-Reduce: does the analysis and index the data
So, the index is build on the reducer side

method 2 (hadoop lucene index contrib)
-Map: does analysis and open indexWriter to add docs
-Reducer: Merge small indexs build in the map
So, indexs are build on the map side
method 2 has no good integration with Solr at the moment.

In the jira (SOLR-1301) there's a good explanation of the advantages and disadvantages of indexing on the map or reduce side. I recomend you to read with detail all the comments on the jira to know exactly how it works.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Muneeb Ali
In reply to this post by Blargy
Hi Blargy,

Nice to hear that I am not alone ;)

Well we have been using Hadoop for other data-intensive services, those that can be done in parallel. We have multiple nodes, which are used by Hadoop for all our MapReduce jobs. I personally don't have much experience with its use and hence wouldn't be able to help you much with that.

Our indexing takes 6+ hours to index 15 million documents (using solrj.streamUpdateSolrServer). I wanted to explore hadoop for this task, as it can be done in parallel.

I have just started investigating into this, will keep this post updated if found anything helpful.
 
-Neeb
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Jason Rutherglen
In reply to this post by Muneeb Ali
We (Attensity Group) have been using SOLR-1301 for 6+ months now
because we have a ready Hadoop cluster and need to be able to re/index
up to 3 billion docs.  I read the various emails and wasn't sure what
you're asking.

Cheers...

On Tue, Jun 22, 2010 at 8:27 AM, Neeb <[hidden email]> wrote:

>
> Hey James,
>
> Just wondering if you ever had a chance to try out hadoop with solr? Would
> appreciate any information/directions you could give.
>
> I am particularly interested in indexing using a mapreduce job.
>
> Cheers,
> -Ali
> --
> View this message in context: http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p914450.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Blargy
In reply to this post by Muneeb Ali
Muneeb Ali wrote
Hi Blargy,

Nice to hear that I am not alone ;)

Well we have been using Hadoop for other data-intensive services, those that can be done in parallel. We have multiple nodes, which are used by Hadoop for all our MapReduce jobs. I personally don't have much experience with its use and hence wouldn't be able to help you much with that.

Our indexing takes 6+ hours to index 15 million documents (using solrj.streamUpdateSolrServer). I wanted to explore hadoop for this task, as it can be done in parallel.

I have just started investigating into this, will keep this post updated if found anything helpful.
 
-Neeb
Would you mind explaining how your full indexing strategy is implemented using the StreamingUpdateSolrServer? I am currently only familar with using the DataImportHandler. Thanks.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Otis Gospodnetic-2
In reply to this post by Marc Sturlese
Marc is referring to the very informative by Ted Dunning from maybe a month or so ago.

For what it's worth, we just used Hadoop Streaming, JRuby, and EmbeddedSolr to speed up indexing by parallelizing it.
 Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Marc Sturlese <[hidden email]>
> To: [hidden email]
> Sent: Tue, June 22, 2010 12:43:27 PM
> Subject: Re: anyone use hadoop+solr?
>
>
Well, the patch consumes the data from a csv. You have to modify the input
> to
use TableInputFormat (I don't remember if it's called exaclty like that)
> and
it will work.
Once you've done that, you have to specify as much
> reducers as shards you
want.

I know 2 ways to index using
> hadoop
method 1 (solr-1301 & nutch):
-Map: just get data from the
> source and create key-value
-Reduce: does the analysis and index the
> data
So, the index is build on the reducer side

method 2 (hadoop
> lucene index contrib)
-Map: does analysis and open indexWriter to add
> docs
-Reducer: Merge small indexs build in the map
So, indexs are build on
> the map side
method 2 has no good integration with Solr at the
> moment.

In the jira (SOLR-1301) there's a good explanation of the
> advantages and
disadvantages of indexing on the map or reduce side. I
> recomend you to read
with detail all the comments on the jira to know exactly
> how it works.


--
View this message in context:
> href="http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p914625.html"
> target=_blank
> >http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p914625.html
Sent
> from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Marc Sturlese
Hi Otis, just for curiosity, wich strategy do you use? Index in the map or reduce side?
Do you use it to build shards or a single monolitic index?
Thanks
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Otis Gospodnetic-2
Marc,

In Map, purposely ending up with lots of smaller indices/shards at the end of the whole MapReduce job.
 Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Marc Sturlese <[hidden email]>
> To: [hidden email]
> Sent: Thu, June 24, 2010 8:14:22 AM
> Subject: Re: anyone use hadoop+solr?
>
>
Hi Otis, just for curiosity, wich strategy do you use? Index in the map
> or
reduce side?
Do you use it to build shards or a single monolitic
> index?
Thanks

--
View this message in context:
> href="http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p919335.html"
> target=_blank
> >http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p919335.html
Sent
> from the Solr - User mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

MitchK
Hi,

this topic started a few months ago, however there are some questions from my side, that I couldn't answer by looking at the SOLR-1301-issue nor the wiki-pages.

Let me try to explain my thoughts:
Given: a Hadoop-cluster, a solr-search-cluster and nutch as a crawling-engine which also performs LinkRank and webgraph-related tasks.

Once a list of documents is created by nutch, you put the list + the LinkRank-values etc. into a Solr+Hadoop-job like it is described in Solr-1301 to index or reindex the given documents.
When the shards are built, they will be sent over the network to the solr-search-cluster.
Is this description correct?

What makes me thinking is:
Assumed I got a Document X on machine Y in shard Y...
When I reindex that document X together with lots of other documents that are present or not present in Shard Y... and I put the resulting shard on a machine Z, how does machine Y notice that it has got an older version of document X than machine Z?

Furthermore: Go on and assume that the shard Y was replicated to three other machines, how do they all notice, that their version of document X is not the newest available one?
In such an environment, we do not have a master (right?), so far: How to keep the index as consistent as possible?

Thank you for clearifying.

Kind regards
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Andrzej Bialecki
On 2010-09-04 19:53, MitchK wrote:

>
> Hi,
>
> this topic started a few months ago, however there are some questions from
> my side, that I couldn't answer by looking at the SOLR-1301-issue nor the
> wiki-pages.
>
> Let me try to explain my thoughts:
> Given: a Hadoop-cluster, a solr-search-cluster and nutch as a
> crawling-engine which also performs LinkRank and webgraph-related tasks.
>
> Once a list of documents is created by nutch, you put the list + the
> LinkRank-values etc. into a Solr+Hadoop-job like it is described in
> Solr-1301 to index or reindex the given documents.

There is no out of the box integration between Nutch and SOLR-1301, so
there is some step that you omitted from this chain... e.g. "export from
Nutch segments to CSV".


> When the shards are built, they will be sent over the network to the
> solr-search-cluster.
> Is this description correct?

Not really. SOLR-1301 doesn't deal with how you deploy the results of
indexing. It simply creates the shards on HDFS. SOLR-1301 just creates
the index data - it doesn't deal with serving the data...

>
> What makes me thinking is:
> Assumed I got a Document X on machine Y in shard Y...
> When I reindex that document X together with lots of other documents that
> are present or not present in Shard Y... and I put the resulting shard on a
> machine Z, how does machine Y notice that it has got an older version of
> document X than machine Z?
>
> Furthermore: Go on and assume that the shard Y was replicated to three other
> machines, how do they all notice, that their version of document X is not
> the newest available one?
> In such an environment, we do not have a master (right?), so far: How to
> keep the index as consistent as possible?

It's not possible to do it like this, at least for now...

Looking into the future: eventually, when SolrCloud arrives we will be
able to index straight to a SolrCloud cluster, assigning documents to
shards through a hashing schema (e.g. 'md5(docId) % numShards'). Since
shards would be created in a consistent way, then newer versions of
documents would end up in the same shards and they would replace the
older versions of the same documents - thus the problem would be solved.
Additional benefit from this model is that it's not a disruptive and
copy-intensive operation like SOLR-1301 (where you have to do "create
new indexes, deploy them and switch") but rather a regular online update
that is already supported in Solr.

Once this is in place, we can modify Nutch to send documents directly to
a SolrCloud cluster. Until then, you need to build and deploy indexes
more or less manually (or using Katta, but again Katta is not integrated
with Nutch).

SolrCloud is not far away from hitting the trunk (right, Mark? ;) ), so
medium-term I think this is your best bet.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

MitchK
Thanks for your detailed feedback Andzej!

From what I understood, SOLR-1301 becomes obsolete ones Solr becomes cloud-ready, right?

Looking into the future: eventually, when SolrCloud arrives we will be
able to index straight to a SolrCloud cluster, assigning documents to
shards through a hashing schema (e.g. 'md5(docId) % numShards')
Hm, let's say the md5(docId) would produce a value of 10 (it won't, but let's assume it).
If I got a constant number of shards, the doc will be published to the same shard again and again.

i.e.: 10 % numShards(5) = 2 -> doc 10 will be indexed at shard 2.

A few days later the rest of the cluster is available, now it looks like

10 % numShards(10) ->  1 -> doc 10 will be indexed at shard 1... and what about the older version at shard 2? I am no expert when it comes to cloudComputing and the other stuff.
If you can point me to one or another reference where I can read about it, it would help me a lot, since I only want to understand how it works at the moment.

The problem with Solr is its lack of documentation in some classes and the lack of capsulating some very complex things into different methods or extra-classes. Of course, this is because it costs some extra time to do so, but it makes understanding and modifying things very complicated if you do not understand whats going on from a theoretical point of view.

Since the cloud-feature will be complex, a lack of documentation and no understanding of the theory behind the code will make contributing back very, very complicated.

Thank you :-)
- Mitch
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Yonik Seeley-2-2
On Mon, Sep 6, 2010 at 8:37 AM, MitchK <[hidden email]> wrote:
> 10 % numShards(10) ->  1 -> doc 10 will be indexed at shard 1... and what
> about the older version at shard 2? I am no expert when it comes to
> cloudComputing and the other stuff.
> If you can point me to one or another reference where I can read about it,
> it would help me a lot, since I only want to understand how it works at the
> moment.

SolrCloud hasn't done anything on the indexing front yet (it still
relies on user partitioning).
The problem you point out is why a simple hash won't work well.  When
we do support indexing, we'll be able to figure out what document goes
to what shard by consulting the state we keep in zookeeper.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

Andrzej Bialecki
In reply to this post by MitchK
(I adjusted the subject to better reflect the content of this discussion).

On 2010-09-06 14:37, MitchK wrote:
>
> Thanks for your detailed feedback Andzej!
>
>> From what I understood, SOLR-1301 becomes obsolete ones Solr becomes
> cloud-ready, right?

Who knows... I certainly didn't expect this code to become so popular ;)
so even after SolrCloud becomes available it's likely that some people
will continue to use it. But SolrCloud should solve the original problem
that I tried to solve with this patch.

>> Looking into the future: eventually, when SolrCloud arrives we will be
>> able to index straight to a SolrCloud cluster, assigning documents to
>> shards through a hashing schema (e.g. 'md5(docId) % numShards')
>>
> Hm, let's say the md5(docId) would produce a value of 10 (it won't, but
> let's assume it).
> If I got a constant number of shards, the doc will be published to the same
> shard again and again.
>
> i.e.: 10 % numShards(5) = 2 ->  doc 10 will be indexed at shard 2.
>
> A few days later the rest of the cluster is available, now it looks like
>
> 10 % numShards(10) ->   1 ->  doc 10 will be indexed at shard 1... and what
> about the older version at shard 2? I am no expert when it comes to
> cloudComputing and the other stuff.

There are several possible solutions to this, and they all boil down to
the way how you assign documents to shards... Keep in mind that nodes
(physical machines) can manage several shards, and the aggregate
collection of all unique shards across all nodes forms your whole index
- so there's also a related, but different issue, of how to assign
shards to nodes.

Here are some scenarios how you can solve the doc-to-shard mapping
problem (note: I removed the issue of replication from the picture to
make this clearer):

a) keep the number of shards constant no matter how large is the
cluster. The mapping schema is then as simple as the one above. In this
scenario you create relatively small shards, so that a single physical
node can manage dozens of shards (each shard using one core, or perhaps
a more lightweight structure like MultiReader). This is also known as
micro-sharding. As the number of documents grows the size of each shard
will grow until you have to reduce the number of shards per node,
ultimately ending up with a single shard per node. After that, if your
collection continues to grow, you have to modify your hashing schema to
split some shards (and reindex some shards, or use an index splitter tool).

b) use consistent hashing as the mapping schema to assign documents to a
changing number of shards. There are many explanations of this schema on
the net, here's one that is very simple:

http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/

In this case, you can grow/shrink the number of shards (and their size)
as you see fit, incurring only a small reindexing cost.

> If you can point me to one or another reference where I can read about it,
> it would help me a lot, since I only want to understand how it works at the
> moment.

http://wiki.apache.org/solr/SolrCloud ...

>
> The problem with Solr is its lack of documentation in some classes and the
> lack of capsulating some very complex things into different methods or
> extra-classes. Of course, this is because it costs some extra time to do so,
> but it makes understanding and modifying things very complicated if you do
> not understand whats going on from a theoretical point of view.

In this case the lack of good docs and user-level API can be blamed on
the fact that this functionality is still under heavy development.

>
> Since the cloud-feature will be complex, a lack of documentation and no
> understanding of the theory behind the code will make contributing back
> very, very complicated.

For now, yes, it's an issue - though as soon as SolrCloud gets committed
I'm sure people will follow up with user-level convenience components
that will make it easier.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

MitchK
In reply to this post by Yonik Seeley-2-2
Yonik,

are there any discussions about SolrCloud-indexing?

I would be glad to join them, if I find some interesting papers about that topic.

- Mitch
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: anyone use hadoop+solr?

Yonik Seeley-2-2
On Mon, Sep 6, 2010 at 9:47 AM, MitchK <[hidden email]> wrote:
> are there any discussions about SolrCloud-indexing?

Not recently - personally I've been sidetracked by other stuff.

Mapping docs to shards is the easy part... take a hash of the id, and
then I imagine the shard id (the label for the index) can just be a
range of hashes (i.e. 0x00000000-0xffffffff would be the entire
collection).

The harder part in all this is actually handling updates.
For high availability, updates need to go to multiple hosts (the
replication factor).
But some hosts may be down, and updates may go to any shard replica.
The shards/nodes need to be able to exchange info and fix things up.

> I would be glad to join them, if I find some interesting papers about that
> topic.

Awesome!  Here's some stuff to browse:
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
http://sourceforge.net/mailarchive/forum.php?forum_name=bailey-developers
http://cassandra.apache.org/


-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SolrCloud distributed indexing (Re: anyone use hadoop+solr?)

MitchK
In reply to this post by Andrzej Bialecki
Andrzej,

thank you for sharing your experiences.

b) use consistent hashing as the mapping schema to assign documents to a
changing number of shards. There are many explanations of this schema on
the net, here's one that is very simple:
Boom.
With the given explanation, I understand it as the following:
You can use hadoop and do some map-reduce-jobs per csv-file.
At the reducer-side, the reducer has to look for the id of the current doc and needs to create a hash of it.
Now it looks inside a SortedSet, picks the next-best server and looks in a map, whether this server has got free capacity or not. That's cool.

But it doesn't solve the problem at all, correct me if I am wrong, but: If you add a new server, let's call him IP3-1, and IP3-1 is nearer to the current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1 holds the older version.
Am I right?

Thank you for sharing the paper. I will have a look for more like this.

In this case the lack of good docs and user-level API can be blamed on
the fact that this functionality is still under heavy development.
I do not only mean documentation at the user-level but also inside a class, if there is going on some complicated stuff.

- Mitch
12
Loading...