|
can u talk about it ?
maybe i will use hadoop + solr. thks for ur advice. -- regards j.L |
|
Hey James,
Just wondering if you ever had a chance to try out hadoop with solr? Would appreciate any information/directions you could give. I am particularly interested in indexing using a mapreduce job. Cheers, -Ali |
|
I think there's people using this patch in production: https://issues.apache.org/jira/browse/SOLR-1301
I have tested it myself indexing data from CSV and from HBase and it works properly |
|
Thanks Marc,
Well I have an HBASE storage architecture and solr master-slave setup with two slave servers. Would this patch work with my setup? Do I need sharding in place? and what tasks would be run at map and reduce phases? I was thinking something like: At Map: read documents as key/value and convert it to solrInputDoc and add it to the server. At Reduce: merge index? and commit>optimioze? Also is there any quick guidelines on how to get start with this setup? As I am new to hadoop as well as fairly new to Solr. Appreciate your help, -A |
|
Neeb,
Seems like we are in the same boat. Our index consist of 5M records which roughly equals around 30 gigs. All in all thats not too bad however our indexing process (we use DIH but I'm now revisiting that idea) takes a whopping 30+ hours!!! I just bought the Hadoop In Action early edition but haven't had time to read it yet. I was wondering what resources you are using to learn Hadoop and more importantly its applications to Solr. Would you mind explaining your thought process on how you will be using Hadoop in more detail? |
|
In reply to this post by Muneeb Ali
Well, the patch consumes the data from a csv. You have to modify the input to use TableInputFormat (I don't remember if it's called exaclty like that) and it will work.
Once you've done that, you have to specify as much reducers as shards you want. I know 2 ways to index using hadoop method 1 (solr-1301 & nutch): -Map: just get data from the source and create key-value -Reduce: does the analysis and index the data So, the index is build on the reducer side method 2 (hadoop lucene index contrib) -Map: does analysis and open indexWriter to add docs -Reducer: Merge small indexs build in the map So, indexs are build on the map side method 2 has no good integration with Solr at the moment. In the jira (SOLR-1301) there's a good explanation of the advantages and disadvantages of indexing on the map or reduce side. I recomend you to read with detail all the comments on the jira to know exactly how it works. |
|
In reply to this post by Blargy
Hi Blargy,
Nice to hear that I am not alone ;) Well we have been using Hadoop for other data-intensive services, those that can be done in parallel. We have multiple nodes, which are used by Hadoop for all our MapReduce jobs. I personally don't have much experience with its use and hence wouldn't be able to help you much with that. Our indexing takes 6+ hours to index 15 million documents (using solrj.streamUpdateSolrServer). I wanted to explore hadoop for this task, as it can be done in parallel. I have just started investigating into this, will keep this post updated if found anything helpful. -Neeb |
|
In reply to this post by Muneeb Ali
We (Attensity Group) have been using SOLR-1301 for 6+ months now
because we have a ready Hadoop cluster and need to be able to re/index up to 3 billion docs. I read the various emails and wasn't sure what you're asking. Cheers... On Tue, Jun 22, 2010 at 8:27 AM, Neeb <[hidden email]> wrote: > > Hey James, > > Just wondering if you ever had a chance to try out hadoop with solr? Would > appreciate any information/directions you could give. > > I am particularly interested in indexing using a mapreduce job. > > Cheers, > -Ali > -- > View this message in context: http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p914450.html > Sent from the Solr - User mailing list archive at Nabble.com. > |
|
In reply to this post by Muneeb Ali
Would you mind explaining how your full indexing strategy is implemented using the StreamingUpdateSolrServer? I am currently only familar with using the DataImportHandler. Thanks. |
|
In reply to this post by Marc Sturlese
Marc is referring to the very informative by Ted Dunning from maybe a month or so ago.
For what it's worth, we just used Hadoop Streaming, JRuby, and EmbeddedSolr to speed up indexing by parallelizing it. Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: Marc Sturlese <[hidden email]> > To: [hidden email] > Sent: Tue, June 22, 2010 12:43:27 PM > Subject: Re: anyone use hadoop+solr? > > Well, the patch consumes the data from a csv. You have to modify the input > to use TableInputFormat (I don't remember if it's called exaclty like that) > and it will work. Once you've done that, you have to specify as much > reducers as shards you want. I know 2 ways to index using > hadoop method 1 (solr-1301 & nutch): -Map: just get data from the > source and create key-value -Reduce: does the analysis and index the > data So, the index is build on the reducer side method 2 (hadoop > lucene index contrib) -Map: does analysis and open indexWriter to add > docs -Reducer: Merge small indexs build in the map So, indexs are build on > the map side method 2 has no good integration with Solr at the > moment. In the jira (SOLR-1301) there's a good explanation of the > advantages and disadvantages of indexing on the map or reduce side. I > recomend you to read with detail all the comments on the jira to know exactly > how it works. -- View this message in context: > href="http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p914625.html" > target=_blank > >http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p914625.html Sent > from the Solr - User mailing list archive at Nabble.com. |
|
Hi Otis, just for curiosity, wich strategy do you use? Index in the map or reduce side?
Do you use it to build shards or a single monolitic index? Thanks |
|
Marc,
In Map, purposely ending up with lots of smaller indices/shards at the end of the whole MapReduce job. Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: Marc Sturlese <[hidden email]> > To: [hidden email] > Sent: Thu, June 24, 2010 8:14:22 AM > Subject: Re: anyone use hadoop+solr? > > Hi Otis, just for curiosity, wich strategy do you use? Index in the map > or reduce side? Do you use it to build shards or a single monolitic > index? Thanks -- View this message in context: > href="http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p919335.html" > target=_blank > >http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p919335.html Sent > from the Solr - User mailing list archive at Nabble.com. |
|
Hi,
this topic started a few months ago, however there are some questions from my side, that I couldn't answer by looking at the SOLR-1301-issue nor the wiki-pages. Let me try to explain my thoughts: Given: a Hadoop-cluster, a solr-search-cluster and nutch as a crawling-engine which also performs LinkRank and webgraph-related tasks. Once a list of documents is created by nutch, you put the list + the LinkRank-values etc. into a Solr+Hadoop-job like it is described in Solr-1301 to index or reindex the given documents. When the shards are built, they will be sent over the network to the solr-search-cluster. Is this description correct? What makes me thinking is: Assumed I got a Document X on machine Y in shard Y... When I reindex that document X together with lots of other documents that are present or not present in Shard Y... and I put the resulting shard on a machine Z, how does machine Y notice that it has got an older version of document X than machine Z? Furthermore: Go on and assume that the shard Y was replicated to three other machines, how do they all notice, that their version of document X is not the newest available one? In such an environment, we do not have a master (right?), so far: How to keep the index as consistent as possible? Thank you for clearifying. Kind regards |
|
On 2010-09-04 19:53, MitchK wrote:
> > Hi, > > this topic started a few months ago, however there are some questions from > my side, that I couldn't answer by looking at the SOLR-1301-issue nor the > wiki-pages. > > Let me try to explain my thoughts: > Given: a Hadoop-cluster, a solr-search-cluster and nutch as a > crawling-engine which also performs LinkRank and webgraph-related tasks. > > Once a list of documents is created by nutch, you put the list + the > LinkRank-values etc. into a Solr+Hadoop-job like it is described in > Solr-1301 to index or reindex the given documents. There is no out of the box integration between Nutch and SOLR-1301, so there is some step that you omitted from this chain... e.g. "export from Nutch segments to CSV". > When the shards are built, they will be sent over the network to the > solr-search-cluster. > Is this description correct? Not really. SOLR-1301 doesn't deal with how you deploy the results of indexing. It simply creates the shards on HDFS. SOLR-1301 just creates the index data - it doesn't deal with serving the data... > > What makes me thinking is: > Assumed I got a Document X on machine Y in shard Y... > When I reindex that document X together with lots of other documents that > are present or not present in Shard Y... and I put the resulting shard on a > machine Z, how does machine Y notice that it has got an older version of > document X than machine Z? > > Furthermore: Go on and assume that the shard Y was replicated to three other > machines, how do they all notice, that their version of document X is not > the newest available one? > In such an environment, we do not have a master (right?), so far: How to > keep the index as consistent as possible? It's not possible to do it like this, at least for now... Looking into the future: eventually, when SolrCloud arrives we will be able to index straight to a SolrCloud cluster, assigning documents to shards through a hashing schema (e.g. 'md5(docId) % numShards'). Since shards would be created in a consistent way, then newer versions of documents would end up in the same shards and they would replace the older versions of the same documents - thus the problem would be solved. Additional benefit from this model is that it's not a disruptive and copy-intensive operation like SOLR-1301 (where you have to do "create new indexes, deploy them and switch") but rather a regular online update that is already supported in Solr. Once this is in place, we can modify Nutch to send documents directly to a SolrCloud cluster. Until then, you need to build and deploy indexes more or less manually (or using Katta, but again Katta is not integrated with Nutch). SolrCloud is not far away from hitting the trunk (right, Mark? ;) ), so medium-term I think this is your best bet. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
Thanks for your detailed feedback Andzej!
From what I understood, SOLR-1301 becomes obsolete ones Solr becomes cloud-ready, right? Hm, let's say the md5(docId) would produce a value of 10 (it won't, but let's assume it). If I got a constant number of shards, the doc will be published to the same shard again and again. i.e.: 10 % numShards(5) = 2 -> doc 10 will be indexed at shard 2. A few days later the rest of the cluster is available, now it looks like 10 % numShards(10) -> 1 -> doc 10 will be indexed at shard 1... and what about the older version at shard 2? I am no expert when it comes to cloudComputing and the other stuff. If you can point me to one or another reference where I can read about it, it would help me a lot, since I only want to understand how it works at the moment. The problem with Solr is its lack of documentation in some classes and the lack of capsulating some very complex things into different methods or extra-classes. Of course, this is because it costs some extra time to do so, but it makes understanding and modifying things very complicated if you do not understand whats going on from a theoretical point of view. Since the cloud-feature will be complex, a lack of documentation and no understanding of the theory behind the code will make contributing back very, very complicated. Thank you :-) - Mitch |
|
On Mon, Sep 6, 2010 at 8:37 AM, MitchK <[hidden email]> wrote:
> 10 % numShards(10) -> 1 -> doc 10 will be indexed at shard 1... and what > about the older version at shard 2? I am no expert when it comes to > cloudComputing and the other stuff. > If you can point me to one or another reference where I can read about it, > it would help me a lot, since I only want to understand how it works at the > moment. SolrCloud hasn't done anything on the indexing front yet (it still relies on user partitioning). The problem you point out is why a simple hash won't work well. When we do support indexing, we'll be able to figure out what document goes to what shard by consulting the state we keep in zookeeper. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 |
|
In reply to this post by MitchK
(I adjusted the subject to better reflect the content of this discussion).
On 2010-09-06 14:37, MitchK wrote: > > Thanks for your detailed feedback Andzej! > >> From what I understood, SOLR-1301 becomes obsolete ones Solr becomes > cloud-ready, right? Who knows... I certainly didn't expect this code to become so popular ;) so even after SolrCloud becomes available it's likely that some people will continue to use it. But SolrCloud should solve the original problem that I tried to solve with this patch. >> Looking into the future: eventually, when SolrCloud arrives we will be >> able to index straight to a SolrCloud cluster, assigning documents to >> shards through a hashing schema (e.g. 'md5(docId) % numShards') >> > Hm, let's say the md5(docId) would produce a value of 10 (it won't, but > let's assume it). > If I got a constant number of shards, the doc will be published to the same > shard again and again. > > i.e.: 10 % numShards(5) = 2 -> doc 10 will be indexed at shard 2. > > A few days later the rest of the cluster is available, now it looks like > > 10 % numShards(10) -> 1 -> doc 10 will be indexed at shard 1... and what > about the older version at shard 2? I am no expert when it comes to > cloudComputing and the other stuff. There are several possible solutions to this, and they all boil down to the way how you assign documents to shards... Keep in mind that nodes (physical machines) can manage several shards, and the aggregate collection of all unique shards across all nodes forms your whole index - so there's also a related, but different issue, of how to assign shards to nodes. Here are some scenarios how you can solve the doc-to-shard mapping problem (note: I removed the issue of replication from the picture to make this clearer): a) keep the number of shards constant no matter how large is the cluster. The mapping schema is then as simple as the one above. In this scenario you create relatively small shards, so that a single physical node can manage dozens of shards (each shard using one core, or perhaps a more lightweight structure like MultiReader). This is also known as micro-sharding. As the number of documents grows the size of each shard will grow until you have to reduce the number of shards per node, ultimately ending up with a single shard per node. After that, if your collection continues to grow, you have to modify your hashing schema to split some shards (and reindex some shards, or use an index splitter tool). b) use consistent hashing as the mapping schema to assign documents to a changing number of shards. There are many explanations of this schema on the net, here's one that is very simple: http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/ In this case, you can grow/shrink the number of shards (and their size) as you see fit, incurring only a small reindexing cost. > If you can point me to one or another reference where I can read about it, > it would help me a lot, since I only want to understand how it works at the > moment. http://wiki.apache.org/solr/SolrCloud ... > > The problem with Solr is its lack of documentation in some classes and the > lack of capsulating some very complex things into different methods or > extra-classes. Of course, this is because it costs some extra time to do so, > but it makes understanding and modifying things very complicated if you do > not understand whats going on from a theoretical point of view. In this case the lack of good docs and user-level API can be blamed on the fact that this functionality is still under heavy development. > > Since the cloud-feature will be complex, a lack of documentation and no > understanding of the theory behind the code will make contributing back > very, very complicated. For now, yes, it's an issue - though as soon as SolrCloud gets committed I'm sure people will follow up with user-level convenience components that will make it easier. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |
|
In reply to this post by Yonik Seeley-2-2
Yonik,
are there any discussions about SolrCloud-indexing? I would be glad to join them, if I find some interesting papers about that topic. - Mitch |
|
On Mon, Sep 6, 2010 at 9:47 AM, MitchK <[hidden email]> wrote:
> are there any discussions about SolrCloud-indexing? Not recently - personally I've been sidetracked by other stuff. Mapping docs to shards is the easy part... take a hash of the id, and then I imagine the shard id (the label for the index) can just be a range of hashes (i.e. 0x00000000-0xffffffff would be the entire collection). The harder part in all this is actually handling updates. For high availability, updates need to go to multiple hosts (the replication factor). But some hosts may be down, and updates may go to any shard replica. The shards/nodes need to be able to exchange info and fix things up. > I would be glad to join them, if I find some interesting papers about that > topic. Awesome! Here's some stuff to browse: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html http://sourceforge.net/mailarchive/forum.php?forum_name=bailey-developers http://cassandra.apache.org/ -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 |
|
In reply to this post by Andrzej Bialecki
Andrzej,
thank you for sharing your experiences. Boom. With the given explanation, I understand it as the following: You can use hadoop and do some map-reduce-jobs per csv-file. At the reducer-side, the reducer has to look for the id of the current doc and needs to create a hash of it. Now it looks inside a SortedSet, picks the next-best server and looks in a map, whether this server has got free capacity or not. That's cool. But it doesn't solve the problem at all, correct me if I am wrong, but: If you add a new server, let's call him IP3-1, and IP3-1 is nearer to the current ressource X, than doc x will be indexed at IP3-1 - even if IP2-1 holds the older version. Am I right? Thank you for sharing the paper. I will have a look for more like this. I do not only mean documentation at the user-level but also inside a class, if there is going on some complicated stuff. - Mitch |
| Powered by Nabble | Edit this page |
