Introduction

Our current Bigtop deployment code shows its signs of age (it originated in pre 2.X puppet). That code currently serves as the main engine for us to dynamically deploy Bigtop clusters on EC2 for testing purposes. However, given that the Bigtop distro is the foundation for the commercial distros, we would like our Puppet code to be the go-to place for all the puppet-driven Hadoop deployment needs.

Thus at the highest level, our Puppet code needs to be:

self-contained, to a point where we might end up forking somebody else's module(s) and maitaining it (subject to licensing)
useful for as many versions of Puppet as possible from the code (manifest) compatibility point of view. Obviously, we shouldn't obsess too much over something like Puppet 0.24, but we should keep the goal in mind. Also, we're NOT solving a problem of different versions of Puppet master and Puppet agents here.
useful in a classical puppet-master driven setup where one has access to modules, hiera/extlookup, etc all nicely setup and maitained under /etc/puppet
useful in a masterless mode so that things like Puppet/Whirr integration can be utilized: WHIRR-385 This is the usecase where the Puppet classes are guaranteed to be delivered to each node out of band and --modulepath will be given to puppet apply. Everything else (hiera/extlookup files, etc) is likely to require additional out-of-band communications that we would like to minimize.
useful in orchestration scenarios (Apache Ambari, Reactor8) although this could be viewed as a subset of the previous one. We need to be absolutely clear though that we are NOT proposing to use Puppet itself as an orchestration platform, what is being proposed is to treat puppet as a tool for predictably stamping out the state of each node when and only when instructed by a higher level orchestration engine. The same orchestration engine will also make sure to provide all the required data for the puppet classes. We are leaving it up to the orchestration implementation to either require a Puppet master or not. The important part is the temporal sequence in which either puppet agent or puppet apply get invoked on every node.

Proof of concept implementation (ongoing)

The next generation puppet implementation will be a pretty radical departure from what we currently have in Bigtop. It feels like a POC with a subset of Bigtop component will be a useful first step. HBase deployment is one such subset that, on one hand, is still small enough, but on the other hand would require puppet deployment for 2 supporting components:

Zookeeper
HDFS

before HBase itself gets fully bootstrapped. For those unfamiliar with the HBase architecture the following provides a good introduction: http://hbase.apache.org/book.html#distributed

Also, there's now a GitHub repository for the code to evolve: https://github.com/rvs/bigtop-puppet

Issues to be resolved

Parameterized classes vs. external lookups

Which style is prefered:

class { "bigtop::hdfs::secondarynamenode": 
   hdfs_namenode_uri => "hdfs://nn.cluster.company.com:8020/"
}

vs.

   # all the data required for the class gets
   # to us via external lookups
   include bigtop::hdfs::secondarynamenode

jcbollinger says:

If you are going to rely on hiera or another external source for all class data, then you absolutely should use the 'include' form, not the parametrized form. The latter carries the constraint that it can be used only once for any given class, and that use must be the first one parsed. There are very good reasons, however, why sometimes you would like to declare a given class in more than one place. You can work around any need to do so with enough effort and code, but that generally makes your manifest set more brittle, and / or puts substantially greater demands on your ENC. I see from your subsequent comment (elided) that you recognize that, at least to some degree.

Source of external configuration: extlookup/hiera

Should we rely on extlookup or hiera? Should we somehow push that decision onto the consumer of our Puppet code so that they can use either one?

rvs says:

My biggest fear of embracing hiera 100% is compatibility concerns with older Pupppets

What is the best tool for modeling the data that characterize the configuration to be deployed

It seems like class parameters is an obvious choice here

jcbollinger says:

You are supposing that these will be modeled as class parameters in the first place. Certainly they are data that characterize the configuration to be deployed, and it is possible to model them as class parameters, but that is not the only – and maybe not the best – alternative available to you. Class parametrization is a protocol for declaring and documenting that data on which a class relies, and it enables mechanisms for obtaining that data that non-parametrized classes cannot use, but the same configuration objectives can be accomplished without them.

Page tree

Next generation Puppet code for Bigtop deployment

Introduction

Proof of concept implementation (ongoing)

Issues to be resolved

Parameterized classes vs. external lookups

jcbollinger says:

Source of external configuration: extlookup/hiera

rvs says:

What is the best tool for modeling the data that characterize the configuration to be deployed

jcbollinger says: