Fc047 Attribute Assignment Does Not Specify Precedence

Modified to use discovery and options preparation.

Cookbook dependencies: * java * apt * runit * thrift * iptables * volumes * metachef * install_from

  • - Cassandra cluster name (default: "cluster_name")
    • The name for the Cassandra cluster in which this node should participate. The default is 'Test Cluster'.
  • - (default: "/usr/local/share/cassandra")
    • Directories, hosts and ports # =
  • - (default: "/etc/cassandra")
  • - (default: "/mnt/cassandra/commitlog")
  • -
  • - (default: "/var/lib/cassandra/saved_caches")
  • - cassandra (default: "cassandra")
  • - (default: "localhost")
  • -
  • - (default: "localhost")
  • - (default: "9160")
  • - (default: "7000")
  • - (default: "12345")
  • - (default: "8081")
  • - (default: "127.0.0.1")
  • - (default: ":apache_mirror:/cassandra/:version:/apache-cassandra-:version:-bin.tar.gz")
    • install_from_release: tarball url
  • - (default: "git://git.apache.org/cassandra.git")
  • - (default: "cdd239dcf82ab52cb840e070fc01135efb512799")
    • until ruby gem is updated, use cdd239dcf82ab52cb840e070fc01135efb512799
  • - (default: "http://debian.riptano.com/maverick/pool/libjna-java_3.2.7-0~nmu.2_amd64.deb")
  • - Cassandra automatic boostrap boolean (default: "false")
    • Boolean indicating whether a node should automatically boostrap on startup.
  • - Cassandra keyspaces

    • Make a databag called 'cassandra', with an element 'clusters'. Within that, define a hash named for your cluster:
    • keys_cached: specifies the number of keys per sstable whose locations we keep in memory in "mostly LRU" order. (JUST the key locations, NOT any column values.) Specify a fraction (value less than 1) or an absolute number of keys to cache. Defaults to 200000 keys.
    • rows_cached: specifies the number of rows whose entire contents we cache in memory. Do not use this on ColumnFamilies with large rows, or ColumnFamilies with high write:read ratios. Specify a fraction (value less than 1) or an absolute number of rows to cache. Defaults to 0. (i.e. row caching is off by default)
    • comment: used to attach additional human-readable information about the column family to its definition.
    • read_repair_chance: specifies the probability with which read repairs should be invoked on non-quorum reads. must be between 0 and 1. defaults to 1.0 (always read repair).
    • preload_row_cache: If true, will populate row cache on startup. Defaults to false.
    • gc_grace_seconds: specifies the time to wait before garbage collecting tombstones (deletion markers). defaults to 864000 (10 days). See http://wiki.apache.org/cassandra/DistributedDeletes
  • - Cassandra authenticator (default: "org.apache.cassandra.auth.AllowAllAuthenticator")

    • The IAuthenticator to be used for access control.
  • - (default: "org.apache.cassandra.dht.RandomPartitioner")

  • -

  • - (default: "128")

  • - (default: "15")

  • - (default: "auto")

  • - (default: "64")

  • - (default: "64")

  • - (default: "64")

  • - (default: "0.3")

  • - (default: "60")

  • - (default: "8")

  • - (default: "32")

  • - (default: "periodic")

  • - (default: "10000")

  • - (default: "org.apache.cassandra.auth.AllowAllAuthority")

  • - (default: "true")

  • - (default: "3600000")

  • - (default: "50")

  • - (default: "org.apache.cassandra.locator.SimpleSnitch")

  • - (default: "true")

  • - (default: "128M")

  • - (default: "1650M")

  • - (default: "1500M")

  • - (default: "1")

  • - (default: "16")

  • -

  • -

  • - (default: "64")

  • - (default: "true")

  • - (default: "0.75")

  • - (default: "0.85")

  • - (default: "0.6")

  • - (default: "10000")

  • - (default: "false")

  • - (default: "8")

  • - (default: "org.apache.cassandra.scheduler.NoScheduler")

  • - (default: "80")

  • - (default: "keyspace")

  • - (default: "/var/log/cassandra")

  • - (default: "/var/lib/cassandra")

  • - (default: "/var/run/cassandra")

  • - nogroup (default: "nogroup")

    • The group that cassandra belongs to
  • - (default: "0.7.10")

  • - (default: "3.0.2")

  • - (default: "http://downloads.sourceforge.net/project/mx4j/MX4J%20Binary/x.x/mx4j-x.x.zip?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fmx4j%2Ffiles%2F&ts=1303407638&use_mirror=iweb")

    • MX4J location (at least as of Version 3.0.2)
  • - (default: "330")

  • - (default: "330")

  • -

  • Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

    readme generated by cluster_chef's cookbook_munger

    hadoop_cluster chef cookbook

    Hadoop: distributed massive-scale data processing framework. Store and analyze terabyte-scale datasets with ease

    Overview

    The hadoop_cluster cookbook lets you spin up hadoop clusters of arbitrary size, makes it easy to configure and tune them, and opens up powerful new ways to reduce your compute bill.

    This cookbook installs Apache hadoop using the Cloudera hadoop distribution (CDH), and it plays well with the infochimps cookbooks for HBase, Flume, ElasticSearch, Zookeeper, Ganglia and Zabbix.

    Cluster instantiation

    Instantiating a cluster from thin air requires * start the master, with all the s set to 'stop' * running , which is now fairly robust and re-runnable. It should start the namenode on its own. * set s to 'start', run and then the master. * launch the jobbtracker, workers, etc.

    Tunables

    For more details on the so, so many config variables, see

    Advanced Cluster-Fu for the impatient cheapskate

    Stop-start clusters

    If you have a persistent HDFS, you can shut down the cluster with at the end of your workday, and restart it in less time than it takes to get your morning coffee. Typical time from typing "knife cluster launch science-worker" until the node reports in the jobtracker is <= 6 minutes on launch -- faster than that on start.

    Stopped nodes don't cost you anything in compute, though you do continue to pay for the storage on their attached drives. See the example science cluster for the setup we use.

    Reshapable clusters

    The hadoop cluster definition we use at infochimps for production runs uses its HDFS ONLY a scratch pad - anything we want to keep goes into S3.

    This lets us do stupid, dangerous, awesome things like:

    • spin up a few dozen c1.xlarge CPU-intensive machines, parse a ton of data, store it back into S3.
    • blow all the workers away and reformat the namenode with the shell script.
    • spin up a cluster of m2.2xlarge memory-intensive machines to group and filter it, storing final results into S3.
    • shut the entire cluster down before anyone in accounting notices.

    Tasktracker-only workers

    Who says your workers should also be datanodes? Sure, "bring the compute to the data" is the way the robots want you to do it, but a tasktracker-only node on an idle cluster is one you can kill with no repercussions.

    This lets you blow up the size of your cluster and not have to wait later for nodes to decommission. Non-local map tasks obviously run slower-than-optimal, but we'd rather have sub-optimal robots than sub-optimal data scientists.

    Author:

    Author:: Joshua Timberman (joshua@opscode.com), Flip Kromer (flip@infochimps.org), much code taken from Tom White (tom@cloudera.com)'s hadoop-ec2 scripts and Robert Berger (http://blog.ibd.com)'s blog posts.

    Copyright:: 2009, Opscode, Inc; 2010, 2011 Infochimps, In

    Recipes

    • - Add Cloudera repo to package manager
    • - Configure cluster
    • - Installs Hadoop Datanode service
    • - Base configuration for hadoop_cluster
    • - Installs Hadoop documentation
    • - Pretend that groups of machines are on different racks so you can execute them without guilt
    • - Installs Hadoop HDFS Fuse service (regular filesystem access to HDFS files)
    • - Installs Hadoop Jobtracker service
    • - Installs Hadoop Namenode service
    • - Installs Hadoop Secondary Namenode service
    • - Simple Dashboard
    • - Installs Hadoop Tasktracker service
    • - Wait on HDFS Safemode -- insert between cookbooks to ensure HDFS is available

    Integration

    Supports platforms: debian and ubuntu

    Cookbook dependencies:

    • java
    • apt
    • runit
    • volumes
    • tuning
    • metachef
    • dashpot

    Attributes

    • - Number of machines in the cluster (default: "5")
      • Number of machines in the cluster. This is used to size things like handler counts, etc.
    • - Override the distro name apt uses to look up repos (default: "maverick")
      • Typically, leave this blank. However if (as is the case in Nov 2011) you are on natty but Cloudera's repo only has packages up to maverick, use this to override.
    • - Release identifier (eg cdh3u2) of the cloudera repo to use. See also hadoop/deb_version (default: "cdh3u2")
    • - Version prefix for the daemons and other components (default: "hadoop-0.20")
      • Cloudera distros have a prefix most (but not all) things with. This helps isolate the times they say 'hadoop-0.20' vs. 'hadoop'
    • - Apt revision identifier (eg 0.20.2+923.142-1~maverick-cdh3) of the specific cloudera apt to use. See also apt/release_name (default: "0.20.2+923.142-1~maverick-cdh3")
    • - Default HDFS replication factor (default: "3")
      • HDFS blocks are by default reproduced to this many machines.
    • - (default: "10")
    • - (default: "false")
    • - (default: "BLOCK")
    • - (default: "org.apache.hadoop.io.compress.DefaultCodec")
    • - (default: "true")
    • - (default: "org.apache.hadoop.io.compress.DefaultCodec")
    • - (default: "24")
    • - (default: "1000")
      • uses /etc/default/hadoop-0.20 to set the hadoop daemon's java_heap_size_max
    • - (default: "134217728")
      • You may wish to set the following to the same as your HDFS block size, esp if you're seeing issues with s3:// turning 1TB files into 30_000+ map tasks
    • - fs.s3n.block.size (default: "134217728")
      • Block size to use when reading files using the native S3 filesystem (s3n: URIs).
    • - dfs.block.size (default: "134217728")
      • The default block size for new files
    • - (default: "3")
    • - (default: "2")
    • - (default: "-Xmx2432m -Xss128k -XX:+UseCompressedOops -XX:MaxNewSize=200m -server")
    • - (default: "7471104")
    • - (default: "25")
    • - (default: "250")
    • -
      • Other recipes can add to this under their own special key, for instance node[:hadoop][:extra_classpaths][:hbase] = '/usr/lib/hbase/hbase.jar:/usr/lib/hbase/lib/zookeeper.jar:/usr/lib/hbase/conf'
    • - (default: "/usr/lib/hadoop")
    • - (default: "/etc/hadoop/conf")
    • - (default: "/var/run/hadoop")
    • -
    • -
    • - (default: "hdfs")
    • -
      • define a rack topology? if false (default), all nodes are in the same 'rack'.
    • - (default: "40")
    • - (default: "stop")
    • -
    • - (default: "/hadoop/mapred/system")
    • - (default: "/hadoop/mapred/system")
    • - (default: "8021")
    • - (default: "50030")
    • - (default: "mapred")
    • - (default: "8008")
    • - (default: "40")
    • - (default: "stop")
      • What states to set for services. You want to bring the big daemons up deliberately on initial start. Override in your cluster definition when things are stable.
    • -
    • - (default: "8020")
    • - (default: "50070")
    • - (default: "hdfs")
    • -
      • These are handled by volumes, which imprints them on the node. If you set an explicit value it will be used and no discovery is done. Chef Attr Owner Permissions Path Hadoop Attribute [:namenode ][:data_dir] hdfs:hadoop drwx------ {persistent_vols}/hadoop/hdfs/name dfs.name.dir [:sec..node ][:data_dir] hdfs:hadoop drwxr-xr-x {persistent_vols}/hadoop/hdfs/secondary fs.checkpoint.dir [:datanode ][:data_dir] hdfs:hadoop drwxr-xr-x {persistent_vols}/hadoop/hdfs/data dfs.data.dir [:tasktracker][:scratch_dir] mapred:hadoop drwxr-xr-x {scratch_vols }/hadoop/hdfs/name mapred.local.dir [:jobtracker ][:system_hdfsdir] mapred:hadoop drwxr-xr-x {!!HDFS!! }/hadoop/mapred/system mapred.system.dir [:jobtracker ][:staging_hdfsdir] mapred:hadoop drwxr-xr-x {!!HDFS!! }/hadoop/mapred/staging mapred.system.dir Important: In CDH3, the mapred.system.dir directory must be located inside a directory that is owned by mapred. For example, if mapred.system.dir is specified as /mapred/system, then /mapred must be owned by mapred. Don't, for example, specify /mrsystem as mapred.system.dir because you don't want / owned by mapred.
    • - (default: "8004")
    • - (default: "8")
    • - (default: "start")
      • You can just kick off the worker daemons, they'll retry. On a full-cluster stop/start (or any other time the main daemons' ip address changes) however you will need to converge chef and then restart them all.
    • -
    • - (default: "50010")
    • - (default: "50020")
    • - (default: "50075")
    • - (default: "hdfs")
    • -
    • - (default: "8006")
    • - (default: "32")
    • - (default: "start")
    • -
    • - (default: "50060")
    • - (default: "mapred")
    • -
    • - (default: "8009")
    • - (default: "stop")
    • -
    • - (default: "50090")
    • - (default: "hdfs")
    • -
    • - (default: "8005")
    • - (default: "stop")
    • - (default: "stop")
    • - (default: "8007")
    • - (default: "1048576")
      • bytes per second -- 1MB/s by default
    • - (default: "300")
    • - (default: "301")
    • - (default: "302")
    • - (default: "303")
    • - (default: "/usr/lib/jvm/java-6-sun/jre")
    • - (default: "302")
    • - (default: "303")
    • -
    • -

    License and Author

    Author:: Philip (flip) Kromer - Infochimps, Inc (coders@infochimps.com) Copyright:: 2011, Philip (flip) Kromer - Infochimps, Inc

    Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

    Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

    readme generated by cluster_chef's cookbook_munger

    Collaborator Number Metric

    3.0.5 failed this metric

    Failure: Cookbook has 0 collaborators. A cookbook must have at least 2 collaborators to pass this metric.

    Contributing File Metric

    3.0.5 failed this metric

    Failure: To pass this metric, your cookbook metadata must include a source url, the source url must be in the form of https://github.com/user/repo, and your repo must contain a CONTRIBUTING.md file

    Foodcritic Metric

    3.0.5 failed this metric

    FC034: Unused template variables: hadoop_cluster/templates/default/hadoop-topologizer.rb.erb:1
    FC046: Attribute assignment uses assign unless nil: hadoop_cluster/attributes/default.rb:60
    FC046: Attribute assignment uses assign unless nil: hadoop_cluster/attributes/default.rb:61
    FC047: Attribute assignment does not specify precedence: hadoop_cluster/recipes/cluster_conf.rb:28
    FC047: Attribute assignment does not specify precedence: hadoop_cluster/recipes/cluster_conf.rb:29
    FC047: Attribute assignment does not specify precedence: hadoop_cluster/recipes/cluster_conf.rb:30
    FC047: Attribute assignment does not specify precedence: hadoop_cluster/recipes/default.rb:125
    FC047: Attribute assignment does not specify precedence: hadoop_cluster/recipes/default.rb:133
    FC047: Attribute assignment does not specify precedence: hadoop_cluster/recipes/default.rb:140
    FC047: Attribute assignment does not specify precedence: hadoop_cluster/recipes/default.rb:144
    FC064: Ensure issues_url is set in metadata: hadoop_cluster/metadata.rb:1
    FC065: Ensure source_url is set in metadata: hadoop_cluster/metadata.rb:1
    FC066: Ensure chef_version is set in metadata: hadoop_cluster/metadata.rb:1
    FC069: Ensure standardized license defined in metadata: hadoop_cluster/metadata.rb:1
    FC072: Metadata should not contain "attribute" keyword: hadoop_cluster/metadata.rb:1
    Run with Foodcritic Version 12.3.0 with tags metadata,correctness ~FC031 ~FC045 and failure tags any

    License Metric

    3.0.5 passed this metric

    No Binaries Metric

    3.0.5 passed this metric

    Publish Metric

    3.0.5 passed this metric

    Supported Platforms Metric

    3.0.5 passed this metric

    Testing File Metric

    3.0.5 failed this metric

    Failure: To pass this metric, your cookbook metadata must include a source url, the source url must be in the form of https://github.com/user/repo, and your repo must contain a TESTING.md file

    Version Tag Metric

    3.0.5 failed this metric

    Failure: To pass this metric, your cookbook metadata must include a source url, the source url must be in the form of https://github.com/user/repo, and your repo must include a tag that matches this cookbook version number

    Comments

    Leave a Reply

    Your email address will not be published. Required fields are marked *