Storage cluster



Installing the HDFS-only cluster

You can setup a HDFS cluster by following the official guidelines, which are pretty detailed (and probably complex for a newbie), or you can use a Hadoop distribution with an installation manager that will make your life easier. Well known distributions are:

Tools from Cosmos Ecosystem have been tested with HDP 2.2, but should work on any other Hadoop distribution.

Since this is a storage-only cluster, just install the HDFS service. Do not install YARN nor MapReduce2 nor any other analysis tool such as Hive. The complete list of services and daemons is:

  • HDFS service:
    • (Active) Namenode (mandatory)
    • (Stand-by) Namenode (optional)
    • SecondaryNamende (optional)
    • Datanodes (mandatory)


Installing the services node

In addition to the Hadoop cluster, it is highly recommended to deploy a special node not being part of the cluster (i.e. not hosting any Hadoop daemon) but having installed the Hadoop libraries and a copy of all the configuration files of the cluster. The reason is this node may work as the unique endpoint for the computing services, hiding the details of the cluster and thus saving a lot of public IP addresses (this node is the only one exposing a public one).

Available services should be:

  • HttpFS server. This service works as a gateway of WebHDFS, implementing the same REST API but specifically designed to hide the IP address/FQDN of the nodes of the cluster, even in those WebHDFS operations based on redirections (e.g. creating a file with initial content). HttpFS can be installed from sources, but if you finally used a Hadoop distribution, most probably you will be able to install it from a repo.
  • ssh server. In you want your users manage their HDFS userpace through the File System Shell, they will need to ssh to this services node.

Of course, you can achieve the same goals by exposing those services in one of the nodes of the cluster and allowing a public access to such a node. But this is not recommended in terms of performance (just think the services must share the resources of the node with the Hadoop daemons) and security (you don't want any user get access to a machine effectively storing any other user's data).



Configuring the HDFS-only cluster

The different managers/installers developed by the Hadoop distributions do most of the work for you regarding the configuration. Simply follow their "next-next" wizards and you will be done.

Nevertheless, for further reference, these are the configuration files used by Hadoop:

  • Read-only default configurations, when no site-specific ones are given:
    • /etc/hadoop/conf/[ core-default.xml]
    • /etc/hadoop/conf/[ hdfs-default.xml]
  • Site-specific configurations:
    • /etc/hadoop/conf/core-site.xml
    • /etc/hadoop/conf/hdfs-site.xml


Configuring the services node

HttpFS must be configured as stated in the official documentation.

The ssh server could be used with the default configuration. More relevant is the creation of an administrative Unix user with sudo permissions and creating a public-private key pair for that user; the public key must be installed. This user is required by the Cosmos GUI in order to run certain administration commands on the computing cluster. See annex A for more details about doing it.



Running the HDFS-only cluster

Once again, the usage of a manager within any of the existent distributions makes everything easier. These managers usually expose very simple and intuitive means of starting and stopping a cluster.

Nevertheless, for further reference, this is the command that start/stop each one of the daemons the HDFS service run:

$ (su -l hdfs -c) /usr/lib/hadoop/sbin/ --config /etc/hadoop/conf [start|stop] [namenode|datanode|journalnode]


Running the services

As stated in the official documentation, httpfs is started stopped/restarted by doing:

$ /usr/lib/hadoop-httpfs/sbin/ [start|stop|restart]



You can use the HDFS administration commands.