How to Run Hortonworks HDP 2.5 in AWS EC2

Hortonworks is one of the industry leaders when it come to data solutions. They provide an enterprise ready platform, that facilitates the creation and the management of big data clusters, leveraging open-source software like Apache Ambari, and Apache Hadoop, among many others.

Hortonworks provide this in the form of virtual machine configuration that is available for the most common virtualisation platforms (Docker – VMWare – Virtual Box). So by just downloading the vm file, and double clicking it, you will have a single node cluster running on your PC.

But who needs a cluster that contains only one machine running in a virtual box on a small PC? Especially, if you have a PC with limited resources like mine. If this is the case, then, you are on the right path by reading these words, because after reading this post, you will be able to create your own Ambari-managed cluster on AWS.

Pre-requisites

Before you start, you will need the following:

Preparation

First, we need to copy the OVA file to AWS. Here is a short video describing how to create an AMI from a virtual machine using AWS CLI.

Notes About the Video

  1. The policies used in the previous video can be found in the following gists. You need to make sure you use your S3 buckets names in DemouserPolicy.json and VMImportPolicy.json
  2. Do not forget to save the Access Key ID and the Secret Access Key for the newly created user, because you are going to need them when you configure AWS CLI.
  3. In the video, they do not mention anything about configuring AWS CLI. After installing AWS CLI, to configure it, use this command aws configure This command will take you through a simple wizard like the one shown below, you just need to use the Access Key ID and the Secret Access Key for the newly created user, and specify the region in which you want your AMI to be created in.
    $ aws configure
    AWS Access Key ID [****************FMWQ]: the-access-key-of-the-user
    AWS Secret Access Key [****************ILVR]:the-secret-key-of-the-user
    Default region name [us-east-1]: us-east-1
    Default output format [json]: json
    
  4. One major thing that the video does not mention is that you need to upload the ova file to AWS S3 manually, and that the importer will not do that for you automatically (that’s what I thought in the beginning), so before using the aws ec2 import-image command you have to use aws s3 cp command to upload the ova file. Also below you can find the cotainer.json file

Creating The EC2 Instances

Now that the AMI is created in the specified region (us-east-1 in my case), we can proceed to use it to create our instance that will be running Ambari in HDP 2.5 VM.

To do so, first login to your AWS console, and navigate to EC2

From there, select “AMIs” from the left panel

In the “AMIs” section, you will find your imported VM (provided that you are in the region that you selected when you configured your AWS CLI). From here, right click the AMI entry and select “Launch”

Now just follow the instructions in the wizard, until you reach to “6. Configure Security Group“, in this step we will allow all TCP traffic from anywhere.

WARNING! Please note that this is not a good practice, instead you should open only needed ports, however I am doing this here for educational purposes, to ease the installation. We will talk about this in another post.

Note: For the instance I created I am using t2.large instance type, to get better performance, but this instance type is NOT eligible for Free Tier. You can use t2.micro but this will mean slower performance.

Now that the machine is running we can access it from the internet using its public IP address on the port 8080 so if your EC2 public IP address is for example 1.2.3.4, then you can access Ambari dashboard by just navigating to : (http://1.2.3.4:8080) and by logging in using (username: maria_dev, and password: maria_dev) you would see something similar to the picture below:

Configuring Main Node

The main (and only) node is now up and running, but a cluster with one node is not a cluster. So let’s add few more servers to our cluster. But before that, let’s configure our main node, to accept communication from Ambari client agents.

Ambari client agents and Ambari server communicate on the ports (8440, and 8441). These two ports are not open by default on our docker image (for some weird reason! ). To open these ports we need to stop the container and expose the ports and run the container again, as follows:

$ docker stop sandbox # Stop the container

$ docker commit sandbox image01 # Commit the container to a new image to save previous config inside the container

$ docker rm sandbox # remove old container

# Finally run the new container ....
$ docker run -d -v hadoop:/hadoop --hostname "sandbox.hortonworks.com" --privileged -p 6080:6080 -p 9090:9090 -p 9000:9000 -p 8000:8000 -p 8020:8020 -p 2181:2181 -p 42111:42111 -p 10500:10500 -p 16030:16030 -p 8042:8042 -p 8040:8040 -p 2100:2100 -p 4200:4200 -p 4040:4040 -p 8050:8050 -p 9996:9996 -p 9995:9995 -p 8080:8080 -p 8088:8088 -p 8886:8886 -p 8889:8889 -p 8443:8443 -p 8744:8744 -p 8888:8888 -p 8188:8188 -p 8983:8983 -p 1000:1000 -p 1100:1100 -p 11000:11000 -p 10001:10001 -p 15000:15000 -p 10000:10000 -p 8993:8993 -p 1988:1988 -p 5007:5007 -p 50070:50070 -p 19888:19888 -p 16010:16010 -p 50111:50111 -p 50075:50075 -p 50095:50095 -p 18080:18080 -p 60000:60000 -p 8090:8090 -p 8091:8091 -p 8005:8005 -p 8086:8086 -p 8082:8082 -p 60080:60080 -p 8765:8765 -p 5011:5011 -p 6001:6001 -p 6003:6003 -p 6008:6008 -p 1220:1220 -p 21000:21000 -p 6188:6188 -p 2222:22 -p 8440:8440 -p 8441:8441 --name sandbox image01

Now check the container’s IP address by executing the command docker ps to get the container ID then using the container ID execute the command docker inspect container-id the second command will give a long JSON output, that contains a lot of information, one of which is the IP address, after getting the IP address you can login to the container using SSH.

Login to your container ssh root@172.17.0.2 the default password is root . Now, lets enable our Ambari admin account. To do that, execute the command ambari-admin-password-reset this command will help you set a password to your Ambari admin account, so now you can login to your Ambari dashboard as an admin, which will allow you later to add new hosts to your cluster.

Last, but not least, we need to change the hosts file /etc/hosts/ to add the default name by Hortonworks:

$ sudo vi /etc/hosts # Open the file
# And add this line ... replace 172.x.x.x with your instance's private IP address
172.x.x.x sandbox.hortonworks.com sandbox sandbox.hortonworks.com

Configuring Slaves

Slaves can be as many as you want, and as big or small as you wish, but only one requirement is required to get the slaves to work well with the master, they all should be of the same OS type and version as the master, in our case they should be RHEL 6.x. So to create them you need to find a RHEL AMI.

As with the master we will open all TCP ports for the slaves. Again I need to remind, that this is not a good practice.

Finally, for each of the slaves:

  1. Connect via SSH
  2. sudo vi /etc/hosts
  3. Add the master private IP along with its sandbox DNS name
    172.x.x.x sandbox.hortonworks.com sandbox sandbox.hortonworks.com
  4. Finally add, the IPs of other nodes in the cluster along with their private DNS names (as defined in the EC2 page).

By the end of this step, the file /etc/hosts on all the slaves instances must be identical, i.e. all of them have records of themselves and of their neighbours.

Also the file /etc/hosts on  the master, must contain references to the slaves IPs and Private DNS names.

An example of how /etc/hosts should look like:

172.xx.xx.xx sandbox.hortonworks.com ip-172-xx-xx-xx.ec2.internal
172.xx.xx.xx ip-172-xx-xx-xx.ec2.internal
172.xx.xx.xx ip-172-xx-xx-xx.ec2.internal
172.xx.xx.xx ip-172-xx-xx-xx.ec2.internal
172.xx.xx.xx ip-172-xx-xx-xx.ec2.internal

Creating The Cluster

Now the final step is to create the cluster, which should be very smooth if everything in the previous steps was done properly.

  1. Head to http://master-ip-address:8080 to access Ambari’s dashboard, and login using admin account.
  2. On the top navigation bar hit Hosts
  3. From the Actions menu select Add New Hosts and follow the instructions of the wizard, as shown below :

What is Next

At this point, we have a micro cluster, running Big Data services like Hadoop, Hive, and others. Now you are ready to do some data analysis, on your cloud hosted cluster … go try it out :).

Image source : https://www.pexels.com/photo/interior-of-office-building-325229/

Leave a Reply

Your email address will not be published. Required fields are marked *