Hortonworks is one of the industry leaders when it come to data solutions. They provide an enterprise ready platform, that facilitates the creation and the management of big data clusters, leveraging open-source software like Apache Ambari, and Apache Hadoop, among many others.
Hortonworks provide this in the form of virtual machine configuration that is available for the most common virtualisation platforms (Docker – VMWare – Virtual Box). So by just downloading the vm file, and double clicking it, you will have a single node cluster running on your PC.
But who needs a cluster that contains only one machine running in a virtual box on a small PC? Especially, if you have a PC with limited resources like mine. If this is the case, then, you are on the right path by reading these words, because after reading this post, you will be able to create your own Ambari-managed cluster on AWS.
Before you start, you will need the following:
- Obviously, an AWS account
- AWS CLI installed on your PC
- Hortonworks VM HDP 2.5 .ova It can be found in the archive under “Hortonworks Sandbox in the Cloud”
First, we need to copy the OVA file to AWS. Here is a short video describing how to create an AMI from a virtual machine using AWS CLI.
Notes About the Video
- The policies used in the previous video can be found in the following gists. You need to make sure you use your S3 buckets names in
- Do not forget to save the Access Key ID and the Secret Access Key for the newly created user, because you are going to need them when you configure AWS CLI.
- In the video, they do not mention anything about configuring AWS CLI. After installing AWS CLI, to configure it, use this command
aws configureThis command will take you through a simple wizard like the one shown below, you just need to use the Access Key ID and the Secret Access Key for the newly created user, and specify the region in which you want your AMI to be created in.
$ aws configure AWS Access Key ID [****************FMWQ]: the-access-key-of-the-user AWS Secret Access Key [****************ILVR]:the-secret-key-of-the-user Default region name [us-east-1]: us-east-1 Default output format [json]: json
- One major thing that the video does not mention is that you need to upload the
ovafile to AWS S3 manually, and that the importer will not do that for you automatically (that’s what I thought in the beginning), so before using the
aws ec2 import-imagecommand you have to use
aws s3 cpcommand to upload the
ovafile. Also below you can find the
Creating The EC2 Instances
Now that the AMI is created in the specified region (
us-east-1 in my case), we can proceed to use it to create our instance that will be running Ambari in HDP 2.5 VM.
To do so, first login to your AWS console, and navigate to EC2
From there, select “AMIs” from the left panel
In the “AMIs” section, you will find your imported VM (provided that you are in the region that you selected when you configured your AWS CLI). From here, right click the AMI entry and select “Launch”
Now just follow the instructions in the wizard, until you reach to “6. Configure Security Group“, in this step we will allow all TCP traffic from anywhere.
WARNING! Please note that this is not a good practice, instead you should open only needed ports, however I am doing this here for educational purposes, to ease the installation. We will talk about this in another post.
Note: For the instance I created I am using t2.large instance type, to get better performance, but this instance type is NOT eligible for Free Tier. You can use t2.micro but this will mean slower performance.
Now that the machine is running we can access it from the internet using its public IP address on the port 8080 so if your EC2 public IP address is for example 126.96.36.199, then you can access Ambari dashboard by just navigating to : (http://188.8.131.52:8080) and by logging in using (username: maria_dev, and password: maria_dev) you would see something similar to the picture below:
Configuring Main Node
The main (and only) node is now up and running, but a cluster with one node is not a cluster. So let’s add few more servers to our cluster. But before that, let’s configure our main node, to accept communication from Ambari client agents.
Ambari client agents and Ambari server communicate on the ports (8440, and 8441). These two ports are not open by default on our docker image (for some weird reason! ). To open these ports we need to stop the container and expose the ports and run the container again, as follows:
$ docker stop sandbox # Stop the container $ docker commit sandbox image01 # Commit the container to a new image to save previous config inside the container $ docker rm sandbox # remove old container # Finally run the new container .... $ docker run -d -v hadoop:/hadoop --hostname "sandbox.hortonworks.com" --privileged -p 6080:6080 -p 9090:9090 -p 9000:9000 -p 8000:8000 -p 8020:8020 -p 2181:2181 -p 42111:42111 -p 10500:10500 -p 16030:16030 -p 8042:8042 -p 8040:8040 -p 2100:2100 -p 4200:4200 -p 4040:4040 -p 8050:8050 -p 9996:9996 -p 9995:9995 -p 8080:8080 -p 8088:8088 -p 8886:8886 -p 8889:8889 -p 8443:8443 -p 8744:8744 -p 8888:8888 -p 8188:8188 -p 8983:8983 -p 1000:1000 -p 1100:1100 -p 11000:11000 -p 10001:10001 -p 15000:15000 -p 10000:10000 -p 8993:8993 -p 1988:1988 -p 5007:5007 -p 50070:50070 -p 19888:19888 -p 16010:16010 -p 50111:50111 -p 50075:50075 -p 50095:50095 -p 18080:18080 -p 60000:60000 -p 8090:8090 -p 8091:8091 -p 8005:8005 -p 8086:8086 -p 8082:8082 -p 60080:60080 -p 8765:8765 -p 5011:5011 -p 6001:6001 -p 6003:6003 -p 6008:6008 -p 1220:1220 -p 21000:21000 -p 6188:6188 -p 2222:22 -p 8440:8440 -p 8441:8441 --name sandbox image01
Now check the container’s IP address by executing the command
docker ps to get the container ID then using the container ID execute the command
docker inspect container-id the second command will give a long JSON output, that contains a lot of information, one of which is the IP address, after getting the IP address you can login to the container using SSH.
Login to your container
ssh email@example.com the default password is
root . Now, lets enable our Ambari admin account. To do that, execute the command
ambari-admin-password-reset this command will help you set a password to your Ambari admin account, so now you can login to your Ambari dashboard as an admin, which will allow you later to add new hosts to your cluster.
Last, but not least, we need to change the hosts file
/etc/hosts/ to add the default name by Hortonworks:
$ sudo vi /etc/hosts # Open the file # And add this line ... replace 172.x.x.x with your instance's private IP address 172.x.x.x sandbox.hortonworks.com sandbox sandbox.hortonworks.com
Slaves can be as many as you want, and as big or small as you wish, but only one requirement is required to get the slaves to work well with the master, they all should be of the same OS type and version as the master, in our case they should be RHEL 6.x. So to create them you need to find a RHEL AMI.
As with the master we will open all TCP ports for the slaves. Again I need to remind, that this is not a good practice.
Finally, for each of the slaves:
- Connect via SSH
sudo vi /etc/hosts
- Add the master private IP along with its sandbox DNS name
172.x.x.x sandbox.hortonworks.com sandbox sandbox.hortonworks.com
- Finally add, the IPs of other nodes in the cluster along with their private DNS names (as defined in the EC2 page).
By the end of this step, the file
/etc/hosts on all the slaves instances must be identical, i.e. all of them have records of themselves and of their neighbours.
Also the file
/etc/hosts on the master, must contain references to the slaves IPs and Private DNS names.
An example of how /etc/hosts should look like:
172.xx.xx.xx sandbox.hortonworks.com ip-172-xx-xx-xx.ec2.internal 172.xx.xx.xx ip-172-xx-xx-xx.ec2.internal 172.xx.xx.xx ip-172-xx-xx-xx.ec2.internal 172.xx.xx.xx ip-172-xx-xx-xx.ec2.internal 172.xx.xx.xx ip-172-xx-xx-xx.ec2.internal
Creating The Cluster
Now the final step is to create the cluster, which should be very smooth if everything in the previous steps was done properly.
- Head to http://master-ip-address:8080 to access Ambari’s dashboard, and login using admin account.
- On the top navigation bar hit Hosts
- From the Actions menu select Add New Hosts and follow the instructions of the wizard, as shown below :
What is Next
At this point, we have a micro cluster, running Big Data services like Hadoop, Hive, and others. Now you are ready to do some data analysis, on your cloud hosted cluster … go try it out :).
Image source : https://www.pexels.com/photo/interior-of-office-building-325229/