Enterprise Reference Architecture for Apache Hadoop – AWS Cloud

Architecture for Apache

Many Fortune 500 organizations are adopting AWS to deploy Java applications services, however, Cloudera Enterprise makes it possible for organizations to deploy the Hadoop clusters in the AWS cloud. 

According to one of the trusted Business media, ‘Running Cloudera Enterprise on AWS provides the greatest flexibility in deploying Hadoop – while providing robust security, governance, and data management.’

AWS Overview

Architecture for Apache

Cloudera Enterprise build can use the following service providing:

1) Elastic Compute Cloud (EC2)

Amazon EC2 provides a wide selection of instances optimized to fit different use cases.

It consists of:

  • On-Demand Instances
  • Reserved Instances
  • Spot Instances

2) Simple Storage Service (S3)

S3 is designed for 99.999999999% durability and availability. It is a crucial component of AWS Data Lake.

3) Relational Database Service (RDS)

Relational Database Service (RDS) provides users to provide various kinds of supported relational database instances, connecting Oracle and MySQL. RDS manages database management tasks, such as backups for a user-defined retention period, point-in-time recovery, patch management, and replication, allowing users to pursue higher value software application development or database refinements.

4) Virtual Private Cloud (VPC)

Virtual Private Cloud is an isolated environment created for managing Hadoop, clusters, and Applications in the cloud.

With VPC, we can create multiple subnets in different Azs.

Deployment Architecture

This section describes Cloudera’s recommendations and best practices applicable to Hadoop cluster system architecture.

Instance Types & Resource Planning

Cloudera recommends deploying these instance types into production:

  • Compute Optimized – Used for process-centric applications, Image Processing
  • Memory Optimized – Used for in-memory processing, Spark applications
  • Storage Optimized - Used for storage management
  • General Purpose – Used for general purpose
  • Accelerated Computing (GPU) – Used for gaming, 3D visualization
Architecture for Apache

Hosts & Services Layout

  • Master Nodes
  • Worker Nodes
  • Utility Nodes
  • Edge Nodes

Master Nodes

  • ResourceManager
  • NameNode
  • Standby NameNode
  • JournalNodes
  • ZooKeeper

Worker Nodes

  • HDFS DataNode
  • YARN NodeManager
  • HBase RegionServer
  • Impala Daemons
  • Solr Servers

Utility Nodes

  • Cloudera Manager
  • Cloudera Management Services
  • JournalNode
  • ZooKeeper
  • Oozie
  • Hive Server
  • Impala Catalog Server
  • Impala State Store
  • Job History Server

Edge Nodes

  • Third-party tools
  • Hadoop command-line client
  • Hive command-line client
  • Impala command-line client
  • Flume agents
  • Hue Server
Architecture for Apache

Regions and Availability Zones


Regions are completed in geographical locations where AWS services are built.

Availability Zones

AZ is a fully isolated infrastructure with one or more data centers. Each AZ has independent power, cooling, and physical security and is connected via redundant, ultra-low-latency networks.

Architecture for Apache

Amazon Machine Image (AMI)

Amazon Machine Images (AMIs) are the virtual machine images that run on EC2 instances.

AMI is a template that contains a software configuration including an operating system, an application server, and applications.

For launching EC2 instances, AWS provides multiple options:

  • AWS Marketplace
  • Community AMI
  • My AMI
Architecture for Apache

Cloud Storage Options

1) Object Storage (S3)

Unlike HDFS, S3 is Enterprise storage that stores unlimited data and offers industry level security, data availability.

S3 offers you robust capabilities to support access, price, replication, and data protection.

The data on S3 is organized and stored in the form of BUCKETS.

You can access data in the bucket using below S3 Storage classes:

Architecture for Apache

2) Elastic Block Storage (EBS)

Amazon Elastic Block Store (Amazon EBS) gives block-level cache volumes for use with EC2 instances.

Architecture for Apache

EBS Volume types:

  • SSD-backed volumes – It is analyzed for transactional workloads improving periodic read/write activity with small I/O size, where the dominant performance approach is IOPS
  • HDD-backed volumes – It is optimized for large streaming workloads where throughput is a better performance measure than IOPS

3) Enterprise Relational Database

Cloudera Enterprise deployments require relational databases for the following components: Cloudera Manager, Cloudera Navigator, Hive megastore, Hue, Sentry, Oozie, and others.

On the AWS cloud, you can find this Relational DB with AWS RDS.


Relational DB service (RDS) is a fully-managed AWS service with a choice of popular Database engines.

Cloudera supports the following Enterprise Databases:

  • MySQL
  • Oracle DB
  • MariaDB
  • Postgre DB
Architecture for Apache

Deployment Topologies

1. Cluster inside a public subnet in VPC

A public subnet in this context is a subnet with a route to the Internet gateway. Instances provisioned in public subnets inside VPC can have direct access to the Internet as well as to other external services such as AWS services in another region.

Architecture for Apache

2. Cluster inside a private subnet in VPC

Instances provisioned in private subnets inside VPC don’t have direct access to the Internet. However, instances must go through VPC endpoints to reach AWS services and NAT for the internet.

Architecture for Apache
Related article

Yes, cloud computing has indeed transformed the work activities of many sectors during the last decade. Many business owners, however, are still wary of the notion of transferring their activities to a virtual platform. It is normal to feel worried about making a change. However, studies and real-world experiences demonstrate that the advantages of the cloud exceed the dangers, which are in any case insignificant.

Almost every organization is taking a step closer to cloud. Reason is obvious - To avoid upfront infrastructure cost and to avoid maintainence cost.

Cloud computing has revolutionized the IT industry by providing infrastructure as a service. Companies/Individuals can launch their servers on a cloud without purchasing single hardware, moreover, they don’t have to spend a single penny on maintenance and administrations. Servers can be launched on AWS cloud in no time and you have to pay for what you use.

DMCA Logo do not copy