AWS EMR

Learn about the architecture of Amazon EMR and how it helps in data processing.

Amazon EMR (previously called Elastic MapReduce) is a cloud-based service offered by Amazon Web Services (AWS) that helps us process and analyze large amounts of data. It simplifies running big data frameworks like Hadoop and Spark on AWS for data processing and analysis. It's a managed service, so it removes the complexity of managing the big data infrastructure, i.e., it scales processing power based on data volume, and we only pay for what we use. In this lesson, we will learn about the features of EMR and how it works.

Press + to interact

Amazon EMR cluster

The core processing unit of the Amazon EMR cluster is the cluster. It’s a group of Amazon EC2 instances working together as a single compute resource. Each instance is called a node. These nodes can be categorized into different types depending on the roles they perform, which depend on the software components that Amazon EMR installs in them.

Let’s look at the different types of nodes are given as follows:

  • Primary node: The primary node in an Amazon EMR cluster has a software component that manages the overall coordination and execution of tasks in the cluster. It coordinates the distribution of tasks across core and task nodes, monitors the health of the cluster, and manages communication between nodes.

  • Core node: Core nodes have a software component that let’s it store and process data. They typically run the Hadoop Distributed File System (HDFS) and execute data processing, storage, and retrieval tasks. Core nodes store data blocks and perform data replication for fault tolerance.

  • Task node: Task nodes are additional compute resources used for processing tasks in parallel. The software component does let them store data like core nodes but only executes tasks assigned by the primary node. These are optional nodes and are often added to increase processing capacity without increasing storage capacity.

Press + to interact
Amazon EMR cluster
Amazon EMR cluster

Amazon EMR architecture

The Amazon EMR service architecture is divided into multiple layers, each providing the cluster with specific functionalities and capabilities. These architecture components are discussed as follows:

Storage

Amazon EMR offers several storage options for the big data processing requirements:

  • HDFS (Hadoop distributed file system): This distributed and scalable storage system is ideal for large datasets. It stores data across multiple cluster instances for redundancy and is often used for temporary storage, like intermediate results during processing.

  • EMRFS (EMR file system): This extends Hadoop’s capabilities by allowing direct access to data stored in Amazon S3, a scalable object storage service. EMRFS functions like a file system similar to HDFS but leverages S3 for persistent storage. This is useful for storing input, output, and intermediate results.

Press + to interact
  • Local file system: This refers to the temporary storage available on each EC2 instance within the cluster. This storage is ephemeral and disappears when the instance terminates.

Cluster resource management

Amazon EMR uses YARNYARN (Yet Another Resource Negotiator) is a cluster resource manager that schedules jobs and runs the specified tasks. Computing sources such as CPU and memory are also allocated by YARN. (Yet Another Resource Negotiator) to manage resources and schedule jobs within the big data cluster. YARN acts as a central coordinator for various data processing frameworks. Here are some key points:

  • Centralized resource management: YARN manages resources across the cluster, ensuring efficient allocation for running jobs.

  • Agent on each node: EMR has an agent on each cluster node to manage YARN components, maintain cluster health, and communicate with the EMR service.

  • Node labeling: EMR automatically labels core nodes to differentiate them from task nodes, allowing YARN schedulers to prioritize core nodes for application masters.

  • Application primary placement: EMR ensures critical application primary processes (controlling jobs) only run on core nodes (reliable instances) to prevent job failures due to the termination of spot Instances (used for task nodes)

  • Configuration management: EMR pre-configures YARN properties to achieve optimal scheduling based on node labels. Manual modification of these configurations could disrupt this functionality.

YARN acts like a traffic controller for the cluster resources, ensuring jobs run smoothly despite using potentially volatile spot Instances for some tasks. While YARN is the default, some EMR applications might use different resource managers.

Data processing frameworks

EMR offers various data processing frameworks to suit big data needs. Here’s a breakdown of two main options:

  • Hadoop MapReduce: This open-source framework simplifies writing parallel distributed applications. It handles the underlying logic while we provide the core processing functions (Map and Reduce). These functions transform and combine data to produce the desired output. There are several frameworks available for MapReduce, one of which is Hive, that can automatically generate MapReduce programs.

  • Apache Spark: This is another open-source framework for big data processing. Spark offers directed acyclic graphs (DAGs) for efficient execution plans. It can cache data in memory, leading to faster processing as compared to traditional disk-based methods in Hadoop MapReduce. When running Spark on EMR, we can leverage EMRFS to directly access the data stored in Amazon S3. Spark supports interactive query modules like SparkSQL for data exploration.

Press + to interact

The choice between these frameworks depends on the specific use case. Hadoop MapReduce is a good choice for traditional batch processing tasks, while Spark offers more flexibility for various processing needs and faster performance, especially when working with in-memory data.

Data processing in EMR

EMR clusters are equipped with various frameworks and applications for data processing. We can process data in these clusters through two main methods:

  • Connecting to the cluster’s primary node and using the interfaces or tools provided by the installed software to submit jobs and interact directly.

  • Submitting a sequence of ordered steps to the cluster. Each step acts as a unit of work with instructions to manipulate data using software installed on the cluster.

EMR typically uses data stored in the chosen file system (like S3 or HDFS) as input. This data is passed between steps in the processing sequence. The final step writes the processed data to a specified location (e.g., an S3 bucket). Let’s look at the execution order of the steps:

  1. A request triggers step processing.

  2. All steps are initially set to PENDING.

  3. The first step transitions to RUNNING, while others remain PENDING.

  4. Upon completion, the first step changes to COMPLETED.

  5. The next step starts running and changes to RUNNING. This pattern repeats until all steps are completed.

An example of a three-step process would be submitting an input dataset, processing the dataset using a Hive program, and writing the final output dataset.

Press + to interact
Submit request
Submit request
1 of 5

If a step fails, its state changes to FAILED. We can define how subsequent steps behave in case of failures:

  • Default: The remaining steps are set to CANCELLED and won’t run.

  • Option 1: Ignore the failure and continue processing with the remaining steps.

  • Option 2: Terminate the entire cluster immediately.

Understanding these methods for submitting work and processing data empowers us to effectively leverage EMR clusters for big data analytics tasks.

Benefits of AWS EMR

Here are some of the key benefits of using Amazon EMR for big data processing:

  • Simplified cluster management: EMR eliminates the need to manually provision, configure, and manage Hadoop cluster infrastructure. EMR handles these tasks for us, allowing us to focus on developing and running the data processing applications.

  • Scalability: EMR clusters can be easily scaled up or down based on the processing needs. We can add or remove nodes to adjust processing power as required. This helps optimize costs by paying only for the resources we use.

  • Cost-effectiveness: EMR can leverage multiple instance purchase options to optimize costs. For example, we can use reserved instances for the primary node and core nodes and spot instances for task nodes. This can significantly reduce the processing costs compared to using on-demand instances all the time.

Press + to interact
  • Integration with AWS Services: EMR integrates seamlessly with other AWS services like Amazon S3 for data storage, Amazon DynamoDB for NoSQL databases, and Amazon CloudWatch for monitoring cluster performance.

  • Security: EMR offers features like IAM roles and cluster security configuration to restrict access to the cluster and data. Additionally, EMR utilizes data encryption to ensure the security of our sensitive data during processing.

  • Flexibility and ease of use: EMR offers flexibility in submitting work to the clusters. We can define processing steps during cluster creation, submit jobs directly to applications, or run ordered steps within the cluster. It also provides a user-friendly interface and tools for launching, managing, and monitoring the big data clusters. We can use the EMR console, API, or AWS CLI to interact with the EMR clusters.

In essence, Amazon EMR provides a managed platform that simplifies big data processing on AWS by offering scalability, cost-effectiveness, a variety of tools, integration with other AWS services, and a user-friendly experience.

Get hands-on with 1300+ tech skills courses.