Amazon Redshift
Explore how Amazon Redshift helps creating high performant data warehouses.
Amazon Redshift is a fully managed data warehousing service that can scale to petabytes of data. It is based on PostgreSQL and can be queried using SQL queries. Amazon Redshift stores data in a columnar format and offers a multi-query engine that optimizes analytic queries.
How Amazon Redshift works
Redshift uses a distributed architecture with multiple nodes working in parallel to execute queries. This Massively Parallel (MPP) architecture enables Redshift to run complex queries and quickly scale horizontally to process large datasets.
The core component of Redshift is a cluster. A cluster consists of multiple compute nodes and a leader node.
We connect to the leader node to interact with the cluster. The leader node generates a query execution plan and aggregates results. It manages the communication between the client and the compute nodes.
The compute nodes perform the query and respond to the leader node with the results. Each compute node has its own memory and processing power based on the instance’s size. As the workload increases, we can change the compute instance’s type, number of instances, or both.
Redshift creates a separate isolated network for the leader and compute nodes within which the nodes communicate over high bandwidth connections and custom communication protocols. The illustration below depicts an overview of the Redshift architecture.
Load data in Redshift
To optimize costs and performance, data should be loaded in batches to the Redshift clusters. There are three main methods to load data in a redshift cluster.
Amazon Kinesis Firehose to Redshift: Load data in an Aazmon S3 bucket. The Firehouse will issue a
COPY
command to copy data from the S3 bucket to the Redshift cluster
Redshift auto copy: The Redshift auto copy feature allows us to specify the name of an S3 bucket in the copy job. Whenever an object is added to the S3 bucket, it automatically copies the data to the Redshift cluster.
Insert data using the JDBC/ODBS driver: If we have an application running on an EC2 instance producing large batches of data, we can use the JDBC/ODBC driver to directly upload data from the EC2 instance to the Redshift cluster.
Snapshots for fault tolerance
Reshift supports both Single-AZ (Availability Zone) and Multi-AZ clusters. The Multi-AZ clusters are fault tolerant. However, Redshift offers snapshots to recover a Single-AZ cluster from a disaster.
Amazon Redshift snapshots are point-in-time backups of data warehouse clusters. When the snapshot is taken, it captures the entire cluster state, including data, schema, and configuration settings. These snapshots are incremental in nature and are stored on an S3 bucket.
We can restore a snapshot to a cluster within the same or different AWS region. This feature can be helpful to backup data in a different region or availability zone to increase fault tolerance. Redshift allows us to manually create snapshots or automate them to create snapshots after every 12 hours or 5 GB of data change.
Amazon Redshift serverless
Amazon Redshift serverless can scale data warehouses to meet the increasing workload requirements. Thus, developers, analytics, and data scientists can upload their data to Redshift and start querying it without having to tune or manage the Redshift cluster. The developers pay only for the data warehouse while they are using it.
Amazon Redshift ML
Amazon Redshift ML allows developers to train models on their data using simple and familiar SQL commands. It works with SageMaker to train models on our data. The SageMaker autopilot obtains the data and chooses the best model for training and making predictions.
We train the model on data using the SQL statement CREATE MODEL
. To train the model, Redshift first uploads our data to an S3 bucket. Sagemaker then preprocesses this data to prepare it for machine learning. The SageMaker copilot obtains the best algorithm and algorithm hyperparameters for the most accurate training. The resulting prediction function is registered in the cluster as an SQL function.
Redshift ML additionally charges for training the model and storing data in the S3 bucket. However, it does not charge for the inference through the trained models.
Redshift Spectrum
Redshift Spectrum allows us to query data in an external S3 bucket without loading data in the Redshift cluster. To use the Redshift spectrum, we need a Redshift cluster and an SQL client to run SQL queries. The query is submitted to thousands of spectrum nodes, which perform it on an S3 bucket.
Note that, in the Redshift Spectrum, the cluster and S3 bucket should be in the same region.
Use case: Business analytics
Consider a retail company that wants to analyze its sales data to gain insights into customer behavior, product performance, and market trends. They have large volumes of transactional data stored in various databases, including sales records, customer information, and inventory data.
The company uses Amazon Redshift as a centralized data warehouse to consolidate and store all its transactional data in a single location. To load the data from multiple data sources to a Redshift cluster, they use AWS Glue, a data integration service, to extract, transform, and load data. The data analysts and data scientists use SQL queries and Business Intelligence (BI) tools like Amazon QuickSight, Tableau, or Looker to analyze the data stored in Redshift. They create data models, dashboards, and reports to visualize key metrics such as sales revenue, customer demographics, and product trends.
To advance their analytics strategies, they can use Redshift ML and derive deeper insights from the data. As the volume of data grows or the complexity of analytics increases, the company can easily scale its Redshift cluster up or down to meet its evolving needs.
Get hands-on with 1300+ tech skills courses.