Amazon Redshift

Amazon Redshift is a fully managed data warehousing service that can scale to petabytes of data. It is based on PostgreSQL and can be queried using SQL queries. Amazon Redshift stores data in a columnar format and offers a multi-query engine that optimizes analytic queries.

Press + to interact

How Amazon Redshift works

Redshift uses a distributed architecture with multiple nodes working in parallel to execute queries. This Massively Parallel (MPP) architecture enables Redshift to run complex queries and quickly scale horizontally to process large datasets.

The core component of Redshift is a cluster. A cluster consists of multiple compute nodes and a leader node.

  • We connect to the leader node to interact with the cluster. The leader node generates a query execution plan and aggregates results. It manages the communication between the client and the compute nodes.

  • The compute nodes perform the query and respond to the leader node with the results. Each compute node has its own memory and processing power based on the instance’s size. As the workload increases, we can change the compute instance’s type, number of instances, or both.

Redshift creates a separate isolated network for the leader and compute nodes within which the nodes communicate over high bandwidth connections and custom communication protocols. The illustration below depicts an overview of the Redshift architecture.

Press + to interact
Redshift architecture
Redshift architecture

Load data in Redshift

To optimize costs and performance, data should be loaded in batches to the Redshift clusters. There are three main methods to load data in a redshift cluster.

  • Amazon Kinesis Firehose to Redshift: Load data in an Aazmon S3 bucket. The Firehouse will issue a COPY command to copy data from the S3 bucket to the Redshift cluster

Press + to interact
Load data from S3 to Redshift cluster using Kinesis Firehose
Load data from S3 to Redshift cluster using Kinesis Firehose
  • Redshift auto copy: The Redshift auto copy feature allows us to specify the name of an S3 bucket in the copy job. Whenever an object is added to the S3 bucket, it automatically copies the data to the Redshift cluster.

Press + to interact
Redshift Auto Copy
Redshift Auto Copy
  • Insert data using the JDBC/ODBS driver: If we have an application running on an EC2 instance producing large batches of data, we can use the JDBC/ODBC driver to directly upload data from the EC2 instance to the Redshift cluster.

Press + to interact
Using JDBC/ODBC driver to upload data
Using JDBC/ODBC driver to upload data

Snapshots for fault tolerance

Reshift supports both Single-AZ (Availability Zone) and Multi-AZ clusters. The Multi-AZ clusters are fault tolerant. However, Redshift offers snapshots to recover a Single-AZ cluster from a disaster.

Amazon Redshift snapshots are point-in-time backups of data warehouse clusters. When the snapshot is taken, it captures the entire cluster state, including data, schema, and configuration settings. These snapshots are incremental in nature and are stored on an S3 bucket.

We can restore a snapshot to a cluster within the same or different AWS region. This feature can be helpful to backup data in a different region or availability zone to increase fault tolerance. Redshift allows us to manually create snapshots or automate them to create snapshots after every 12 hours or 5 GB of data change.

Press + to interact
Cluster replication across regions using Redshift Snapshots
Cluster replication across regions using Redshift Snapshots

Amazon Redshift serverless

Amazon Redshift serverless can scale data warehouses to meet the increasing workload requirements. Thus, developers, analytics, and data scientists can upload their data to Redshift and start querying it without having to tune or manage the Redshift cluster. The developers pay only for the data warehouse while they are using it.

Press + to interact
Redshift Serverless
Redshift Serverless

Amazon Redshift ML

Amazon Redshift ML allows developers to train models on their data using simple and familiar SQL commands. It works with SageMaker to train models on our data. The SageMaker autopilot obtains the data and chooses the best model for training and making predictions.

Press + to interact

We train the model on data using the SQL statement CREATE MODEL. To train the model, Redshift first uploads our data to an S3 bucket. Sagemaker then preprocesses this data to prepare it for machine learning. The SageMaker copilot obtains the best algorithm and algorithm hyperparameters for the most accurate training. The resulting prediction function is registered in the cluster as an SQL function.

Press + to interact
How Redshift ML works
How Redshift ML works

Redshift ML additionally charges for training the model and storing data in the S3 bucket. However, it does not charge for the inference through the trained models.

Redshift Spectrum

Redshift Spectrum allows us to query data in an external S3 bucket without loading data in the Redshift cluster. To use the Redshift spectrum, we need a Redshift cluster and an SQL client to run SQL queries. The query is submitted to thousands of spectrum nodes, which perform it on an S3 bucket.

Press + to interact
Redshift Spectrum
Redshift Spectrum

Note that, in the Redshift Spectrum, the cluster and S3 bucket should be in the same region.

Use case: Business analytics

Consider a retail company that wants to analyze its sales data to gain insights into customer behavior, product performance, and market trends. They have large volumes of transactional data stored in various databases, including sales records, customer information, and inventory data.

The company uses Amazon Redshift as a centralized data warehouse to consolidate and store all its transactional data in a single location. To load the data from multiple data sources to a Redshift cluster, they use AWS Glue, a data integration service, to extract, transform, and load data. The data analysts and data scientists use SQL queries and Business Intelligence (BI) tools like Amazon QuickSight, Tableau, or Looker to analyze the data stored in Redshift. They create data models, dashboards, and reports to visualize key metrics such as sales revenue, customer demographics, and product trends.

To advance their analytics strategies, they can use Redshift ML and derive deeper insights from the data. As the volume of data grows or the complexity of analytics increases, the company can easily scale its Redshift cluster up or down to meet its evolving needs.

Press + to interact
Architecture of business analytics for a retail company
Architecture of business analytics for a retail company

Get hands-on with 1300+ tech skills courses.