AWS Lake Formation

Learn about the features of Lake formation and how it integrates with AWS Glue.

AWS Lake Formation simplifies the setup and management of data lakes. Data lakes serve as repositories for various types of organizational data, structured or unstructured data at any scale. We can think of a data lake as a vast body of water where we can store raw data in its native format, such as text files, images, videos, sensor data, log files, and more. This data can come from various sources and formats, and it's stored in its raw form until it's needed for analysis.

AWS Lake Formation aids in data management, security management, and data sharing with other AWS services for analytics and machine learning tasks. It also helps manage fine-grained access control to data lakes. Lake Formation enables users to focus on deriving insights from their data rather than handling infrastructure and security complexities.

Press + to interact

In this lesson, we will learn about the features of AWS Lake Formation, how they work, and their benefits.

Features of Lake Formation

AWS Lake Formation provides the following features:

Data management

AWS Lake Formation provides comprehensive data management capabilities for data lakes. Here’s a summary of the key data management features offered by AWS Lake Formation:

  • Data ingestion: Lake Formation facilitates ingesting data from various sources into the data lake. It supports seamless integration with AWS services like Amazon S3, Amazon RDS, Amazon Redshift, and Amazon DynamoDB, allowing organizations to ingest data in real-time or batch mode.

  • Data life cycle management: Lake Formation supports data life cycle management, allowing organizations to define policies for data retention, archiving, and deletion. Organizations can automate data life cycle workflows to optimize storage costs and ensure compliance with data retention policies.

  • Data sharing: It facilitates internal and external data sharing across multiple AWS accounts or organizations, ensuring secure data sharing within and outside the organization.

Press + to interact

Security management

Lake Formation introduces its own permission model, enhancing the IAM permissions model. This model enables fine-grained access control to data lakes, similar to a relational database management system (RDBMS). Administrators can easily grant or revoke permissions, allowing control at the column, row, and cell levels across various AWS analytics and machine learning services.

Press + to interact

Here’s a breakdown of the security management features:

  • Hybrid access mode: The hybrid access mode for AWS Glue Data Catalog integrates Lake Formation permissions with IAM permissions policies, enabling selective adoption and focusing on specific use cases. We can pick specific databases and tables to bring under Lake Formation’s permissions and set permissions for new users using Lake Formation without affecting the IAM permissions of existing user access.

  • Audit logging: Lake Formation tracks three aspects of the data: who accessed it, what they accessed, and when they accessed the data. This helps with the identification of suspicious activity or potential breaches and it also demonstrates adherence to data access policies.

  • Row and cell-level security: Not all data needs the same level of protection. Lake Formation allows us to restrict access to specific rows or cells within a table. This helps us protect sensitive data like a row containing clients’ credit card information and implement data privacy regulations.

  • Tag-based access control (TBAC): Managing permissions for a large number of resources can be cumbersome. TBAC simplifies this by creating custom labels (LF-Tags) to categorize data (e.g., marketing data, sensitive data), attaching these tags to databases, tables, or columns, and granting access based on the tags, not individual resource policies.

  • Cross-account access: Data lakes can span multiple AWS accounts. Lake Formation allows us to centrally manage permissions across these accounts and provide fine-grained access control to data stored in different accounts. This simplifies data governance for large organizations with distributed data storage.

Integration with AWS Glue

AWS Lake Formation is built upon AWS Glue. AWS Glue provides several key features that AWS Lake Formation utilizes:

  • Data catalog: AWS Glue provides a fully managed metadata catalog service that serves as the central repository for storing and managing metadata information about datasets, tables, and schemas. Lake Formation leverages the AWS Glue Data Catalog for metadata management, enabling organizations to discover, catalog, and access data assets and analyze them within the data lake.

  • Data crawlers: AWS Glue provides data crawlers that automatically discover and catalog metadata from various data sources such as Amazon S3, RDS, DynamoDB, and Redshift. Lake Formation leverages Glue’s data crawlers to infer schemas and populate the AWS Glue Data Catalog with metadata information about the underlying data sources.

  • ETL: AWS Glue offers serverless ETL capabilities for automating the process of extracting data from various sources, transforming it according to predefined transformations, and loading it into target data stores. Lake Formation utilizes AWS Glue’s ETL functionality to perform data ingestion and transformation tasks like schema mapping, data cleansing, and other data processing within the data lake environment.

  • Job execution: AWS Glue supports executing ETL jobs and data processing workflows using Apache Spark and Python-based scripts. Lake Formation utilizes Glue’s job execution capabilities to automate and schedule the execution of ETL jobs, ensuring efficient data processing and management within their data lake.

Press + to interact
Integration of AWS Lake Formation with AWS Glue
Integration of AWS Lake Formation with AWS Glue

Overall, AWS Glue is a foundational component within AWS Lake Formation, providing essential capabilities for metadata management, data discovery, data ingestion, transformation, and job execution within the data lake environment.

AWS Lake Formation can also integrate seamlessly with other AWS analytics services, such as Amazon Athena, Amazon Redshift, Amazon EMR, and Amazon QuickSight. This integration enables organizations to leverage these services for querying, processing, visualizing, and deriving insights from data stored in their data lakes managed by Lake Formation.

How does Lake Formation work?

To understand the working of AWS Lake Formation, let’s assume that we have a private S3 bucket that has some data, an AWS glue crawler that has an IAM role that grants it permissions to access the contents of the S3 bucket, and a “User A” user that wants to analyze the data.

In this scenario, the AWS Lake Formation can be used to crawl the data from the S3 bucket with the AWS Glue crawler, save the metadata in data catalog tables, and then grant access to the “User A” to be able to access the tables in the data catalog. After that, “User A” will be able to access and analyze the data.

Press + to interact
Grant access to a user to analyze data in S3 bucket with AWS Lake Formation
Grant access to a user to analyze data in S3 bucket with AWS Lake Formation

Use case: Healthcare organization

Lake Formation’s true strength lies in its ability to consolidate data from various data sources to create a data lake on S3. Consider the use case scenario, where a healthcare organization collects vast amounts of patient data from various sources, including electronic health records (EHR), medical imaging systems, wearable devices, and research databases. They want to centralize and analyze this data to improve patient care, research outcomes, and operational efficiency while ensuring compliance with strict regulatory requirements such as HIPAA.

Press + to interact
Infrastructure of healthcare organization
Infrastructure of healthcare organization

Data scientists and healthcare analysts leverage analytics tools like Amazon Athena, interactive query service, Amazon Redshift, or Amazon EMR to query and analyze data stored in the data lake. They perform advanced analytics, predictive modeling, and machine learning algorithms to derive insights for personalized patient care, disease prevention, and treatment optimization. Using AWS Lake Formation we can restrict permissions of those accessing the data lake. For example, we do want the analysts to be able to analyze the data, however, we have to restrict analysts from having access to the column with the social security numbers (SSNs) of patients.

Get hands-on with 1300+ tech skills courses.