Select and Glacier Select
Learn how to fetch a subset of information from large object storage using S3 Select and S3 Glacier Select.
We'll cover the following
S3 provides a durable storage solution and can store objects as large as 5 TBs. Traditionally, we have been accessing objects as a whole, meaning we'll retrieve the entire 5 TB of the object even if we want a small piece of information from it. Amazon S3 offers S3 Select and S3 Glacier Select to allow us to fetch only a subset of information from a large object using simple SQL expressions.
S3 Select
The S3 Select improves the performance drastically and latency by only retrieving the desired object. For example, consider a data analytics platform that stores massive log files in S3. These log files are compressed in zip archives and uploaded on a daily basis. If we analyze the logs weekly, it can be difficult to decompress each log file and process them. With S3 Select, you can efficiently query and retrieve relevant information from these logs without downloading or decompressing the entire files. This can significantly reduce data transfer costs and processing time, especially when dealing with large datasets.
S3 Select can accelerate the efficiency of all sorts of applications. S3 Select also allows us to specify the range of bytes we want to query.
S3 Glacier Select
The S3 Glacier is one of the most cost-efficient storage classes commonly used to archive large amounts of data for compliance purposes. Also, the S3 Glacier Infrequent Access does not provide real-time retrieval and takes hours, if not days, to restore or fetch data. Retrieving those large objects and waiting days and weeks just to extract a small amount of information is ineffective and discouraging.
S3 Glacier Select allows us to extract the subset of information using SQL expressions. We can specify the time limit to retrieve results by selecting one of the three retrieval options:
Expedited Retrievals: It takes 1–5 minutes.
Standard Retrievals: It takes 3–5 hours.
Bulk Retrievals: It takes up to 12 hours.
S3 Glacier Select supports data in CSV, JSON, and Apache Parquet format or compressed in GZIP or BZIP2. This means that we can use it for tasks such as pattern matching, auditing, and analysis of large amounts of data archived, all while ensuring minimum costs and the highest latency.
Note: We cannot query objects in the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, or S3 Intelligent-Tiering Deep Archive Access tier using S3 select.
Get hands-on with 1300+ tech skills courses.