Parquet: Projection Schema & Misc. Tools

This lesson demonstrates the use of a projection schema and other miscellaneous parquet related tools.

Projection Schema & Misc. Tools

In the previous section, we learned to write a Parquet file using an Avro model. The allure of storing data in Parquet format is the ability to read columns independent of other columns for a given record. So how can we do that programmatically? Specify a projection schema; set it on the reader object to filter the columns we want to read from a Parquet file. The projection schema is a subset of the schema used to write a Parquet file with.

Thrift and Protocol Buffers also allow for projection in a similar manner. However, Avro also comes with an option to set a read schema. The read schema resolves Avro records, as Avro allows for the evolution of schema.

In the code widget below, we read only the horsepower column for all the car records, and then compute the average horsepower for the car records in our Parquet file. We use the auto-generated Car class to read-in the records. The unread columns get assigned their default values of null.

Get hands-on with 1200+ tech skills courses.