Creating and Parsing Data
Let's explore the options for creating and parsing the data sent through the network.
One of the main differentiating characteristics of distributed systems with other systems is that the various nodes need to exchange data across the network boundary. In this chapter, we will examine how we can achieve this and the trade-offs of each approach.
Serialization and deserialization
Every node needs a way to transform data that resides in memory into a format that can be transmitted over the network, and it also needs a way to translate data received from the network back to the appropriate in-memory representation. These processes are called serialization and deserialization, respectively. The following illustration shows this process:
Note: Serialization are also used to transform data in a format that’s suitable for storage, not only communication.
Serializing and deserializing data
The various nodes of the system need to agree on a common way to serialize and deserialize data. Otherwise, they will not be able to translate the data they send to each other. There are various options available for this purpose. We will discuss three approaches in this lesson:
Native support
Some languages provide native support for serialization, such as Java and Python via its pickle
module.
Advantages and disadvantages
-
The main benefit of this option is convenience since there is very little extra code needed to serialize and deserialize an object. However, this comes at the cost of maintainability, security, and interoperability.
-
Given the transparent nature of how these serialization methods work, it becomes hard to keep the format stable since it can be affected even by small changes to an object that do not affect the data contained in it (e.g., implementing a new interface).
-
Furthermore, some of these mechanisms are not very secure since they indirectly allow a remote data sender to initialize any objects they want, thus introducing the risk of remote code execution.
-
Last but not least, most of these methods are available only in specific languages, which means systems developed in different programming languages will not be able to communicate.
Note: Some third-party libraries operate in a similar way using reflection or bytecode manipulation, such as Kryo. These libraries tend to be subject to the same trade-offs.
Libraries
Another option is a set of libraries that serialize an in-memory object based on instructions provided by the application. These instructions can be imperative or declarative, i.e., annotating the fields to be serialized instead of explicitly calling operations for every field.
An example of such a library is Jackson, which supports many different formats, such as JSON and XML.
Advantages and disadvantages
-
The main benefit of this approach is the ability to interoperate between different languages.
-
Most of these libraries also have rather simple rules on what gets serialized, so it’s easier to preserve backwards compatibility when evolving some data, i.e., introducing new optional fields.
-
They also tend to be a bit more secure. As they reduce the exploitation surface by reducing the number of types that can be instantiated during deserialization, i.e. only the ones that have been annotated for serialization and thus implicitly whitelisted.
-
However, sometimes, they can create additional development efforts since the exact serialization mapping needs to be defined on every application.
Interface definition languages (IDL)
Interface definition languages (IDL) are specification languages used to define the schema of a data type in a language-independent way.
Advantages and disadvantages
- These definitions can dynamically generate code in different languages that will perform serialization and deserialization when included in the application. It allows applications built in different programming languages to interoperate, depending on the languages supported by the IDL. In addition, each IDL allows for different forms of evolution of the underlying data types.
- They also reduce duplicate development effort, but they require adjusting build processes so that they can integrate with the code generation mechanisms. Some examples of IDLs are Protocol Buffers, Avro, and Thrift.
The difference between the library and the interface definition languages (IDL) approach, using Jackson and Protocol Buffers, is shown in the following illustration:
Get hands-on with 1400+ tech skills courses.