Meta introduces ‘Tulip,’ a binary serialization protocol supporting schema evolution. This simultaneously addresses protocol reliability and other issues and assists us with data schematization. Tulip has multiple legacy formats. Hence, it is used in Meta’s data platform and has seen a considerable increase in performance and efficiency. Meta’s data platform is made up of numerous heterogeneous services, such as warehouse data storage and various real-time systems exchanging large amounts of data and communicating among themselves via service APIs. As the number of AI and machine learning ML-related workloads in Meta’s system increase that use data for training these ML models, it is necessary to continually work on making our data logging systems efficient. The schematization of data plays a huge role in creating a platform for data at Meta’s scale. These systems are designed based on the knowledge that every decision and trade-off impacts reliability, data preprocessing efficiency, performance, and the engineer’s developer experience. Changing serialization formats for the data infrastructure is a big bet but offers benefits in the long run that make the platform evolve over time.
The Data Analytics Logging Library is present in the web tier and the internal services, and this is also responsible for logging analytical and operational data using Scribe- a durable message queuing system used by Meta. Data is read and ingested from Scribe, which also includes a data platform ingestion service and real-time processing systems. The data analytics reading library helps deserialize data and rehydrate it into a structured payload. Logging schemas are created, updated, and deleted every month by thousands of engineers at Meta, and these logging schema data flows in petabytes range each and every day over Scribe.
Schematization is necessary to ensure that any message logged in the past, present, or future, depending on the (de) serializer’s version, can be reliably (de)serialized at any time with the utmost fidelity and no data loss. Safe schema evolution via backward and forward compatibility is the name given to this characteristic. The article’s main focus lies on the on-wire serialization format used to encode the data that is finally processed by the data platform. Compared to the two serialization formats previously utilized, Hive Text Delimited and JSON serialization, the new encoding format is more efficient, requiring 40 to 85 percent fewer bytes and 50 to 90 percent fewer CPU cycles to (de)serialize data.
The applications of the logging library are written in various languages like C++, Java, Haskell, Hack, and Python so as to serialize payload according to logging schema, and these logging schemas are defined according to business needs and are written to Scribe for easier delivery. The logging library is available in two flavors namely Code Generated and Generic. In Code Generated flavor for type-safe use, statically typed setters are generated for each field. For optimal efficiency, post-processing and serialization code is also generated. Whereas in Generic flavor to conduct (de)serialization of dynamically typed payloads, a C++ library with the name of Tulib is offered. A message that uses dynamic typing is serialized in accordance with a logging scheme. Because it permits (de)serialization of messages without requiring rebuilding and redeploying the application binary, this method is more flexible than the code-generated mode.
The logging library sends data to several back-end systems, each of which has traditionally specified its own serialization rules and various problems are faced while using these formats for serializing payloads, and they are
- Standardization: There was no standardization of serialization formats in the past; each downstream system had its own format leading to an increase in maintenance and development costs.
- Reliability: New columns can only be added at the end to maintain deserialization reliability. Any effort to insert a field in the mid of an existing column or remove a column would cause all the following columns to shift, making it impossible to deserialize the row, and the updated schema is distributed to the readers in real-time.
- Efficiency: When compared to binary (de)serialization, both the Hive Text Delimited and JSON protocols are text-based and inefficient.
- Correctness: Field delimiters and line delimiters must be escaped and unescaped for text-based protocols like Hive Text. Every writer and reader does this, which increases the pressure on library authors. Dealing with outdated or flawed implementations that merely look for the presence of these characters and reject the entire message rather than escaping the troublesome characters is difficult.
- Forward and backward compatibility: Consuming payloads that were serialized by a serialization schema both before and after the version the consumer sees is desired. The Hive Text Protocol does not provide this assurance.
- Metadata: The metadata insertion into the payload is not trivially supported by Hive Text Serialization. For downstream systems to implement features that profit from the presence of metadata, propagation of that data is essential.
Tulip solves our fundamental problem, the reliability issue, by supplying a secure schema evolution format that is both backward and forward-compatible across services with distinct deployment cycles. Tulip solved all these problems at one go, making it a better investment than other options available.
The TCompactProtocol from Thrift is used for serializing a payload in the Tulip serialization protocol, which is a binary serialization protocol. The fields are numbered with IDs the same way an engineer would be expected to when altering IDs in a Thrift struct. Engineers define a list of field names and kinds when they create a logging schema, and the Field IDs are managed by the data platform management module rather than the specified engineers. The serialization schema repository contains a translation of the logging schema into a serialization schema. Lists of the field name, field type, field ID for a related logging schema, and field history are stored in a serialization configuration. When an engineer wants to update a logging schema, a transactional operation is carried out on the serialization schema.
Please Don't Forget To Join Our ML Subreddit