The Fall and Rise of the Need of Memory Optimization

In the early days of computers, memory has been a scarce resource. When memory sizes were measured in kilobytes, some developers used really dirty tricks, including jumping into the middle of instructions, or use of self-modifying code. Not only those times have ended, but advances in processing power and available memory sizes have been that good for a while, that programmers stopped to worry at all.

As a net effect, applications developed in the era of 32-bit computers often required longer startup times than their prior 8 bit counterparts (admittedly, the former offering a lot more functionality). Henry Petroski commented: “The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.”

The advent of mobile devices reintroduced the need to not waste resources. Once mobile devices became as powerful as earlier PCs, the Internet of Things, introducing even smaller devices, added new challenges. Nowadays, awareness of resource cost is important again. Also in clouds, environments offering almost unlimited computing power, the required resources directly related to the monthly bill. In competitive environments, the cost induced by a transaction can make a difference and decide about a company’s success.

A lot of books have been published about the subject of code optimization. Here, we want to analyze requirements for efficient data encoding, for data at rest (in some cache) and transmitted data (as required in service-oriented architectures). On the protocol layer, there has been good progress, moving from http/1.1 to http/2. Encoding of the payload is left to the application. Established standards include XML and JSON, but neither of these is famous for compact encoding and small payload size.

The reasons for the significant size include representation of data as readable text as well as including any data field name in the payload. Since the structure of the data is known, it should not be required to transmit these. In this case, care must be taken to not introduce a format which is completely inflexible and easily breaks for typical changes of data structures (the Java serialization format already covers a couple of compatible changes which do not require a new version).

The following upgrades of a data structure should not require a new version, and parsers, therefore, should without additional effort be capable of decoding data generated before the change: (Note: Examples below will use keywords of the Java programming language, but the format requirements are in fact programming language independent)

  • Propagation of a data type to a type of higher precision, or generally having a superset of its domain, for example, float to double, short to int, int to long, or an enum with additional instances.
  • Propagation from primitive to its corresponding wrapped data type, for example, int to Integer.
  • Adding nullable fields at the end of any class.
  • Increasing the size of arrays or lists.

The data format should also provide good coverage of data types because transmission of data as strings with subsequent conversions adds to program execution time as well as (in languages as Java) GC overhead.

A typical problem when serializing complex data structures are repeated or even cyclic references. While repeated references just add to the size of the serialized data, cyclic references can cause endless loops and result in program crashes for some implementations. Therefore, the following additional requirements are important:

  • If composite classes are referenced multiple times, they should be serialized only once.
  • When serializing and deserializing a data structure, the identity relationship of user-defined classes should be preserved.
  • It should be possible to serialize data structures with cyclic references.

The Bonaparte compact format provides a serialization with the above properties. It is a binary encoding, but independent of processor endianness. Strings are encoded in either UTF-8 or UTF-16 encoding, whichever is shortest for the given string. Numeric values occurring most frequently such as numbers from -10 to 10 require a single byte only. Next, to the above compatibility rules, boolean fields can be propagated to numeric types. The supported data types include all Java primitive types and their wrapper classes, enumeration types, temporal types such as LocalDate, LocalTime, LocalDateTime, and Instant, as well as data structure such as lists and maps with integral or string key types.

A Java-based implementation is available. Serialization and deserialization work based on code generated by Xtext instead of reflection, resulting in a very good performance and very little GC overhead. At Arvato Systems, this has been established as an internal communication format between services in applications.

Leave a Reply