Open code review: FlatBuffers

by Max Galkin

This is one of the “open code review” posts, where I publish my notes after looking through and playing with one of the open source C++ projects. My main goal here is to become a better coder by learning from the experience of other developers, and my secondary goal is to build a mental map of the tools and frameworks available “out there” to not reinvent the proverbial wheel, should I ever need one. The blog post expresses my personal opinion, not affiliated, endorsed, sponsored, etc. I am not arguing for or against the usage of any specific open source library. I will be grateful if you take time to point out any misunderstanding I might have.

Today I looked into google/flatbuffers, a “Memory Efficient Serialization Library”.

FlatBuffers is a member of a numerous family of serialization frameworks, being a lesser-known sibling of Google’s Protocol Buffers (FlatBuffers even understands some subset of their .proto language) and a distant relative of such libraries as Cap’n’Proto, Rapid JSON, pugixml and Thrift. I’m not going to compare them all here, and the optimal choice is likely to be problem-specific anyway.

FlatBuffers documentation is quite good in my opinion, it shows some benchmarks and highlights the specific features of the library. What FlatBuffers claims as its strength is the absence of the packing/unpacking step per se, e.g. instead of “deserializing” the data from the binary representation into a newly allocated struct you just get the fields of the struct mapped as offsets onto the underlying byte array. That’s why the benchmark page proudly demonstrates comparison with “raw structs” serialization performance. FlatBuffers uses a platform-agnostic language-independent binary format, always little-endian and with predefined alignment. The few poor souls with big-endian machines will pay for some extra binary swapping here.

The workflow of using FlatBuffers is 2-phased, just like with protobuf. In the first phase, you write an interface definition file, describing the structure of the payload you’d like to serialize. Having that description, you run the FlatBuffers code-generator to produce the reading/writing code in C++, Go, Java or C# (or in all of them). In the second phase, you use the generated code and some extra headers in your project to serialize/deserialize your data. Normally, you’d use the default binary serialization, as that is the most efficient approach, but the library also lets you serialize to/from JSON, if you so desire, e.g. for debugging purposes or to interact with a JavaScript API.

Here is how an interface definition file looks like, it’s really pretty straightforward, for more detailed explanations go to this page:

namespace MyGame;

attribute "priority";

enum Color : byte { Red = 1, Green, Blue }

struct Vec3 {
  x:float;
  y:float;
  z:float;
}

table Monster {
  pos:Vec3;
  mana:short = 150;
  hp:short = 100;
  name:string;
  friendly:bool = false (deprecated, priority: 1);
  inventory:[ubyte];
  color:Color = Blue;
}

root_type Monster;


Conceptually, FlatBuffers seems to be quite solid, the rationale and the need are well-defined, and as I mentioned the documentation is quite helpful too, now what about the implementation? Show me the code!

The sources contain Visual Studio solution, so building was easy for me. I’ve hit a couple of warnings, but that’s only because I built with the VS 2015 preview which comes with a few stricter checks — can’t blame the library here.

The first thing I looked at were tests: the coverage is pretty good. In addition to some “educational” tests, which demonstrate and verify basic library usage, there are a couple of “fuzz tests”, which generate randomized schema and data and perform the serialization / deserialization. I’ve modified one of the tests to generate some extra cases, and it still passed, so there you go, not so easy break :) I’ve opened a pull request, so maybe that update will make it into the main branch even. There are also tests covering the functionality in Java, C# and Go.

Now to mention a few things that I didn’t like that much in the implementation.

I’m going to leave aside the fact that the library isn’t written in “modern C++”, I don’t consider this as a drawback, it’s a design trade-off, there are people who still have to use older compilers, and a few folks requested explicitly to make it C++98-compliant.

I also won’t dig into the idiosyncrasies of the Google C++ Style Guide, suffice it to say that a somewhat typical function prelude is to get a reference to an argument passed in by a pointer (without checking for nullptr…):


static void GenEnum(const Parser &parser, EnumDef &enum_def,
std::string *code_ptr, std::string *code_ptr_post,
const GeneratorOptions &opts)

{
   if (enum_def.generated) return;
   std::string &code = *code_ptr;
   std::string &code_post = *code_ptr_post;

What bothers me more is the fact that the parser of the input files and the code-generation are implemented in a somewhat ad-hoc way, without clearly separating some lexer, parser, AST or DOM components. FlatBuffers actually documents the grammar of the interface definition file, I’m sure these days there is a way to take such a grammar and turn it into a parser without manually implementing it from scratch? I suspect that having the formal grammar at hand could have also helped to fuzz-test the project even more thoroughly.

Here is for example a fragment of the FlatBuffers input parser, the full implementation is here:


case '/':
if (*cursor_ == '/') {
   const char *start = ++cursor_;
   while (*cursor_ && *cursor_ != '\n') cursor_++;
   if (*start == '/') { // documentation comment
      if (cursor_ != source_ && !seen_newline)
         Error("a documentation comment should be on a line on its own");
      doc_comment_.push_back(std::string(start + 1, cursor_));
   }
   break;
}

Similarly, code generation is done by appending little code chunks to a long string. Yes, there is some code sharing between generators, some notion of language traits, but the producing code and the produced code are so intertwined that it’s difficult to tell one from another. Maybe some kind of text templates should have been used, or some sort of DOM… The drawback of the existing implementation is that it seems like it could  be difficult to analyze, verify and maintain, and if a fix is ever needed for one of the code generators it may be difficult to properly “port” that fix to the generators for other languages. Adding support for new languages could be quite a tedious task as well.

Here is a fragment from the C++ code generator:


  // Generate a builder struct, with methods of the form:
  // void add_name(type name) { fbb_.AddElement<type>(offset, name, default); }
  code += "struct " + struct_def.name;
  code += "Builder {\n  flatbuffers::FlatBufferBuilder &fbb_;\n";
  code += "  flatbuffers::uoffset_t start_;\n";
  for (auto it = struct_def.fields.vec.begin();
       it != struct_def.fields.vec.end();
       ++it) {
    auto &field = **it;
    if (!field.deprecated) {
      code += "  void add_" + field.name + "(";
      code += GenTypeWire(parser, field.value.type, " ", true) + field.name;
      code += ") { fbb_.Add";
      if (IsScalar(field.value.type.base_type)) {
        code += "Element<" + GenTypeWire(parser, field.value.type, "", false);
        code += ">";
      } else if (IsStruct(field.value.type)) {
        code += "Struct";
      } else {
        code += "Offset";
      }
      code += "(" + NumToString(field.value.offset) + ", ";
      code += GenUnderlyingCast(parser, field, false, field.name);
      if (IsScalar(field.value.type.base_type))
        code += ", " + field.value.constant;
      code += "); }\n";
    }
  }
  code += "  " + struct_def.name;
  code += "Builder(flatbuffers::FlatBufferBuilder &_fbb) : fbb_(_fbb) ";
  code += "{ start_ = fbb_.StartTable(); }\n";
  code += "  " + struct_def.name + "Builder &operator=(const ";
  code += struct_def.name + "Builder &);\n";
  code += "  flatbuffers::Offset<" + struct_def.name;
  code += "> Finish() {\n    auto o = flatbuffers::Offset<" + struct_def.name;
  code += ">(fbb_.EndTable(start_, ";
  code += NumToString(struct_def.fields.vec.size()) + "));\n";
  for (auto it = struct_def.fields.vec.begin();
       it != struct_def.fields.vec.end();
       ++it) {
    auto &field = **it;
    if (!field.deprecated && field.required) {
      code += "    fbb_.Required(o, " + NumToString(field.value.offset);
      code += ");  // " + field.name + "\n";
    }
  }
  code += "    return o;\n  }\n};\n\n";

I want to emphasize that this doesn’t mean that you shouldn’t use this library, and I don’t want you to get an impression that I’m arguing for or against it: as far as I can see the library does what it promises to do and has sufficient test coverage. As always, it is up to you to decide the acceptable trade-offs and requirements in your situation.