Automating Object Serialization/Deserialization in C++

Automating Object Serialization/Deserialization in C++

Using C++ Template Meta-Programming vs Macros to generator visitors for objects

Topics: C++, Serialization, Reflection, Templates, JSON, Visitor pattern

by John Ryland © 2016


Overview

Reflection/introspection. Some languages support this really well, while C++ has its runtime type information system. Why do you need this? You can manage without it, but it makes some jobs easier, such as serialization and de-serialization. What I'll present is some ways to try and bridge some gaps.

Background


Why is serialization/de-serialization needed?

There are multiple uses for serialization. Most obvious is to save some state to disk that can be later restored, eg saving a document or file, and be able to open it again later. In this particular case, if the data will be loaded again on the same machine and with the same program, then endian and interoperability issues likely can be ignored. Other uses include interoperability, as well as transmission of objects over a network.

Formats


Generally pointers can't be serialized and de-serialized (array indexes are okay if what they reference are also serialized/de-serialized). Compound objects are usually serialized and de-serialized by calling serializing and de-serializing functions for the component parts, so generally you end up with a hierarchy, so formats that can represent structure data like this are used, such as XML and JSON being two notable ones. The downside of these are that they are text based formats, so the data is inflated as text, and more time is spent parsing. There are binary forms of XML and JSON. Generally to serialize/de-serialise in binary requires fixing the size of basic types, fixing number representations and endian order etc. Google's solution is protocol buffers, which Google provides implementations for for quite a number of languages, including C++. Protocol buffers require running extra tools. If you are familiar with Qt, it has its Meta-Object system which bridges some introspection gaps also, but has historically required running moc (meta-object compiler) to generate code, similar to protocol-buffers. There are similarities with CORBA and IDL compilers. But I'm not really a fan of running extra tools to generate code you need to make build rules for etc.

Languages / Comparison


In various languages, there is some degree of built in support for serializing and de-serializing objects.
In python, it is called pickling (https://docs.python.org/3/library/pickle.html) and is well supported and is quite convenient to use, not to mention that converting python objects to json back and forth is also quite easy.
In C#, the language provides some convenience attributes and classes for serializing and de-serializing objects to/from xml (see: https://msdn.microsoft.com/en-us/library/58a18dwa(v=vs.110).aspx).

    public class Address
    {
       // Xml attribute annotations part of language
       [XmlAttribute]
       public string Name;
       public string Line1;
    }

    // Code to serialize
    XmlSerializer serializer =
    new XmlSerializer(typeof(PurchaseOrder));
    TextWriter writer = new StreamWriter(filename);
    PurchaseOrder po=new PurchaseOrder();
    ... populating the object ...
    serializer.Serialize(writer, po);
    writer.Close();

Solutions for C++


While python (and C#) are useful for particular tasks, there are times when C/C++ is required or preferred, and you may need to add serializing and de-serializing in to an existing project already using C/C++. So how to make serializing and de-serializing more convenient in C++? Certainly C++11 has made some great improvements, but it doesn't fully solve this yet. Preferably the solution follows some or most of these kinds of principles:

http://webpro.github.io/programming-principles/

It's simple, as simple as it needs to be. It just does the job it does, not usurps other responsibilities. It doesn't require you to repeat yourself etc. Preferably the solution doesn't require running other tools.

A survey to find something that ticks many of boxes for me finds some solutions. One notable one that looks quite promising is this:

https://github.com/Loki-Astari/ThorsSerializer

How it looks like in practise:

    #include "ThorSerialize/Traits.h"
    #include "ThorSerialize/JsonThor.h"

    struct Color
    {
        int     red;
        int     green;
        int     blue;
    };
    // Annotation of fields to serialize
    ThorsAnvil_MakeTrait(Color, red, green, blue);


    // Code to serialize
    using ThorsAnvil::Serialize::jsonExport;
    Color color = {255,0,0};
    std::cout << jsonExport(color) << "\n";


That looks pretty close to ideal, wouldn't you think? No MACROs, just pure C++. Looks nice.
I'm pretty pleased with the syntax Loki-Astari has managed to create and the minimal amount of code needed to annotate classes. But there are a few things I don't quite like. When calling MakeTrait, the names of the variables are entered again, breaks the principle of not repeating oneself. Also the Syntax feels a bit clunky with the type followed by members in the parameters to MakeTrait. The particular solution also ships with it's own json serializer/deserializer, coupling those together, although with some work I imagine it is possible to hook it up with rapidJson, which is my preferred json implementation in C++ at the moment. I think perhaps the API might be better if there could be a clearer separation of concerns for the project, to isolate the traversal of the objects from the more mundane matter of reading and writing files in a particular format which existing libraries may be more well tuned at or are more powerful at.

Unfortunately it also appears that it requires C++14, and may not work with C++11. I'm not actually sure why, but haven't looked deeply enough to say what that dependancy is.

Visitors

Recall one criticism I had with this solution is that the project taken as a whole couples together serialization with the work of providing the class's reflection/introspection needed for this in C++.

So I think what interests me is the generation of the object traversal, rather than the detail of reading/writing involved in serialization. The GOF (https://en.wikipedia.org/wiki/Design_Patterns) would call this the visitor pattern. Serialization is just one specialization or use of this visitor pattern. Instead of for example, actually generating all that xml or json or what ever you plan to serialize, you could instead traverse the objects (in the same way) and generate a hash of the objects. I've made such a use of the visitor pattern in doing unit testing to check the state of objects against a known hash that the objects are expected to have, hence saving needing to dump the entire state and do a comparison of a large amount of data. And I've worked on client-server systems that are server authoritative and the state between the client and server can be compared to validate client actions based on a hash generated in this way. The good thing about the visitor pattern is that it can be non-obtrusive to an implementation so it does not impact performance when you aren't doing serialization and de-serialization, nor impact code clarity.

Before we look at how to do a visitor pattern correctly, lets look at some other ways people attempt to do this and the impacts these have.

Other solutions:


One way, using MACROs (excuse my yelling) (BTW, can you tell I'm not a huge fan of MACROs despite having done a lot of 'clever' MACROs in my time).

say we have our color example again:

    struct Color
    {
        uint8_t     red;
        uint8_t     green;
        uint8_t     blue;
        uint8_t     alpha;
    };

This is nice POD (plain old data), which has nice properties. Also note in this example, it fits in 32-bits.

With MACROs, commonly you see people do something like this:

In the header:

    DECLARE_CLASS(Color)
       DECLARE_MEMBER(uint8_t, red)
       DECLARE_MEMBER(uint8_t, green)
       DECLARE_MEMBER(uint8_t, blue)
       DECLARE_MEMBER(uint8_t, alpha)
    END_DECLARE_CLASS(Color)

And then in a cpp file:

    DEFINE_CLASS(Color)
       DEFINE_MEMBER(uint8_t, red)
       DEFINE_MEMBER(uint8_t, green)
       DEFINE_MEMBER(uint8_t, blue)
    END_DEFINE_CLASS(Color)

If you've done something like this, this is pretty common and seems like a natural solution in the absence of something better. I don't really like it. Recall the principles. Don't repeat yourself. It's error prone. Did you notice the error I made?

Also unfortunately, if you define DECLARE_CLASS as something like this:

    #define DECLARE_CLASS(classname) \
        class classname : public SerializableBase {

The consequence is that depending on the size of SerializableBase, all objects will have grown, using more memory. And it just gets worse if you have virtuals in there too. Actually its pretty shit. Remember we started out with nice POD that was 32-bits in size, now our data is horrible.

Hopefully you didn't do that. If you didn't give yourself a pat on the back. Perhaps you then did this instead:

    #define DECLARE_CLASS(classname) \
        class classname {  \
           void serialize(Serializer& s);

    #define DECLARE_MEMBER(typ, nam) \
           typ nam;

    #define DEFINE_CLASS(classname) \
        void classname::serialize(Serializer& s) {

    #define DEFINE_MEMBER(typ, nam) \
           if (s.isWriter())        \
             v << nam;              \
           else                     \
             nam >> v;

Not bad. Your set of macros can handle serializing and de-serializing! Pretty clever. There is no inheritance, and a non-virtual member function which should keep POD types as POD. Unfortunately this still misses the more interesting possibilities of traversing the objects for something other than serializing, such as hashing, or what ever algorithm that you wish or need to apply, rather than specifically serializing with a specific implementation or for a specific format.

An improvement is this:

    #define DECLARE_CLASS(classname) \
        class classname {  \
           template <typename Visitor> \
           void visit(Visitor& v);

    #define DECLARE_MEMBER(typ, nam) \
        typ nam;

    #define DEFINE_CLASS(classname) \
        template <typename Visitor> \
        void classname::visit(Visitor& v) {

    #define DEFINE_MEMBER(typ, nam) \
            v.visit(nam);

Nice work. Instead of that horrible branching for switching between serializing or de-serializing inside the macro (wasn't that yucky), it is instead controlled by which Visitor implementation we pass in and the two sets of code are generated. Sweet.

The Visitor implementation needs to implement the visit function. This function can be templated so that it is specialized for basic types, and can then call the visit function of other compound types like ones these macros are creating.

So that is not bad. If you managed to do this give yourself a couple more pats on the back.

But there is still the duplication in declaring things in two sets of macros for both declaring the class and for generating the code to create the visitor function, which introduces issues of maintainability and is error prone as previously noted (DRY principle).

But really, what are we saving ourselves from writing with these macros anyway?

We could, if we don't mind duplication, simply write out explicitly the expansion of the macros inside the header file like this example:


    struct Person
    {
      int32_t        id;
      std::string    name;
      std::string    email;
      uint64_t       phone;

      // Visit being a member function, the members could
      // be private and this will still work
      template <class V>
      void Visit(V& v)
      {
        v.Enter("person");
        v.Visit("id",id);
        v.Visit("name",name);
        v.Visit("email",email);
        v.Visit("number",phone);
        v.Exit("person");
      }
    };


Or if we don't like having the member function, and want a nice more clear separation between the struct and the visitor, one way is like this:


    struct Person
    {
      int32_t        id;
      std::string    name;
      std::string    email;
      uint64_t       phone;
    };

    template <class V>
    void Visit(V& v, Person& p)
    {
      v.Enter("person");
      v.Visit("id",p.id);
      v.Visit("name",p.name);
      v.Visit("email",p.email);
      v.Visit("number",p.phone);
      v.Exit("person");
    }

Is that so bad? Everything is together in the one place. It avoids MACROs. It is quite idiomatic C++ code. I believe this will maintain PODness of structs that would originally be POD without the visit function (whether as a member function or not). This particular example shows how the members can be named and it doesn't need to match the member name in the class. The particular detail about calling enter() and exit() is to name the type and for dealing with arrays with particular serialization implementations of the visitor.

Depending on taste, the visit function could be declared elsewhere, but there is a greater chance that someone adds a new member and doesn't update the visit function if these are in different files. A static_assert of the sizeof the type near the visit function may help detect this.

Looking back at ThorsAnvil_MakeTrait, to compare, ThorsAnvil does look like a bit less typing, but requires C++14 and pulls in more code by comparison. The above formation however doesn't require anything exotic or including headers or pulling in large amounts of any outside code. The syntax also feels nice, and gives an opportunity to name the fields (As JSON, the data can be quite large, smaller field names can cut down the size of the JSON. It can also help with compatibility/interoperability with adapting the name to that of externally provided JSON).

If not happy enough with this, and don't mind MACROs, perhaps with a bit of MACRO magic we can just declare things once so we don't need to repeat ourselves. MACROs always seem a bit messy, but we might be able to avoid a bit of duplication and save a bit of typing and overall make things a bit less error prone.

Solution:


So here we go:

So say this is what we desire we end up with when we declare a class, and to do this once, this would be the entire declaration and definition for these types:


    DECLARE_STRUCT(TestBaseStruct)
      DECLARE_MEMBER(int,    m_number,  9   /* default value */ )
      DECLARE_MEMBER(bool,   m_bool,    false)
    END_STRUCT()


    DECLARE_STRUCT(TestStruct)
      DECLARE_MEMBER(TestBaseStruct, m_base)
      DECLARE_MEMBER(int32_t,        m_int1) /* specifying defaults is optional */
      DECLARE_MEMBER(int32_t,        m_int2,  100)
      DECLARE_MEMBER(float,          m_flt,   9.0)
    END_STRUCT()


So what we need is something that will both be able to make our struct declaration and simultaneously make our visitor function. Can it be done you ask? Here I present what I call 'Ryland's Device':


    #define DECLARE_STRUCT(name) \
      struct name { \
        private: \
          typedef name _this_type; \
          static const char* _this_name() { return #name; } \
        public: \
          typedef struct { \
            template <class V> \
            static void Visit(_this_type* o, V& v) { \
            }

    #define DECLARE_MEMBER(type, name, ...) \
          } blah##name; \
          \
          type name = type(__VA_ARGS__);  /* if supporting defaults */ \
          /* type name;   if not supporting defaults */ \
          \
          typedef struct { \
            template <class V> \
            static void Visit(_this_type* o, V& v) { \
              blah##name::Visit(o, v); \
              v.Visit(#name, o->name); \
            }

    #define END_STRUCT() \
          } last; \
          template <class V> \
          void Visit(V& v) \
          { \
            v.Enter(_this_name()); \
            last::Visit(this, v); \
            v.Exit(_this_name()); \
          } \
      };


It's not pretty. I believe this will work with C++11 and probably before C++11 also. I've just tested with g++, but possibly will work (hopefully without tweaks) with other compilers.

I've applied 'Ryland's Device' now for a few things. Serialization/de-serialization, and also for a bit of a speciality use of this in the binding of attributes with shaders in OpenGL and the attribute declarations when setting up vertex buffer objects. I'll probably write a blog or make an article on the details of that separately, but it's interesting the applications that can open with this clever device of making the typedef'd name of the last item be defined in the current item, giving it the ability to call it and chain together the calls of the members. I've never come across this before, so as far as I know this is an original concept. Feel free to use this however you like. If I am the first person to come up with this, I'd feel pretty chuffed if people could refer to it as 'Ryland's Device', that would be all the credit I need :)

So in the end, we can end up with something reasonably close to what can be done in C#, but with added flexibility for options other than just serializing and de-serializing XML.


    DECLARE_STRUCT(TestStruct)
      DECLARE_MEMBER(TestBaseStruct,  m_base)
      DECLARE_MEMBER(int32_t,         m_int1)   /* default is optional */
      DECLARE_MEMBER(int32_t,         m_int2,  100)
      DECLARE_MEMBER(float,           m_flt,   9.0)
    END_STRUCT()


    XMLSerializerVisitor serializer;
    TestStruct test;
    .. initialization of values ...
    serializer.Visit(test);
    printf("%s", serializer.Output().c_str());


    // Getting the hash of the objects
    MD5SumVisitor hasher;
    hasher.Visit(test);
    printf("hash: %s", hasher.hash().c_str());


    // Getting as json
    JsonVisitor jsonVisitor;
    jsonVisitor.Visit(test);
    printf("json: %s", jsonVisitor.value().toCompactString().c_str());


So is it worth it? I'll let you be the judge. I'm somewhat partial to the formation with the explicitly provided visitor function declared outside of the type as I'm not a huge MACRO fan. Although just defining each member and its properties once in one place despite having to use MACROs does have it's advantages.

How does it work?


Basically as you call DECLARE_MEMBER each time, it generates a static member function, and each time we create one of these it also calls the one from before. But how do we call the one from before? Well what I do is as a way to access the last member's function, I put that static member function inside a struct that gives a kind of namespace to it, and the last one is named with the currently being declared members name, that way I can call the previous one. Using a typedef of the struct allows the naming of that to happen after I've declared it which is how it allows it to be declared in the next one. The last one is named 'last', so then the visitor function calls this, which in-turn calls the other functions etc. Hope that makes sense. Not sure if there might be any simplifications that could shorten this formulation, but this way does appear to work and is not too much code.


Conclusion:



Instead of duplication, such as doing this:


In the header:

    DECLARE_CLASS(Color)
       DECLARE_MEMBER(int, red)
       DECLARE_MEMBER(int, green)
       DECLARE_MEMBER(int, blue)
    END_DECLARE_CLASS(Color)

And then in a CPP file duplicating similar/same information:

    DEFINE_CLASS(Color)
       DEFINE_MEMBER(int, red)
       DEFINE_MEMBER(int, green)
       DEFINE_MEMBER(int, blue)
    END_DEFINE_CLASS(Color)

Alternatively, this way has duplication too:

    struct Person
    {
      int32_t        id;
      std::string    name;
      std::string    email;
      uint64_t       phone;
    };

Then in the visitor, we have to name the members again:

    template <typename V>
    void Visit(V& v, Person& p)
    {
      v.Enter("person");
      v.Visit("id",p.id);
      v.Visit("name",p.name);
      v.Visit("email",p.email);
      v.Visit("number",p.phone);
      v.Exit("person");
    }

We instead can just do this:

    DECLARE_CLASS(Color)
       DECLARE_MEMBER(int, red)
       DECLARE_MEMBER(int, green)
       DECLARE_MEMBER(int, blue)
    END_DECLARE_CLASS


And done. No duplicated info anywhere. Also we saw how this isn't locked in to just serializing/de serializing or doing so for a certain format such as JSON/XML. It is not tightly coupled to a serialization implementation. Other algorithms can be applied to the objects, such as hashing. It can preserves POD data as POD (if refraining from defining initializer values from the members, and minor change to that MACRO to not do that, I think it is perhaps the only C++11 specific thing in the macros too) if this is important.

P.S. - static_asserts


This reminds me, in-case you aren't using static_asserts, they are really useful, there is little reason to not use them as they are only an overhead when compiling and will have zero overhead to the size and speed of the generated code, but they allow catching errors, and catching them at the right time, at compile time, instead of at runtime (But I may be making an assumption here, that you value working code over compiling code).

The type of error they can catch is not limited to what can be evaluated by the preprocessor. For example if you want to ensure a given type is POD and stays POD, one can statically assert this, so that if someone else came along and modified a struct to make it non-POD, the code would refuse to compile because of the static_assert. Nice isn't it? You can annotate your code with assertions about the kind of properties you want for a type and have it enforced. No need to let other people guess about what your intent is or inadvertently break the performance of critical code/data. In the case of asserting something is POD, you would do it like this:



    static_assert(std::is_pod<color>::value == true, "Color not pod");


If the compiler is pre-C++11 and doesn't support static_assert, it can be emulated with a macro, just Google for 'static_assert macro' for one of many options. But it's probably better to just update your compiler instead.

Coming up


In another article, I can elaborate on implementing specific types of visitors, for example ones to hash, ones to serialize/deserialize json and xml, and there is the OpenGL vertex shader binding and attribute setup I mentioned also.

Leave comments below to let me know what you think. I particularly want to find out if anyone knows of any existing prior art for the trick I did in the MACROs.


No comments:

Post a Comment