2009-12-18

C# serialization container type pattern

The following pattern is not my invention, but I find it neat as an alternative to implementing ISerializable  and/or IDeserializationCallback (or related attributes) when using .NET binary serialization. Your reasons for doing this may be performance or backwards-compatible serialization.

This takes advantage of the fact that the binary serialization framework stores the type of each object, and so you don't have to know the type on deserialization.

Now, let MyData be the type we want to store:
  1. Do not mark MyData as [Serializable]. Instead,
  2. introduce the IMyDataStorage interface containing the single method ToMyData().
  3. Now, the first version of MyData's serialized format is introduced as the class (or struct) MyDataStorage1, implementing IMyDataStorage.
  4. Introduce the method IMyDataStorage MyData.ToMyDataStorage().
When serializing, do this:

// Create an instance of the newest version of MyDataStorage.
IMyDataStorage storage = myData.ToMyDataStorage();
// Now, store it using regular binary serialization
using (Stream stream = new ...)
{
    new BinaryFormatter.Serialize(stream, storage)
}

When deserializing, do this:

// Deserialize
IMyDataStorage storage = new BinaryFormatter.Deserialize(stream)
MyData myData = storage.ToMyData();

Now, when you need to change the serialized format, introduce MyDataStorage2, and upgrade the MyDataStorage1.ToMyData. Except for that, leave MyDataStorage1 unchanged.

When you decide an old format is obsolete, just delete your definition of the old serialized type (and handle the ensuing SerializationException!).

When not to use ISerializable and IDeserializationCallback

When you need to serialize something else under the hood (i.e. for backwards compatibility or for performance, see the previous blog on that topic), it is tempting to use the ISerializable interface (and in some cases, the IDeserializationCallback interface or the OnDeserialized attribute). However, these have important limitations that often make it easier to create a separate type for serialization, instead of using the interfaces.
  1. On deserialization, the constructor that's implicitly a part of ISerializable is invoked at an undefined point in the deserialization of the complete graph. This means that you cannot rely on the value of any referenced objects at the point when the constructor is invoked.
    If you for instance need to examine the contents of a field of type Dictionary during deserialization, that dictionary may appear empty, even though it will contain entries when the entire graph has been deserialized.
  2. IDeserializationCallback.OnDeserialization is called once the graph has been deserialized, but if multiple objects in the graph implement this interface, the ordering is not well-defined. This means that if any of your fields implement IDeserializationCallback (such as Dictionary), you cannot rely on them being completely initialized when IDeserializationCallback.OnDeserialization is called.
The workaround for point 1 is to implement IDeserializationCallback, but the workaround for point 2 is to call OnDeserialization on the field yourself (suggested in this blog post), which may or may not work, as the binary serialization framework will also call this method at some point.

(Incidentally, the workaround appears to work for Dictionary, so is an ugly, but functional workaround.)

Whether or not to use ISerializable and IDeserializationCallback is arguably a design choice, but the fairly involved mechanics makes the code harder to read for another programmer, and factoring the post-deserialization "fixup" code into a code path separate from the .NET serialization system may be a better choice when maintainability is concerned.

(See separate blog post on this.)

.NET binary serialization speed optimizations

When serializing large amounts of small objects (100.000 instances+) using .NET binary serialization (i.e. using the [Serializable] attribute and new BinaryFormatter().Serialize(stream, object)), there are two important gotchas which can completely ruin the performance of serialization.

Applying the information below reduced deserialization times from 3:40 to 0:04 (yes, from almost four minutes to four seconds) for one concrete case.
  1. If your small objects are classes or structs, you will receive a major performance hit. This is because the field information is stored per instance, and the binary serializer takes some time to decipher this information. When the number of instances is high, this can amount to minutes.
  2. If you reference other objects from the stored objects, these references need additional bookkeeping on serialization and deserialization not to store the same object twice. (Note that this is the case even for arrays of arrays.)
    This takes a small amount of time per object, which can also become a significant amount of seconds when the number of objects is large. Still, this is less of a performance hit than point 1.
Workarounds:
  1. Split the arrays of objects into arrays of PODs (plain old datatypes), one per field. For instance, if  you have an array of structs or classes each containing string name and int age, this can be split into string[] names, int[] ages before serialization.
  2. Create a lookup table of the objects you're referencing, and store indices into this lookup table instead of the actual references. Then store the lookup table.
Now, IMO the cleanest way to implement this is to use an alternate type for serialization (which holds your separated fields and lookup tables), not using the quite limited ISerializable and IDeserializationCallback interfaces. See this blog post for more.