2009-12-18

.NET binary serialization speed optimizations

When serializing large amounts of small objects (100.000 instances+) using .NET binary serialization (i.e. using the [Serializable] attribute and new BinaryFormatter().Serialize(stream, object)), there are two important gotchas which can completely ruin the performance of serialization.

Applying the information below reduced deserialization times from 3:40 to 0:04 (yes, from almost four minutes to four seconds) for one concrete case.
  1. If your small objects are classes or structs, you will receive a major performance hit. This is because the field information is stored per instance, and the binary serializer takes some time to decipher this information. When the number of instances is high, this can amount to minutes.
  2. If you reference other objects from the stored objects, these references need additional bookkeeping on serialization and deserialization not to store the same object twice. (Note that this is the case even for arrays of arrays.)
    This takes a small amount of time per object, which can also become a significant amount of seconds when the number of objects is large. Still, this is less of a performance hit than point 1.
Workarounds:
  1. Split the arrays of objects into arrays of PODs (plain old datatypes), one per field. For instance, if  you have an array of structs or classes each containing string name and int age, this can be split into string[] names, int[] ages before serialization.
  2. Create a lookup table of the objects you're referencing, and store indices into this lookup table instead of the actual references. Then store the lookup table.
Now, IMO the cleanest way to implement this is to use an alternate type for serialization (which holds your separated fields and lookup tables), not using the quite limited ISerializable and IDeserializationCallback interfaces. See this blog post for more.

No comments: