Avro 2.0 Specification Proposals

Optional Fields (AVRO-519)

Arguments in Favor

  • Large sets of very sparse optional fields have a variety of uses, e.g. tags
  • Some other serialization systems have direct support for optional fields
  • Avro can handle missing fields by providing a default value however this can be inefficient for sparse sets of options and more complex since the default value doubles as the "empty" marker
  • Avro can handle optional fields as a vector of a union of all the optional fields. While this works, it makes quickly determining whether or not a particular fields is present complex and inefficient. Annotations can be used to decrease the API complexity but that argues for making optional a top level concept.
  • Support for fast "select" of sparse optional fields from a top level record would benefit certain applications.

Proposal

An additional field attribute:

  • "optional" - with values "true"/"false" (where "false" is assumed)

For the encoding, any record which includes optional fields would be prefixed by an presence map which would be a sequence of int8 x* where:

  • x > 0 : the lower 7 bits are presence bits for the next 7 optional fields (low bit first)
  • -128 < x < 0 : the next present field is position x + 135 (as x runs from 0 to -127 and the first 7

must be empty otherwise we would use the x > 0 encoding)

  • x == -128: no optional fields present in the next 134 optional fields
  • x = 0 : end of sequence

further, if the map has covered all the options, the end-of-sequence marker can be elided. For example, a type with 3 optional fields would require only a single byte.

This will permit encoding at 8/7 of a bit per present entry (worst case) and at a cost of 8/134 (0.06) bits/entry per all but last not-present (7.5 bytes / 1000 optional fields).

This encoding is backward compatible as well as schema's which do not contain optional elements do not have the presence map and the encoding is therefore identical. Backward compatibility can be maintained by simply using the default value for not-present fields.

Language APIs

Efficient support could include either an explicit presence test or a function which returns the value or default value (if the field is not present).

Named Unions(AVRO-248)

Arguments in Favor

  • Anonymous unions make reuse difficult (AVRO-266)
  • Other serialization systems support names for unions and branches, arrays

Proposal

: { "type": "union", "name": "Foo", "branches": ["string", "Bar", ... ] }

Language APIs

For Java, code is generated for a union, a class could be generated that includes an enum indicating which branch of the union is taken, e.g., a union of string and int named Foo might cause a Java class like

public class Foo {
public static enum Type {STRING, INT};
private Type type;
private Object datum;
public Type getType();
public String getString() { if (type==STRING) return (String)datum; else throw ... }
public void setString(String s) { type = STRING; datum = s; }
....
}

Then Java applications can easily use a switch statement to process union values rather than using instanceof.

  • when using reflection, an abstract class with a set of concrete implementations can be represented as a union (AVRO-241). However, if one wishes to create an array one must know the name of the base class, which is not represented in the Avro schema. One approach would be to add an annotation to the reflected array schema (AVRO-242) noting the base class. But if the union itself were named, that could name the base class. This would also make reflected protocol interfaces more consise, since the base class name could be used in parameters return types and fields.
  • Generalizing the above: Avro lacks class inheritance, unions are a way to model inheritance, and this model is more useful if the union is named.

Named Branches (discussed in AVRO-248)

Arguments in Favor

  • Anonymous branches are not supported in some languages and require casts or type checks in others
  • One argument against named branches was that anonymous branches are a good way of handling nullable fields which could be handled as optionals (above)
  • Other serialization systems support names for unions and branches, arrays

Proposal

: { "type": "union", "name": "Foo", "branches": [{"name": "URL", "type": "string"} , {"name": "hostname", "type": "string"} , ... ] }

Language APIs

The language API should produce named typed accessors in addition to the tag. Languages which have native support for named branches e.g. C, C++, Pascal etc. should use an explicit tag and their native unions.

  • No labels