Unicode Normalization for Path, Mergeinfo and Lock Lookup
This specification is the result of a number of ongoing discussions, starting including issue #2464 and the various discussions gathered on the UnicodeComposition page. It has also been strongly influenced by this blog post, which discusses the solution adopted by the ZFS filesystem.
Any solution to the normalization problem must maintain strict backwards compatibility between clients and servers. This implies that:
- we cannot change the network protocol to require that all paths are normalized;
- the server cannot store paths, or return them to clients, in a different representation than the one they were originally created with.
The solution also may not drastically affect the performance of the server or working copy. For example, the working copy database cannot use a normalization-independent collation for indexing paths, because that limits SQLite's ability to opimize queries.
For repositories that use the FSFS backend, the solution must not affect the layout of the revision files or directory contents. The repository administrator should be given the choice whether to implement the solution, regardless of format version.
All of the above boils down to:
- The server must accept paths in any representation that can be normalized to the same byte sequence as the as the normalized representation of the stored path (or mergeinfo entry or lock path).
- The client must send paths (and mergeinfo entries and lock tokens) in exactly the same representation as it received them from the server.
FSX should incorporate the solution as a mandatory feature. BDB will not support it, ever.
Repository and FS API Implementation
In the FSFS back-end, we use paths as keys in three distinct ways:
- during lookup of directory entries physically stored on disk;
- for writing and reading entries in the node cache;
- when writing directory entries for new and changed nodes in transactions.