This page gathers design notes related to Unicode Composition.
Unicode Composition for Paths
The problem was originally described in the note Unicode Composition for Filenames but has since then been discussed a number of times on the mailinglist.
Different solutions to the issue are described below:
NonNormalizingUnicodeCompositionAwareness - proposing Unicode composition awareness in the client and minimal changes in the server.
UnicodeCollation - experimental approach leveraging a Sqlite collation, e.g. the Sqlite ICU extension or a Subversion collation.
NormalizationOfUnicodeComposition - (could be drafted as a competing proposition) normalization of all paths in the repository
There is now a branch open for the client-side implementation, generally following these design discussions. It embeds the utf8proc library into libsvn_subr instead of using ICU, but otherwise follows the same general pattern.
The plan is to provide the following extensions for SQLite:
- A collation for paths that normalizes to NFD before comparing keys
A similar replacement for the LIKE and GLOB operator
this will remove the need to specify PRAGMA case_sensitive_like=1 since this LIKE operator will always be case-sensitive.
Since columns in the database will use non-standard collations, we'll also create a SQLite extension module svnwcdb.sqlext that defines the same collation and operators. A new cmdline tool svnwcdb will launch a SQLite shell with the extension loaded and all other required parameters.
N.B.: LIKE and GLOB are not and should not be used by libsvn_wc because they cannot use indexes. However, for completeness, the svnwcsb.sqlext SQLite extension must override them, otherwise inspecting the working copy database using command-line SQLite tools would not be reliable.
N.B.: with a bit of magic we'll make svnwcdb work with amalgamated SQLite, which happens to be an amazingly good idea if we use amalgamation to override a too-old or broken installed version.
Working Copy Database
Every SQL statement currently used that returns information about the node, e.g., STMT_SELECT_NODE_INFO, must be modified to also return the actual local_relpath it found, since there's no guarantee that the search key will be byte-for-byte identical to the row key. Consequently, functions such as svn_wc__db_read_info must return that column along with all the others. It's an open question whether these changes will have to propagate all the way to the public svn_wc API.
Alternative: These functions already return repos-relpath, which is what should be communicated to the repository and should ideally never come from local disk, except for locally added files