Non-normalizing Unicode Composition Awareness

Context

Within Unicode, some characters can in the unicode standard be represented in different ways (composed/decomposed, canonical ordering, etc), while rendered equally on screen or in print. A unicode string (e.g. a file name) can be represented in normalized forms (NFC/NFD) or mixed (not normalized).

The majority of file systems (e.g. NTFS, Ext3) will accept a unicode filename in any form, store and give back in the form it was input. These file systems will typically even accept multiple files where the path looks identical on screen but the unicode string is different due to character composition.

A minority of file systems (currently Mac OS X HFS+ only) will normalize the paths. In the case of HFS+, the path will be normalized into NFD and it will be given back that way when listing the filesystem.

Most significant differences from the majority of filesystems:

The topic has been described here: http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames

http://svn.haxx.se/dev/archive-2010-09/0319.shtml

Issue Description

Differences to case-sensitivity

Similarities to case-sensitivity

To Normalize or Not to Normalize

Whether or not to normalize within a Subversion repository (server-side) has been debated. The note (unicode-composition-for-filenames) considers normalization to NFC to be the long term (2.x) solution. Hereafter referring to that approach as "repository normalization".

There are implementation advantages with normalized paths which can simplify comparisons and storage.

There are also reasons not to normalize:

However, there is very little reason to allow the creation of new "normalized-name collisions". There are no known use-cases for creating multiple files in the same directory that would have identical normalized paths. Subversion should preferably refuse such add operations as early as possible, at the latest during commit. Referring to this feature as "uniqueness normalization".

Solution Overview

There are 2 components of this solution, one server side and one client side. These can be addressed individually, which is an important requirement for Subversion 1.x interoperability between client and server versions.

This solution does not normalize paths in the repository. Paths are only normalized for the purpose of comparisons.

Server Changes

The Subversion server should no longer accept 'add':ing paths that cause "normalized-name collisions". The comparison with existing paths (and other paths in the same txn) should be performed in normalized form. However, the paths created in the repository will keep the form input by the client.

There could be a performance impact. [Need more data] However, the 'add' operation is not one of the most frequent ones, in a typical installation.

The major impact would not stem from collision avoidance on add but normalization during directory search, which affects most other operations. For the server, it is probably better to store names twice (original for display and normalized for indexing) rather than normalize on every lookup.

ThomasAkesson: It might be better to store names twice, but I don't see why the server needs to do normalization during directory search? That would be a client side task in this proposal.

It is not possible to rely on client behavior. A Subversion server can be accessed via mod_dav_svn, and elder Subversion clients.

The desired server behavior can be accomplished with Subversion 1.7 or earlier using a pre-commit hook, but it is desirable to have "uniqueness normalization" as the future default behavior.

This would make it impossible to load a dump of a repository with "normalized-name collisions". An important advantage of this proposal compared to normalizing approaches is that there is no requirement to process legacy data (see below for a discussion on 'svn mv' as cleanup tool). During loading of dump files, the normalized comparison should be disabled, either by default of via a switch, e.g. --ignore-utf-normalize.

Client Changes

The Working Copy needs an abstraction between the repository path provided by the server and the actual file system path. This is required for normalizing file systems (HFS+) regardless if the Subversion server performs normalization to NFC (repository normalization) or just enforces "uniqueness normalization".

It might be more feasible to implement such an abstraction now in wc-ng than it was in Subversion <=1.6.

Alternative Approaches

There are different approaches to implementing this abstraction of paths. The following have been identified so far, each with its Wiki page:

The following sections are applicable to all above approaches.

Normalized uniqueness

Repository path uniqueness should be checked in normalized form during add operations, in order to prevent new "normalized-name collisions" as early as possible. It might be acceptable to identify this later during commit, since very few users will encounter this condition. At the latest, it will be identified by the server (with above change).

When an existing "normalized-name collision" arrives to a Working Copy on HFS+ via checkout or update, there will be a uniqueness issue in the column local_relpath (queried with collation) or in local_relpath_disk and a situation somewhat similar to an obstruction. This should be communicated in some friendly way, similar to conflicts on case-insensitive file systems.

Pristine Storage

Since svn 1.7, pristines are stored based on the SHA1 checksum of their contents, independently of their name. There should be very little impact.

Command Line

When referring to WC entries using the command line on Mac OSX, the tab-completion works unreliably because the keyboard typically produces composed characters while files are NFD. The tab completion is a general Mac OSX issue which should be addressed by Apple, specifically the case; user types beginning including a composed character (currently matches nothing on disk). However, Subversion could be helpful when attempting to identify entries referred to via the command line.

* Subversion must recognize paths that match the file system Unicode path (even if it does not match the repository path). Failure to do so makes tab-completion unusable, especially on Mac OS X.

* Subversion must recognize paths that match the repository path in NFC. Failure to do so might make scripts less portable and might require the use of tab-completion in order to reference non-NFC entries (since keyboard input is typically NFC). E.g. A file added by Mac OS X can currently not be typed on other (any actually) OSes.

Hashtables in WC-NG

Bert has mentioned expected issues related to hashtables.

TODO: Please elaborate on when they are used and approximately where in the codebase.

Subcommand Status

Current issues with svn subcommands related to Unicode composition are outlined below.

Below investigations where made on svn 1.7.x.

Checkout

Completes, but creates a "broken" WC, see Status below.

Update

Issues are related to the status issues when reporting the WC. Other issues?

Status

The status subcommand reports one unversioned and one missing entry for each non-NFD on Mac OS X. This reflects the general WC issues with HFS+.

Add

Works and creates an entry with the same composition as on disk.

Since this approach does not dictate a Normalized repository storage, the add subcommand should not perform any normalization.

mkdir

TODO: Test. Suspect this might fail.

Commit

Seems to work.

...

TODO: More subcommands requiring attention?

Externals

External definitions are required to exactly match the Unicode URL. This is a currently existing requirement which is easily worked around using copy-paste. It would be difficult to lookup the repository URL in a Unicode composition aware manner.

On Mac OS X it will be necessary to determine the actual filesystem path for the target, much like during update/checkout.

On Mac OS X it will not be possible to define externals that cause a "normalized-name collision".

In a URL there are several different parts: the hostname, the <Location> (httpd only), the repository relpath(ra_svn) or basename(ra_dav with SVNParentPath), and the fspath. Some of them might also be subject to canonicalization issues (eg: repos basename as handled by Mac mod_dav_svn).

ThomasAkesson: Can we accept the limitation to not have decomposable characters in these parts? They are defined by administrators while paths inside repositories are defined by users.

Use Cases

Legacy Data

NonNormalizingUnicodeCompositionAwareness (last edited 2013-01-21 22:19:32 by Thomas Åkesson)