Differences between revisions 10 and 11
Revision 10 as of 2012-11-12 08:18:10
Size: 16817
Editor: brane
Comment:
Revision 11 as of 2013-01-21 22:19:32
Size: 14085
Comment:
Deletions are marked like this. Additions are marked like this.
Line 86: Line 86:
The major impact would not stem from collision avoidance on `add` but normalization during directory search, which affects most other operations. For the server, it is probably better to store names twice (original for display and normalized for indexing) rather than normalize on every lookup.}}} The major impact would not stem from collision avoidance on `add` but normalization during directory search, which affects most other operations. For the server, it is probably better to store names twice (original for display and normalized for indexing) rather than normalize on every lookup.

ThomasAkesson: It might be better to store names twice, but I don't see why the server needs to do normalization during directory search? That would be a client side task in this proposal.
}}}
Line 101: Line 104:
TODO: This section needs input from someone more familiar with wc-ng database design.

=== WC Database Columns ===

Columns of interest in wc.db:

 * The repository path as stored on server: repos_path (e.g. "project/dir/file.txt")

 * The local path from WC root to node: local_relpath (e.g. "dir/file.txt")

 * The local path from WC root to node parent: parent_relpath (e.g. "dir")

All three paths are in UTF-8 but NFC/NFD is not currently specified. local_relpath/parent_relpath get converted from UTF-8 to whatever locale encoding is in use whenever they are used to access the filesystem.

Takesson: Is this conversion done on the fly every time? I am guessing this works because locale encoding is a reversible process , otherwise lookups in the database would fail?

An abstraction between the repository path and the file system path can be achieved by ensuring that there is a column in wc.db that contains the file system path in exactly the same form that the file system gives back. APIs in wc needs to be extended to ensure that all interaction with the file system is performed with the file system path.


==== Alternative 1: Redefine local_relpath ====

Redefine the existing column local_relpath to contain the path as stored in the file system. Code that currently relies on local_relpath being a substring of repos_path needs to be adjusted. E.g. a node might be considered switched when this condition is not met.

It would generally be desirable to use repos_path when referring to entries rather than local_relpath.

This alternative can be simulated using the attached script localrelpath2nfd.sh. This provides a Working Copy equivalent to what a checkout should produce if this alternative was implemented in Subversion itself:
 * svn co ...
 * svn stat #Shows any problematic items
 * localrelpath2nfd.sh
 * svn stat #Should be clean apart from misperception that some items are switched

TODO: provide a dump file with suitable test data.

==== Alternative 2: Introduce local_relpath_disk ====

A new column, local_relpath_disk, is added that contains the path as stored in the file system. This column will be used on all systems to interact with the file system. Currently, the content of columns local_relpath and local_relpath_disk will be identical on all file systems except HFS+.

I guess this would require parent_relpath_disk as well? Or would you plan to use the local_relpath==parent_relpath row to get local_relpath_disk for parent_relpath?

Takesson: thanks for pointing that out. I will update both alternatives, alt 1 redefining both and alt 2 "duplicating" both.

=== Alternative Approaches ===

There are different approaches to implementing this abstraction of paths. The following have been identified so far, each with its Wiki page:

 * WC Database columns: UnicodeClientColumns
 * SQLite collation: UnicodeCollation

The following sections are applicable to all above approaches.
Line 145: Line 118:
Repository path uniqueness should be checked in normalized form during add operations, in order to prevent new "normalized-name collisions" as early as possible. It might be acceptable to identify this later during commit, since it is a quite rare condition.

When an existing "normalized-name collision" arrives to a Working Copy on HFS+ via checkout or update, there will be a uniqueness issue in the column local_relpath/local_relpath_disk and a situation somewhat similar to an obstruction. This should be communicated in some friendly way, similar to conflicts on case-insensitive file systems.
Repository path uniqueness should be checked in normalized form during add operations, in order to prevent new "normalized-name collisions" as early as possible. It might be acceptable to identify this later during commit, since very few users will encounter this condition. At the latest, it will be identified by the server (with above change).

When an existing "normalized-name collision" arrives to a Working Copy on HFS+ via checkout or update, there will be a uniqueness issue in the column local_relpath (queried with collation) or in local_relpath_disk and a situation somewhat similar to an obstruction. This should be communicated in some friendly way, similar to conflicts on case-insensitive file systems.
Line 156: Line 128:
When referring to WC entries using the command line on Mac OSX, the tab-completion works unreliably because the keyboard typically produces composed characters while files are NFD. The tab completion is a general Mac OSX issue which should be addressed by Apple. However, Subversion could be helpful when attempting to identify entries referred to via the command line.

 * Subversion must recognize paths that match the file system Unicode path (even if it does not match the repository path). Failure to do so makes tab-completion unusable.
  * Paths on the command line should be matched against local_relpath/local_relpath_disk.

 * Subversion should as a fallback (optional) recognize paths that match the repository Unicode path. Failure to do so might make scripts less portable and might require the use of tab-completion in order to reference entries.

=== Subcommand Changes ===

Specific changes to svn subcommands are outlined below.

All commands that access files in the Working Copy must do so by getting the path from the column local_relpath/local_relpath_disk.

TODO: Investigate which subcommands currently use local_relpath for other purposes than accessing the file. With alternative 1 (above), it will NOT be acceptable to use local_relpath for comparison/substring operations with other paths, e.g. repos_path.


==== Checkout/Update ====

When adding paths to the WC, determine the actual filesystem path and store that in local_relpath/local_relpath_disk. This is actually only required on OSX. How can this be done?
 * Do we get a handle back from the filesystem after creating a file/dir that can be queried for the path?
 * Use platform dependent APIs to establish the expected path.
 * Alternatively, first look for the exact same path (will find the one on most filesystems) then fall back to globbing with Unicode composition aware comparison.

TODO: Do we need to process paths that are not actually checked out due to the depth setting?
When referring to WC entries using the command line on Mac OSX, the tab-completion works unreliably because the keyboard typically produces composed characters while files are NFD. The tab completion is a general Mac OSX issue which should be addressed by Apple, specifically the case; user types beginning including a composed character (currently matches nothing on disk). However, Subversion could be helpful when attempting to identify entries referred to via the command line.

* Subversion must recognize paths that match the file system Unicode path (even if it does not match the repository path). Failure to do so makes tab-completion unusable, especially on Mac OS X.

* Subversion must recognize paths that match the repository path in NFC. Failure to do so might make scripts less portable and might require the use of tab-completion in order to reference non-NFC entries (since keyboard input is typically NFC). E.g. A file added by Mac OS X can currently not be typed on other (any actually) OSes.


=== Hashtables in WC-NG ===

Bert has mentioned expected issues related to hashtables.

TODO: Please elaborate on when they are used and approximately where in the codebase.


=== Subcommand Status ===

Current issues with svn subcommands related to Unicode composition are outlined below.

Below investigations where made on svn 1.7.x.

==== Checkout ====

Completes, but creates a "broken" WC, see Status below.

==== Update ====

Issues are related to the status issues when reporting the WC. Other issues?
Line 184: Line 158:
The status subcommand incorrectly reports externals when manually adjusting local_relpath to match the filesystem.

TODO: Clarify if status performs string comparisons between local_relpath and some other path.

TODO: how does status show a file whose name changed to a value that canonicalizes to the same value as the original name? (is that possible?)

==== Add and mkdir ====
The status subcommand reports one unversioned and one missing entry for each non-NFD on Mac OS X. This reflects the general WC issues with HFS+.


==== Add ====

Works and creates an entry with the same composition as on disk.
Line 194: Line 167:
The uniqueness test should be Unicode aware to avoid a "normalized-name collision". This is not vital but desirable for better usability (has no effect on Mac OSX since it is not possible to create such collisions).

TODO: Anything else?

==== mkdir ====

TODO: Test. Suspect this might fail.
Line 201: Line 175:
No specific changes expected.

TODO: Confirm.

==== Changelist ====

Changelists should use repos_path to refer to entries, unless already the case.
Seems to work.
Line 225: Line 192:

ThomasAkesson: Can we accept the limitation to not have decomposable characters in these parts? They are defined by administrators while paths inside repositories are defined by users.

Non-normalizing Unicode Composition Awareness

Context

Within Unicode, some characters can in the unicode standard be represented in different ways (composed/decomposed, canonical ordering, etc), while rendered equally on screen or in print. A unicode string (e.g. a file name) can be represented in normalized forms (NFC/NFD) or mixed (not normalized).

The majority of file systems (e.g. NTFS, Ext3) will accept a unicode filename in any form, store and give back in the form it was input. These file systems will typically even accept multiple files where the path looks identical on screen but the unicode string is different due to character composition.

A minority of file systems (currently Mac OS X HFS+ only) will normalize the paths. In the case of HFS+, the path will be normalized into NFD and it will be given back that way when listing the filesystem.

Most significant differences from the majority of filesystems:

  • A file that is stored in NFC or mixed, will not be returned with an identical name. Generally considered a negative effect of the HFS+ unicode implementation.
  • Multiple files whose name is rendered equally cannot be stored in the same directory. Often considered an advantage.

The topic has been described here: http://svn.apache.org/repos/asf/subversion/trunk/notes/unicode-composition-for-filenames

  • This RFC is not as complete in all areas, and depend on this note for additional context and issue description.
  • This RFC proposes a solution very similar to the note's solution 4, "Client and server-side path comparison routines". However, here it is proposed as a long term solution.
  • This RFC is essentially identical to what Erik H. proposes in this thread:

http://svn.haxx.se/dev/archive-2010-09/0319.shtml

Issue Description

  • Subversion and most file systems currently allow creation of multiple paths, which in normalized form are identical. Hereafter referred to as "normalized-name collisions". This could cause significant upgrade issues for repositories containing such collisions, depending on which solution is implemented. See section "Legacy Data".
  • Users have difficulty understanding and managing "normalized-name collisions". It is difficult to know which file is which and one of the paths is typically not possible to type on a keyboard.
  • Mac OS X clients can not interoperate with non-OSX clients when paths contain composed Unicode characters (added by a non-OSX client). The working copies report status issues directly after checkout/update on OSX. Tracked by: Bug 2464

Differences to case-sensitivity

  • NFC/NFD look the same when rendered on screen.
  • Different case can be controlled with the keyboard, while Unicode composition is more difficult.
  • Most modern case-insensitive file systems are case-preserving, i.e. they do not normalize to a preferred form and always return the same form that was stored. Normalizing file systems do not preserve the paths.

Similarities to case-sensitivity

  • If two Unicode strings differ only by letter case, on some computer systems they refer to the same file, while on other systems they refer to different files. The same applies if two Unicode strings differ only by composition. The rules are set by each file system.
  • Subversion inter-operates with different systems. When two file names that differ only by letter case are transferred from a case-sensitive system to a case-insensitive system, they will collide and Subversion should handle this in some friendly way. The same applies if two file names differ only by composition.

To Normalize or Not to Normalize

Whether or not to normalize within a Subversion repository (server-side) has been debated. The note (unicode-composition-for-filenames) considers normalization to NFC to be the long term (2.x) solution. Hereafter referring to that approach as "repository normalization".

There are implementation advantages with normalized paths which can simplify comparisons and storage.

There are also reasons not to normalize:

  • A file system is generally expected to give back exactly what was stored, or refuse up-front. HFS+ has been criticized for not living up to this expectation, which is also the reason the Svn WC has issues on HFS+. Subversion can be considered a sort of file system, and could therefore be expected to live up to this expectation.
  • Compatibility is a high priority for Subversion. Introducing normalization/translation/etc is not unlikely to introduce compatibility issues, now or later. There is a principle that Subversion should not be a limiting factor or impose undue limitations on allowed characters, file names etc.
  • Introducing normalization tends to complicate the upgrade process, especially for repositories that contain "normalized-name collisions". This is one of the reasons this very issue has not been addressed.

However, there is very little reason to allow the creation of new "normalized-name collisions". There are no known use-cases for creating multiple files in the same directory that would have identical normalized paths. Subversion should preferably refuse such add operations as early as possible, at the latest during commit. Referring to this feature as "uniqueness normalization".

Solution Overview

There are 2 components of this solution, one server side and one client side. These can be addressed individually, which is an important requirement for Subversion 1.x interoperability between client and server versions.

This solution does not normalize paths in the repository. Paths are only normalized for the purpose of comparisons.

Server Changes

The Subversion server should no longer accept 'add':ing paths that cause "normalized-name collisions". The comparison with existing paths (and other paths in the same txn) should be performed in normalized form. However, the paths created in the repository will keep the form input by the client.

There could be a performance impact. [Need more data] However, the 'add' operation is not one of the most frequent ones, in a typical installation.

The major impact would not stem from collision avoidance on add but normalization during directory search, which affects most other operations. For the server, it is probably better to store names twice (original for display and normalized for indexing) rather than normalize on every lookup.

ThomasAkesson: It might be better to store names twice, but I don't see why the server needs to do normalization during directory search? That would be a client side task in this proposal.

It is not possible to rely on client behavior. A Subversion server can be accessed via mod_dav_svn, and elder Subversion clients.

The desired server behavior can be accomplished with Subversion 1.7 or earlier using a pre-commit hook, but it is desirable to have "uniqueness normalization" as the future default behavior.

This would make it impossible to load a dump of a repository with "normalized-name collisions". An important advantage of this proposal compared to normalizing approaches is that there is no requirement to process legacy data (see below for a discussion on 'svn mv' as cleanup tool). During loading of dump files, the normalized comparison should be disabled, either by default of via a switch, e.g. --ignore-utf-normalize.

Client Changes

The Working Copy needs an abstraction between the repository path provided by the server and the actual file system path. This is required for normalizing file systems (HFS+) regardless if the Subversion server performs normalization to NFC (repository normalization) or just enforces "uniqueness normalization".

It might be more feasible to implement such an abstraction now in wc-ng than it was in Subversion <=1.6.

Alternative Approaches

There are different approaches to implementing this abstraction of paths. The following have been identified so far, each with its Wiki page:

The following sections are applicable to all above approaches.

Normalized uniqueness

Repository path uniqueness should be checked in normalized form during add operations, in order to prevent new "normalized-name collisions" as early as possible. It might be acceptable to identify this later during commit, since very few users will encounter this condition. At the latest, it will be identified by the server (with above change).

When an existing "normalized-name collision" arrives to a Working Copy on HFS+ via checkout or update, there will be a uniqueness issue in the column local_relpath (queried with collation) or in local_relpath_disk and a situation somewhat similar to an obstruction. This should be communicated in some friendly way, similar to conflicts on case-insensitive file systems.

Pristine Storage

Since svn 1.7, pristines are stored based on the SHA1 checksum of their contents, independently of their name. There should be very little impact.

Command Line

When referring to WC entries using the command line on Mac OSX, the tab-completion works unreliably because the keyboard typically produces composed characters while files are NFD. The tab completion is a general Mac OSX issue which should be addressed by Apple, specifically the case; user types beginning including a composed character (currently matches nothing on disk). However, Subversion could be helpful when attempting to identify entries referred to via the command line.

* Subversion must recognize paths that match the file system Unicode path (even if it does not match the repository path). Failure to do so makes tab-completion unusable, especially on Mac OS X.

* Subversion must recognize paths that match the repository path in NFC. Failure to do so might make scripts less portable and might require the use of tab-completion in order to reference non-NFC entries (since keyboard input is typically NFC). E.g. A file added by Mac OS X can currently not be typed on other (any actually) OSes.

Hashtables in WC-NG

Bert has mentioned expected issues related to hashtables.

TODO: Please elaborate on when they are used and approximately where in the codebase.

Subcommand Status

Current issues with svn subcommands related to Unicode composition are outlined below.

Below investigations where made on svn 1.7.x.

Checkout

Completes, but creates a "broken" WC, see Status below.

Update

Issues are related to the status issues when reporting the WC. Other issues?

Status

The status subcommand reports one unversioned and one missing entry for each non-NFD on Mac OS X. This reflects the general WC issues with HFS+.

Add

Works and creates an entry with the same composition as on disk.

Since this approach does not dictate a Normalized repository storage, the add subcommand should not perform any normalization.

mkdir

TODO: Test. Suspect this might fail.

Commit

Seems to work.

...

TODO: More subcommands requiring attention?

Externals

External definitions are required to exactly match the Unicode URL. This is a currently existing requirement which is easily worked around using copy-paste. It would be difficult to lookup the repository URL in a Unicode composition aware manner.

On Mac OS X it will be necessary to determine the actual filesystem path for the target, much like during update/checkout.

On Mac OS X it will not be possible to define externals that cause a "normalized-name collision".

In a URL there are several different parts: the hostname, the <Location> (httpd only), the repository relpath(ra_svn) or basename(ra_dav with SVNParentPath), and the fspath. Some of them might also be subject to canonicalization issues (eg: repos basename as handled by Mac mod_dav_svn).

ThomasAkesson: Can we accept the limitation to not have decomposable characters in these parts? They are defined by administrators while paths inside repositories are defined by users.

Use Cases

  • Interoperability between Mac OSX and non-OSX Subversion clients: an OS X user will after checkout/update from a repository containing NFC/mixed Unicode paths (added by non-OSX client) receive a fully functional WC where all normal operations can be performed. Tracked by: Bug 2464

  • It will no longer be possible to add paths that look like duplicates but use different Unicode composition. It is highly unlikely anyone is relying on this.

Legacy Data

  • This change will cause no problems when upgrading existing repositories even if they contain "normalized-name collisions".
  • If "normalized-name collisions" exist in HEAD, a check out on Mac OS X will still fail after an upgrade but potentially with a better error message. This is an issue that is very similar to case-collisions on case-insensitive file systems. The detection code is similar and the same friendly error message can potentially be used.
  • These "normalized-name collisions" can be resolved in HEAD via "svn mv SRC_URL DST_URL". Historical revisions will still be difficult to check out from Mac OS X.
  • Working Copies will be upgraded in the same way as any other wc-ng upgrade with SQL schema changes.
    • Working Copies on Mac OS X that are broken before upgrade might require a fresh check out.
    • Consequently, no transformation of paths in wc.db is required. With alternative 2 above, local_relpath should be copied to local_relpath_disk.
    • No identification of "normalized-name collisions" is required. Normal users should not be bothered with such maintenance tasks.

NonNormalizingUnicodeCompositionAwareness (last edited 2013-01-21 22:19:32 by Thomas Åkesson)