Unicode Composition - WC Database columns

This page describes one approach of implementing NonNormalizingUnicodeCompositionAwareness. It involves redefining and/or adding column(s) to wc.db.

More work is needed in this specification. Focus is currently on UnicodeCollation.

TODO: This section needs input from someone more familiar with wc-ng database design.

WC Database Columns

Columns of interest in wc.db:

* The repository path as stored on server: repos_path (e.g. "project/dir/file.txt")

* The local path from WC root to node: local_relpath (e.g. "dir/file.txt")

* The local path from WC root to node parent: parent_relpath (e.g. "dir")

All three paths are in UTF-8 but NFC/NFD is not currently specified. local_relpath/parent_relpath get converted from UTF-8 to whatever locale encoding is in use whenever they are used to access the filesystem.

Takesson: Is this conversion done on the fly every time? I am guessing this works because locale encoding is a reversible process , otherwise lookups in the database would fail?

An abstraction between the repository path and the file system path can be achieved by ensuring that there is a column in wc.db that contains the file system path in exactly the same form that the file system gives back. APIs in wc needs to be extended to ensure that all interaction with the file system is performed with the file system path.

Alternative 1: Redefine local_relpath and parent_relpath

Redefine the existing columns local_relpath and parent_relpath to contain the path as stored in the file system. Code that currently relies on local_relpath/parent_relpath being a substring of repos_path needs to be adjusted. E.g. a node might be considered switched when this condition is not met.

It would generally be desirable to use repos_path when referring to entries rather than local_relpath.

This alternative can be simulated using the attached script localrelpath2nfd.sh. This provides a Working Copy equivalent to what a checkout should produce if this alternative was implemented in Subversion itself (only local_relpath is currently adjusted by the script): * svn co ... * svn stat #Shows any problematic items * localrelpath2nfd.sh * svn stat #Should be clean apart from misperception that some items are switched

TODO: provide a dump file with suitable test data.

Alternative 2: Introduce local_relpath_disk and parent_relpath_disk

New columns, local_relpath_disk and parent_relpath_disk, are added that contains the path as stored in the file system. These columns will be used on all systems to interact with the file system. Currently, the content of columns local_relpath and local_relpath_disk will be identical on all file systems except HFS+.

Subcommand Changes

Specific changes to svn subcommands are outlined below.

All commands that access files in the Working Copy must do so by getting the path from the column local_relpath/local_relpath_disk.

TODO: Investigate which subcommands currently use local_relpath for other purposes than accessing the file. With alternative 1 (above), it will NOT be acceptable to use local_relpath for comparison/substring operations with other paths, e.g. repos_path.

Checkout/Update

When adding paths to the WC, determine the actual filesystem path and store that in local_relpath/local_relpath_disk. This is actually only required on OSX. How can this be done? * Do we get a handle back from the filesystem after creating a file/dir that can be queried for the path? * Use platform dependent APIs to establish the expected path. * Alternatively, first look for the exact same path (will find the one on most filesystems) then fall back to globbing with Unicode composition aware comparison.

TODO: Do we need to process paths that are not actually checked out due to the depth setting?

Status

The status subcommand incorrectly reports externals when manually adjusting local_relpath to match the filesystem.

TODO: Clarify if status performs string comparisons between local_relpath and some other path.

TODO: how does status show a file whose name changed to a value that canonicalizes to the same value as the original name? (is that possible?)

Add and mkdir

Since this approach does not dictate a Normalized repository storage, the add subcommand should not perform any normalization.

The uniqueness test should be Unicode aware to avoid a "normalized-name collision". This is not vital but desirable for better usability (has no effect on Mac OSX since it is not possible to create such collisions).

TODO: Anything else?

Commit

No specific changes expected.

TODO: Confirm.

Changelist

Changelists should use repos_path to refer to entries, unless already the case.

...

TODO: More subcommands requiring attention?

UnicodeClientColumns (last edited 2013-01-21 22:03:52 by Thomas Åkesson)