Part of the formatting process is the correct handling of white space. The rules and properties governing this process have been rewritten in the XSL-FO 1.1 Working Draft in response to comments by Karen Lease. This rewrite has attempted to clarify white space handling and moved most of the white space handling processes from the refinement step to the area creation step and in particular to the line and inline building processes. However, the changes don't seem to intend to trigger changes to the XSL-FO 1.0 white space handling and are merely a clarification. In the following white space handling will therefore be discussed in terms of the XSL-FO 1.1 WD under the assumption that the outcomes are valid for XSL-FO 1.0.

Properties and Rules in the XSL-FO 1.1 WD related to white space handling

The following properties control white space handling:

An additional important description of parts of the white space handling process can be found in 4.7.2 Line-building.

Some thoughts about the concerns addressed in the XSL FO specification are outlined in XslFoWhiteSpaceHandling.

What is white space?

XSL-FO defines white space as any character whose Unicode value is classified as white space in XML. This means only U+0020 (space), U+0009 (tab), U+000D (carriage return) and U+000A (linefeed) are white space characters in XSL-FO. It should be noted that therefore there is a difference between the set of line breaking characters, especially taking non Western scripts into account, and white space characters.

/!\ The spec is very specific under rules 5. and 6. in 4.7.2 that only white space glyph areas can be deleted. Any other line breaking characters are not removed around line breaks. It remains to be seen if this is consistent with the typographical conventions for scripts which have their own Unicode white space characters.

Processing model

One problem in understanding XSL-FO white space handling is to derive a suitable processing model which matches the intention of the specification. The specification itself is in parts contradictory what processing takes place at which stage in the XSL-FO model. The specification appears to leave many issues open in the area of white space handling and its literal interpretation also seems to give unexpected or undesirable results. The approach chosen here is to ignore to a certain extend the details of the specification and to try and capture (= guess) its intent and to derive a suitable model from there (See also XslFoWhiteSpaceHandling for a possible interpretation of the intent - which is consistent with the presentation here). At a high level the following 3 rules may capture the intent if everything is at its default value:

1. White space exclusively between block level elements is to be ignored, that is <fo:block>.<fo:block> is identical to <fo:block><fo:block> and so are </fo:block>.</fo:block> or </fo:block>.<fo:block> or <fo:block>.</fo:block> respectively.

2. All other sequences of white space are collapsed into a single space.

3. All white space at the beginning and end of an output line is deleted.

A processing model which achieves such an outcome would be:

Step 1. Refinement: linefeed-treatment

All fo:character objects which have a character property value of U+000A are dealt with according to the setting of the linefeed-treatment property. This is straight forward and involves either preservation or deletion of the fo:character object or replacement of its character property value with a new value of U+0020 (space) or U+200B (zero width space).

Step 2. Refinement: white-space-collapse

The processing model presented here deviates from the text in the specification (not necessarily the intent though) as the specification makes white-space-collapse an area tree construction activity. However, the remainder of the description of the white-space-collapse property refers only fo:character objects and their direct siblings in the fo tree with certain character property values. It also refers directly to fo:character objects with a character property value of U+000A (linefeed) but does not refer to line breaks. All this leads to the conclusion that collapsing white space is really a refinement activity. The actual processing is again straight forward: If the property value is "false" just skip the step. If its "true" for any sequence of direct sibling fo:character objects whose character property value is an XML white space value and is not U+000A retain only the first fo:character object and set its character property value to U+0020 and delete all others.

Issues

/!\ The spec does mention replacement of any white space that is not a U+0020 (space) or U+000A (linefeed) with a space only if white-space-treatment="preserve". This seems to indicate that U+0009 (tab) and U+000D (carriage return) are left unchanged in the fo tree in other circumstances. The current FOP version does replace those with a space always. That seems reasonable and consistent with other implementations but is it compliant? (resolved : not compliant, but it still leaves the question how a LayoutManager should handle a carriage return or a tab)

/!\ The spec does not put any constraint on collapsing white space with different properties. e.g.

   &#x20;
   <fo:character font-size="80pt" character=" "/>
   <fo:character border="2pt solid red" font-size="10pt" character=" "/>

would be collapsed leaving only the initial space character. Is that intentional? This should be contrasted with the description in 4.7.2 of glyph merging/replacement which clearly states that only glyphs with matching properties can be merged/substituted.

Step 3. Refinement: white-space-treatment

a) If the value of the white-space-treatment property is "preserve" skip this step.

b) If the value of the white-space-treatment property is "ignore" delete all white space.

c) If the value of the white-space-treatment property is "ignore-if-after-linefeed" or "ignore-if-surrounding-linefeed" delete any sequence of consecutive characters whose suppress-at-line-break property is "true" which directly follows a linefeed or where the first element of the sequence is a first child of a block level object.

d) If the value of the white-space-treatment property is "ignore-if-before-linefeed" or "ignore-if-surrounding-linefeed" delete any sequence of consecutive characters whose suppress-at-line-break property is "true" which directly precedes a linefeed or where the last element of the sequence is the last child of a block level object.

Step 4. line building: white-space-treatment

At any formatter generated line break:

a) If the value of the white-space-treatment property is "preserve" or "ignore" do nothing.

b) If the value of the white-space-treatment property is "ignore-if-after-linefeed" or "ignore-if-surrounding-linefeed" delete any sequence of consecutive characters whose suppress-at-line-break property is "true" which directly follow the line break.

c) If the value of the white-space-treatment property is "ignore-if-after-linefeed" or "ignore-if-surrounding-linefeed" delete any sequence of consecutive characters whose suppress-at-line-break property is "true" which directly precede the line break.

Issues

/!\ According to the spec white-space-treatment and the related suppress-at-linebreak properties are dealt with during 4.7.2 line-building. This model puts part of the processing into refinement. It is believed this does not change the intended results.

/!\ While there are contradictions in the spec in that both 7.16.8 white-space-treatment and 7.16.12 white-space-collapse still mention refinement as the stage in which the white-space-treatment property is dealt with, these are most likely editorial mistakes.

/!\ As for white-space-collapse the spec does not put any constraint on deleting white space under white-space-treatment with different properties. Intentional or not?

Examples

In the examples that follow spaces are represented by '.'. For brevity properties are not shown apart from the initial example. When areas are shown the notation |area|...|/area| is used.

Example 1: Simple text - all properties defaulting

<fo:block>
...This.is..some...arbitrary.text
</fo:block>

After step 1 (linefeed-treatment):

<fo:block>....This.is..some...arbitrary.text.</fo:block>

After step 2 (white-space-collapse):

<fo:block>.This.is.some.arbitrary.text.</fo:block>

After step 3 (white-space-treatment during refinement):

<fo:block>.This.is.some.arbitrary.text.</fo:block>

After step 4 (white-space-treatment and line-building):

|line|This.is.some.arbitrary.text|/line|

Example 2: Simple nested block - all properties defaulting

<fo:block>
..<fo:block background-color="green">
...Green.background...text
..</fo:block>
...This.is..some...arbitrary.text
</fo:block>

After step 1 (linefeed-treatment) (Note: the linefeeds in the block below are for readability only!):

<fo:block>...<fo:block>....Green.background...text...</fo:block>....This.is..some...
arbitrary.text.</fo:block>

After step 2 (white-space-collapse) (Note: the linefeeds in the block below are for readability only!):

<fo:block>.<fo:block>.Green.background.text.</fo:block>.This.is.some.
arbitrary.text.</fo:block>

After step 3 (white-space-treatment) (Note: the linefeeds in the block below are for readability only!):

<fo:block><fo:block>Green.background.text</fo:block>This.is.some.
arbitrary.text</fo:block>

After step 4 (white-space-treatment and line-building):

|line|Green.background.text|/line|
|line|This.is.some.arbitrary.text|/line|

Example 3: Simple nested inline - all properties defaulting

<fo:block>
..<fo:inline background-color="green" border="solid 1pt red">
...Green.background...here
..</fo:inline>
...This.is..some...arbitrary.text
</fo:block>

After step 1 (linefeed-treatment) (Note: the linefeeds in the block below are for readability only!):

<fo:block>...<fo:inline>....Green.background...here...</fo:inline>....This.is..some...
arbitrary.text.</fo:block>

After step 2 (white-space-collapse) (Note: the linefeeds in the block below are for readability only!):

<fo:block>.<fo:inline>.Green.background.here.</fo:inline>.This.is.some.
arbitrary.text.</fo:block>

After step 3 (white-space-treatment) (Note: the linefeeds in the block below are for readability only!):

<fo:block><fo:inline>Green.background.here.</fo:inline>.This.is.some.
arbitrary.text</fo:block>

After step 4 (white-space-treatment and line-building) assuming everything fits on one line:

|line||inline|Green.background.here.|/inline|.This.is.some.arbitrary.text|/line|

<!> After step 2 there are still two spaces in front of the word 'Green' because white-space-collapse does not extend beyond direct siblings. However, white-space-treatment removes in this case all leading spaces from the line area as it is defined based on the first /last glyph area of a line area. There are still two spaces left between the words 'here' and 'This'. These stay because neither white-space-collapse nor white-space-treatment affects them in this case. Contrast this with the case where because of a narrow page width the formatter generates a line break between the words 'here' and 'This':

After step 3 (white-space-treatment and line-building) assuming a formatter generated linebreak between the words 'here' and 'This':

|line||inline|Green.background.here|/inline||/line|
|line|This.is.some.arbitrary.text|/line|

<!> The two spaces between the words 'here' and 'This' have disappeared.

/!\ The example above will result in visually quite different output. In the first case (no line break) there is a space between the word 'here' and the border while in the 2nd case the border directly abutts the word 'here'. The same applies to the start of the inline if we had some text before it. In that case the space before the word 'Green' would not be removed and there would be a gap between the border and the word.

This gets even more confusing if the formatter decides to break directly after the word 'here' pushing the space after 'here' to the next line. This means the inline finishes now on the second line and we end up an 'empty' closing border at the start of the 2nd line as we cannot really move the border to the line above as the formatter wouldn't have reserved space for it there.

|line||inline|Green.background.here|/line|
|line||/inline|This.is.some.arbitrary.text|/line|

However, it could probably be argued that the sequence <white-space><border-and-padding-end><white-space> should be handled by the formatter in a way that <border-and-padding-end> should never end up at the start of a line area and vice versa <border-and-padding-start> should never end up at the end of a line area.

This is WIP and more may come

LineLayout/WhitespaceHandling (last edited 2009-09-20 23:52:21 by localhost)