**Contents**

Contents

I'm planning to read the following literature:

- "Digital Typography", Knuth: glue/box/penalty model, optimal line-breaking algorithm
- "Pagination Reconsidered", Brüggemann-Klein, Klein, Wohlfeil: a better algorithm for placing floating objects than TeX's one.
- Pages of this Wiki related to the Knuth Approach
- Have a look at Simon Pepping's generalized glue/box/penalty model.

# 1. "Digital Typography": Breaking Paragraphs into Lines

The algorithm works when each line of the paragraph has a different length. This may be interesting for implementing side-floats, provided that we know in advance which height each line will have. And I think this may depend on the line-stacking strategy.

There are three main aspects which characterize the Knuth approach:

- build a sequence of glue/box/penalty elements from some input data;
- define a somewhat arbitrary formula used to compute the demerits of each break, and which is to be minimized;
- the algorithm itself, which corresponds to the finding of a shortest path in an acyclic graph. The nodes are breakpoints and each edge is tagged by the demerits of the corresponding line.

To enter a bit in the details:

- a line may be broken either at a penalty item, or at a glue immediately following a box;
- when a line is broken, all of the glue/penalty items at the beginning of the line up to the first box are removed.
- those properties may be utilized to achieve various effects, such as ragged-right/ragged-left/centered lines. Many effects rely on the fact that a negative value may be given to the stretching of a glue
- the definition of the to-be-minimized formula may vary according to the criteria of optimization. In the formula used by TeX to justify text, there are three criteria:
- the badness beta, related to the adjustment ratio of the line (how much it is stretched/shrinked);
- the penalty pi, if the line breaks at a penalty item;
- an additional penalty alpha, if this line and the previous ones end at a flagged penalty item.

Those three values are mixed together to compute the line's

*demerit*. The algorithm will find a set of lines with a minimal sum of demerits.

Regarding (before-)floats:

- the fo:float element (with the "float" property set to "before") will have to be converted to a sequence of glue/box/penalty elements, which themselves will have to be inserted at the beginning of certain pages;
- the formula used to determine the optimality of floats placements should be retrieved from the "Pagination Reconsidered" paper. I still have to see how to mix it (if any) with the formula currently used to break pages. I assume that it is different from the formula used for paragraphs, as there are different constraints (more to come when I've looked at the code);
- the line-breaking algorithm isn't designed to handle a floating sequence of g/b/p items. That is, a subsequence that is not fixed in the larger general sequence, but may be put at other places and possibly split. The algorithm will have to be extended, probably by the algorithm found in "Pagination Reconsidered";
- All what have been stated so far may be invalidated by further readings...

# 2. "On the Pagination of Complex Documents"

From Brüggeman-Klein, Klein, Wohlfeil. This article may be found here. Also, from the same authors, the article "Pagination Reconsidered" gives the details of the algorithm.

## 2.1. Some Theoretical Background

This article let me understand some abstract background of document layout. In fact, document layout is somewhat similar to a bin packing problem: bins correspond to pages, objects to typeset material of some height. Those objects usually come from different flows: before-floats, main flow, footnotes. The order of each flow must be respected (a figure appearing after another one in the list must appear later on the pages), but there is some freedom in choosing from which flow the next typeset element will be taken.

An optimal way of putting those elements on the pages must be found. In other words there is some cost function to minimize. This cost function should reflect what a human would call a good pagination; among various criteria we may cite the following: figures should appear as near to their first reference as possible; footnotes *must* appear on the same page as their citation; facing pages (or columns) should be balanced; etc.

As for packing problems, page layout will usually be a combinatorial NP-hard problem. However, if the cost function has some particular properties (namely optimal substructure), it becomes possible to solve it in polynomial time by using dynamic programming. The cost function used by Knuth and Plass for computing demerits of each line of a paragraph has such a property.

For the page layout problem, the main issue is then to find a cost function:

- which correctly reflects the quality of a layout;
- which has the right properties to make it possible to use dynamic programming for minimizing it.

Note that other techniques than dynamic programming may be used to layout pages. For example, by using some heuristic approach combined to genetic algorithms or simulated annealing (like in this article). But exploring those approaches would lead us much too far. Let's concentrate on the dynamic programming approach implemented in Fop.

## 2.2. Comments on the Article

- It is assumed that paragraphs have already been broken into lines. The main restriction of such an assumption is that it won't be possible to apply this method to sequences of pages of different widths. However, this shouldn't prevent us from registering several layouts of the same paragraph with differing numbers of lines. Thus it would be possible to choose a layout with fewer lines to avoid an orphan, for example.
- In his thesis (see below), Plass defines a badness function for measuring pagination quality that makes the problem NP-hard. In this article, another function is used which in the authors' opinion better reflects a human's judgment, and for which there is a polynomial solution. In short, they use the number of page turns necessary to read the document, that is, the total number of pages + the number of page turns caused by figures appearing on a different page from their citations. Plass used the sum of the squares of page differences between figures and their citations.
- The page model used to explain the algorithm is pretty similar to the one used by FO: an optional figure region at the top of the page, with optional glue separating it from the text. Good point.
- The authors utilize the fact that on double-sided documents, a figure may appear on the left page if its citation is on the right page. I think the XSL-FO spec doesn't allow this, but perhaps that a user-configurable extension...
- The computation time of the pagination algorithm presented in this paper is proportional to the number of text lines times the number of figures. Note that this article does not deal with footnotes, but it should be easy to extend it to handle them.
- It is possible to tweak the algorithm to allow pages which are not entirely full (yet still with balanced facing pages). Indeed sometimes it is impossible to find a layout which respects all of the contraints (full pages, no widow/orphan, keeps...)
- Although the algorithm assumes that each line of text is separated by some glue, it only relies on their minimum and maximum values. It doesn't take the optimal value into account. And as it tries to minimize the total number of pages, we may end up with spaces systematically set to their minimum value. Indeed, if there are no before-floats nor footnotes, it will obviously be the case. I think this may break the XSL-FO spec; at least we can expect complaints from users.
- Related to this: the possibility to break a page is binary: allowed or not. This constraint is IMO too heavy: although an orphan is undesirable, it should be acceptable when no better layout is possible. In fact this algorithm doesn't utilize the whole range of penalty values available in the Knuth model; only two are allowed, 0 and infinity.

## 2.3. Resulting Thoughts

The algorithm cannot be used directly as it is presented, mainly because of the two last items of the previous section. However, it provides a strong basis IMO. Some points:

- There may be cases where it is impossible to find a layout which fulfills all of the criteria. We may consider to perform a second pass with some relaxed properties, but:
- can we afford the additional computation time?
- which properties to relax, in which order? Possibilities:
- orphans/widows properties; but contrary to TeX, in XSL-FO it is only possible to define the number of orphans/widows, not a penalty associated to them.
- keep properties: those may have integer values that allow for some flexibility
- page fullness: AFAICT the XSL-FO spec does not impose that pages be full. And as explained earlier, the algorithm can use this possibility. According to the article this becomes quite easy to find a layout just by allowing pages to be only 90% filled. This might even become possible to perform only one pass (which would ideally find the optimal pagination when possible, and one with some percentage of fullness otherwise). (See also the recent thread on fop-dev on this subject.)

There is a free bonus implied by this algorithm: this should give pages which are vertically justified

^{1}. That is, the bottoms of pages should coincide with the bottom of the region-body area. The only exception would be in "relaxed mode", when a percentage of fullness is introduced. Such a feature should remain optional, however. But I think the exact same situation occurs with justified or ragged-right text: the algorithm is the same, just the glues are actually stretched/shrinked or left as is.- BTW, is there in XSL-FO a way of determining if the document is single- or double-sided? We can check if separate page masters are used for even and odd pages. Such a check should be reliable, though. Anyway, such a thing may probably be deferred to a fine-tuning period, when we have a working basic implementation.
- The algorithm may be generalized to allow several kinds of floats e.g., tables and figures. In such cases, even if the order of each float sequence must be respected, this is possible to place a table before a figure even if it is referenced later in the main text. That said, XSL-FO defines only one kind of before-floats. Anyway, there is room for future Fop extensions IMHO...

When there are no figures nor footnotes, pagination becomes equivalent to breaking paragraphs into lines. In such a particular case the Knuth algorithm may be used. It would thus be great to find a cost function and an associated algorithm that become equivalent to line-breaking in such a case. Anyway, this would be great to utilize the full potential of the g/b/p model for page-breaking. And, moreover, to rely as much as possible on the current source code. More to come when I have looked at it.

# 3. "Optimal Pagination Techniques for Automatic Typesetting Systems"

Michael Frederick Plass -- PhD thesis

## 3.1. Introduction

Plass starts by having a look at the practices which were in use in the past. In short, compositors were mostly performing a first-fit pagination, because of the difficulty to look ahead and because re-typesetting the n-1 previous pages to have a better-looking page n demanded too much time. Footnotes were particularly difficult to handle. Others problems were balanced facing pages, and obviously floats placement.

The g/b/p model used for line-breaking does not readily apply to page-breaking, it must be extended with some "insert" operator to handle figures and footnotes.

## 3.2. Pagination may be an NP-complete Problem

The basic algorithm which consists in choosing the best pagination among all of the possible ones is obviously exponential. But depending on the chosen badness function, the problem may become solvable in polynomial time.

Basically Plass shows that, among others, "quadratic" badness functions lead to an NP-complete problem. Such functions compute the square of the differences between the page numbers of their citations and the page numbers of their corresponding figures.

In reality this is a bit more subtle. Indeed, with certain constraints, quadratic functions may be solvable in polynomial time. Plass actually shows that this is quite easy to choose constraints which makes the problem exponential. For example, simply allowing glue between the elements suffices to make the problem NP-complete. The issue is then to find the right constraints (number of citations per figure, number of objects per page, constant page size or not, and so on) and the right badness function, so that the problem still is a good representation of the reality yet is solvable in polynomial time.

## 3.3. Making Pagination Polynomial

It may be interesting to quickly explain the case where there are no footnotes nor figures. The problem becomes equivalent to line-breaking and lets understand the main basis on which the whole dynamic programming approach relies.

We define B_{jk} to be the badness corresponding to the best way to put the first j lines on the first k pages. Then the following formula holds:

B

_{jk}= min_{0<=i<j}{B_{i k-1}+ β_{ijk}}

where β_{ijk} is the badness due to placing lines i+1 through j on page k. The best way to put the first j lines on the first k pages only depends on the best ways to put i lines on the first k-1 pages. The only problem is to find this particular i for which (the badness of the first k-1 pages) + (the badness of page k) will be minimal.

We can easily see the recursive nature of this problem. The rest, well, "only" corresponds to implementation details.

When introducing footnotes and figures, the challenge is to find a formula which has a similar structure. Plass begins with a formula which computes the distance (in absolute value) between a figure and each(!) of its citations. The problem is solvable but the constraints are unrealistic: their may be only one type of element on each page, either text or a figure. Plass then generalizes the problem by handling any number of independant flows (normal text, figures, tables, etc.), but again with only one element per page.

Ok, I've decided to stop here. At the end of the paragraph, Plass presents a differently generalized problem and tries to find sufficient conditions which make it solvable by dynamic programming. In this case the previous weird constraints are relaxed: their may be any number of element on each page, several flows, several references from any flow to any other, etc. Then he formulates the cost function as depending on 4 independent arbitrary sub-functions, which give some freedom to the characterization of good pagination.

Then there is an entire chapter describing an implementing algorithm, which is very similar to the line-breaking algorithm. Here I think it is more interesting to refer to the work of Brüggemann et al., which is easier to read and more directly re-usable than Plass' work. It will only have to be adapted to the g/b/p model, which shouldn't be too difficult.

It may be worth returning to Plass' work if we encounter unexpected difficulties during the implementation of floats (e.g., the chosen algorithm doesn't provide satisfying results), as it provides a strong theoretical basis. For now I think I know enough to proceed to the implementation.

# 4. The Fop Source Code

Even if it is well explained, the Knuth line-breaking algorithm isn't that easy to understand. ATM I've concentrated on the class layoutmgr.BreakingAlgorithm, which contains the part of the algorithm which is common to page- and line-breaking. It is splitted in parts which follow pretty closely those described in "Digital Typography". It relies on the following skeleton:

findBreakingPoints(a paragraph, max allowable ratio for stretching, force or not the finding of breakpoints) initialize the various fields and variables For each Knuth element of the paragraph Do update totalWidth, totalStretch, totalShrink If the element is a feasible breakpoint Then > considerLegalBreak(at this element) EndIf EndFor If no set of feasible breakpoints could be found Then reactivate the last node and restart the algorithm from it EndIf Select some active nodes among the available ones > filterActiveNodes() For each remaining active node Do Compute the corresponding set of optimal breakpoints > calculateBreakPoints EndFor considerLegalBreak(Knuth element) For each currently active node Do compute the adjustment ratio of a line starting at the active node and ending at this element If the adjustment ratio is unacceptable Then deactivate this node Else compute the fitness class of the corresponding line > computeFitness update the total demerits by adding the demerits of this line > computeDemerits If the result is better than the best recorded demerits for the given fitness class Then register this new best active node > BestRecords.addRecord EndIf EndIf (perform something when the adjustment ratio is unacceptable and we are in forced mode) EndFor add new active nodes for breaks at this element > addBreaks addBreaks(element ending the line, line number) compute the new totalWidth, totalStretch, totalShrink (after this element) For each fitness class Do If the recorded best active node for this fitness class is to be retained Then create a new active node for a break from this node to the element EndIf EndFor calculateBreakPoints(last active node, corresponding paragraph, corresponding number of lines) For j = number of lines DownTo 1 Do breakpoint = node.element node = node.previous EndFor

don't forget to have a look at Luca's LayoutExtensions on this issue (1)