You probably noticed the not-so-graceful hyphenation in the lines above. It was placed there on purpose, to make you twitch. The rest of this guide shows you how to avoid such embarrasing situations.
Our formatter of choice is Prince, an HTML-and-CSS-to-PDF converter. The screenshots you see, and the PDF documents linked from this guide, have all been generated with Prince. You can easily create the same pdf files by downloading Prince and pointing it to the HTML links provided in this document.
The longest word in the English dictionary is pneumonoultramicroscopicsilicovolcanoconiosis,
which is the name of an unpleasant lung disease. Let's see how Prince handles it when manual hyphenation is on:
htmlbody { hyphens: manual }
As you can see, Prince will not split a word when manual hyphenation is on. Unless you place hyphens inside the word:
html<p>Pneumonoultramicroscopic­silicovolcanoconiosis
The named entity shy
is a soft hyphen, which only becomes visible when the word is broken.
Adding soft hyphens in the HTML code is cumbersome. An easier method is to write a CSS rule which transforms all instances of a word into another string. And in that other string we can add soft hyphen characters, which are encoded as \AD
in CSS:
htmlbody { hyphens: manual; prince-text-replace: "Pneumonoultramicroscopicsilicovolcanoconiosis" "Pneu\AD mono\AD ultra\AD micro\AD scopic\AD silico\AD volcano\AD coniosis"; }
The prince-text-replace property is a Prince-specific extension which is very powerful. And very dangerous. Let this example serve as a stern warning:
htmlbody { prince-text-replace: "oui" "non"; } <p>Mais, oui!
Adding soft hyphens on a per-word basis is hard work. It's much easier to turn on automatic hyphenation:
htmlbody { hyphens: auto }
Automatic hyphenation will sometimes result in too many split words. Notice how 3 out of 5 lines are hyphenated in this otherwise normal paragraph:
htmlbody { hyphens: auto }
There are several ways to avoid this. First, one can mark words that should not be hyphenated:
html... <nobr>aliqua</nobr> ...
But again, marking specific words in the HTML code is cumbersome and it is easier to put CSS to work.
There are several properties to put constraints on hyphenation. One can use the hyphenate-lines property to limit the number of consecutive lines which with an hyphen:
htmlbody { hyphenate-lines: 1 }
Or, can use the hyphenate-before property to specify the minimum number of letters that may be left at the end of an hyphenated line:
htmlbody { hyphenate-before: 3 }
Likewise, there is a property hyphenate-after to specify the minimum number of letters that may be moved to the next line. It doesn't work so well in our example:
htmlbody { hyphenate-after: 3 }
Increasing it to 4 helps:
htmlbody { hyphenate-after: 4 }
When automatic hyphenation is on, Prince uses hyphenation patterns borrowed from TeX to decide where to split words. Prince ships with a hyphenation patterns for a number of European languages. For example, in the hyph.css
file, which you find in the Prince installation directories, one reads:
:lang(da) { prince-hyphenate-patterns: url("../hyph/hyph-da.pat") }
The code above means that Danish text will use the hyphenation patterns found in the "hyph-da.pat" file. The "lang" attribute is used to set the language in the HTML code:
<html lang="da"> ... <p lang="it">...
In the above example, the document is labelled as being Danish, and the paragraph inside is labelled as Italian.
Hyphenation is fundamentally difficult problem, and the techniques described in this guide are not sufficient for achieving optimal formatting in all cases. In particular, the hyphenation patterns have limitations and language-specific dictionaries must be used to determine where words can be split. High-end publishers should try Prince for Books, a special edition of Prince which ships with dictionaries for English and German.