Forum How do I...?

Exclude hyphens from selection?

jez
When hyphens get inserted into the PDF for justified text that uses `hyphens: auto;`, the hyphens are included when selecting and copying text. Is there a way to change this behavior, so that the PDF allows copy/pasting the original text, instead of the text as it's presented on the page?

For example, starting from this `foo.html` file:
<style>
@page {
  size: A5;
}

p {
  text-align: justify;
  hyphens: auto;
}
</style>

<p>
  Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off—then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.
</p>

then running this command:
prince-15.4.1-macos/lib/prince/bin/prince foo.html

produces a PDF whose first two lines of content copy paste like this:
Call me Ishmael. Some years ago—never mind how long precise-
ly—having little or no money in my purse, and nothing particular

where the hyphen (and even newline) between the two lines is included in the selection.

I'm interested in this because it seems like whatever the root cause of this is, also affects:

  • PDF highlight annotations (the PDF viewer's annotation list preview incldues include the hyphen characters)
  • search results (instances of hyphenated words are not shown in search results)
  1. foo.pdf28.7 kB
jez
Update: I was able to fix this with the `-prince-hyphenate-character` property:
  -prince-hyphenate-character: '\0000AD';

That instructs Prince to use soft hyphens, instead of ASCII hyphens. Now when I select, copy/paste, highlight, and search, the hyphen characters are ignored.

I'm not sure if this special treatment for soft hyphens is coming from the operating system, the PDF specification, individual PDF readers, or something else, but in any case it accomplishes what I want.
  1. foo.pdf38.6 kB
    fixed PDF from above, using soft hyphens

Edited by jez

jez
I wonder if using the `--tagged-pdf` option (or similar `--pdf-profile` options) should instruct Prince to treat `-prince-hyphenate-character: auto;` as if the soft hyphen character had been specified? I found this conversation on another tool that generates PDFs:

reknih wrote:
The PDF 1.7 spec has the following to say on this in clause 14.8.2.2.3:
Hyphenation. Among the artifacts introduced by text layout is the hyphen marking the incidental division of a word at the end of a line. In Tagged PDF, such an incidental word division shall be represented by a soft hyphen character, which the Unicode mapping algorithm (see "Unicode Mapping in Tagged PDF" in 14.8.2.4, "Extraction of Character Properties") translates to the Unicode value U+00AD. (This character is distinct from an ordinary hard hyphen, whose Unicode value is U+002D.) The producer of a Tagged PDF document shall distinguish explicitly between soft and hard hyphens so that the consumer does not have to guess which type a given character represents.



https://github.com/typst/typst/issues/1267#issuecomment-1555983298