Extracting page break location

_savage
3 Sep 2024

Hi,

My question is somewhat related to Prince’s target-counter and perhaps logging, but I can’t quite put the puzzle together…

So, for a given XML file I’d like to know where the rendered PDF changes pages.

With the DOM in mind it’s probably a little tricky to log exactly where a text node crosses/breaks from one page to another or where in between two elements a page break happened. Notations like XPath (insufficient for text node offsets) or EPUB CFI or Python lxml’s ObjectPath come to mind.

How would I go about logging/extracting the page break locations in the rendered PDF?

Thank you!

mikeday
4 Sep 2024

The attached file gives a rough sketch of recursing into the box tree to find the position of the first content on the page, although as you point out it's difficult to give the exact position without doing a lot of work; for example it would be necessary to look back at the text on the previous page as well to give an accurate estimate of where the break occurred in a text node, although you could just do a rough match and it would be mostly right.

break.html‎ 1.3 kB

_savage
7 Sep 2024

This is a great start, thank you @mikeday! I’ll play around with it, and I’ll let you know how things unfold…

Forum › How do I...?

Extracting page break location