Hello, I am evaluating Prince for the use in our batch application that converts EPUBs to PDFs. One of the things we need to to is to insert back into HTML application specific attributes that indicate page numbers that match pages of the PDF rendering engine. Is it possible to get callbacks or info messages that output each page break and the corresponding CFI (Canonical Fragment Identifier for EPUBs) or something similar? Is there something we can get out of debug mode? How costly would be the development?
Which platform are you running Prince on? We have updated the box tracking API to provide additional information that will help, but it will require building some new Prince packages.
I have uploaded a latest build for Windows that includes an updated box tracking API. To use it you need to enable it, then register a JavaScript function that will be called when Prince has finished generating the PDF file:
Prince.trackBoxes = true;
Prince.addEventListener("complete", checkPages, false);
function checkPages() {
// this will be called by the oncomplete event
}
The checkPages function can look at the DOM and see which boxes were generated for each element, and which pages they appeared on. For example, to see where all the paragraphs are:
function checkPages() {
var ps = document.getElementsByTagName("p");
for (var i = 0; i < ps.length; ++i) {
var p = ps[i];
var bs = p.getPrinceBoxes();
if (bs.length > 0) {
var box = bs[0];
Log.warning("para "+p.id+" on page "+box.pageNum);
}
}
}
Alternatively, you can iterate over the pages in the PDF, and find the first text on the page and see what element it is from:
function checkPages() {
for (var i = 0; i < PDF.pages.length; ++i) {
var page = PDF.pages[i];
var box = getFirstTextBox(page);
if (box) {
Log.warning("first text box on page "+(i+1)+" is for element "+box.element.id);
}
}
}
function getFirstTextBox(box)
{
if (box.type == "TEXT" && box.element) return box;
for (var i = 0; i < box.children.length; ++i)
{
var curr = getFirstTextBox(box.children[i]);
if (curr) return curr;
}
return null;
}
So that is roughly how the API works. As you can see, there are quite a few things you can do with it, but that is a quick introduction to get started.
Thank you Mike. Is there anything written up about that PDF box model? In particular, will I be able to get an offset into text, when the text of the paragraph overflows from one page to the next? Do you do anything to the original DOM? For example, when I construct a CFI out of a box.element, will it point to the right place in the original HTML?
Currently it is not possible to get an offset into the actual text of an element, so I think the best you could do for now is link to the first paragraph that begins on that particular page, and ignore paragraphs that continue on to that page.
The original DOM is not changed, so if you have an ID on the element you can get access to that, or you could walk up the DOM tree to make some kind of path-based identifier.
Mike, are you sure that Windows version you created on Apr 06 contains PDF box model? Your code fails with PDF undefined and getPrinceBoxes() dies just as well.
The conversion itself works very well and my converted EPUBs look awesome.
A problem with the code example above is that a page sometimes does not start with text, but with an image (for example), Sometimes there is no text on a page at all. So, I decided to find the lowest on the stack box and use its box.element as the beginning of the page. It often yields good results, but I have a book which uses lots of <br> elements inside <p> elements. <br> element unfortunately does not show up as any kind of box and the last box on the stack has type TEXT and nodeName P. Is that possible to fix? Or this is my mistake and <br> elements are Prince boxes of some type?
I think in many situations <br> elements will not generate any boxes, they will just force a line break. Most empty spans will not generate boxes either, but if you wrap up some text in a span then it will generate a box.
We would hate to drop the idea of using Prince and go back to Flying Saucer and start modifying it for HTML5. Can somebody in your company quote a price for developing in Prince the functionality that I am looking for? In other words, for each page break to generate a CFI (or some other DOM based scheme) that points to the first visible element or text fragment on the page.
We have made a number of changes to the box tracking API in latest builds, including providing access to the text of TEXT boxes via the "text" property (although this currently does not handle right-to-left scripts or Indic scripts very well).