We produce PDFs for our customers using Prince, and some of our customers have had issues with the PDFs when they needed to convert them to HTML.
While we have no access to the converter our customer uses, we suspect it may be based on pdf2htmlEX, and we have managed to reproduce a similar issue oureselves.
Attached are two HTML documents that differ only in the presence of a third p element. Both are processed by Prince 15.2 without issues. However, the document with only two p elements produces the following warning when reconverted by pdf2htmlEX:
When this warning is produced, the text in the converted HTML is corrupted, and copying and pasting it results only in a mess of missing Unicode characters.
I have used a Google Font for better reproducibility, but have also seen the same warning with a sans-serif system font.
This could be a bug in pdf2htmlEX and not Prince, so I'm thankful for any help to find the root of the issue.
PS: even the succesful conversion to HTML leaves some slightly corrupted text where ligatures occur: "fling, fish, and affinity" becomes "$ing, #sh, and a%nity"
While we have no access to the converter our customer uses, we suspect it may be based on pdf2htmlEX, and we have managed to reproduce a similar issue oureselves.
Attached are two HTML documents that differ only in the presence of a third p element. Both are processed by Prince 15.2 without issues. However, the document with only two p elements produces the following warning when reconverted by pdf2htmlEX:
ToUnicode CMap is not valid and got dropped for font: 1
When this warning is produced, the text in the converted HTML is corrupted, and copying and pasting it results only in a mess of missing Unicode characters.
I have used a Google Font for better reproducibility, but have also seen the same warning with a sans-serif system font.
This could be a bug in pdf2htmlEX and not Prince, so I'm thankful for any help to find the root of the issue.
PS: even the succesful conversion to HTML leaves some slightly corrupted text where ligatures occur: "fling, fish, and affinity" becomes "$ing, #sh, and a%nity"