Processing of big HTML files hangs when runs without --input=xml flag.

tomek
19 Jul 2018

Hi,
I found a strange issue when processing big HTML files (around 16MB). Maybe it's not an issue but the way how Prince works with html and xhtml files. If so, I would like to better understand it.

So given is HTML file (*.html), 16MB in size. When I run price command to process it to PDF, it hangs on "Loading document..." for 13-15 minutes and then it generates the PDF file. The whole process takes 15-16 minutes in total in this case.

We've noticed that there are two solutions to the problem:
- changing file name from *.html to *.xhtml
- adding --input=xml flag as a command parameter

After applying any of above "fixes" the whole process doesn't hang and takes around 1 minute in total.

Could you put some more light on it?

mikeday
19 Jul 2018

That's interesting, by any chance would you be able to send me (mikeday@yeslogic.com) a compressed HTML file that demonstrates this behaviour?

tomek
27 Jul 2018

Hi, sorry for late reply, but I was checking whether I can send you any of those files. Unfortunately, all of them contains some of the sensitive data that can't be shared with third parties. I was also trying to generate some kind of dummy html file and trying to reproduce the issue with it but without any success.

It looks like we will simply stay with --input=xml flag solution because it works, but it doesn't satisfy my curiosity as to why it works this way

mikeday
27 Jul 2018

It would be convenient to have a HTML anonymiser that replaces all of the text with aaaaa but keeps the overall file structure the same.

One possibility would be to run Prince with the --log=log.txt argument and check the timestamps in log.txt to see if parsing or processing is taking longer, that at least will help to narrow it down.

Does the document contain many tables?

Forum › Bugs

Processing of big HTML files hangs when runs without --input=xml flag.