We have a problem with svg image filenames in the html document which contains special characters.
All files are downloaded in advance with wget. The URL of the svg image contains e.g. '\', which is encoded as '%5C' by wget and so part of the downloaded image filename.
Within the html document the "%" is encoded as '%25'. So in the resulting filename in the 'src' attribute contains '%255C' for the original '\'.
This seems to be all correct for me.
But the generated pdf document does not contain the image.
file system:
pic%5Ctest.svg
html document index.html:
<img src="pic%255Ctest.svg" />
If I rename the file removing all special characters and change the 'src' attribute within the html document accordingly, everything works fine.
I found out some strange things trying to solve the problem:
1. Starting prince with --debug reports no error : prince: debug: loaded resource: pic%5Ctest.svg
2. With special charater '/' (%2F) there is no problem!
3. With png images there are also no problems with special characters
Annother problem I noticed in this context: The SVG file requires the '.svg' extension, otherwise the warning 'Unknown image format' message is printed out and the SVG is missed in the pdf. Again: With PNG files the filename extension are not needed!
At the moment Prince does not try to guess whether a file is SVG or not. If it is a HTTP URL, then it will check the content-type header returned by the server, and if it is a local file it will check if it has an .svg extension.
For other image types, it will try loading it in different ways, due to the high number of misidentified image files, eg. PNG images saved with .jpeg extension.
Are you using a different extension for your SVG images?
Thereby the CGI-Program exports the SVG from a database. This is why our SVG filesnames don't have any extension currently. It would be extensive to change that. Perhaps you consider to let Prince also do a SVG check. That would be very helpful for us!
But it is originally the separator between url component path and query string which is never escaped. The resulting filename comes from the download with wget. We had no problems with Prince 9 concerning that.
Yes, I was suggesting --restrict-file-names=windows, to avoid the ? character. The other approach would be to rewrite the image URLs inside your HTML document. You could use JavaScript regular expressions to do this.
The new Prince behaviour for URL parsing is more correct according to the specification, and also consistent with what web browsers do, so it is generally not possible to interpret the unescaped ? character without breaking other behaviour.
Prince follows the same rules for decoding URLs as browsers, so special characters like ? & # and so on need to be escaped.
I think using --restrict-file-names=windows should be sufficient, yes. Although really it would be helpful if wget had more convenient mechanisms for rewriting links to local files.