Fix sanitizer with libxml2 >= 2.12.0
Somehow with newer libxml2, `<?xml encoding="UTF-8">` no longer enforces UTF-8. Instead, non-ASCII contents are treated as ISO-8859-1 and get broken. For example, `<p>中文</p>` becomes `<p>中文</p>` (should be `<p>中文</p>`). Switching to another trick mentioned on [1] fixes the issue, and the new trick still works with older libxml2 (tested 2.11.5). As a side note, DOMDocument::loadHTML uses HTMLParser in libxml2 [2][3]. [1] https://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly [2] https://github.com/php/php-src/blob/php-8.1.26/ext/dom/document.c#L1855 [3] https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-HTMLparser.html
This commit is contained in:
parent
2c7e000120
commit
d4da4dcc32
|
@ -72,7 +72,7 @@ class Sanitizer {
|
|||
$res = trim($str); if (!$res) return '';
|
||||
|
||||
$doc = new DOMDocument();
|
||||
$doc->loadHTML('<?xml encoding="UTF-8">' . $res);
|
||||
$doc->loadHTML('<meta charset="UTF-8">' . $res);
|
||||
$xpath = new DOMXPath($doc);
|
||||
|
||||
// is it a good idea to possibly rewrite urls to our own prefix?
|
||||
|
|
Loading…
Reference in New Issue