This issue has been moved from the zendframework
repository as part of the bug migration program as outlined here - http://framework.zend.com/blog/2016-04-11-issue-closures.html
Original Issue: https://api.github.com/repos/zendframework/zendframework/issues/7618
User: @mtrippodi
Created On: 2015-08-26T13:51:12Z
Updated At: 2015-11-06T22:17:32Z
Body
use Zend\Dom\Query;
use Zend\Debug\Debug;
$html = '<div><h1>ßüöä</h1></div>';
$dom = new Query($html);
$nodes = $dom->execute('h1');
Debug::dump($nodes->current()->nodeValue);
...will result in sth. like:
Ãüöä
$html = '<div><h1>ßüöä</h1></div>';
$dom = new Query(utf8_decode($html));
$nodes = $dom->execute('h1');
Debug::dump($nodes->current()->nodeValue);
... will solve the problem and result in correct rendering.
For convenience I extended Zend\Dom\Query
:
<?php
namespace MyNamespace\Dom;
use Zend\Dom\Query as ZF2Query;
class Query extends ZF2Query
{
/**
* Set document to query. If is UTF-8: decode.
*
* @param string $document
* @param null|string $encoding Document encoding
* @return Query
*/
public function setDocument($document, $encoding = null)
{
if (0 === strlen($document)) {
return $this;
}
$_encoding = empty($encoding) ? $this->getEncoding() : $encoding;
if($_encoding == 'UTF-8')
$document = utf8_decode($document);
return parent::setDocument($document, $encoding);
}
}
Now I wonder if this could be perhaps implemented in Zend\Dom\Query
. Or do I miss something and there's a better solution?
Thanks
m.
Comment
User: @mtrippodi
Created On: 2015-08-26T18:15:20Z
Updated At: 2015-08-26T19:17:05Z
Body
OK, forget my first "solution". It's bad because e.g. ...
$html = '<div><h1>€</h1></div>';
$dom = new Query(utf8_decode($html));
$nodes = $dom->execute('h1');
Debug::dump($nodes->current()->nodeValue);
...will result in:
?
This is, because all that utf8_decode()
does is convert a string encoded in UTF-8 to ISO-8859-1. This is of course not good because UTF-8 can represent many more characters than ISO-8859-1. See this comment at PHP Man.
The real problem is, that DOMDocument::loadHTML ()
by default will always treat the source-string as ISO-8859-1-encoded. Unfortunately, you can only change this behavior by specifying the encoding in the html head at the beginning of the source-string. This comment at PHP Man still seems to apply even though it is 10 years old and UTF-8 is so common nowadays!
So, based on this comment I again extended Zend\Dom\Query
as follows:
<?php
namespace MyNamespace\Dom;
use Zend\Dom\Query as ZF2Query;
class Query extends ZF2Query
{
/**
* Set document to query
*
* @param string $document
* @param null|string $encoding Document encoding
* @return Query
*/
public function setDocument($document, $encoding = null)
{
if (0 === strlen($document)) {
return $this;
}
$prepend = '';
$_encoding = empty($encoding) ? $this->getEncoding() : $encoding;
if(!empty($_encoding) && strtolower($_encoding) != 'iso-8859-1')
$prepend = sprintf('<?xml encoding="%s">', $_encoding);
// breaking XML declaration to make syntax highlighting work
if ('<' . '?xml' == substr(trim($document), 0, 5)) {
if (preg_match('/<html[^>]*xmlns="([^"]+)"[^>]*>/i', $document, $matches)) {
$this->xpathNamespaces[] = $matches[1];
return $this->setDocumentXhtml($prepend . $document, $encoding);
}
return $this->setDocumentXml($document, $encoding);
}
if (strstr($document, 'DTD XHTML')) {
return $this->setDocumentXhtml($prepend . $document, $encoding);
}
return $this->setDocumentHtml($prepend . $document, $encoding);
}
}
Still, two questions remain:
- Is this the best solution?
- Should a solution be implemented in
Zend\Dom\Query
?
Comment
User: @croensch
Created On: 2015-08-28T14:15:05Z
Updated At: 2015-08-28T14:15:05Z
Body
AFAIK if no header is present the passed encoding is used, if the header is present the passed encoding is ignored. So if your documents are always in iso-8859-1 then just try setDocument()
as it is?
Originally posted by @GeeH at https://github.com/zendframework/zend-dom/issues/10