PHPのsimple_html_domパーサーでエラー

simple_html_domクラスの使用中にエラーが見つかりました。

私のHTML文字列を解析する必要があります。私が見つける（「メタ[名=画像]」）でメタタグという名前の画像を取得しようとしました

<!DOCTYPE html> 
<html lang="en"> 
<head> 
<title>Y-shaped ZnO Nanobelts Driven from Twinned</title> 

<meta name="site" content="Reports"/> 

<meta name="description" content="Description with twinned planes {11&#"/> 

<meta name="image" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a"/> 


... 


</body> 
</html>

は、しかし、私はできませんでした。

理由を調べたところ、上記の行の中央にある文字「&＃」が原因であることが判明しました。

<meta name="description" content="Description with twinned planes {11&#"/>

私はこのケースでだから

Description with twinned planes {11&#"/> <meta name="image" ....

のようにそのmetaタグのcontent属性を持って、私はsimple_html_domが正しくHTMLを解析させるために何をすべきでしょうか？

それ以外の場合、このhtmlを正しく解析するライブラリはありますか？

出典

2017-07-28 Cuza

のそれは{11＆＃は{11 &＃なければならないことは問題ではありません –

このコードを試してみてください。使用PHPをDomDocument

あなたは使用して属性値をgetElementsByTagNameを使用してメタ取得し、取得することができますgetAttribute

$hml = '<!DOCTYPE html> 
<html lang="en"> 
<head> 
<title>Y-shaped ZnO Nanobelts Driven from Twinned</title> 

<meta name="site" content="Reports"/> 

<meta name="description" content="Description with twinned planes {11&#"/> 

<meta name="image" content="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a"/> 
</head> 
<body> 

</body> 
</html>'; 

$dom = new DOMDocument(); 
libxml_use_internal_errors(true); 

$dom->loadHTML($hml); 

$metas = $dom->getElementsByTagName('meta'); 

foreach($metas as $meta){ 

if($meta->getAttribute('name')=="image"){echo $meta->getAttribute('content');} 

}

出力：

https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a

注：あなたがしている場合ページからのコンテンツの読み込み代わりに、この $dom->loadHTML($hml);

出典

2017-07-28 12:06:34 NID

PHPのsimple_html_domパーサーでエラー

答えて

関連する問題