lxmlビルダーでの非再帰的な検索

lxmlビルダーを使用すると、Python 2.7で非再帰的なbs4.BeautifulSoup.find_allを実行できないことがわかりました。lxmlビルダーでの非再帰的な検索

は、次の例のHTMLスニペットを取る：

<p> <b> Cats </b> are interesting creatures </p> 

<p> <b> Dogs </b> are cool too </p> 

<div> 
<p> <b> Penguins </b> are pretty neat, but they're inside a div </p> 
</div> 

<p> <b> Llamas </b> don't live in New York </p>

は、私が直接の子であるすべてのpの要素を見つけたいと言います。私はfind_allをfind_all("p", recursive=False)と再帰的に行います。

これをテストするために、上記のHTMLスニペットをhtmlという変数に設定しました。

a = bs4.BeautifulSoup(html, "html.parser") 
b = bs4.BeautifulSoup(html, "lxml")

通常find_allを使用したとき、彼らの両方が正しく実行します：その後、私は2つのBeautifulSoupインスタンス、aとb作成

>>> a.find_all("p") 
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Penguins </b> are pretty neat, but they're inside a div </p>, <p> <b> Llamas </b> don't live in New York </p>] 
>>> b.find_all("p") 
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Penguins </b> are pretty neat, but they're inside a div </p>, <p> <b> Llamas </b> don't live in New York </p>]

を私は再帰的な発見、唯一a作品をオフにした場合。 bは空のリストを返します。

>>> a.find_all("p", recursive=False) 
[<p> <b> Cats </b> are interesting creatures </p>, <p> <b> Dogs </b> are cool too </p>, <p> <b> Llamas </b> don't live in New York </p>] 
>>> b.find_all("p", recursive=False) 
[]

なぜですか？これはバグですか、何か間違っていますか？ lxmlビルダーは非再帰的にfind_allをサポートしていますか？

出典

2016-03-25 Luke Taylor

これは、それが存在しない場合lxmlパーサはhtml/bodyにあなたのHTMLコードを入れてしまうためである：、

>>> b = bs4.BeautifulSoup(html, "lxml") 
>>> print(b) 
<html><body><p> <b> Cats </b> are interesting creatures </p> 
<p> <b> Dogs </b> are cool too </p> 
<div> 
<p> <b> Penguins </b> are pretty neat, but they're inside a div </p> 
</div> 
<p> <b> Llamas </b> don't live in New York </p> 
</body></html>

そして、それゆえに、非再帰モードでfind_all()がhtml要素内の要素を見つけるためにしようとするだろうbody子しかありません：

出典

2016-03-25 18:33:13 alecxe

これは私には矛盾しているようですが、なぜこのように異なるパーサーが異なる動作をするべきですか？ –

@ LukeTaylorそれは混乱するかもしれない、私は同意する。 [Parser間の違い]（http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers）のドキュメントの段落には、いくつかの情報があります。それはすべて、違うパーサーに送られ、非整形式のHTMLを有効なものにします。 – alecxe

lxmlビルダーでの非再帰的な検索

答えて

関連する問題