LXMLを使用してすべてのHTML要素を取得します

大きなdivタグをHTML文書で解析しようとしていて、div内にすべてのHTMLタグと入れ子タグを取得する必要があります。私のコード：LXMLを使用してすべてのHTML要素を取得します

innerTree = fromstring(str(response.text)) 
print("The tags inside the target div are") 
print innerTree.cssselect('div.story-body__inner')

しかし、それは印刷します。

[<Element div at 0x66daed0>]

私はそれが内部のすべてのHTMLタグを返すようにしたいですか？ LXMLでこれを行うには？

出典

2017-02-17 Mehdi

！、私が見つけなければなりません上記の解決策 – Mehdi

LXMLは素晴らしいライブラリです。 BeautiulSoupなどを使用する必要はありません。ここであなたが求める追加情報を取得する方法は次のとおりです。

# import lxml HTML parser and HTML output function 
from __future__ import print_function 
from lxml.html import fromstring 
from lxml.etree import tostring as htmlstring 

# test HTML for demonstration 
raw_html = """ 
    <div class="story-body__inner"> 
     <p>Test para with <b>subtags</b></p> 
     <blockquote>quote here</blockquote> 
     <img src="..."> 
    </div> 
""" 

# parse the HTML into a tree structure 
innerTree = fromstring(raw_html) 

# find the divs you want 
# first by finding all divs with the given CSS selector 
divs = innerTree.cssselect('div.story-body__inner') 

# but that takes a list, so grab the first of those 
div0 = divs[0] 

# print that div, and its full HTML representation 
print(div0) 
print(htmlstring(div0)) 

# now to find sub-items 

print('\n-- etree nodes') 
for e in div0.xpath(".//*"): 
    print(e) 

print('\n-- HTML tags') 
for e in div0.xpath(".//*"): 
    print(e.tag) 

print('\n-- full HTML text') 
for e in div0.xpath(".//*"): 
    print(htmlstring(e))

注lxmlノードのcssselectとxpath戻りリストのような機能ではなく、単一ノードという。含まれているノードを取得するには、それらのリストにインデックスを付けなければなりません。

すべてのサブタグまたはサブHTMLを取得するには、ElementTreeノードを取得すること、タグ名を取得すること、またはそれらのノードの完全なHTMLテキストを取得することなどがあります。このコードは3つすべてをデモします。これは、XPathクエリを使用することによって行われます。場合によってはCSSセレクタが便利な場合もありますが、時にはXPathです。この場合、XPathクエリ.//*は、「現在のノードの下に、任意の深さのタグ名を持つすべてのノードを返します」という意味です。

この結果をPython 2で実行すると、私の場合はbeautifullsoupは非常にうまく機能してくれincorrent htmlタグを与えていない3、etree.tostring戻っは、Python 3下のUnicode文字列は、文字列ではなくバイトとして出力テキストは、若干異なるものの）

<Element div at 0x106eac8e8> 
<div class="story-body__inner"> 
     <p>Test para with <b>subtags</b></p> 
     <blockquote>quote here</blockquote> 
     <img src="..."/> 
    </div> 


-- etree nodes 
<Element p at 0x106eac838> 
<Element b at 0x106eac890> 
<Element blockquote at 0x106eac940> 
<Element img at 0x106eac998> 

-- HTML tags 
p 
b 
blockquote 
img 

-- full HTML text 
<p>Test para with <b>subtags</b></p> 
<b>subtags</b> 
<blockquote>quote here</blockquote> 
<img src="..."/>

出典

2017-02-17 06:22:32

OPがあなたの答えを受け入れることを望みます。彼の質問を誤解しました。 –

@Mehdiこれがあなたの質問に答えるならば、 "好ましい回答"答えの左上にチェックマークを入れてください）。 –

LXMLを使用してすべてのHTML要素を取得します

答えて

関連する問題