beautifulsoup解析のhtmlファイルの内容

私はフォルダ内の30911個のhtmlファイルを持っています。beautifulsoup解析のhtmlファイルの内容

<strong>123</strong>

と、このセクションが終了するまで（2）以下の内容を抽出します。私は、（1）それはタグが含まれているかどうかを確認する必要があります。

しかし、私はこの問題は、それらのいくつかは

<strong>567</strong>

前に終了したことで、そのうちのいくつかは、それはまた別のp個のp_numberを持って

<strong>89/strong> or others(that I do not know because I cant check 30K+files)

前に終了されるようなタグを、持っていないましたそれぞれのファイルにはありますが、時々idがありません

最初に私はbeautifulsoupを使用して検索しますが、次の抽出コンテンツの操作方法はわかりません

Btwは、txt形式でコンテンツを保存することはできますが、html形式のように見えますか？

line 1 
line 2 
... 
lin 50

p.get_text（strip = true）を使用する場合は、すべて一緒です。

line1 content line2 content ... 
line50 content....

出典

2017-05-28 Michael Lin

私が正しくあなたを理解していれば、あなたは最初の出発点を見つけることができます - 「質疑応答」テキストでstrong要素を持っているp要素を。あなたは、「著作権ポリシー」テキストでstrong要素を持っているものを打つまで続いて、あなたはp要素のnext siblingsを反復処理することができます。

完全再現可能な例：

import re 

from bs4 import BeautifulSoup 


data = """ 
<body> 
    <p class="p p4" id="question-answer-session"> 
     <strong> 
     Question-and-Answer Session 
     </strong> 
    </p> 

    <p class="p p4"> 
     Hi John and Greg, good afternoon. contents.... 
    </p> 

    <p class="p p14"> 
     <strong> 
     Copyright policy: 
     </strong> 
     other content about the policy.... 
    </p> 
</body> 
""" 

soup = BeautifulSoup(data, "html.parser") 

def find_question_answer(tag): 
    return tag.name == 'p' and tag.find("strong", text=re.compile(r"Question-and-Answer Session")) 

question_answer = soup.find(find_question_answer) 
for p in question_answer.find_next_siblings("p"): 
    if p.find("strong", text=re.compile(r"Copyright policy")): 
     break 

    print(p.get_text(strip=True))

プリント：

Hi John and Greg, good afternoon. contents....

出典

2017-05-28 03:23:11 alecxe

私は新しいHTMLファイルに内容を記述する場合、フォーマットがが台無しされます。 –

@MichaelLin大丈夫、どの部分をファイルに書きたいですか？ – alecxe

私はそれを解決すると思います。私は p.prettify（）。encode（ 'ascii'、 'ignore'）をデコード（ 'utf-8'、 'ignore'）すると著作権の前にコンテンツを保存するだけです –

beautifulsoup解析のhtmlファイルの内容

答えて

関連する問題