Python - htmlタグのストリップ文字列、リンクは残したまま変更した形式

文字列からすべてのhtmlタグを削除する方法はありますか？リンクを残してその表現を変更しますか？例：Python - htmlタグのストリップ文字列、リンクは残したまま変更した形式

description: <p>Animation params. For other animations, see <a href="#myA.animation">myA.animation</a> and the animation parameter under the API methods.  The following properties are supported:</p> 
<dl> 
    <dt>duration</dt> 
    <dd>The duration of the animation in milliseconds.</dd> 
<dt>easing</dt> 
<dd>A string reference to an easing function set on the <code>Math</code> object. See <a href="http://example.com">demo</a>.</dd> 
</dl> 
<p>

と私は唯一の 'myA.animation' で

<a href="#myA.animation">myA.animation</a>

を交換したいのですが、 'デモ：http://example.com' が

<a href="http://example.com">demo</a>

EDIT：今それは働いているようです：

def cleanComment(comment): 
    soup = BeautifulSoup(comment, 'html.parser') 
    for m in soup.find_all('a'): 
     if str(m) in comment: 
      if not m['href'].startswith("#"): 
       comment = comment.replace(str(m), m['href'] + " : " + m.__dict__['next_element']) 
    soup = BeautifulSoup(comment, 'html.parser') 
    comment = soup.get_text() 
    return comment

出典

2017-02-23 Ratka

あなたの例でありますあなたのためのグローバルなルールhtml？または、保存したいリンクもあれば保存しないリンクもありますか？ – arieljannai

はい、リンクには2種類しかありません。 – Ratka

この正規表現は、あなたのために働く必要があります：(?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"

あなたは、Pythonではover here

それを試すことができます。

import re 

text = '' 
with open('textfile', 'r') as file: 
    text = file.read() 

matches = re.findall('(?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"', text) 

strings = [] 
for m in matches: 
    m = filter(bool, m) 
    strings.append(': '.join(m)) 

print(strings)

結果は次のようになります。['myA.animation', 'demo: http://example.com']

出典

2017-02-23 11:26:39 arieljannai

Wooaa、コードで始める方法がわからない – Ratka

あなたのためにPythonの例を追加しました – arieljannai

ありがとうございましたが、私にとってはうまくいくソリューションが見つかりました – Ratka

Python - htmlタグのストリップ文字列、リンクは残したまま変更した形式

答えて

関連する問題