Pythonで特定の文字列を抽出する方法

マークアップで特定の文字列を抽出して保存しようとしています（この行の複雑な処理のため）。Pythonで特定の文字列を抽出する方法

<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg" WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">

しかし、私が保存したい：

tempUrl = 'http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg' 

tempWidth = 500 

tempHeight = 375 

tempAlt = 'Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road'

は、どのように私はPythonでそれをやって行くだろうので、たとえば、私は、ファイルからの行に読んだし、現在の行があると言います？

おかげ

出典

2016-12-15 Johnny

私はあなたに問題を保存して、正規表現がこれに該当しないことを教えてください。それを試して考えてはいけません。後で頭を打つだけです。データがWebソースからのものであれば、BeautifulSoupまたはscrapyまたはその他の「スクレイピング」ライブラリを参照してください。マークアップをすでにお持ちの場合は、パーサを使用してノードをたどり、属性情報を収集するだけです。 –

['HTMLParser']（https://docs.python.org/2/library/htmlparser.html）または[' html.parser']（https://docs.python.org/3.4/library/html）です。 parser.html）は、Pythonバージョン –

あなたがここにいくつかのアプローチで逃げることができますが、私は拡張可能であり、HTMLの多くの問題に対処することができますHTMLパーサを、使用することをお勧めします。ここでBeautifulSoupでの作業例を示します

>>> from bs4 import BeautifulSoup 
>>> string = """<center><img border="0" src="http://www.world-of-waterfalls.com/images/Cascades_04_015L.jpg" WIDTH="500" HEIGHT="375" alt="Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road" ***PINIT***></center><br clear="all"><br clear="all">""" 
>>> soup = BeautifulSoup(string, 'html.parser') 
>>> for attr in ['width', 'height', 'alt']: 
...  print('temp{} = {}'.format(attr.title(), soup.img[attr])) 
... 
tempWidth = 500 
tempHeight = 375 
tempAlt = Looking up the Merced River Canyon towards Bridalveil Fall from the Big Oak Flat Road

出典

2016-12-15 17:16:51 brianpck

に依存しています。ついにbs4がインストールされた後、これは美しい解決策です。ありがとう！ – Johnny

そして、正規表現のアプローチ：

import re 

string = "YOUR STRING" 
matches = re.findall("src=\"(.*?)\".*WIDTH=\"(.*?)\".*HEIGHT=\"(.*?)\".*alt=\"(.*?)\"", string)[0] 
tempUrl = matches[0] 
tempWidth = matches[1] 
tempHeight = matches[2] 
tempAlt = matches[3]

すべての値は、しかし、文字列されているので、あなたがしたい場合は、それを唱える...

そして、正規表現のコピーであることを知って/ペーストは悪い考えです。簡単に間違いが起きる可能性があります。

出典

2016-12-15 17:44:09

Pythonで特定の文字列を抽出する方法

答えて

関連する問題