引用符付きの文字列をHTMLヘッダーから取得する方法は？

このHTMLスニペットでは、href =の後に引用符で囲まれた文字列を見つけるためにpythonパッケージリクエストまたはxlmlを使用するにはどうすればよいですか？引用符付きの文字列をHTMLヘッダーから取得する方法は？

<dl> 
    <dt><a href="oq-phys.htm"> 
     <b>Physics and Astronomy</b></a> 
    <dt><a href="oq-math.htm"> 
     <b>Mathematics</b></a> 
    <dt><a href="oq-life.htm"> 
     <b>Life Sciences</b></a> 
    <dt><a href="oq-tech.htm"> 
     <b>Technology</b></a> 
    <dt><a href="oq-geo.htm"> 
     <b>Earth and Environmental Science</b></a> 
</dl>

出典

2017-11-19 frogfanitw

スニペット次のPythonコードからのものである：（ページ= requests.get（ 'http://www.openquestions.com'）プリントpage.text） – frogfanitw

リクエストやXMLについて同様のものであるかのように質問することで、私を混乱させます。 HTMLページのコンテンツを取得したいのですか、コンテンツを解析して特定の部分を探したいですか？ – mrCarnivore

href=

ショートrequests + beautifulsoup解決した後に引用符で囲まれた文字列を検索するには：

import requests, bs4 

soup = bs4.BeautifulSoup(requests.get('http://.openquestions.com').content, 'html.parser') 
hrefs = [a['href'] for a in soup.select('dl dt a')] 
print(hrefs)

出力：

['oq-phys.htm', 'oq-math.htm', 'oq-life.htm', 'oq-tech.htm', 'oq-geo.htm', 'oq-map.htm', 'oq-about.htm', 'oq-howto.htm', 'oqc/oqc-home.htm', 'oq-indx.htm', 'oq-news.htm', 'oq-best.htm', 'oq-gloss.htm', 'oq-quote.htm', 'oq-new.htm']

出典

2017-11-19 17:01:36 RomanPerekhrest

これは私のために働いた。 – frogfanitw

上記の例では、上記のスニペットを含むhtml_stringがあるとします。

import requests 
import lxml.etree as LH 
html_string = LH.fromstring(requests.get('http://openquestions.com').text)

for quoted_link in html_string.xpath('//a'): print(quoted_link.attrib['href'], quoted_link.text_content())

出典

2017-11-19 16:43:02 2324

質問に重要な情報が残っている可能性はありますが、結果は誤りです。お疲れ様でした！ – frogfanitw

この猫には多くの方法があります。私はこのようにそれを書いたのはなぜ

import requests 
from lxml.html import fromstring 

req = requests.get('http://www.openquestions.com') 
resp = fromstring(req.content) 
hrefs = resp.xpath('//dt/a/@href') 
print(hrefs)

編集

：

私はCSSにXPathを好むここで（明示的）forループが含まれていませんrequests/lxmlソリューションですセレクタ
速いです。

ベンチマーク：

import requests,bs4 
from lxml.html import fromstring 
import timeit 

req = requests.get('http://www.openquestions.com').content 

def myfunc() : 
    resp = fromstring(req) 
    hrefs = resp.xpath('//dl/dt/a/@href') 

print("Time for lxml: ", timeit.timeit(myfunc, number=100)) 

############################################################## 

resp2 = requests.get('http://www.openquestions.com').content 

def func2() : 
    soup = bs4.BeautifulSoup(resp2, 'html.parser') 
    hrefs = [a['href'] for a in soup.select('dl dt a')] 

print("Time for beautiful soup:", timeit.timeit(func2, number=100))

出力：

('Time for lxml: ', 0.09621267095780464) 
('Time for beautiful soup:', 0.8594218329542824)

出典

2017-11-19 21:39:43

引用符付きの文字列をHTMLヘッダーから取得する方法は？

答えて

関連する問題