2つの同様のhtmlファイルを解析する際に大きな時間差がある

Webサービスからの2つの検索結果がhtmlとして保存されているため、いくつかのデータを抽出するためにBeautifulSoupと解析する必要があります。私はそれらのうちの1つがおよそかかることに気づいた。他のものよりも35倍長い。2つの同様のhtmlファイルを解析する際に大きな時間差がある

低速のhtmlファイルのパフォーマンスを向上させるために私は何ができるのですか？

セットアップ：

Python 2.7.13 
Jupyter Notebook 4.3.1 
beautifulsoup4 (4.5.3) 
lxml (3.8.0)

コード：私はあなたの2つのHTMLファイルにBeautifulSoupを時限

from bs4 import BeautifulSoup 

path = "path to the files" 
file_1 = "slow.html" 
file_2 = "fast.html" 

with open(path+file_1) as rfile_1: 
    html_1 = rfile_1.read() 
with open(path+file_2) as rfile_2: 
    html_2 = rfile_2.read() 

%timeit soup = BeautifulSoup(html_1, 'lxml') 
>> 1 loop, best of 3: 4.67 s per loop 
%timeit soup = BeautifulSoup(html_2, 'lxml') 
>> 10 loops, best of 3: 136 ms per loop

出典

2017-08-12 RandomDude

結果はあなたとは逆です。「速い」には「遅い」の約2倍の時間がかかりました。なぜこのようにすべきか分かりません。

>>> timeit.timeit("import bs4;HTML = open('slow.html').read();bs4.BeautifulSoup(HTML, 'lxml')", number=1000) 
83.10731378142236 
>>> timeit.timeit("import bs4;HTML = open('fast.html').read();bs4.BeautifulSoup(HTML, 'lxml')", number=1000) 
147.65896100030727

解析時間が重要な場合は、私は治療の使用を提案します。それぞれのファイルについて、その時間の約4分の1で結果を返しました。

>>> timeit.timeit("from scrapy.selector import Selector;HTML = open('slow.html').read();Selector(text=HTML)", number=1000) 
21.85675587779292 
>>> timeit.timeit("from scrapy.selector import Selector;HTML = open('fast.html').read();Selector(text=HTML)", number=1000) 
39.938533099930055

出典

2017-08-12 15:23:21

slow.htmlはfast.htmlの約半分ですので、結果は意味があります。あなたは35倍長いという点で同じ結果を得ていないので、私はpython/packagesのインストールに問題があると推測できますか？治療を使用しているときと同じ結果を得ます - チップのおかげで – RandomDude

その短い答えは分かりません。あなたがPython、Jupyter、そしてこれらすべてのものを持っている時には、誰が言うことができますか？意味のあるタイミングを得ることは、まず頭痛です。 –

2つの同様のhtmlファイルを解析する際に大きな時間差がある

答えて

関連する問題