2012-05-10 17 views
3

は、私は、このスクリプトを持っている:PythonのBeautifulSoupエラー

import urllib2 
from BeautifulSoup import BeautifulSoup 
import html5lib 
import lxml 

soup = BeautifulSoup(urllib2.urlopen("http://www.hitmeister.de").read()) 

をしかし、これは私に次のエラー与える:

Traceback (most recent call last): 
    File "akaConnection.py", line 59, in <module> 
    soup = BeautifulSoup(urllib2.urlopen("http://www.hitmeister.de").read()) 
    File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__ 
    BeautifulStoneSoup.__init__(self, *args, **kwargs) 
    File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__ 
    self._feed(isHTML=isHTML) 
    File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed 
    self.builder.feed(markup) 
    File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed 
    self.goahead(0) 
    File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead 
    k = self.parse_starttag(i) 
    File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag 
    endpos = self.check_for_whole_start_tag(i) 
    File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag 
    self.error("malformed start tag") 
    File "/usr/lib/python2.6/HTMLParser.py", line 115, in error 
    raise HTMLParseError(message, self.getpos()) 
HTMLParser.HTMLParseError: malformed start tag, at line 56, column 872 

をそれから私はこのコードを試してみました:

soup = BeautifulSoup(urllib2.urlopen("http://www.hitmeister.de").read(),"lxml") 

または

soup = BeautifulSoup(urllib2.urlopen("http://www.hitmeister.de").read(),"html5lib") 

これは私にこのエラーを与える:「3.1.0.1」 がどのように私は私のコードを修正することができ、あるいはどのような私の何かがあります:

Traceback (most recent call last): 
    File "akaConnection.py", line 59, in <module> 
    soup = BeautifulSoup(urllib2.urlopen("http://www.hitmeister.de").read(),"lxml") 
    File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__ 
    BeautifulStoneSoup.__init__(self, *args, **kwargs) 
    File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__ 
    self._feed(isHTML=isHTML) 
    File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed 
    self.builder.feed(markup) 
    File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed 
    self.goahead(0) 
    File "/usr/lib/python2.6/HTMLParser.py", line 156, in goahead 
    k = self.parse_declaration(i) 
    File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1112, in parse_declaration 
    j = HTMLParser.parse_declaration(self, i) 
    File "/usr/lib/python2.6/markupbase.py", line 109, in parse_declaration 
    self.handle_decl(data) 
    File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1097, in handle_decl 
    self._toStringSubclass(data, Declaration) 
    File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1030, in _toStringSubclass 
    self.soup.endData(subclass) 
    File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1318, in endData 
    (not self.parseOnlyThese.text or \ 
AttributeError: 'str' object has no attribute 'text' 

私は、LinuxのUbuntu 10.04、Pythonの2.6.5、BeautifulSoupのバージョンを実行していますです逃した?

import urllib2 
from StringIO import StringIO 

from BeautifulSoup import BeautifulSoup 
from tidylib import tidy_document 

html = urllib2.urlopen("http://www.hitmeister.de").read() 
tidy, errors = tidy_document(html) 
soup = BeautifulSoup(tidy) 
print type(soup) 

がこれを実行する

+2

あなたの最初のスクリプトは私にとってはうまくいくようです... BeautifulSoupのバージョンはありますか?鉱山は3.0.8.1です。 – Eli

+1

本当に壊れたHTMLについては、まずはTidyを使ってHTMLを実行してください。何かのようなhttp://countergram.com/open-source/pytidylib – Eli

+1

2番目のエラーは、BeautifulSoup 4の例をコピーし、BeautifulSoup 3で使用しようとしています。BS3はlxmlまたはhtml5libを使用しません。 –

答えて

3

コメントで示唆したように、pytidylibを使用してください... ... pytidylibから

(py26_default)[[email protected] ~]$ python foo.py 
<class 'BeautifulSoup.BeautifulSoup'> 
(py26_default)[[email protected] ~]$ 

errorsは以下の通りであった。

line 53 column 1493 - Warning: '<' + '/' + letter not allowed here 
line 53 column 1518 - Warning: '<' + '/' + letter not allowed here 
line 53 column 1541 - Warning: '<' + '/' + letter not allowed here 
line 53 column 1547 - Warning: '<' + '/' + letter not allowed here 
line 132 column 239 - Warning: '<' + '/' + letter not allowed here 
line 135 column 231 - Warning: '<' + '/' + letter not allowed here 
line 434 column 98 - Warning: replacing invalid character code 156 
line 453 column 96 - Warning: replacing invalid character code 156 
line 780 column 108 - Warning: replacing invalid character code 159 
line 991 column 27 - Warning: replacing invalid character code 156 
line 1018 column 43 - Warning: '<' + '/' + letter not allowed here 
line 1029 column 40 - Warning: '<' + '/' + letter not allowed here 
line 1037 column 126 - Warning: '<' + '/' + letter not allowed here 
line 1039 column 96 - Warning: '<' + '/' + letter not allowed here 
line 1040 column 71 - Warning: '<' + '/' + letter not allowed here 
line 1041 column 58 - Warning: '<' + '/' + letter not allowed here 
line 1047 column 126 - Warning: '<' + '/' + letter not allowed here 
line 1049 column 96 - Warning: '<' + '/' + letter not allowed here 
line 1050 column 72 - Warning: '<' + '/' + letter not allowed here 
line 1051 column 58 - Warning: '<' + '/' + letter not allowed here 
line 1063 column 108 - Warning: '<' + '/' + letter not allowed here 
line 1066 column 58 - Warning: '<' + '/' + letter not allowed here 
line 1076 column 17 - Warning: <input> element not empty or not closed 
line 1121 column 140 - Warning: '<' + '/' + letter not allowed here 
line 1202 column 33 - Error: <g:plusone> is not recognized! 
line 1202 column 33 - Warning: discarding unexpected <g:plusone> 
line 1202 column 88 - Warning: discarding unexpected </g:plusone> 
line 1245 column 86 - Warning: replacing invalid character code 130 
line 1265 column 33 - Warning: entity "&gt" doesn't end in ';' 
line 1345 column 354 - Warning: '<' + '/' + letter not allowed here 
line 1361 column 255 - Warning: unescaped & or unknown entity "&_s_icmp" 
line 1361 column 562 - Warning: unescaped & or unknown entity "&_s_icmp" 
line 1361 column 856 - Warning: unescaped & or unknown entity "&_s_icmp" 
line 1397 column 115 - Warning: replacing invalid character code 130 
line 1425 column 116 - Warning: replacing invalid character code 130 
line 1453 column 115 - Warning: replacing invalid character code 130 
line 1481 column 116 - Warning: replacing invalid character code 130 
line 1509 column 116 - Warning: replacing invalid character code 130 
line 1523 column 251 - Warning: replacing invalid character code 159 
line 1524 column 259 - Warning: replacing invalid character code 159 
line 1524 column 395 - Warning: replacing invalid character code 159 
line 1533 column 151 - Warning: replacing invalid character code 159 
line 1537 column 115 - Warning: replacing invalid character code 130 
line 1565 column 116 - Warning: replacing invalid character code 130 
line 1593 column 116 - Warning: replacing invalid character code 130 
line 1621 column 115 - Warning: replacing invalid character code 130 
line 1649 column 115 - Warning: replacing invalid character code 130 
line 1677 column 115 - Warning: replacing invalid character code 130 
line 1705 column 115 - Warning: replacing invalid character code 130 
line 1750 column 150 - Warning: replacing invalid character code 130 
line 1774 column 150 - Warning: replacing invalid character code 130 
line 1798 column 150 - Warning: replacing invalid character code 130 
line 1822 column 150 - Warning: replacing invalid character code 130 
line 1826 column 78 - Warning: replacing invalid character code 130 
line 1854 column 150 - Warning: replacing invalid character code 130 
line 1878 column 150 - Warning: replacing invalid character code 130 
line 1902 column 150 - Warning: replacing invalid character code 130 
line 1926 column 150 - Warning: replacing invalid character code 130 
line 1954 column 186 - Warning: unescaped & or unknown entity "&charge" 
line 2004 column 100 - Warning: replacing invalid character code 156 
line 2033 column 162 - Warning: replacing invalid character code 159 
line 21 column 1 - Warning: <meta> proprietary attribute "property" 
line 22 column 1 - Warning: <meta> proprietary attribute "property" 
line 23 column 1 - Warning: <meta> proprietary attribute "property" 
line 29 column 1 - Warning: <meta> proprietary attribute "property" 
line 30 column 1 - Warning: <meta> proprietary attribute "property" 
line 31 column 1 - Warning: <meta> proprietary attribute "property" 
line 412 column 9 - Warning: <body> proprietary attribute "itemscope" 
line 412 column 9 - Warning: <body> proprietary attribute "itemtype" 
line 1143 column 1 - Warning: <script> inserting "type" attribute 
line 1225 column 44 - Warning: <table> lacks "summary" attribute 
line 1934 column 9 - Warning: <div> proprietary attribute "name" 
line 436 column 41 - Warning: trimming empty <li> 
line 446 column 89 - Warning: trimming empty <li> 
line 1239 column 33 - Warning: trimming empty <span> 
line 1747 column 37 - Warning: trimming empty <span> 
line 1771 column 37 - Warning: trimming empty <span> 
line 1795 column 37 - Warning: trimming empty <span> 
line 1819 column 37 - Warning: trimming empty <span> 
line 1851 column 37 - Warning: trimming empty <span> 
line 1875 column 37 - Warning: trimming empty <span> 
line 1899 column 37 - Warning: trimming empty <span> 
line 1923 column 37 - Warning: trimming empty <span> 
line 2018 column 49 - Warning: trimming empty <span> 
line 2026 column 49 - Warning: trimming empty <span> 
+0

多分、この質問はBSには反映されていませんが、私はPythonには初めてです。説明してください、この行:整頓、エラー= tidy_document(html)。 – torayeff

+0

'tidy'は' pytidylib'でクリーンアップされたhtml文書で、 'errors'は私が送ったオリジナルで見つかった' pytidylib'のエラーです。 –

+0

あなたはちょうど1行に2つの変数を割り当てます、そうですか?私は – torayeff

0

urllibとurllib2にはバージョンの依存関係があります。 私はで同じことをやっwhave:私は属性によってsoup.find_allを使用していた

sock=urllib.urlopen("http://www.espncricinfo.com/ci/engine/match/903603.html") 

htmlSource = sock.read()

。これが役に立ったと思っています