python string find関数がbeautifulsoupから返されたテキストから位置を指定しない

10-Kファイルのセクションをスクラップしようとしています。私は「項目7（a）」の位置を特定するのに問題があります。 beautifulsoupが返すテキストから、その中に単語を持つことを促します。しかし、次のコードは、 'Item 7（a）'を含む文字列に対して機能しています。python string find関数がbeautifulsoupから返されたテキストから位置を指定しない

import urllib2 
import re 
import bs4 as bs 
url=https://www.sec.gov/Archives/edgar/data/1580608/000158060817000015/santander201610-k.htm' 

html = urllib2.urlopen(url).read().decode('utf8') 
soup = bs.BeautifulSoup(html,'lxml') 
text = soup.get_text() 
text = text.encode('utf-8') 
text = text.lower() 
print type(text) 
print len(text) 
text1 = "hf dfbd item 7. abcd sfjsdf sdbfjkds item 7(a). adfbdf item 8. skjfbdk item 7. sdfkba ootgf sffdfd item 7(a). sfbdskf sfdf item 8. sdfbksdf " 
print text.find('item 7(a)') 
print text1.find('item 7(a)') 

Output: 
<type 'str'> 
592214 
-1 
37

出典

2017-12-03 Vinay

あなたは万が一python2を使用していますか？ –

はい。私はPython 2.7を使用しています。私もPython 3.6で試しましたが、私は同じ結果を得ました。 – Vinay

'text'を表示しましたか？たぶんサーバーは、Webブラウザーとは異なる結果をもたらします。 – furas

ページはテキストITEM 7(A)

にエンティティ  （ N SPエースをreaking OT B）の代わりに通常の間隔（文字コード 160有する）
（コード 32）を使用し

すべての文字をコードに置き換えることができます210（chr(160)）、通常のスペース（" "）。のみ

することができますPythonの3でテスト：

#import urllib.request as urllib2 # Python 3 
import urllib2 
import re 
import bs4 as bs 

url='https://www.sec.gov/Archives/edgar/data/1580608/000158060817000015/santander201610-k.htm' 

html = urllib2.urlopen(url).read().decode('utf8') 
soup = bs.BeautifulSoup(html,'lxml') 
text = soup.get_text() 
text = text.encode('utf-8') # only Python 2 
text = text.lower() 

#text = text.replace(chr(160), " ") # Python 3 
text = text.replace(char(194)+chr(160), " ") # Python 2 

search = 'item 7(a)' 

# find every occurence in text  
pos = 0 
while True: 
    pos = text.find(search, pos) 
    if pos == -1: 
     break 
    #print(pos, ">"+text[pos-1]+"<", ord(text[pos-1])) 
    print(text[pos:pos+20]) 
    pos += 1

はEDIT 194と160

text = text.replace(chr(160), " ") # Python 3 text = text.replace(char(194)+chr(160), " ") # Python 2

全例 - Pythonの2で
（符号化後）次の2つの文字を交換する必要が検索文字列'item 7(a)'をエスケープしてから
ここでは" "の代わりに を使用する必要があることを知っておく必要があります。

from html import unescape search = unescape('item 7(a)')

全コード

#import urllib.request as urllib2 # Python 3 import urllib2 import re import bs4 as bs url='https://www.sec.gov/Archives/edgar/data/1580608/000158060817000015/santander201610-k.htm' html = urllib2.urlopen(url).read().decode('utf8') soup = bs.BeautifulSoup(html,'lxml') text = soup.get_text() text = text.lower() from html import unescape search = unescape('item 7(a)') # find every occurence in text pos = 0 while True: pos = text.find(search, pos) if pos == -1: break #print(pos, ">"+text[pos-1]+"<", ord(text[pos-1])) print(text[pos:pos+20]) pos += 1

出典

2017-12-03 01:45:16 furas

python string find関数がbeautifulsoupから返されたテキストから位置を指定しない

答えて

関連する問題