python htmlパーサのデータが見つかりません

私はウェブページを解析し、ウェブページ内の単語や単語を検索するウェブページ「クローラ」を作っています。ここで私の問題が発生し、私が探しているデータが解析されたWebページに含まれています（特定の単語をテストとして使用して実行しました）が、探しているデータが見つからなかったと言います。python htmlパーサのデータが見つかりません

from html.parser import HTMLParser 
from urllib import * 

class dataFinder(HTMLParser): 
    def open_webpage(self): 
     import urllib.request 
     request = urllib.request.Request('https://www.summet.com/dmsi/html/readingTheWeb.html')#Insert Webpage 
     response = urllib.request .urlopen(request) 
     web_page = response.read() 
     self.webpage_text = web_page.decode() 
     return self.webpage_text 


    def handle_data(self, data): 
     wordtofind = 'PaperBackSwap.com' 
     if data == wordtofind: 
      print('Match found:',data) 
     else: 
      print('No matches found') 



p = dataFinder() 
print(p.open_webpage()) 
p.handle_data(p.webpage_text)

しかし、それは今では動作しません、私は、給紙方法を使用して開いたWebページ機能せずにプログラムを実行しているし、それがデータを動作し、検索します。

この問題を解決する上で任意の助けが

出典

2017-08-14 S0lo

ウェブサイトから抽出することは、正確には何ですか？ hrefタグからのリンク？ –

hrefタグでもpタグでも、ページ内からテキストを検索しようとしています – S0lo

を高く評価しているので、あなたが「いいえ一致が見つからない」しまったあなたは、htmlページと文字列を比較しようとしていると、当然のことながら、彼らはsimillarではありません。文字列内の文字列を検索するには、str.find()メソッドを使用できます。これは、テキストの最初に見つかった位置以外の位置-1を返します。

正しいコード：

from html.parser import HTMLParser 
from urllib import * 

class dataFinder(HTMLParser): 
    def open_webpage(self): 
     import urllib.request 
     request = urllib.request.Request('https://www.summet.com/dmsi/html/readingTheWeb.html')#Insert Webpage 
     response = urllib.request .urlopen(request) 
     web_page = response.read() 
     self.webpage_text = web_page.decode() 
     return self.webpage_text 

    def handle_data(self, data): 
     wordtofind = 'PaperBackSwap.com' 
     if data.find(wordtofind) != -1: 
      print('Match found position:', data.find(wordtofind)) 
     else: 
      print('No matches found') 

p = dataFinder() 
print(p.open_webpage()) 
p.handle_data(p.webpage_text)

出典

2017-08-14 10:25:13 Mentos

これはうまくいきます。私にこれを紹介していただきありがとうございます。私はプログラミングに新しく、ドキュメントを非常に徹底的に調べる機会はありませんでしたが、誰かが私がドキュメントのどこにそれを指し示すことができれば、私は非常に感謝しています。また、あなたはそれが最初に見つかった位置を返すと言いました。単語のすべての位置を返すための方法はありますか？ http://code.activestate.com/recipes/ – S0lo

@ S0lo 499314-find-all-a-substring-in-a-given-string /＃c1部分文字列のすべての位置を取得します。このように使うことができます： 'allindices（data、wordtofind）' – Mentos

私はBeautifulsoupとHTMLコンテンツからテキストを解析し、見つけることができるよ、それはあなたのために動作するかどうかを確認してください。以下は、あなたのケースのサンプルコードです。

from bs4 import BeautifulSoup 

soup= BeautifulSoup(web_page,'html.parser') 
for s in soup.findAll(wordtofind): 
    if data == wordtofind: 
     print('Match found:',data) 
    else: 
     print('No matches found')

出典

2017-08-14 10:30:06 SeJaPy

Late to the party, but I would strongly advise using the requests module for HTTP interactions.あなたの人生はずっと楽になります。

import requests 
from html.parser import HTMLParser 

class dataFinder(HTMLParser): 
    def open_webpage(self): 
     request = requests.get('https://www.summet.com/dmsi/html/readingTheWeb.html') 
     self.webpage_text = request.text 
     return self.webpage_text

出典

2017-08-14 14:07:11

python htmlパーサのデータが見つかりません

答えて

関連する問題