Pythonでの単純なWebクローラー

私は自分自身でPythonを教えていて、簡単なWebクローラー・エンジンを構築しました。以下のコードはPythonでの単純なWebクローラー

def find_next_url(page): 
    start_of_url_line = page.find('<a href') 
    if start_of_url_line == -1: 
     return None, 0 
else: 
    start_of_url = page.find('"http', start_of_url_line) 
    if start_of_url == -1: 
     return None, 0 
    else: 
     end_of_url = page.find('"', start_of_url + 1) 
     one_url = page[start_of_url + 1 : end_of_url] 
     return one_url, end_of_url 

def get_all_url(page): 
p = [] 
while True: 
    url, end_pos = find_next_url(page) 
    if url: 
     p.append(url) 
     page = page[end_pos + 1 : ] 
    else: 
     break 
return p 

def union(a, b): 
    for e in b: 
    if e not in a: 
     a.append(e) 
    return a 

def webcrawl(seed): 
    tocrawl = [seed] 
    crawled = [] 
    while True: 
     page = tocrawl.pop() 
     if page not in crawled: 
      import urllib.request 
      intpage = urllib.request.urlopen(page).read() 
      openpage = str(intpage) 
      union(tocrawl, get_all_url(openpage)) 
      crawled.append(page) 
    return crawled

ですが、常にHTTP 403エラーが表示されます。

出典

2017-11-28 Sayan

403の手段[**禁断**] https://en.wikipedia.org/wiki/HTTP_403 ） - あなたがアクセスしようとしているURLを知っていれば、これが*望ましい*動作であるかどうかは分かりません。 –

私が達成しようとしているのは、コードがあるページからいくつかのURLを取得して個々のURLに入り、以前に見つかったURLのリストの中にさらに多くのURLを取得できるかどうかを確認することです。私はおそらく、もし私がいくつかのHTTPハイパーリンクを持つシンプルなWebページを持っていれば、これを達成するでしょう。私はhttps://xkcd.com/353/で試しました。 – Sayan

HTTP 403エラーは、あなたのコードに関連していません。つまり、クロール中のURLはアクセスが禁止されています。ほとんどの場合、ページはログインしているユーザーまたは特定のユーザーのみが使用できることを意味します。

私は実際にコードを実行し、creativecommonsリンクで403を得ました。その理由は、urllibはデフォルトでHostヘッダーを送信しないため、エラーが発生しないように手動で追加する必要があります（ほとんどのサーバーはHostヘッダーをチェックし、どのコンテンツを配信するかを決定します）。デフォルトでHostヘッダーを送信する組み込みurllibの代わりにRequests python packageを使用することもできますし、より多くのpythonic IMOです。

try-exept節を追加してエラーをキャッチしてログに記録し、他のリンクをクロールし続けます。ウェブ上には多くの壊れたリンクがあります。

from urllib.request import urlopen 
from urllib.error import HTTPError 
... 
def webcrawl(seed): 
    tocrawl = [seed] 
    crawled = [] 
    while True: 
     page = tocrawl.pop() 
     if page not in crawled: 
      try: 
       intpage = urlopen(page).read() 
       openpage = str(intpage) 
       union(tocrawl, get_all_url(openpage)) 
       crawled.append(page) 
      except HTTPError as ex: 
       print('got http error while crawling', page) 
    return crawled

出典

2017-11-28 13:31:38

403エラーの原因となる正確なURLを見つけて、質問に追加してください。 URLが問題である可能性が高くなります。 'urlopen'呼び出しの前にURLを表示してみてください。 –

最初のリストからURLを見つけました - http://creativecommons.org/licenses/by-nc/2.5/ – Sayan

リクエストヘッダーやその他の認証を追加する必要があります。ユーザエージェントを追加して、reCaptchaを回避してください。

例：他人として

User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36

出典

2017-11-28 14:03:56

はエラーがコード自体が原因ではない、と述べているが、あなたはカップルの事

は、例外ハンドラを追加してみてください実行しようとすることもできますクロールが期待どおりに機能していることを確認するために問題のあるページをすべて無視してください：

def webcrawl(seed): 
    tocrawl = [seed] 
    crawled = [] 
    while tocrawl: # replace `while True` with an actual condition, 
        # otherwise you'll be stuck in an infinite loop 
        # until you hit an exception 
     page = tocrawl.pop() 
     if page not in crawled: 
      import urllib.request 
      try: 
       intpage = urllib.request.urlopen(page).read() 
       openpage = str(intpage) 
       union(tocrawl, get_all_url(openpage)) 
       crawled.append(page) 
      except urllib.error.HTTPError as e: # catch an exception 
       if e.code == 401: # check the status code and take action 
        pass # or anything else you want to do in case of an `Unauthorized` error 
       elif e.code == 403: 
        pass # or anything else you want to do in case of a `Forbidden` error 
       elif e.cide == 404: 
        pass # or anything else you want to do in case of a `Not Found` error 
       # etc 
       else: 
        print('Exception:\n{}'.format(e)) # print an unexpected exception 
        sys.exit(1) # finish the process with exit code 1 (indicates there was a problem) 
    return crawled

リクエストにUser-Agentヘッダーを追加してみてください。 urllib.request docsから： - とは対照的に、いくつかのHTTPサーバは要求だけが共通ブラウザから来る許可し

これは、しばしば「なりすまし」自体を識別するためにブラウザで使用されてUser-Agentヘッダ、に使用されていますスクリプト。たとえば、Mozilla Firefoxの場合、は "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"、となりますが、urllibのデフォルトユーザーエージェント文字列は "Python-urllib/2.6"（Python 2.6）です。

だから、このようなものは、403エラーの一部を回避するために役立つかもしれない：（

headers = {'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'} 
    req = urllib.request.Request(page, headers=headers) 
    intpage = urllib.request.urlopen(req).read() 
    openpage = str(intpage)

出典

2017-11-28 15:40:16 rmq

Pythonでの単純なWebクローラー

答えて

関連する問題