PythonのBeautifulSoup - ページ

からグラブ内部リンクは、私はしかし、私はページ上の内部リンクをたどるにしようとしています、私はurllib2.urlopenで取得したページのリンクを探すための基本的なループを持っている...PythonのBeautifulSoup - ページ

どれでも私の下のループを同じドメインにあるリンクだけにする方法

for tag in soupan.findAll('a', attrs={'href': re.compile("^http://")}): 
       webpage = urllib2.urlopen(tag['href']).read() 
       print 'Deep crawl ----> ' +str(tag['href']) 
       try: 
        code-to-look-for-some-data... 

       except Exception, e: 
        print e

出典

2012-05-03 user1213488

>>> import urllib 
>>> print urllib.splithost.__doc__ 
splithost('//host[:port]/path') --> 'host[:port]', '/path'.

ホストが同じであるか、またはホストが空の場合（相対パスのためである）、URLが同じホストに属しています。

for tag in soupan.findAll('a', attrs={'href': re.compile("^http://")}): 

      href = tag['href'] 
      protocol, url = urllib.splittype(href) # 'http://www.xxx.de/3/4/5' => ('http', '//www.xxx.de/3/4/5') 
      host, path = urllib.splithost(url) # '//www.xxx.de/3/4/5' => ('www.xxx.de', '/3/4/5') 
      if host.lower() != theHostToCrawl and host != '': 
       continue 

      webpage = urllib2.urlopen(href).read() 

      print 'Deep crawl ----> ' +str(tag['href']) 
      try: 
       code-to-look-for-some-data... 

      except: 
       import traceback 
       traceback.print_exc()

あなたはこの

'href': re.compile("^http://")

を行うためには相対パスは使用されません。は同じです

<a href="/folder/file.htm"></a>

多分使用しないでください。

出典

2012-05-03 16:27:48 User

私はそれを私のループに実装する方法を理解していませんが、私はロジックを見ています:)それをループに実装する方法は分かりますか？ – user1213488

あなたはこれを好きですか？ – User

あなたは 're'を全く使わないと言っていますが、' http：// whatever'と '（no http：//）'にマッチする正規表現を思いつくことができます。 – jadkik94

クローラのアドバイス：mechanicalizeをBeautifulSoupと組み合わせて使用すると、作業が簡単になります。

出典

2012-05-04 08:41:35 marbdq

PythonのBeautifulSoup - ページ

答えて

関連する問題