beautifulsoupでウェブページからすべてのリンクを再帰的に見つける方法は？

私は再帰的に指定されたURLからのすべてのリンクを見つけるために、in this answerを発見したいくつかのコードを使用しようとしている：beautifulsoupでウェブページからすべてのリンクを再帰的に見つける方法は？

import urllib2 
from bs4 import BeautifulSoup 

url = "http://francaisauthentique.libsyn.com/" 

def recursiveUrl(url,depth): 

    if depth == 5: 
     return url 
    else: 
     page=urllib2.urlopen(url) 
     soup = BeautifulSoup(page.read()) 
     newlink = soup.find('a') #find just the first one 
     if len(newlink) == 0: 
      return url 
     else: 
      return url, recursiveUrl(newlink,depth+1) 


def getLinks(url): 
    page=urllib2.urlopen(url) 
    soup = BeautifulSoup(page.read()) 
    links = soup.find_all('a') 
    for link in links: 
     links.append(recursiveUrl(link,0)) 
    return links 

links = getLinks(url) 
print(links)

と警告

/usr/local/lib/python2.7/dist-packages/bs4/__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. 

The code that caused this warning is on line 28 of the file downloader.py. To get rid of this warning, change code that looks like this: 

BeautifulSoup(YOUR_MARKUP}) 

to this: 

BeautifulSoup(YOUR_MARKUP, "lxml")

以外にも、私は次のエラーを取得：

を

Traceback (most recent call last): 
    File "downloader.py", line 28, in <module> 
    links = getLinks(url) 
    File "downloader.py", line 25, in getLinks 
    links.append(recursiveUrl(link,0)) 
    File "downloader.py", line 11, in recursiveUrl 
    page=urllib2.urlopen(url) 
    File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen 
    return _opener.open(url, data, timeout) 
    File "/usr/lib/python2.7/urllib2.py", line 396, in open 
    protocol = req.get_type() 
TypeError: 'NoneType' object is not callable

問題は何ですか？

出典

2017-10-08 Alex

私はあなたが 'urlopen'ではなく、URLにBeautifulSoupオブジェクトを渡していると思います。 'link ['href']'のようなものを試してみてください。ただし、最初に存在することを確認してください。 – Thomas

Thomasさんに感謝します。しかし、 "ValueError：unknown url type：/ webpage/categery/general"というエラーが発生するようになりました。たぶん、これは相対リンクで絶対リンクではないからでしょうか？ – Alex

@Alex correct：） –

あなたのrecursiveUrlは、/ webpage/category/generalのような無効なURLリンクにアクセスしようとします。これは、hrefリンクの1つから抽出した値です。

抽出されたhref値をWebサイトのURLに追加してから、Webページを開こうとする必要があります。あなたは、あなたが達成したいことがわからないので、再帰のためにあなたのアルゴリズムに取り組む必要があります。

コード：

import requests 
from bs4 import BeautifulSoup 

def recursiveUrl(url, link, depth): 
    if depth == 5: 
     return url 
    else: 
     print(link['href']) 
     page = requests.get(url + link['href']) 
     soup = BeautifulSoup(page.text, 'html.parser') 
     newlink = soup.find('a') 
     if len(newlink) == 0: 
      return link 
     else: 
      return link, recursiveUrl(url, newlink, depth + 1) 

def getLinks(url): 
    page = requests.get(url) 
    soup = BeautifulSoup(page.text, 'html.parser') 
    links = soup.find_all('a') 
    for link in links: 
     links.append(recursiveUrl(url, link, 0)) 
    return links 

links = getLinks("http://francaisauthentique.libsyn.com/") 
print(links)

出力：

http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017/10 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017/09 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017/08 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/2017/07 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general 
http://francaisauthentique.libsyn.com//webpage/category/general

出典

2017-10-08 17:53:40 Ali

beautifulsoupでウェブページからすべてのリンクを再帰的に見つける方法は？

答えて

関連する問題