BeautifulSoup Soup Recursive

ウェブページのURLを再帰的に取得し、その結果をリストで取得したいと考えています。BeautifulSoup Soup Recursive

これは私が使用しているコードです：私は、URLのcatalog_url内を得る最初のループでは

catalog_url = "http://nomads.ncep.noaa.gov:9090/dods/gfs_0p25/" 

from bs4 import BeautifulSoup # conda install -c asmeurer beautiful-soup=4.3.2 
import urllib2 
from datetime import datetime 

html_page = urllib2.urlopen(catalog_url) 
soup = BeautifulSoup(html_page) 

urls_day = [] 
for link in soup.findAll('a'): 
    if datetime.today().strftime('%Y') in link.get('href'): # String contains today's year in name 
     print link.get('href') 
     urls_day.append(link.get('href')) 

urls_final = [] 
for run in urls_day: 
    html_page2 = urllib2.urlopen(run) 
    soup2 = BeautifulSoup(html_page2) 
    for links in soup2.findAll('a'): 
     if datetime.today().strftime('%Y') in soup2.get('a'): 
      print links.get('href') 
      urls_final.append(links.get('href'))

を。 urls_dayは、現在の年の文字列を含むurlを持つリストオブジェクトです。

第二のループは、次の出力で失敗します。

<a href="http://nomads.ncep.noaa.gov:9090/dods">GrADS Data Server</a> 
Traceback (most recent call last): 
    File "<stdin>", line 6, in <module> 
TypeError: argument of type 'NoneType' is not iterable

urls_finalは、URLの私の関心のを含むリストオブジェクトでなければなりません。

これを解決する方法はありますか？私は再帰で美しいスープの同様の記事をチェックしましたが、私はいつも同じ 'NoneType'レスポンスを取得します。

出典

2016-11-06 jordi vidal

おそらくsoup2.findAll（ 'A'）で '場合datetime.today（）はstrftime（ '％Y'）必要があります。代わりに' '... soup2.get（ 'A'）のを'。 –

とにかく動作しません。 'Oct 24 04:42 UTC'のような文字列はタグの一部ではなく、タグの前のテキストです。このテキストを見つけて、その後にタグを配置する必要があります。 –

再帰関数を呼び出す前に、戻り値がNoneTypeであるかどうかを確認する必要があります。私はあなたが改善できる例を書いた。

from bs4 import BeautifulSoup 
from datetime import datetime 
import urllib2 

CATALOG_URL = "http://nomads.ncep.noaa.gov:9090/dods/gfs_0p25/" 

today = datetime.today().strftime('%Y') 

cache = {} 


def cached(func): 
    def wraps(url): 
     if url not in cache: 
      cache[url] = True 
      return func(url) 
    return wraps 


@cached 
def links_from_url(url): 
    html_page = urllib2.urlopen(url) 
    soup = BeautifulSoup(html_page, "lxml") 
    s = set([link.get('href') for link in soup.findAll('a') if today in link.get('href')]) 
    return s if len(s) else url 


def crawl(links): 
    if not links: # Checking for NoneType 
     return 
    if type(links) is str: 
     return links 
    if len(links) > 1: 
     return [crawl(links_from_url(link)) for link in links] 


if __name__ == '__main__': 
    crawl(links_from_url(CATALOG_URL)) 
    print cache.keys()

出典

2016-11-06 17:13:59

これはOPの問題を解決しないだろうと思っています。あなたの 'キャッシュされた'デコレータについてはわかりません。 'url'がキャッシュされている場合は何も返しません。意図的ですか？ –

私は既にフェッチされたURLを取得したくありませんでした。 –

答えて

関連する問題