BeautifulSoupの不要なBase64リンクを除外します

ほとんどの場合に使用する簡単な画像スクレーパースクリプトを作成しました。私はいくつかの素敵なjpgの壁紙を持っているウェブサイトに出くわしました。スクリプトは正常に動作しますが、不要なbase64データイメージリンクも印刷します。これらのリンクを除外するにはどうすればよいですか？BeautifulSoupの不要なBase64リンクを除外します

import requests 
from bs4 import BeautifulSoup 

r = requests.get('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/') 
soup = BeautifulSoup(r.content, 'lxml') 

for link in soup.find_all('img'): 
    image = (link.get('src')) 
    print image

出力：

https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/cloudy-ubuntu-mate.jpg 
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== 
https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/ubuntu-feeling.jpg 
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== 
https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/two-gentlemen-in-car.jpg 
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==

更新。助けてくれてありがとう。完成したコードは、すべての画像をダウンロードするためにこのようになります。乾杯：

import requests 
from bs4 import BeautifulSoup 

r = requests.get('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/') 
img_url = 'https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/' 
soup = BeautifulSoup(r.content, 'lxml') 

for link in soup.select('img[src$=".jpg"]'): 
    image = (link['src']) 
    image_name = (img_url + image).split('/')[-1] 
    print ('Downloading: {}'.format(image_name)) 
    r2 = requests.get(image) 
    with open(image_name, 'wb') as f: 
     f.write(r2.content)

出典

2017-12-16 uzdisral

これをショットします。あなたの望む結果が得られます。 .find_all()の代わりに.select()をここで使用しました。

import requests 
from bs4 import BeautifulSoup 

r = requests.get('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/') 
soup = BeautifulSoup(r.content, 'lxml') 

for link in soup.select('img[src$=".jpg"]'): 
    print(link['src'])

それとも.find_all()を使用して同じことを行うことを好む場合：

for link in soup.find_all('img'): 
    if ".jpg" in link['src']: 
     print(link['src'])

出典

2017-12-16 19:07:14 SIM

働いた、ありがとうございました。結局、この場合はfind_allの代わりにselectを使用しますか？そして '（ 'img [src $ ="。jpg "]'）'？ – uzdisral

編集された部分を参照してください。 – SIM

素晴らしいです。私もコードを更新しました:)もう一度ありがとう – uzdisral

BeautifulSoupの不要なBase64リンクを除外します

答えて

関連する問題