Python 2.7を使用してWebページからダウンロードリンクを取得する

私はこのプログラムを作って、反復的な作業をより煩わしいものにしています。リンクを取って「Download STV Demo」ボタンをフィルタリングし、そのボタンからURLを取得してダウンロードすることを想定しています。 URLからのダウンロードファイルはうまくいきますが、私はただURLを開くことができません。それは私が望むサイトではなく、単にstackoverflowからダウンロードされます。 403 Forbiddenエラーが表示されます。誰でもこれをhttp://sizzlingstats.com/stats/479453で動作させる方法と、そのダウンロードstvボタンをフィルタリングする方法についてのアイデアはありますか？Python 2.7を使用してWebページからダウンロードリンクを取得する

import random, sys, urllib2, httplib2, win32clipboard, requests, urlparse 
from copy import deepcopy 
from bs4 import SoupStrainer 
from bs4 import BeautifulSoup 
from urllib2 import Request 
from urllib2 import urlopen 
#When I wrote this, only God and I knew what I was writing 
#Now only God knows 

page = raw_input("Please copy the .ss link and hit enter... ") 
win32clipboard.OpenClipboard() 
page = win32clipboard.GetClipboardData() 
win32clipboard.CloseClipboard() 
s = page 
try: 
    page = s.replace("http://","http://www.") 
    print page + " Found..." 
except: 
    page = s.replace("www.","http://www.") 
    print page 

req = urllib2.Request(page, '', headers = { 'User-Agent' : 'Mozilla/5.0' }) 
req.headers['User-agent'] = 'Mozilla/5.0' 
req.add_header('User-agent', 'Mozilla/5.0') 
print req 
soup = BeautifulSoup(page, 'html.parser') 
print soup.prettify() 
links = soup.find_all("Download STV Demo") 
for tag in links: 
    link = links.get('href',None) 
    if "Download STV Demo" in link: 
     print link 

file_name = page.split('/')[-1] 
u = urllib2.urlopen(page) 
f = open(file_name, 'wb') 
meta = u.info() 
file_size = int(meta.getheaders("Content-Length")[0]) 
print "Downloading: %s Bytes: %s" % (file_name, file_size) 

file_size_dl = 0 
block_sz = 8192 
while True: 
    buffer = u.read(block_sz) 
    if not buffer: 
     break 
    file_size_dl += len(buffer) 
    f.write(buffer) 
    status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100./file_size) 
    status = status + chr(8)*(len(status)+1) 
    print status, 
f.close()

出典

2016-06-28 Ben Smith

あなたはユーザーエージェントを追加する必要があり、なぜあなたはBS4にraw_input出力を渡していますか？ –

私はユーザエージェントを持っており、生の入力からのページはすぐに上書きされます。しかし、それは不要なので削除します –

とにかく、ページコンテンツが動的に作成されるので、それはまったく関係ありません。クロム開発ツールのxhrタブを見て、json形式で必要なすべてのデータを取得することができます –

このページのコンテンツは、APIからJavascriptを使用して動的に生成されます。

>>> import requests 
>>> 
>>> requests.get('http://sizzlingstats.com/api/stats/479453').json()['stats']['stvUrl'] 
u'http://sizzlingstv.s3.amazonaws.com/stv/479453.zip'

ユーザーエージェントがブロックされているため、403が表示されています。

ユーザーエージェントでreqオブジェクトを作成しましたが、代わりにurllib2.urlopen(page)を使用していません。

また、pageをBeautifulSoupにエラーとして渡しています。

soup = BeautifulSoup(page, 'html.parser')

出典

2016-06-28 09:26:13

あなたのコードを見てみましょう：まずあなたが実際に、あなたは多くのあなたが使用していないモジュール（多分これは、穴のコードではありません）、あなたが使用するいくつかの他のものを輸入しているが、あなたがそれらを必要としません唯一必要があります。

from urllib2 import urlopen

（後で理由を参照してくださいよ）と、おそらくwin32clipboardご入力のために、私はこのコードの一部を残しておきますので、あなたの入力はokです：

import win32clipboard 
page = input("Please copy the .ss link and hit enter... ") 
win32clipboard.OpenClipboard() 
page = win32clipboard.GetClipboardData() 
win32clipboard.CloseClipboard()

が、私は本当にD

page = raw_input("Please enter the .ss link: ")

、コードのこの部分は本当に不要です：

s = page 
try:            
    page = s.replace("http://","http://www.") 
    print page + " Found..."     
except:            
    page = s.replace("www.","http://www.")  
    print page

ので、私はちょうどよ入力のこれらの種類の目的は、それが簡単だけのようなものを使用していない参照on'tしかし、私はこのコードを使用しないと、私は理由を説明しようと思って

from urllib2 import Request, urlopen 
from bs4 import BeautifulSoup 
req = Request(page, headers = { 'User-Agent' : 'Mozilla/5.0' }) 
#req.headers['User-agent'] = 'Mozilla/5.0'  # you don't need this 
#req.add_header('User-agent', 'Mozilla/5.0') # you don't need this 
print req 
html = urlopen(req)  #you need to open page with urlopen before using BeautifulSoup 
# it is to fix this error: 
##  UserWarning: "b'http://www.sizzlingstats.com/stats/479453'" looks like a URL. 
##  Beautiful Soup is not an HTTP client. You should probably use an HTTP client 
##  to get the document behind the URL, and feed that document to Beautiful Soup. 
soup = BeautifulSoup(html, 'html.parser') # variable page changed to html 
# print soup.prettify()   # I commented this because you don't need to print html 
           # but if you want to see that it's work just uncomment it

しかし、あなたはBeautifulSoupと他の何らかのページをこすりする必要がある場合、あなたはそれを使用することができます。それを削除し、次の部分は次のようになります。

あなたはので、この部分のそれを必要としない：

links = soup.find_all("Download STV Demo")

ので問題はページがあるので何も、少なくともないスープのHTMLコードで、htmlコードに「STVデモをダウンロード」がないということです

for tag in links:      
    link = links.get('href',None)  like I said there is no use of this 
    if "Download STV Demo" in link: because variable links is empty list 
     print link

をので、私たちは、リンクを必要としているページの一部である言ったように：あなたは任意のリンクを見つけるしたいので、ジャバスクリプトによって作成された、あなたは、このためのlinks == []が、あなたもこれを必要としないことを確認するためにprint(links)を使用することができますjavascriptで作成されているので、スクリプトをスクラップして見つけることができます。それを行うのは非常に困難だろうが、あなたはURLを見れば、我々はそれがこのようになります見つけようとしている。

http://sizzlingstv.s3.amazonaws.com/stv/479453.zip

ので、今あなたが持っているURLを見て、それは次のようになります。

http://sizzlingstats.com/stats/479453

あなたはこの場合のみ、それは479453で、あなたはそれあなたのリンク（http://sizzlingstats.com/stats/479453）は、リンクの最後の部分を見つける必要があり、このリンクhttp://sizzlingstv.s3.amazonaws.com/stv/479453.zipを得るために、それはまた、それの最後の部分です。その数字はfile_nameとして使用します。

f = open(file_name + '.zip', 'wb') # I added '.zip' 
file_size_dl = 0 
block_sz = 8192 
while True: 
    buffer = u.read(block_sz) 
    if not buffer: 
     break 
    file_size_dl += len(buffer) 
    f.write(buffer) 
    status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100./file_size) 
    status = status + chr(8)*(len(status)+1) 
    print status 
f.close()

をし、多分あなたがしたい：

u = urlopen(download_link) 
meta = u.info()  
file_size = int(meta.getheaders("Content-Length")[0]) 
print "Downloading: %s Bytes: %s" % (file_name, file_size)

これ以下の部分が動作します。私はあなたのコードの一部をコピーします、その後

file_name = page.split('/')[-1] 
download_link = 'http://sizzlingstv.s3.amazonaws.com/stv/' + file_name + '.zip'

を：ここでは正確に実行するコードですメッセージをダウンロードすることを参照してくださいが、私はそれが使いやすいと思います：

f = open(file_name + '.zip', 'wb') 
f.write(u.read()) 
print "Downloaded" 
f.close()

と、ここでは単にコード：

from urllib2 import urlopen 

import win32clipboard 
page = input("Please copy the .ss link and hit enter... ") 
win32clipboard.OpenClipboard() 
page = win32clipboard.GetClipboardData() 
win32clipboard.CloseClipboard() 

# or use: 
# page = raw_input("Please enter the .ss link: ") 

file_name = page.split('/')[-1] 
download_link = 'http://sizzlingstv.s3.amazonaws.com/stv/' + file_name + '.zip' 
u = urlopen(download_link) 
meta = u.info()  
file_size = int(meta.getheaders("Content-Length")[0]) 
print "Downloading: %s Bytes: %s" % (file_name, file_size) 

f = open(file_name + '.zip', 'wb') # I added '.zip' 
file_size_dl = 0 
block_sz = 8192 
while True: 
    buffer = u.read(block_sz) 
    if not buffer: 
     break 
    file_size_dl += len(buffer) 
    f.write(buffer) 
    status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100./file_size) 
    status = status + chr(8)*(len(status)+1) 
    print(status) 
f.close() 

# or use: 
##f = open(file_name + '.zip', 'wb') 
##f.write(u.read()) 
##print "Downloaded" 
##f.close()

出典

2016-06-28 19:23:56 ands

コメントなしでこのような答えを見ることは悲しいことです。また、OPがあなたの答えを受け入れなかった、またはそれを上回っていないことも憶測しています。あなたの努力と詳細な説明に感謝します。 – xverges

Python 2.7を使用してWebページからダウンロードリンクを取得する

答えて

関連する問題