私は旅行アドバイザー(レビューからの評価)からいくつかのデータを掻き集めるためにPythonでコードを持っています。問題は、コードを実行するたびに異なる行が表示され、すべてのWebページがスクラップされないことです。Pythonで掻き立てる
現れるインデックスエラーは次のとおりです。
Traceback (most recent call last):
File "C:/Users/thimios/PycharmProjects/TripadvisorScrapping/proxiro.py", line 26, in <module>
rating = soup.findAll("div", {'class': 'rating reviewItemInline'})[i]
IndexError: list index out of range
コード以下の通りです:
from bs4 import BeautifulSoup
import os
import urllib.request
file2 = open(os.path.expanduser(r"~/Desktop/TripAdviser Reviews2.csv"), "wb")
file2.write(b"Organization,Rating" + b"\n")
WebSites = [
"https://www.tripadvisor.com/Hotel_Review-g189400-d198932-Reviews-Hilton_Athens-Athens_Attica.html#REVIEWS"]
Checker ="REVIEWS"
# looping through each site until it hits a break
for theurl in WebSites:
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.parser")
#print(soup)
while True:
# Extract ratings from the text reviews
altarray = ""
for i in range(0,10):
rating = soup.findAll("div", {'class': 'rating reviewItemInline'})[i]
rating1 = rating.find_all("span")[0]
rating2 = rating1['class'][1][-2:]
print(rating2)
if len(altarray) == 0:
altarray = [rating2]
else:
altarray.append(rating2)
#print(altarray)
#print(len(altarray))
#print(type(altarray))
# Extract Organization,
Organization1 = soup.find(attrs={'class': 'heading_name'})
Organization = Organization1.text.replace('"', ' ').replace('Review of',' ').strip()
#print(Organization)
# Loop through each review on the page
for x in range(0, 10):
Rating = altarray[x]
Rating = str(Rating)
#print(Rating)
#print(type(Rating))
Record2 = Organization + "," + Rating
if Checker == "REVIEWS":
file2.write(bytes(Record2, encoding="ascii", errors='ignore') + b"\n")
link = soup.find_all(attrs={"class": "nav next rndBtn ui_button primary taLnk"})
#print(link)
#print(link[0])
if len(link) == 0:
break
else:
soup = BeautifulSoup(urllib.request.urlopen("http://www.tripadvisor.com" + link[0].get('href')),"html.parser")
#print(soup)
#print(Organization)
print(link[0].get('href'))
Checker = link[0].get('href')[-7:]
#print(Checker)
file2.close()
私は、トリップアドバイザーがdata.Anyのアイデアへのフルアクセス権を与えないことを想定していますか?
マイナー用語訂正:この用語は、「擦り傷」および「擦り傷」です。 –