ソースのフォーマットに一貫性がない場合、BeautifulSoupを使用して異なるページをスクラップする方法はありますか？

私は、ページの構造がまったく異なるわけではなく、まったく同じではない複数のページからデータを削りたいと思っています。私のコードは、dがd = soup.findAll（ 'span'、class _ = 'property__base-info__value'）である場合、len（d）== 7で他のページをスキップするすべてのページを通過します。どうすればすべてのページを入手できますか？ページに存在しない変数を導入してNA値を与えることは可能ですか？これは私のコードです：ソースのフォーマットに一貫性がない場合、BeautifulSoupを使用して異なるページをスクラップする方法はありますか？

A=[] 
B=[] 
C=[] 
D=[] 
E=[] 
F=[] 
G=[] 
H=[] 
I=[] 
J=[] 
K=[] 
L=[] 

url = 

['https://www.booli.se/annons/2272818','https://www.booli.se/annons/2082826'] 

import requests 
from bs4 import BeautifulSoup 

for page in url: 
    request = requests.get(page) 
    soup = BeautifulSoup(request.text,'lxml') 
    # 
    d = soup.findAll('span', class_='property__base-info__value') 
    if len(d)==7: 
     # 
     region=soup.findAll('span', itemprop='name') 
     [1].text.strip().encode('utf-8') ###region### 
     # 
     a = soup.findAll('span', class_='property__base-info__title__size') 
     ar = a[0].text.strip().encode('utf-8').split() 
     room=ar[0] ######Rooms##### 
     area=ar[2] #####Area##### 
     # 
     temp=[] 
     d = soup.findAll('span', class_='property__base-info__value') 
     for i in d: 
      i = i.text.strip() 
      temp.append(i) 
     # 
     full_date=temp[0].encode('utf-8') 
     import datetime as dt 
     date=dt.datetime.strptime(full_date, '%d %b %Y').strftime('%Y-%m-%d')  
     # 
     tempo = temp[1].split('\n')[0].encode('utf-8') 
     Utropspris=tempo.replace('kr','') 
     import re 
     estimate=re.sub(r'(\d)\s+(\d)', r'\1\2', Utropspris) 
     # 
     avgift=temp[2].encode('utf-8').replace('kr/m\xc3\xa5n','') 
     fee=re.sub('(?<=\d) (?=\d)', '',avgift) ####avgift#### 
     # 
     apt=[] 
     lag=temp[3].encode('utf-8') 
     if lag=='L\xc3\xa4genhet': 
      apt='apartment'  ######Property type########## 
     # 
     cost=temp[4].encode('utf-8').replace('kr/m\xc3\xa5n','') 
     # 
     floor=temp[5].encode('utf-8').replace('tr','') 
     # 
     year=temp[6].encode('utf-8') ###Year built#### 
     # 
     test=soup.find('span', class_='property__base-info__sub- 
     value').text.strip().encode('utf-8').replace('kr/m\xc2\xb2','') 
     krm2=re.sub('(?<=\d) (?=\d)', '',test) 
     # 
     main=soup.find('span', class_='property__base- 
     info__title__price').text.strip().split('\n')[0].encode('utf- 
     8').replace('kr','') 
     price=re.sub('(?<=\d) (?=\d)', '',main) ####sold price#### 
     # 
     A.append(region) 
     B.append(room) 
     C.append(area) 
     D.append(date) 
     E.append(estimate) 
     F.append(fee) 
     G.append(apt) 
     H.append(cost) 
     I.append(floor) 
     J.append(year) 
     K.append(krm2) 
     L.append(price)

更新（自己の回答から）

私は制限を変更することができます。しかし、それは私に正しい出力を与えることはありません。 len(d)==7は私がすべての情報を得る場合です。私は1軒の家のために、5の上に設定した場合、私は得ることができます：

room, area, cost(Driftskostnad), floor (Våning),estimated price (Utropspris)

と別の家のために：

room, area, fee (avgift), year built (Byggår), estimated price (Utropspris)

出典

2017-03-20 Mary

私が正しく理解していれば、あなたがわからないデータをこすりするために求めています存在する。コンピュータは宇宙の中で最悪のものです。

あなたのコードはだと言いましたが、はlen（d）== 7のページを通ります。別の制限を設定できますか？

ページに存在しない変数を導入することは可能ですか？は、それらにNAの値を与えますか？

はい、あなたは、要素（フィールド）は、単純なif VARIABLE==None:またはちょうどif Variable:あなたの質問に答える（それがどのようなデータを持っている場合はtrueを返す必要があります）うまくいけば、といない存在するかどうかをチェックすることができます。細かいところまで行く必要があるので、質問に正しく答えて、質問を編集して返信します。

出典

2017-03-20 03:46:25

何かが存在しない場合は、変数==は動作しません。私はそれが可変であるべきだと思う== []。 – Mary

それは空リストです。 –

ソースのフォーマットに一貫性がない場合、BeautifulSoupを使用して異なるページをスクラップする方法はありますか？

答えて

関連する問題