。だから私は、質問rght理解している場合:あなただけのユニークな名前が正規表現と大きさを取り除き、セットに追加したい場合は
Balsam Fir 15 ml
Balsam Fir 30 ml
Balsam Fir 5 ml
Basil Essential Oil 15ml
Basil Essential Oil 30ml
Basil Essential Oil 3ml
Basil Essential Oil 5ml
Bergamot Essential Oil 15ml
...
:
import requests
from bs4 import BeautifulSoup
# You need to store cookies so use a session.
s = requests.Session()
# Reques a page to get cookie.
s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories")
# Make the real request.
page = s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils")
soup = BeautifulSoup(page.content,'html.parser')
# Get the div.
divs = soup.find_all('div', attrs={'class': 'col-sm-4 column-spacer'})
# Get the a element text.
for div in divs:
print (div.find('a').text)
出力を
import requests
from bs4 import BeautifulSoup
import re
# You need to store cookies so use a session.
s = requests.Session()
# Reques a page to get cookie.
s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories")
# Make the real request.
page = s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils")
soup = BeautifulSoup(page.content,'html.parser')
# Get the div.
divs = soup.find_all('div', attrs={'class': 'col-sm-4 column-spacer'})
# Get the a element text.
a = set()
for div in divs:
text = div.find('a').text
a.add(re.sub('\s*\d+\s*ml$', '', text))
print (a)
出力:
{'Lavender, Bulgarian Essential Oil', 'Thyme, White', 'Mandarin, Red Essential Oil', 'Pine Needle Essential Oil', 'Lemongrass Essential Oil', 'Fir Needle, Siberian', 'Spruce', 'Peppermint', 'Lime Essential Oil', 'Myrrh', 'Juniper Essential Oil', 'Petitgrain', 'Wintergreen', 'Lemon Essential Oil', 'Palmarosa', 'Balsam Fir', 'Chamomile, Roman', 'Cypress', 'Citronella', 'Rosemary', 'Lemon myrtle Essential Oil', 'Clary Sage', 'Cinnamon Bark', 'Frankincense', 'Tangerine', 'Cocoa, Absolute', 'Spearmint', 'Ravensara Essential Oil', 'Spike Lavender Essential Oil', 'Hyssop', 'Ylang Ylang', 'Basil Essential Oil', 'Bergamot Essential Oil', 'Fir Needle, Siberian1', 'Geranium Bourbon', 'Patchouli', 'Black Pepper Essential Oil', 'Fennel', 'Grapefruit Essential Oil', 'Eucalyptus', 'Carrot Seed Essential Oil', 'Chamomile, German', 'Vetiver', 'Tea Tree', 'Ginger', 'Marjoram, Sweet', 'Clove Bud'}
サンプルhtml –
文書のフォーマットが正しくないため、リンクを共有するので、chrome開発ツールで表示することをお勧めします。私は各製品名(各行の最初の列)の最初のインスタンスを抽出しようとしています。 – rickyjoepr
@rickyjoeprサイト上の各製品へのリンクを取得したいですか? –