2016-12-17 8 views
2

入力されたURLからリンクを削り取ろうとしていますが、1つのURL()でしか動作しません。入力されたURLからどのように掻き集めることができますか?私はBeautifulSoupを使用していますが、これに適したScrapyですか?ScrapyまたはBeautifulSoup複数のウェブサイトからのリンクとテキストを掻き集める

def WebScrape(): 
    linktoenter = input('Where do you want to scrape from today?: ') 
    url = linktoenter 
    html = urllib.request.urlopen(url).read() 
    soup = BeautifulSoup(html, "lxml") 

    if linktoenter in url: 
     print('Retrieving your links...') 
     links = {} 
     n = 0 
     link_title=soup.findAll('a',{'class':'title'}) 
     n += 1 
     links[n] = link_title 
     for eachtitle in link_title: 
      print(eachtitle['href']+","+eachtitle.string) 
    else: 
     print('Please enter another Website...') 
+1

あなたはそれだけで1つのURLのために働く何を意味していますか?あなたが別のものを与えるとどうなりますか?エラーまたは予期しない結果がありますか?あなたが試した他のURLは何でしたか? –

+0

あなたがアクセスしようとしているすべてのリンクについて、そのサイトが 'class =" title "'を持っているようですが、これはあなたのコードが依存しているものです。 –

答えて

1

より一般的なスクレーパーを作成して、これらのタグ内のすべてのタグとすべてのリンクを検索することができます。すべてのリンクのリストを取得したら、正規表現などを使用して、目的の構造に一致するリンクを見つけることができます。

import requests 
from bs4 import BeautifulSoup 
import re 

response = requests.get('http://www.businessinsider.com') 

soup = BeautifulSoup(response.content) 

# find all tags 
tags = soup.find_all() 

links = [] 

# iterate over all tags and extract links 
for tag in tags: 
    # find all href links 
    tmp = tag.find_all(href=True) 
    # append masters links list with each link 
    map(lambda x: links.append(x['href']) if x['href'] else None, tmp) 

# example: filter only careerbuilder links 
filter(lambda x: re.search('[w]{3}\.careerbuilder\.com', x), links) 
0

コード:アウト

def WebScrape(): 
    url = input('Where do you want to scrape from today?: ') 
    html = urllib.request.urlopen(url).read() 
    soup = bs4.BeautifulSoup(html, "lxml") 

    title_tags = soup.findAll('a', {'class': 'title'}) 
    url_titles = [(tag['href'], tag.text)for tag in title_tags] 

    if title_tags: 
     print('Retrieving your links...') 
     for url_title in url_titles: 
      print(*url_title) 

Where do you want to scrape from today?: http://www.businessinsider.com 
Retrieving your links... 
http://www.businessinsider.com/trump-china-drone-navy-2016-12 Trump slams China's capture of a US Navy drone as 'unprecedented' act 
http://www.businessinsider.com/trump-thank-you-rally-alabama-2016-12 'This is truly an exciting time to be alive' 
http://www.businessinsider.com/how-smartwatch-pioneer-pebble-lost-everything-2016-12 How the hot startup that stole Apple's thunder wound up in Silicon Valley's graveyard 
http://www.businessinsider.com/china-will-return-us-navy-underwater-drone-2016-12 Pentagon: China will return US Navy underwater drone seized in South China Sea 
http://www.businessinsider.com/what-google-gets-wrong-about-driverless-cars-2016-12 Here's the biggest thing Google got wrong about self-driving cars 
http://www.businessinsider.com/sheriff-joe-arpaio-still-wants-to-investigate-obamas-birth-certificate-2016-12 Sheriff Joe Arpaio still wants to investigate Obama's birth certificate 
http://www.businessinsider.com/rents-dropping-in-new-york-bubble-pop-2016-12 Rents are finally dropping in New York City, and a bubble might be about to pop 
http://www.businessinsider.com/trump-david-friedman-ambassador-israel-2016-12 Trump's ambassador pick could drastically alter 2 of the thorniest issues in the US-Israel relationship 
http://www.businessinsider.com/can-hackers-be-caught-trump-election-russia-2016-12 Why Trump's assertion that hackers can't be caught after an attack is wrong 
http://www.businessinsider.com/theres-a-striking-commonality-between-trump-and-nixon-2016-12 There's a striking commonality between Trump and Nixon 
http://www.businessinsider.com/tesla-year-in-review-2016-12 Tesla's biggest moments of 2016 
http://www.businessinsider.com/heres-why-using-uber-to-fill-public-transportation-gaps-is-a-bad-idea-2016-12 Here's why using Uber to fill public transportation gaps is a bad idea 
http://www.businessinsider.com/useful-hard-adopt-early-morning-rituals-productive-exercise-2016-12 4 morning rituals that are hard to adopt but could really pay off 
http://www.businessinsider.com/most-expensive-champagne-bottles-money-can-buy-2016-12 The 11 most expensive Champagne bottles money can buy 
http://www.businessinsider.com/innovations-in-radiology-2016-11 5 innovations in radiology that could impact everything from the Zika virus to dermatology 
http://www.businessinsider.com/ge-healthcare-mr-freelium-technology-2016-11 A new technology is being developed using just 1% of the finite resource needed for traditional MRIs