htmlタグからテキストを抽出するにはどうすればいいですか？

だから、リンク内の特定のタグのテキストをつかむだけです。たとえば、テキストに特定の単語が含まれている場合のみHTMLを返します。テキストに「chemical」が含まれている場合は、そのリンクを返しますあなたのget_all_joblinks機能に単一indeed.caページからのリンクをすべて取得しているようだhtmlタグからテキストを抽出するにはどうすればいいですか？

import requests 
from bs4 import BeautifulSoup 
import webbrowser 

jobsearch = input("What type of job?: ") 
location = input("What is your location: ") 
url = ("https://ca.indeed.com/jobs?q=" + jobsearch + "&l=" + location) 
base_url = 'https://ca.indeed.com/' 

r = requests.get(url) 
rcontent = r.content 
prettify = BeautifulSoup(rcontent, "html.parser") 

all_job_url = [] 

def get_all_joblinks(): 
    for tag in prettify.find_all('a', {'data-tn-element':"jobTitle"}): 
     link = tag['href'] 
     all_job_url.append(link) 

def filter_links(): 

    for eachurl in all_job_url: 
     rurl = requests.get(base_url + eachurl) 
     content = rurl.content 
     soup = BeautifulSoup(content, "html.parser") 
     summary = soup.find('td', {'class':'snip'}).get_text() 
     print(summary) 

def search_job(): 

    while True: 

     if prettify.select('div.no_results'): 
      print("no job matches found") 
      break 
     else: 
      # opens the web page of job search if entries are found 
      website = webbrowser.open_new(url); 
      break 

get_all_joblinks() 
filter_links()

出典

2017-07-05 DJRodrigue

：

に合格しない場合は、ここに私のコードです。典型的なリンクがbody要素のテキストのどこかに「化学物質」を記述しているかどうかを確認する方法は次のとおりです。

>>> import requests 
>>> import bs4 
>>> page = requests.get('https://jobs.sanofi.us/job/-/-/507/4895612?utm_source=indeed.com&utm_campaign=sanofi%20sem%20campaign&utm_medium=job_aggregator&utm_content=paid_search&ss=paid').content 
>>> soup = bs4.BeautifulSoup(page, 'lxml') 
>>> body = soup.find('body').text 
>>> chemical_present = body.lower().find('chemical')>-1 
>>> chemical_present 
True

これはあなたが探していたことを希望しています。

編集、コメントに応じて。

>>> import webbrowser 
>>> job_type = 'engineer' 
>>> location = 'Toronto' 
>>> url = "https://ca.indeed.com/jobs?q=" + job_type + "&l=" + location 
>>> base_url = '%s://%s' % parse.urlparse(url)[0:2] 
>>> page = requests.get(url).content 
>>> soup = bs4.BeautifulSoup(page, 'lxml') 
>>> for link in soup.find_all('a', {'data-tn-element':"jobTitle"}): 
...  job_page = requests.get(base_url+link['href']).content 
...  job_soup = bs4.BeautifulSoup(job_page, 'lxml') 
...  body = job_soup.find('body').text 
...  if body.lower().find('chemical')>-1: 
...   webbrowser.open(base_url+link['href'])

出典

2017-07-05 20:05:04

はい私はすべてのリンクを抽出し、それらに含まれる特定のテキストに基づいてフィルタリングしたいと思います。それらがフィルタリングされた後、私はフィルタリングされたものとのリンクを表示したい。それは可能ですか？ – DJRodrigue

編集をご覧ください。私はおそらく、このコードは実行に時間がかかることを警告する必要があります。 –

htmlタグからテキストを抽出するにはどうすればいいですか？

答えて

関連する問題