ウェブサイトからリンクを取り除く

-1

私はhttp://www.medhelp.org/forums/listに行きたいと思っています。そこにはさまざまな病気へのリンクがたくさんあります。各リンクの中には、それぞれがいくつかのリンクをいくつか持っています。ウェブサイトからリンクを取り除く

いくつかのリンクを取得したいと思います。だから私はこのコードを使用した：

myArray=[] 
html_page = urllib.request.urlopen("http://www.medhelp.org/forums/list") 
soup = bs(html_page) 
temp =soup.findAll('div',attrs={'class' : 'forums_link'}) 
for div in temp: 
    myArray.append('http://www.medhelp.org' + div.a['href']) 
myArray_for_questions=[] 
myPages=[] 

#this for is going over all links on the main page. in this case, all 
diseases 
for link in myArray: 

    # "link" is the URL for each link in the main page of our website 
    html_page = urllib.request.urlopen(link) 
    soup1 = bs(html_page) 

    #getting the questions's links in the first page 
    temp =soup1.findAll('div',attrs={'class' : 'subject_summary'}) 
    for div in temp: 
    myArray_for_questions.append('http://www.medhelp.org' + div.a['href']) 

    #now getting the URL for all next pages for this page 
    pages = soup1.findAll('a' ,href=True, attrs={'class' : 'page_nav'}) 
    for l in pages: 
    html_page_t = urllib.request.urlopen('http://www.medhelp.org' 
    +l.get('href')) 
    soup_t = bs(html_page_t) 
    other_pages = soup_t.findAll('a' ,href=True, attrs={'class' : 
    'page_nav'}) 
    for p in other_pages: 
     mystr='http://www.medhelp.org' +p.get('href') 
     if mystr not in myPages: 
      myPages.append(mystr) 
     if p not in pages: 
      pages.append(p) 

    # getting all links inside this page which are people's questions 
    for page in myPages: 
     html_page1 = urllib.request.urlopen(page) 
     soup2 = bs(html_page1) 
     temp =soup2.findAll('div',attrs={'class' : 'subject_summary'}) 
     for div in temp: 
     myArray_for_questions.append('http://www.medhelp.org' + 
     div.a['href'])

しかし、私はすべてのページから欲しいすべてのリンクを取得するのにかかります。何か案は？

おかげ

出典

2017-07-06 Sanaz

これは一般的すぎます。これまでに試したことを教えてください。質問を絞り込んでください。 – rowana

質問をするときは、通常、実装しようとしたことのあるコードを持っているか、トピックの研究を通じて見つかったコードを理解するための助けを求める必要があります（抜粋例など）。 – gavsta707

私はまだ始まっていません。私は、このフォーラムではさまざまな病気について質問しているので、私はそれらをすべてファイルに保存する必要があるため、特別なWebクローラーを作成したいと思っています。 – Sanaz

は、Webページで提供されている例を交換しようと、それに続く、scrapyチュートリアルを試してみてください。

https://doc.scrapy.org/en/latest/intro/tutorial.html

出典

2017-07-06 15:12:19 yaizer

ウェブサイトからリンクを取り除く

答えて

関連する問題