Pythonの複数のWebページからのテキストの掻き取り

私たちのホストの特定のクライアントがウェブページからすべてのテキストを掻き取るように任されました。私は、1つのWebページからテキストを削り取るスクリプトを書いていました。別のWebページをスクラブするたびに、コード内のURLを手動で置き換えることができます。しかし、明らかにこれは非常に非効率的です。理想的には、必要なすべてのURLを含むいくつかのリストにPythonを接続させることができます。リストを繰り返して、すべてのスクラップされたテキストを1つのCSVに印刷します。私は2つのURLの長いリストを作成し、両方のURLをかき集めるために自分のコードを取得しようとすることで、このコードの "テスト"バージョンを作成しようとしました。しかし、わかるように、私のコードはリスト内の最新のURLだけを拾い読みし、掻き出した最初のページを保持しません。私はこれが常に自分自身を書き換えるので、私のプリントステートメントの不足が原因だと思います。ループがリスト全体を通過してすべてを印刷するまで、私が掻き集めたものをどこかに持っていく方法はありますか？Pythonの複数のWebページからのテキストの掻き取り

私のコードを完全に解体することは自由にできます。私はコンピュータ言語は何も知らない。私はちょうどこれらのタスクを割り当てられ続け、私のベストを尽くすためにGoogleを使います。

import urllib 
import re 
from bs4 import BeautifulSoup 

data_file_name = 'C:\\Users\\confusedanalyst\\Desktop\\python_test.csv' 
urlTable = ['url1','url2'] 

def extractText(string): 
    page = urllib.request.urlopen(string) 
    soup = BeautifulSoup(page, 'html.parser') 

##Extracts all paragraph and header variables from URL as GroupObjects 
    text = soup.find_all("p") 
    headers1 = soup.find_all("h1") 
    headers2 = soup.find_all("h2") 
    headers3 = soup.find_all("h3") 

##Forces GroupObjects into str 
    text = str(text) 
    headers1 = str(headers1) 
    headers2 = str(headers2) 
    headers3 = str(headers3) 

##Strips HTML tags and brackets from extracted strings 
    text = text.strip('[') 
    text = text.strip(']') 
    text = re.sub('<[^<]+?>', '', text) 

    headers1 = headers1.strip('[') 
    headers1 = headers1.strip(']') 
    headers1 = re.sub('<[^<]+?>', '', headers1) 

    headers2 = headers2.strip('[') 
    headers2 = headers2.strip(']') 
    headers2 = re.sub('<[^<]+?>', '', headers2) 

    headers3 = headers3.strip('[') 
    headers3 = headers3.strip(']') 
    headers3 = re.sub('<[^<]+?>', '', headers3) 

    print_to_file = open (data_file_name, 'w' , encoding = 'utf') 
    print_to_file.write(text + headers1 + headers2 + headers3) 
    print_to_file.close() 


for i in urlTable: 
    extractText (i)

出典

2016-08-04 confusedanalyst

これを試すと、 'w'はファイルの先頭にポインタでファイルを開きます。あなたがファイルここ

print_to_file = open (data_file_name, 'a' , encoding = 'utf')

の終わりにポインタをしたいことは、将来の参照のため、すべての読み取りと書き込みのモード

The argument mode points to a string beginning with one of the following 
sequences (Additional characters may follow these sequences.): 

``r'' Open text file for reading. The stream is positioned at the 
     beginning of the file. 

``r+'' Open for reading and writing. The stream is positioned at the 
     beginning of the file. 

``w'' Truncate file to zero length or create text file for writing. 
     The stream is positioned at the beginning of the file. 

``w+'' Open for reading and writing. The file is created if it does not 
     exist, otherwise it is truncated. The stream is positioned at 
     the beginning of the file. 

``a'' Open for writing. The file is created if it does not exist. The 
     stream is positioned at the end of the file. Subsequent writes 
     to the file will always end up at the then current end of file, 
     irrespective of any intervening fseek(3) or similar. 

``a+'' Open for reading and writing. The file is created if it does not 
     exist. The stream is positioned at the end of the file. Subse- 
     quent writes to the file will always end up at the then current 
     end of file, irrespective of any intervening fseek(3) or similar.

出典

2016-08-04 19:52:25

ありがとうです！それはまさに私が探していたものでした。私はクライアントからのURLの本当のリストを持っていれば、私は同じ原則を適用することができますね。ありがとうございました！ – confusedanalyst

Pythonの複数のWebページからのテキストの掻き取り

答えて

関連する問題