Unicodeを受け入れることを学ぶ...世界はもうASCIIではありません。
あなたがWindows上にあり、.CSVをExcelまたはメモ帳で表示していると仮定すると、Python 3では次の行を使用します。この変更だけで(あなたの投稿のインデントを固定して) ASCII文字を正しく入力してください。メモ帳とExcelは、ファイルの先頭にUTF-8 BOM署名があり、utf-8-sig
が提供しています。
with open('usnwr_schools.csv', 'w', newline='', encoding='utf-8-sig') as f:
別のPythonスクリプトでファイルを読む場合は、必ず次のファイルを読んでください。 b'University of Michigan\xe2\x80\x94\xe2\x80\x8bAnn Arbor'
を読んだあなたの例はバイナリモード'rb'
で読まれました。
with open('usnwr_schools.csv', encoding='utf-8-sig') as f:
Linuxの場合は、utf-8-sig
の代わりにutf8
を使用することができます。余談として
、あなたがあなたのループを置き換えることができます。
with open('usnwr_schools.csv', 'w', newline='', encoding='utf-8-sig') as f:
writer = csv.writer(f)
for school in reqSoup:
x = reqSoup.find_all("a", {"class" : "school-name"})
for item in x:
y = item.get_text()
writer.writerow([y])
それを裏読み:
with open('usnwr_schools.csv',encoding='utf-8-sig') as f:
print(f.read())
出力:あなたはまだASCIIのみになりたい場合は
Massachusetts Institute of Technology
Stanford University
University of California—Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—Ann Arbor
Georgia Institute of Technology
University of Illinois—Urbana-Champaign
Purdue University—West Lafayette
University of Texas—Austin (Cockrell)
Texas A&M; University—College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—Los Angeles (Samueli)
University of California—San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—Santa Barbara
Harvard University
University of Maryland—College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—Ann Arbor
Georgia Institute of Technology
University of Illinois—Urbana-Champaign
Purdue University—West Lafayette
University of Texas—Austin (Cockrell)
Texas A&M; University—College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—Los Angeles (Samueli)
University of California—San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—Santa Barbara
Harvard University
University of Maryland—College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—Ann Arbor
Georgia Institute of Technology
University of Illinois—Urbana-Champaign
Purdue University—West Lafayette
University of Texas—Austin (Cockrell)
Texas A&M; University—College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—Los Angeles (Samueli)
University of California—San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—Santa Barbara
Harvard University
University of Maryland—College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—Ann Arbor
Georgia Institute of Technology
University of Illinois—Urbana-Champaign
Purdue University—West Lafayette
University of Texas—Austin (Cockrell)
Texas A&M; University—College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—Los Angeles (Samueli)
University of California—San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—Santa Barbara
Harvard University
University of Maryland—College Park (Clark)
University of Washington
Massachusetts Institute of Technology
Stanford University
University of California—Berkeley
California Institute of Technology
Carnegie Mellon University
University of Michigan—Ann Arbor
Georgia Institute of Technology
University of Illinois—Urbana-Champaign
Purdue University—West Lafayette
University of Texas—Austin (Cockrell)
Texas A&M; University—College Station (Look)
Cornell University
University of Southern California (Viterbi)
Columbia University (Fu Foundation)
University of California—Los Angeles (Samueli)
University of California—San Diego (Jacobs)
Princeton University
Northwestern University (McCormick)
University of Pennsylvania
Johns Hopkins University (Whiting)
Virginia Tech
University of California—Santa Barbara
Harvard University
University of Maryland—College Park (Clark)
University of Washington
これでやります:
import requests
import bs4
import csv
results = requests.get('http://grad-schools.usnews.rankingsandreviews.com/best-graduate-schools/top-engineering-schools/eng-rankings?int=a74509')
replacements = {ord('\N{EN DASH}'):'-',
ord('\N{EM DASH}'):'-',
ord('\N{ZERO WIDTH SPACE}'):None}
reqSoup = bs4.BeautifulSoup(results.text, "html.parser")
with open('usnwr_schools.csv', 'w', newline='', encoding='ascii') as f:
writer = csv.writer(f)
for school in reqSoup:
x = reqSoup.find_all("a", {"class" : "school-name"})
for item in x:
y = item.get_text()
writer.writerow([y.translate(replacements)])
with open('usnwr_schools.csv',encoding='ascii') as f:
print(f.read())
emダッシュがどのように表示されると思いますか? Unicodeは文字の抽象的な列挙です。ファイルは一連のバイトです。 UTF-8は、Unicode文字を1バイト以上エンコードするデフォルトの方法です。emダッシュを削除したり、別のものに置き換えたい場合は、自分で行う必要があります。これはエンコーダの仕事ではありません。 – chepner
**すべての**あなたのデータはUTF-8として表示されています(あなたのロケールのために推奨されるエンコーディングであることは明らかですが、ファイルを開いたときには 'encoding'を設定していません)。代わりに何を表示したいですか?残りのテキストはUTF-8です(たとえテキストをASCIIでエンコードすることもできます)。 –
'csv'モジュールはただ特定のフォーマットでデータを書き込んでいることに注意してください。書きたいライターにデータを渡します。つまり、これは 'csv'モジュールの問題ではありません。代わりに別のデータを渡したいと思われるので、ASCII文字を含むようにデータを制限する方法が必要です(多分あなたが望むもの、ちょうどa-z、A-Z、0-9および基本的な句読点)。 –