WebページはUTFで、内部のテキストは、私のpythonでWebページをつかむとき、それは正しく

base_url = "http://vesti.az" 
link = "http://vesti.az" 
#these two lines identifies character encoding 
enc = urllib.request.urlopen(link).read() 
print(chardet.detect(enc)) 
#end of charset identifier 
page = requests.get(link) 
print(page.encoding) 
tree = html.fromstring(page.content)  
links = tree.xpath('//div[@class="news-list"]/ul/li/a/@href') #here I get the last added news link 
new_link = base_url + links[0] 
if show_details == 1: 
    print(new_link) 


info = requests.get(new_link) 
agac = html.fromstring(info.content) #here I open last news link 
newsTitle = agac.xpath('//title/text()') #here I get the news title 
newsTitle = u''.join(newsTitle) 

b0 = agac.xpath('//article[@class="article-content js-mediator-article"]//text()') 
b0 = u"".join(b0) 
b0 = b0.strip() 

newsBody = b0 #re.sub("Oxunub:.*", "", b0, flags=re.DOTALL) 

if show_details == 1: 
    print(new_link) 
    print(newsTitle) #here I print the news title 
    print(newsBody)

を示していないと、私はニュースのタイトルを印刷するとき、残念ながら、私はこのような何かを得る、ロシア語であります°Ð¹Ð'Ð¶Ð°Ð½ÐμÑÐ¿Ð¾Ð»NNDD»ÑÑÐ¶ÐμÐ»Ð¾ÐμND°Ð½ÐμÐ½Ð¸ÐμÐ²Ð¾Ð²ÑÐμÐ¼ÑÐ½Ð°Ð¿Ð°Ð'ÐμÐ½Ð¸ÑÐ½Ð°ÑÐ¿ÑÐ°Ð²ÐÐμÐ½Ð¸Ðμ» Ð¤Ð¡Ð| | Vesti.az | ÐÐ»Ð°Ð²Ð½ÑÐμÐ½Ð¾Ð²Ð¾ÑÑÐ¸ÐÐ・ÐμÑÐ±Ð°Ð¹Ð'Ð¶Ð°Ð½Ð°| ÐÐ¾Ð²Ð¾ÑÑÐ¸ÐÐ・ÐμÑÐ±Ð°Ð¹Ð'Ð¶°Ð½Ð°WebページはUTFで、内部のテキストは、私のpythonでWebページをつかむとき、それは正しく

これは間違いなく私が探していないことです。私はpythonファイルのエンコーディングを変更しようとしましたが、私は成功しませんでした。その問題を解決する方法はありますか？

私はハッシュ（＃）を使用し、コード内で何をしているのか説明しました。

出典

2017-04-22 Sunuba

バイナリ文字列のUTF-8表現があり得るためにKEY：だ

info.content.decode（ 'UTF-8'、 '無視'）

それ。以下のコードは、それを使用しています：

import urllib.request 
import chardet 
import requests 
from lxml import html 

show_details = 1 
base_url = "http://vesti.az" 
link = "http://vesti.az" 
#these two lines identifies character encoding 
enc = urllib.request.urlopen(link).read() 
print(chardet.detect(enc)) 
#end of charset identifier 
page = requests.get(link) 
print(page.encoding) 
tree = html.fromstring(page.content)  
links = tree.xpath('//div[@class="news-list"]/ul/li/a/@href') #here I get the last added news link 
new_link = base_url + links[0] 
if show_details == 1: 
    print(new_link) 
info = requests.get(new_link) 
# print(info.content.decode('utf-8', 'ignore')) 
# print(info.content) 
agac = html.fromstring(info.content.decode('utf-8', 'ignore')) #here I open last news link 
newsTitle = agac.xpath('//title/text()') #here I get the news title 
newsTitle = u''.join(newsTitle) 

b0 = agac.xpath('//article[@class="article-content js-mediator-article"]//text()') 
b0 = u"".join(b0) 
b0 = b0.strip() 

newsBody = b0 #re.sub("Oxunub:.*", "", b0, flags=re.DOTALL) 

if show_details == 1: 
    # print(new_link) 
    print(newsTitle) #here I print the news title 
    # print(newsBody)

、出力は次のようになります。

>python3.6 -u "russian_Cg.py" 
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''} 
utf-8 
http://vesti.az/news/329186 
ВС Армении продолжают нарушать режим прекращения огня | Vesti.az | Главные новости Азербайджана | Новости Азербайджана

出典

2017-04-22 06:04:16 Claudio

どうもありがとうございました。私は変更を綿密に調べて、それを手に入れました。どうもありがとう！ – Sunuba

WebページはUTFで、内部のテキストは、私のpythonでWebページをつかむとき、それは正しく

答えて

関連する問題