Python 3のHTMLページの一部を分離する方法

ページのソースコードを取得するための簡単なスクリプトを作成しましたが、proxy.txtファイルに保存できるようにipsの部分を「分離」したいと思います。助言がありますか？Python 3のHTMLページの一部を分離する方法

import urllib.request 

sourcecode = urllib.request.urlopen("https://www.inforge.net/xi/threads/dichvusocks-us-15h10-pm-update-24-24-good-socks.455588/") 
sourcecode = str(sourcecode.read()) 
out_file = open("proxy.txt","w") 
out_file.write(sourcecode) 
out_file.close()

出典

2016-07-17 Sperly1987

コードに数行追加しましたが、唯一の問題はUIバージョン（ページソースを確認）がIPアドレスとして追加されていることです。

import urllib.request 
import re 

sourcecode = urllib.request.urlopen("https://www.inforge.net/xi/threads/dichvusocks-us-15h10-pm-update-24-24-good-socks.455588/") 
sourcecode = str(sourcecode.read()) 
out_file = open("proxy.txt","w") 
out_file.write(sourcecode) 
out_file.close() 

with open('proxy.txt') as fp: 
    for line in fp: 
     ip = re.findall('(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', line) 

for addr in ip: 
    print(addr)

UPDATE： これはあなたが探しているもので、BeatifulSoupは、我々はしかし、それはピップでインストールする必要があり、CSSクラスを使用してページから必要なデータのみを抽出することができます。ページをファイルに保存する必要はありません。

from bs4 import BeautifulSoup 
import urllib.request 
import re 

url = urllib.request.urlopen('https://www.inforge.net/xi/threads/dichvusocks-us-15h10-pm-update-24-24-good-socks.455588/').read() 
soup = BeautifulSoup(url, "html.parser") 

# Searching the CSS class name 
msg_content = soup.find_all("div", class_="messageContent") 

ips = re.findall('(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', str(msg_content)) 

for addr in ips: 
    print(addr)

出典

2016-07-18 06:26:36

ありがとうございます！それは始まりのポイントです！しかし、HTMLページ（この場合は

）の一部に集中することが可能なので、スクリプトはipsだけを出力できますか？とにかくもう一度お返事します – Sperly1987

私は愚かです.. "ip"はリストなので、内部のアイテムを削除することができます。 – Sperly1987

なぜあなたは再を使用しないのだろうか？どのように正確に言えばソースコードが必要です。

出典

2016-07-17 22:04:46 UpmostScarab

Python 3のHTMLページの一部を分離する方法

答えて

関連する問題