Beautifulsoup4を使用してHTMLからDoctypeを削除しますか？

私はPythonには新しく、BeautifulSoupは私に負担しています。Beautifulsoup4を使用してHTMLからDoctypeを削除しますか？

私はBeautifulsoup4を使ってHTMLファイルからDoctypeを削除する方法を考えていますが、これを達成する。

def saveToText(self): 
    filename = os.path.join(self.parent.ReportPath, str(self.parent.CharName.text()) + "_report.txt") 
    filename, filters = QFileDialog.getSaveFileName(self, "Save Report", filename, "Text (*.txt);;All Files (*.*)") 

    if filename is not None and str(filename) != '': 

     try: 
      if re.compile('\.txt$').search(str(filename)) is None: 
       filename = str(filename) 
       filename += '.txt' 

      soup = BeautifulSoup(self.reportHtml, "lxml") 

      try: # THROWS AttributeError IF NOT FOUND .. 
       soup.find('font').extract() 
      except AttributeError: 
       pass 

      try: # THROWS AttributeError IF NOT FOUND .. 
       soup.find('head').extract() 

      except AttributeError: 
       pass 

      soup.html.unwrap() 
      soup.body.unwrap() 

      for b in soup.find_all('b'): 
       b.unwrap() 

      for table in soup.find_all('table'): 
       table.unwrap() 

      for td in soup.find_all('td'): 
       td.unwrap() 

      for br in soup.find_all('br'): 
       br.replace_with('\n') 

      for center in soup.find_all('center'): 
       center.insert_after('\n') 

      for dl in soup.find_all('dl'): 
       dl.insert_after('\n') 

      for dt in soup.find_all('dt'): 
       dt.insert_after('\n') 

      for hr in soup.find_all('hr'): 
       hr.replace_with(('-' * 80) + '\n') 

      for tr in soup.find_all('tr'): 
       tr.insert_before(' ') 
       tr.insert_after('\n') 

      print(soup) 

     except IOError: 
      QMessageBox.critical(None, 'Error!', 'Error writing to file: ' + filename, 'OK')

私が使用してみました：

from bs4 import Doctype 

if isinstance(e, Doctype): 
    e.extract()

が、それは 'E' が未解決の参照であることを不平を言います。私はドキュメンテーションとGoogleを検索しましたが、動作するものは見つかりませんでした。

このコードを減らす方法はありますか？

出典

2017-11-08 Aaron Tomason

ここで、「e」を定義しましたか？ –

@SamChats私はそれが起こっている理由は確かですが、私が取り組んでいた例はどちらもしていませんでした。私は本当に「e」が何と定義されているのかはわかりません。 Beautifulsoupのドキュメントはまあまあですが、実際に私が外出するのに十分な情報ではありませんでした。 –

「e」は主な「スープ」の参考にすぎないのでしょうか？ –

これは問題を完全に修正したようです。

from bs4 import BeautifulSoup, Doctype 

for item in soup.contents: 
    if isinstance(item, Doctype): 
     item.extract()

出典

2017-11-08 06:49:47

Beautifulsoup4を使用してHTMLからDoctypeを削除しますか？

答えて

関連する問題