beautifulsoupのfindAll

は、私はいくつかのxmlを持っている：beautifulsoupのfindAll

<article> 
<uselesstag></uslesstag> 
<topic>oil, gas</topic> 
<body>body text</body> 
</article> 

<article> 
<uselesstag></uslesstag> 
<topic>food</topic> 
<body>body text</body> 
</article> 

<article> 
<uselesstag></uslesstag> 
<topic>cars</topic> 
<body>body text</body> 
</article>

多く、多くの無駄なタグがあります。 beautifulsoupを使用してbodyタグのテキストをすべて収集し、関連するトピックテキストを使用して新しいxmlを作成したいとします。

私のpythonに新しいですが、私は

import arff 
from xml.etree import ElementTree 
import re 
from StringIO import StringIO 

import BeautifulSoup 
from BeautifulSoup import BeautifulSoup 

totstring="" 

with open('reut2-000.sgm', 'r') as inF: 
    for line in inF: 
     string=re.sub("[^0-9a-zA-Z<>/\s=!-\"\"]+","", line) 
    totstring+=string 


soup = BeautifulSoup(totstring) 

body = soup.find("body") 



for anchor in soup.findAll('body'): 
    #Stick body and its topics in an associated array? 




file.close

のいくつかのフォームが動作すると思われます。

1）どうすればいいですか？ 2）ルートノードをXMLに追加する必要がありますか？そうでなければ、それは適切なXMLではありませんか？

どうもありがとうございました

編集：私はで終わるしたい何

は次のとおりです。

<article> 
<topic>oil, gas</topic> 
<body>body text</body> 
</article> 

<article> 
<topic>food</topic> 
<body>body text</body> 
</article> 

<article> 
<topic>cars</topic> 
<body>body text</body> 
</article>

多く、多くの無駄なタグがあります。

出典

2012-05-09 RNs_Ghost

ので、タグのA、B、Cからコンテンツを取得したり、すべてのタグのコンテンツを取得し、タグD、E、Fを無視したいですか？ –

はい私は2種類のタグ（ボディとトピック）と他のもの（日付、時間など）を無視したい –

ok。ここでの解決策がある、

まず、uは「beautifulsoup4」持っていたことを確認してインストール：空削除する

from bs4 import BeautifulSoup 
html_doc= """ 
<article> 
<topic>oil, gas</topic> 
<body>body text</body> 
</article> 

<article> 
<topic>food</topic> 
<body>body text</body> 
</article> 

<article> 
<topic>cars</topic> 
<body>body text</body> 
</article> 
""" 
soup = BeautifulSoup(html_doc) 

bodies = [a.get_text() for a in soup.find_all('body')] 
topics = [a.get_text() for a in soup.find_all('topic')]

出典

2012-05-09 15:49:14

助けてくれてありがとう@Arthur Nevesしかし、体でconvert.py」、23行目、= [soup.find_allで（ '体'）のためのa.get_text（）] はTypeError： 'NoneType' オブジェクトが呼び出すことはできませんが、私はする必要はありません。？ –

私にとってはうまく働いています。これを試してください：curl https://raw.github.com/gist/2646540/129f95c11cffa159daeec184ba47a57217379060/convert.py> convert.py; python convert.py –

from bs4私のためにやったよ –

別の方法：ここ

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soupは、すべてのボディと話題のタグを取得するために私のコードですxmlまたはhtmlタグは、再帰関数を使用して空のタグを検索し、.extract（）を使用してそれらを削除することです。この方法では、保持するタグを手動でリストする必要はありません。また、ネストされた空のタグのクリーニングも可能です。

from bs4 import BeautifulSoup 
import re 
nonwhite=re.compile(r'\S+',re.U) 

html_doc1=""" 
<article> 
<uselesstag2> 
<uselesstag1> 
</uselesstag1> 
</uselesstag2> 
<topic>oil, gas</topic> 
<body>body text</body> 
</article> 

<p>21.09.2009</p> 
<p> </p> 
<p1><img src="http://www.www.com/"></p1> 
<p></p> 

<!--- This article is about cars---> 
<article> 
<topic>cars</topic> 
<body>body text</body> 
</article> 
""" 

def nothing_inside(thing): 
    # select only tags to examine, leave comments/strings 
    try: 
     # check for img empty tags 
     if thing.name=='img' and thing['src']<>'': 
      return False 
     else: 
      pass 
     # check if any non-whitespace contents 
     for item in thing.contents: 
      if nonwhite.match(item): 
       return False 
      else: 
       pass 
     return True 
    except: 
     return False 

def scrub(thing): 
    # loop function as long as an empty tag exists 
    while thing.find_all(nothing_inside,recursive=True) <> []: 
     for emptytag in thing.find_all(nothing_inside,recursive=True): 
      emptytag.extract() 
      scrub(thing) 
    return thing 

soup=BeautifulSoup(html_doc1) 
print scrub(soup)

結果：

<article> 

<topic>oil, gas</topic> 
<body>body text</body> 
</article> 
<p>21.09.2009</p> 

<p1><img src="http://www.www.com/"/></p1> 

<!--- This article is about cars---> 
<article> 
<topic>cars</topic> 
<body>body text</body> 
</article>

出典

2012-08-16 16:51:45 Kao

答えて

関連する問題