エンロンの電子メールの本文から「転送されたメッセージ」のタイトルと不要なコンテンツを削除するにはどうすればよいですか？

私は、ストップワードを削除し、NLTKで文章に分割することで、これらの電子メールのテキストを処理できるように、エンロン電子メールのすべてのボディを1つのファイルに追加しようとしています。私の問題は、転送されたメッセージと返信されたメッセージで、私はそれらをきれいにする方法がわかりません。これは、これまでの私のコードです：エンロンの電子メールの本文から「転送されたメッセージ」のタイトルと不要なコンテンツを削除するにはどうすればよいですか？

import os, email, sys, re,nltk, pprint 
    from email.parser import Parser 

    rootdir = '/Users/art/Desktop/maildir/lay-k/elizabeth' 
    #function that appends all the body parts of Emails 
    def email_analyse(inputfile, email_body): 
     with open(inputfile, "r") as f: 
     data = f.read() 

     email = Parser().parsestr(data) 

     email_body.append(email.get_payload()) 
    #end of function 
    #defining a list that will contain bodies 
    email_body = [] 
    #call the function email_analyse for every function in directory 
    for directory, subdirectory, filenames in os.walk(rootdir): 
     for filename in filenames: 
      email_analyse(os.path.join(directory, filename), email_body) 
    #the stage where I clean the emails 

    with open("email_body.txt", "w") as f: 
     for val in email_body: 
      if(val): 
       val = val.replace("\n", "") 
       val = val.replace("=01", "") 
       #for some reason I had many of ==20 and =01 in my text 
       val = val.replace("==20", "") 
       f.write(val) 
       f.write("\n")

これは部分的に出力されますので、結果はまったく純粋なテキストではありません Well, with the photographer and the band, I would say we've pretty much outdone our budget! Here's the information on the photographer. I have a feeling for some of the major packages we could negotiate at least a couple of hours at the rehearsal dinner. I have no idea how much this normally costs, but he isn't cheap!---------------------- Forwarded by Elizabeth Lay/HOU/AZURIX on 09/13/99 07:34 PM [email protected] on 09/13/99 05:37:37 PMPlease respond to [email protected] To: Elizabeth Lay/HOU/[email protected]: Subject: Denis Reggie Wedding PhotographyHello Elizabeth:Congratulations on your upcoming marriage! I am Ashley Collins, Mr.Reggie's Coordinator. Linda Kessler forwarded your e.mail address to me sothat I may provide you with information on photography coverage for Mr.Reggie's wedding photography. 。どのようにそれを正しく行うための任意のアイデアですか？

出典

2017-12-10 Art

フォーマットがコーパス全体で一貫している必要があるため、正規表現を見て、転送されたテキストと応答テキストを解析することができます。

-{4,}(.*)(\d{2}:\d{2}:\d{2})\s*(PM|AM)

形式XX 4つの以上のハイフンと時間との間のすべての内容と一致します：：XX：XX PMを

は転送されたテキストを削除するために、次のような正規表現を使用することができます。 3つのダッシュを一致させると、うまくいくでしょう。電子メール本文でハイフンとemダッシュを一致させないようにするだけです。あなたがこの正規表現で遊んでと、このリンクでのヘッダーにし、件名一致させるための独自のを書くことができます：https://regex101.com/r/VGG4bu/1/

また、Pythonで正規表現について語っNLTKブックのセクション3.4を見ることができます。http://www.nltk.org/book/ch03.html

幸運を！これは興味深いプロジェクトのように聞こえる。

出典

2017-12-19 18:48:26

エンロンの電子メールの本文から「転送されたメッセージ」のタイトルと不要なコンテンツを削除するにはどうすればよいですか？

答えて

関連する問題