私は、ストップワードを削除し、NLTKで文章に分割することで、これらの電子メールのテキストを処理できるように、エンロン電子メールのすべてのボディを1つのファイルに追加しようとしています。 私の問題は、転送されたメッセージと返信されたメッセージで、私はそれらをきれいにする方法がわかりません。 これは、これまでの私のコードです:エンロンの電子メールの本文から「転送されたメッセージ」のタイトルと不要なコンテンツを削除するにはどうすればよいですか?
import os, email, sys, re,nltk, pprint
from email.parser import Parser
rootdir = '/Users/art/Desktop/maildir/lay-k/elizabeth'
#function that appends all the body parts of Emails
def email_analyse(inputfile, email_body):
with open(inputfile, "r") as f:
data = f.read()
email = Parser().parsestr(data)
email_body.append(email.get_payload())
#end of function
#defining a list that will contain bodies
email_body = []
#call the function email_analyse for every function in directory
for directory, subdirectory, filenames in os.walk(rootdir):
for filename in filenames:
email_analyse(os.path.join(directory, filename), email_body)
#the stage where I clean the emails
with open("email_body.txt", "w") as f:
for val in email_body:
if(val):
val = val.replace("\n", "")
val = val.replace("=01", "")
#for some reason I had many of ==20 and =01 in my text
val = val.replace("==20", "")
f.write(val)
f.write("\n")
これは部分的に出力されますので、結果はまったく純粋なテキストではありません Well, with the photographer and the band, I would say we've pretty much outdone our budget! Here's the information on the photographer. I have a feeling for some of the major packages we could negotiate at least a couple of hours at the rehearsal dinner. I have no idea how much this normally costs, but he isn't cheap!---------------------- Forwarded by Elizabeth Lay/HOU/AZURIX on 09/13/99 07:34 PM [email protected] on 09/13/99 05:37:37 PMPlease respond to [email protected] To: Elizabeth Lay/HOU/[email protected]: Subject: Denis Reggie Wedding PhotographyHello Elizabeth:Congratulations on your upcoming marriage! I am Ashley Collins, Mr.Reggie's Coordinator. Linda Kessler forwarded your e.mail address to me sothat I may provide you with information on photography coverage for Mr.Reggie's wedding photography.
。どのようにそれを正しく行うための任意のアイデアですか?