2016-06-29 22 views
-1

私は、文章を段落に区切るためにpythonを使用する必要があります。私はnltk.tokenize.texttilingを使用しようとしましたが、何の結果も得られませんでした。 ここでは、テキストの抜粋:文章を段落に区切るためのpythonの使用

– [Voiceover] Bob Dylan is, 
you must be 20 years old now, 
aren't you? 
– [Voiceover] Yeah, I must be 20. 
(laughing) 
– [Voiceover] Are you? 
– [Voiceover] Yeah, I'm 20, I'm 20. 
(guitar music) 
My hands are cold. 
It's a pretty cold studio. 
– [Voiceover] The coldest studio. 
– [Voiceover] Usually can do this. 
There I just want to do it once. 
(guitar strumming) 
– [Voiceover] When I first heard Bob Dylan 
was, I think, about three 
years ago in Minneapolis. 
– [Voiceover] At that time I 
was just sort of doing nothing. 
I was there working, I guess. 
I was making pretend I was 
going to school out there. 
I'd just come there from South Dakota. 
– [Voiceover] You've sung 
now at Goody's here in town. 
Have you sung at any of the coffee houses? 
+3

私は決してそのようなことの頭をしたことはありません。いずれにせよ、あなたは人間の会話の写しを解析しようとしていますか?それらをパラグラフにすることは、全く無意味なようです。段落は、人間の文章を整理する方法であり、人間の会話には適用されません。 –

+0

どのように文章を段落に区切りますか?段落は文章で構成されていますが、文章は単語で構成されています。 – direprobs

+0

NLTKモジュールを使用して文を見つけることができます。次に、文の間でトピックが完全に変更された場合は、新しい段落が表示されます。これがTextTilingの仕組みです。 – Shurik

答えて

0

がとても簡単に思え、あなたがこれを行うために正規表現を使用することができ、 私はあなたが望むどの形式を知っているが、ここでは一例

import re 

sentence = """ 
– [Voiceover] Bob Dylan is, 
you must be 20 years old now, 
aren't you? 
– [Voiceover] Yeah, I must be 20. 
(laughing) 
– [Voiceover] Are you? 
– [Voiceover] Yeah, I'm 20, I'm 20. 
(guitar music) 
My hands are cold. 
It's a pretty cold studio. 
– [Voiceover] The coldest studio. 
– [Voiceover] Usually can do this. 
There I just want to do it once. 
(guitar strumming) 
– [Voiceover] When I first heard Bob Dylan 
was, I think, about three 
years ago in Minneapolis. 
– [Voiceover] At that time I 
was just sort of doing nothing. 
I was there working, I guess. 
I was making pretend I was 
going to school out there. 
I'd just come there from South Dakota. 
– [Voiceover] You've sung 
now at Goody's here in town. 
Have you sung at any of the coffee houses? 

""" 

start_re = re.compile(r'\–\s\[.*?\]') 
result = re.split(start_re,sentence) 
result = filter(lambda x:x, [s.replace('\n','').strip() for s in result]) 
print result 

あるdid't出力

["Bob Dylan is,you must be 20 years old now,aren't you?", 'Yeah, I must be 20.(laughing)', 'Are you?', "Yeah, I'm 20, I'm 20.(guitar music)My hands are cold.It's a pretty cold studio.", 'The coldest studio.', 'Usually can do this.There I just want to do it once.(guitar strumming)', 'When I first heard Bob Dylanwas, I think, about threeyears ago in Minneapolis.', "At that time Iwas just sort of doing nothing.I was there working, I guess.I was making pretend I wasgoing to school out there.I'd just come there from South Dakota.", "You've sungnow at Goody's here in town.Have you sung at any of the coffee houses?"] 
+0

これは文章を見つけることだと思うが、段落は見つからない。段落の識別は難しい作業であり、境界線は微妙ですが、nltk.tokenize.texttilingではなく他のライブラリを教えていただければ幸いです。 – Shurik

関連する問題