略語とアポストロフィを除いた単語と句読点を区切ってテキストをトークン化する

略語とアポストロフィを考慮しながら、句読点を単語から区切って入力テキストを入力しました。私はPythonとnltkライブラリを使用していますが、私は正規表現が正しくないと思います。まだ出力が間違っています。略語とアポストロフィを除いた単語と句読点を区切ってテキストをトークン化する

# coding: utf-8 
import re 
import nltk 
from nltk.tokenize import * 

text = "\"Predictions suggesting that large changes in weight will 
accumulate indefinitely in response to small sustained lifestyle 
modifications rely on the half-century-old 3,500 calorie rule, which 
equates a weight alteration of 2.2 lb to a 3,500 calories cumulative 
deficit or increment,\" write the study authors Dr. Jampolis, Dr. 
Chaudry, and Prof. Harlen, from N.P.C Clinic in OH. The 3,500- calorie 
rule \"predicts that a person who increases daily energy expenditure by 
100 calories by walking 1 mile per day\" will lose 50 pounds over five 
years, the authors say. But the true weight loss is only about 10 
pounds if calorie intake doesn't increase, \"because changes in mass 
... alter the energy requirements of the body’s make-up.\" \"This is a 
myth, strictly speaking, but the smaller amount of weight loss achieved 
with small changes is clinically significant and should not be 
discounted,\" says Dr. Melina Jampolis, CNN diet and fitness expert." 

print(regexp_tokenize(text, pattern='(?:(?!\d)\w)+|\S+'))

助けていただければ幸いです。

出典

2017-09-18 user3432543

私はあなたの望む出力が何であるか不明です – rahlf23

希望の出力はトークン化されたテキストですが、アポストロフィのような句読点（1トークンのままではありません）と略語も分離しませんトークン） – user3432543

基本的に "/"、 "\"、 "、"と引用符を削除したいだけですか？ – rahlf23

これはトリックを行う必要があります。これらの望ましくない句読点のいずれかを何もないもの（つまり ''）で置き換えるには、ここでre.subを使用するだけです。

s = 'Insert your text here' 

new = re.sub(r'(\"\\\")|(\\\")|[.]{3}|,','', s) 

print(new)

この正規表現の難しい部分は、すべてのバックスラッシュをエスケープしています。これを打破するには、次の

(\"\\\")

は、任意の "\"

(\\\")

が

[.]{3}

は、いずれかを検索します...

は、任意のを見つけ、」任意の\を検索します検索します

パイプは 'または'演算子として機能します。うまくいけば、これはあなたのすべての要件を満たしています。

出典

2017-09-18 18:12:36 rahlf23

略語とアポストロフィを除いた単語と句読点を区切ってテキストをトークン化する

答えて

関連する問題