Pythonでの文字列操作

NLTKを使用してNLPに取り込もうとしていますが、ほとんどのコードを理解していますが、x.sub("", word)とif not new_word in ""の意味を理解できません。よくわかりません。Pythonでの文字列操作

text = ["It is a pleasant evening.", "Guests, who came from the US arrived at the venue.", "Food was tasty."] 

tokenized_docs = [word_tokenize(doc) for doc in text] 
print(tokenized_docs) 

x = re.compile("[%s]" % re.escape(string.punctuation)) 
token_nop = [] 
for sentence in tokenized_docs: 
    new_sent = [] 
    for word in sentence: 
     new_word = x.sub('', word) 
     if not new_word in '': 
      sentence.append(new_word) 
    token_nop.append(sentence)

出典

2016-10-04 Savon Brown

このコードから何を期待していますか？それはあなたが期待していることをしませんか？ – lenz

このような簡単なことについては、Pythonは実際に自己文書化しています。あなたはいつもPythonインタプリタを起動し、それが何をするか見るために機能上の__doc__関数を呼び出すことができます。

>>> import re 
>>> print(re.compile(".*").sub.__doc__) 
sub(repl, string[, count = 0]) --> newstring 
    Return the string obtained by replacing the leftmost non-overlapping 
    occurrences of pattern in string by the replacement repl.

だから、私たちが見る、subは、単に与えられた正規表現パターンに置換を行い操作です。（Pythonの正規表現に慣れていない場合は、check this out）。したがって、たとえば：inについては

>>> import re 
>>> s = "Hello world" 
>>> p = re.compile("[Hh]ello") 
>>> p.sub("Goodbye", s) 
'Goodbye world'

、それはちょうどnew_wordが空の文字列であるかどうかをチェックです。

出典

2016-10-04 14:26:22

ああ私はちょうど誤解しました。 re.subでは、最初のパラメータは置き換えられていたものだと思っていましたが、実際は置き換えられています。だから、regex.sub（ ""、string）は文字列中の正規表現の最初の出現を ""で置き換えますか？ –

@SavonBrownはい、まさに。私は例を含めるために私の答えを編集しました。 –

答えて

関連する問題