2016-02-11 8 views
8

私はthis questionを経由していました。Python re.split()vs nltk word_tokenizeとsent_tokenize

NLTKが単語/文のトークン化で正規表現よりも高速であるかどうかは疑問です。

+0

...何を試してみませんか?サンプルを実行し、 'timeit'でそれを計時しますか? – lenz

+0

はPythonで新しくなりました、特にnltkです。私はちょうどre.split()、s.split()がnltkから切り替わった方が速いことに気づいた。私はこれを使用していました:sentences = sent_tokenize(txt)、今度はこれを文章= re.split(r '(?<= [^ AZ]。[。?])+(?= [AZ]'、txt) – wakamdr

+0

ランタイム中にワードネットをロードしなければならない可能性がありますが、nltkの原因は遅いですか? – wakamdr

答えて

15

デフォルトのnltk.word_tokenize()は、を使用して、Penn Treebank tokenizerのトークナイザをエミュレートします。

は例えば、str.split()は言語学の意味でのトークンを達成していないことに注意してください。:

>>> sent = "This is a foo, bar sentence." 
>>> sent.split() 
['This', 'is', 'a', 'foo,', 'bar', 'sentence.'] 
>>> from nltk import word_tokenize 
>>> word_tokenize(sent) 
['This', 'is', 'a', 'foo', ',', 'bar', 'sentence', '.'] 

通常、例えば、指定された区切り文字で文字列を分離するために使用されますタブ区切りのファイルでは、str.split('\t')を使用するか、改行で文字列を分割しようとしているときに、\nのテキストファイルに1行に1つの文が含まれているとします。

とのはpython3でいくつかのベンチマークしましょう:私たちはhttps://github.com/jonsafari/tok-tok/blob/master/tok-tok.plからanother tokenizers in bleeding edge NLTKをしようとした場合

str.split():  0.05451083183288574 
str.split():  0.054320573806762695 
str.split():  0.05368804931640625 
str.split():  0.05416440963745117 
str.split():  0.05299568176269531 
str.split():  0.05304527282714844 
str.split():  0.05356955528259277 
str.split():  0.05473494529724121 
str.split():  0.053118228912353516 
str.split():  0.05236077308654785 
word_tokenize():  4.056122779846191 
word_tokenize():  4.052812337875366 
word_tokenize():  4.042144775390625 
word_tokenize():  4.101543664932251 
word_tokenize():  4.213029146194458 
word_tokenize():  4.411528587341309 
word_tokenize():  4.162556886672974 
word_tokenize():  4.225975036621094 
word_tokenize():  4.22914719581604 
word_tokenize():  4.203172445297241 

import time 
from nltk import word_tokenize 

import urllib.request 
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt' 
response = urllib.request.urlopen(url) 
data = response.read().decode('utf8') 

for _ in range(10): 
    start = time.time() 
    for line in data.split('\n'): 
     line.split() 
    print ('str.split():\t', time.time() - start) 

for _ in range(10): 
    start = time.time() 
    for line in data.split('\n'): 
     word_tokenize(line) 
    print ('word_tokenize():\t', time.time() - start) 

[アウト]

import time 
from nltk.tokenize import ToktokTokenizer 

import urllib.request 
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt' 
response = urllib.request.urlopen(url) 
data = response.read().decode('utf8') 

toktok = ToktokTokenizer().tokenize 

for _ in range(10): 
    start = time.time() 
    for line in data.split('\n'): 
     toktok(line) 
    print ('toktok:\t', time.time() - start) 

[出]:

toktok: 1.5902607440948486 
toktok: 1.5347232818603516 
toktok: 1.4993178844451904 
toktok: 1.5635688304901123 
toktok: 1.5779635906219482 
toktok: 1.8177132606506348 
toktok: 1.4538452625274658 
toktok: 1.5094449520111084 
toktok: 1.4871931076049805 
toktok: 1.4584410190582275 

(注意:テキストファイルのソースがhttps://github.com/Simdiva/DSL-Taskからである)


我々はネイティブperl実装を見れば、ToktokTokenizerためpythonperl対時間は同等です。しかし、perlで、それがないときにPythonの実装では正規表現を事前にコンパイルされていることを行うが、その後the proof is still in the pudding

[email protected]:~$ wget https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl 
--2016-02-11 20:36:36-- https://raw.githubusercontent.com/jonsafari/tok-tok/master/tok-tok.pl 
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133 
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: 2690 (2.6K) [text/plain] 
Saving to: ‘tok-tok.pl’ 

100%[===============================================================================================================================>] 2,690  --.-K/s in 0s  

2016-02-11 20:36:36 (259 MB/s) - ‘tok-tok.pl’ saved [2690/2690] 

[email protected]:~$ wget https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt 
--2016-02-11 20:36:38-- https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt 
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.31.17.133 
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.31.17.133|:443... connected. 
HTTP request sent, awaiting response... 200 OK 
Length: 3483550 (3.3M) [text/plain] 
Saving to: ‘test.txt’ 

100%[===============================================================================================================================>] 3,483,550 363KB/s in 7.4s 

2016-02-11 20:36:46 (459 KB/s) - ‘test.txt’ saved [3483550/3483550] 

[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null 

real 0m1.703s 
user 0m1.693s 
sys 0m0.008s 
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null 

real 0m1.715s 
user 0m1.704s 
sys 0m0.008s 
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null 

real 0m1.700s 
user 0m1.686s 
sys 0m0.012s 
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null 

real 0m1.727s 
user 0m1.700s 
sys 0m0.024s 
[email protected]:~$ time perl tok-tok.pl <test.txt> /tmp/null 

real 0m1.734s 
user 0m1.724s 
sys 0m0.008s 

(注意:tok-tok.plタイミングとき、私たちは、ファイルにパイプへの出力を持っていましたそのタイミングは、ここnltk.tokenize.ToktokTokenizerタイミングで、それはsent_tokenize()に関して


)ファイルに出力するための時間が含まれていないのです一方で、それは少し違うと、マシンがファイルへの出力に要する時間を含み精度を考慮せずに速度ベンチマークを比較するのはちょっと風変わりです。正規表現は、すなわち、0の作業が行われ、その後、速度はほぼ瞬時にあり、1つの文中のテキストファイル/段落アップを分割した場合

  • はこのことを考えてみましょう。しかし、それは恐ろしい文章トークナイザになるでしょう...

  • ファイル内の文章はすでに\nで区切られている場合、それは単にre.split('\n')nltkstr.split('\n')が文のトークン化とは何の関係もないだろうか比較した場合であり、P

についてNLTKでどのsent_tokenize()作品に、以下を参照してください。

、一つはまた、正確さを評価し、トークン化された形式で人間評価文とデータセットを持っている必要があろう。

このタスクを考えてみましょう。彼は はフリーメーソンでは何も見なかったが、外部のフォームや儀式、および それらブラザーズ(過半数を)含まれる第3のカテゴリーで

https://www.hackerrank.com/challenges/from-paragraphs-to-sentences

がテキストを考えますこれらのフォームの厳密なパフォーマンスは、 の趣旨または重要性を悩ますことなく評価されました。そのようなものはウィラルスキーとグランドでさえあった。 プリンシパルロッジのマスター。最後に、第4カテゴリにも 偉大な多くの兄弟が所属していました。特に最近は が参加しました。ピエールの観察によると、 の信念や何も望んでいないが、フリーメーソンの に参加して、 の影響力のある裕福な若い兄弟たちと単に結びついただけで、誰もが ロビーには多くの人がいます。ピエールは のことに不満を感じ始めました。フリーメーソンは、どんなところでも、彼がここでそれを見たときに、時々 は外見だけに基づいているようでした。彼は疑問を思わなかった フリーメーソン自身が、しかし、ロシアの石積みが を取って、元の原則から逸脱していたと思われる。そして、 に向かって、彼は海外に行ってより高い の秘密に着手しました。このような状況で何が行われるのですか? は革命に勝つために、すべてを打倒し、力で力を撃退するのですか?いいえ!私たちは から非常に遠いです。知恵は暴力を必要としないので、すべての暴力的改革は非難を受ける必要があります。それは人が残っている間は は悪を救済することができず、また です。 "しかし、そこには のようなものがありますか?"イラギンの新郎は言った。 "一度彼女はそれを逃して、 それを離れて、それを取り除くことができた"とIlaginは同じことを言っていた 時間、彼のギャロップと彼の興奮から息切れ。

私たちは、この取得したい:

In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. 
Such were Willarski and even the Grand Master of the principal lodge. 
Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. 
These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge. 
Pierre began to feel dissatisfied with what he was doing. 
Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. 
He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. 
And so toward the end of the year he went abroad to be initiated into the higher secrets of the order. 
What is to be done in these circumstances? 
To favor revolutions, overthrow everything, repel force by force? 
No! 
We are very far from that. 
Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. 
"But what is there in running across it like that?" said Ilagin's groom. 
"Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. 

だから、単にstr.split('\n')を行うことはあなたに何も与えないだろう。文の順序を考慮しなくても、0の正の結果が得られます。

>>> text = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. Such were Willarski and even the Grand Master of the principal lodge. Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge.Pierre began to feel dissatisfied with what he was doing. Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. And so toward the end of the year he went abroad to be initiated into the higher secrets of the order.What is to be done in these circumstances? To favor revolutions, overthrow everything, repel force by force?No! We are very far from that. Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. "But what is there in running across it like that?" said Ilagin's groom. "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement. """ 
>>> answer = """In the third category he included those Brothers (the majority) who saw nothing in Freemasonry but the external forms and ceremonies, and prized the strict performance of these forms without troubling about their purport or significance. 
... Such were Willarski and even the Grand Master of the principal lodge. 
... Finally, to the fourth category also a great many Brothers belonged, particularly those who had lately joined. 
... These according to Pierre's observations were men who had no belief in anything, nor desire for anything, but joined the Freemasons merely to associate with the wealthy young Brothers who were influential through their connections or rank, and of whom there were very many in the lodge. 
... Pierre began to feel dissatisfied with what he was doing. 
... Freemasonry, at any rate as he saw it here, sometimes seemed to him based merely on externals. 
... He did not think of doubting Freemasonry itself, but suspected that Russian Masonry had taken a wrong path and deviated from its original principles. 
... And so toward the end of the year he went abroad to be initiated into the higher secrets of the order. 
... What is to be done in these circumstances? 
... To favor revolutions, overthrow everything, repel force by force? 
... No! 
... We are very far from that. 
... Every violent reform deserves censure, for it quite fails to remedy evil while men remain what they are, and also because wisdom needs no violence. 
... "But what is there in running across it like that?" said Ilagin's groom. 
... "Once she had missed it and turned it away, any mongrel could take it," Ilagin was saying at the same time, breathless from his gallop and his excitement.""" 
>>> 
>>> output = text.split('\n') 
>>> sum(1 for sent in text.split('\n') if sent in answer) 
0 
+0

素晴らしい答え。私はいくつかの単純なベンチマークを含めるのが好きでした。 – erewok

+0

私は文分割、単語のトークン化ではないと思う。 – lenz

+0

レンツ、それはまだ非常に良い答えです – wakamdr