マルチバイト文字列をPHPの単語に分割する方法は？

マルチバイト文字列をPHPの単語に分割する方法は？ここは、私がこれまでにやっていることですが、私は、コードを改善したいと思います...マルチバイト文字列をPHPの単語に分割する方法は？

mb_internal_encoding('UTF-8'); 
    mb_regex_encoding('UTF-8'); 
    $arr = mb_split('[\s\[\]().,;:-_]', $str);

ない（単語は「アルファ」-charactersのシーケンスであると言うする方法はありますアルファベット以外の文字を含めるので、表記azを使用してください）

出典

2011-12-07 ragnarius

何あなたの文字列はのようなもので、どのような文字セットを使っていますか。 –

どのように\ b単語の境界 –

私はutf-8を使用しています！ – ragnarius

ここで、この赤ちゃんをお試しください：

preg_match_all('/[\p{L}\p{M}]+/u', $subject, $result, PREG_PATTERN_ORDER); 
for ($i = 0; $i < count($result[0]); $i++) { 
    # Matched text = $result[0][$i]; 
}

マッチ言葉としてのアクセントを持つすべての可能な文字を：

 " 
[\p{L}\p{M}]  # Match a single character present in the list below 
        # A character with the Unicode property “letter” (any kind of letter from any language) 
        # A character with the Unicode property “mark” (a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)) 
    +    # Between one and unlimited times, as many times as possible, giving back as needed (greedy) 
"

See it.

出典

2011-12-07 20:53:14 FailedDev

非ラテン語の場合は、単語の最後の文字が欠けています。 "ocksåhärfinnshö" => ocks、här、finns、h – ragnarius

@ragnariusの文章が修正されました。理由は、単語の境界がUTF-8でうまく動かないということでした。 – FailedDev

素晴らしい！しかし、どういう意味ですか？ – ragnarius