2009-12-02 7 views
141

スタンフォードNLPは、hereをデモを行っ次のような出力が得られます。Java Stanford NLP:音声ラベルの一部ですか?

Colorless/JJ green/JJ ideas/NNS sleep/VBP furiously/RB ./. 

音声タグの一部は何を意味するのですか?私は公式のリストを見つけることができません。スタンフォード独自のシステムですか、普遍的なタグを使用していますか? (例えば、JJとは何ですか?)

また、文章を繰り返して名詞を探すと、タグ.contains('N')を確認するようなことをします。これはかなり弱い感じです。プログラムによって特定の部分を検索するより良い方法はありますか?

+0

これはnitpickかもしれないが、あなたは 'IN' と以来、むしろcontains' 'より' .starts_with( 'N') '、および 'VBN' を使用する必要があります'N'も含む。そして、それはおそらく、タガーが名詞であると思う単語を見つける最良の方法です。 – Joseph

+3

答えはすべてクールですが、これに対処できる人のために、各頭字語をいくつかの単語に分解する以外の意味を説明するコンパクトなリソースはありますが、下位接続詞のようなものの新しい記憶を保持していませんか? – matanster

答えて

224

The Penn Treebank ProjectPart-of-speech tagging psを見てください。

JJは形容詞です。 NNSは名詞、複数形です。 VBPは動詞である。 RBは副詞です。

これは英語用です。中国語では、ペン・チャイナ・ツリーバンクです。ドイツ語の場合はNEGRAコーパスです。

  1. CCコーディネート連動
  2. CDカーディナル数
  3. DTデタミナーが
  4. EX実存
  5. FW外国語
  6. 前置詞、IN
  7. または従属接続詞
  8. JJの形容詞
  9. JJR形容詞、比較
  10. JJS形容詞単数、最上級
  11. LSリスト項目マーカー
  12. MDモーダル
  13. NN名詞、単数又は質量
  14. NNS名詞、複数
  15. NNP適切な名詞、
  16. NNPS固有名詞、複数
  17. PDTプレディクタイター
  18. POS固定終了
  19. PRP人称代名詞
  20. PRP $強欲代名詞
  21. RB副詞
  22. RBR副詞、比較
  23. RBS副詞、最上級
  24. RP粒子
  25. SYMシンボル
  26. TOへ
  27. UH中継
  28. VB動詞、基本形式
  29. VBD動詞、過去形
  30. VBG動詞、動名詞あるいは現在分詞
  31. VBN動詞、過去分詞
  32. VBP動詞、non3rd人
  33. 存在単数VBZ動詞、3人目単数現在
  34. WDT Whdeterminer
  35. WP whpronoun
  36. WP $強欲whpronoun
  37. WRB Whadverb
  38. ちょうどあなたがそれをコーディングしたいた場合
+0

この回答の不備を修正するための編集の提案は拒否されました。したがって、この回答からいくつかの情報が欠落している私の掲示された回答もご覧ください。 – Jules

+3

10th LSとは何ですか? – Devavrata

+2

"to"は特別でなければなりません。 – quemeful

2

Brown Corpus tagsと思われます。

+14

いいえ、Penn English Treebank POSタグです。これはBrown Corpusタグセットの簡素化です。 –

+0

本当ですか?上に引用した例には、 "。"というタグが含まれています。これはBrown Corpusで定義されていますが、上記のPenn Treebankタグのリストでは定義されていませんので、少なくともPenn Treebankタグのように簡単な答えではないと確信しています。 – Jules

+0

追加の研究を行ったところ、* Penn Treebankタグであるように見えますが、上記タグで上に引用した文書は不完全です:Penn Treebankタグには、受け入れられた回答のリストから省略された9個の句読点タグも含まれています。詳細は私の追加の答えを見てください。 – Jules

95
Explanation of each tag from the documentation : 

CC: conjunction, coordinating 
    & 'n and both but either et for less minus neither nor or plus so 
    therefore times v. versus vs. whether yet 
CD: numeral, cardinal 
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty- 
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025 
    fifteen 271,124 dozen quintillion DM2,000 ... 
DT: determiner 
    all an another any both del each either every half la many much nary 
    neither no some such that the them these this those 
EX: existential there 
    there 
FW: foreign word 
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous 
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte 
    terram fiche oui corporis ... 
IN: preposition or conjunction, subordinating 
    astride among uppon whether out inside pro despite on by throughout 
    below within for towards near behind atop around if like until below 
    next into if beside ... 
JJ: adjective or numeral, ordinal 
    third ill-mannered pre-war regrettable oiled calamitous first separable 
    ectoplasmic battery-powered participatory fourth still-to-be-named 
    multilingual multi-disciplinary ... 
JJR: adjective, comparative 
    bleaker braver breezier briefer brighter brisker broader bumper busier 
    calmer cheaper choosier cleaner clearer closer colder commoner costlier 
    cozier creamier crunchier cuter ... 
JJS: adjective, superlative 
    calmest cheapest choicest classiest cleanest clearest closest commonest 
    corniest costliest crassest creepiest crudest cutest darkest deadliest 
    dearest deepest densest dinkiest ... 
LS: list item marker 
    A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005 
    SP-44007 Second Third Three Two * a b c d first five four one six three 
    two 
MD: modal auxiliary 
    can cannot could couldn't dare may might must need ought shall should 
    shouldn't will would 
NN: noun, common, singular or mass 
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat 
    investment slide humour falloff slick wind hyena override subhumanity 
    machinist ... 
NNS: noun, common, plural 
    undergraduates scotches bric-a-brac products bodyguards facets coasts 
    divestitures storehouses designs clubs fragrances averages 
    subjectivists apprehensions muses factory-jobs ... 
NNP: noun, proper, singular 
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos 
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA 
    Shannon A.K.C. Meltex Liverpool ... 
NNPS: noun, proper, plural 
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists 
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques 
    Apache Apaches Apocrypha ... 
PDT: pre-determiner 
    all both half many quite such sure this 
POS: genitive marker 
    ' 's 
PRP: pronoun, personal 
    hers herself him himself hisself it itself me myself one oneself ours 
    ourselves ownself self she thee theirs them themselves they thou thy us 
PRP$: pronoun, possessive 
    her his mine my our ours their thy your 
RB: adverb 
    occasionally unabatingly maddeningly adventurously professedly 
    stirringly prominently technologically magisterially predominately 
    swiftly fiscally pitilessly ... 
RBR: adverb, comparative 
    further gloomier grander graver greater grimmer harder harsher 
    healthier heavier higher however larger later leaner lengthier less- 
    perfectly lesser lonelier longer louder lower more ... 
RBS: adverb, superlative 
    best biggest bluntest earliest farthest first furthest hardest 
    heartiest highest largest least less most nearest second tightest worst 
RP: particle 
    aboard about across along apart around aside at away back before behind 
    by crop down ever fast for forth from go high i.e. in into just later 
    low more off on open out over per pie raising start teeth that through 
    under unto up up-pp upon whole with you 
SYM: symbol 
    % & ' '' ''.)). * + ,. <=> @ A[fj] U.S U.S.S.R * ** *** 
TO: "to" as preposition or infinitive marker 
    to 
UH: interjection 
    Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen 
    huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly 
    man baby diddle hush sonuvabitch ... 
VB: verb, base form 
    ask assemble assess assign assume atone attention avoid bake balkanize 
    bank begin behold believe bend benefit bevel beware bless boil bomb 
    boost brace break bring broil brush build ... 
VBD: verb, past tense 
    dipped pleaded swiped regummed soaked tidied convened halted registered 
    cushioned exacted snubbed strode aimed adopted belied figgered 
    speculated wore appreciated contemplated ... 
VBG: verb, present participle or gerund 
    telegraphing stirring focusing angering judging stalling lactating 
    hankerin' alleging veering capping approaching traveling besieging 
    encrypting interrupting erasing wincing ... 
VBN: verb, past participle 
    multihulled dilapidated aerosolized chaired languished panelized used 
    experimented flourished imitated reunifed factored condensed sheared 
    unsettled primed dubbed desired ... 
VBP: verb, present tense, not 3rd person singular 
    predominate wrap resort sue twist spill cure lengthen brush terminate 
    appear tend stray glisten obtain comprise detest tease attract 
    emphasize mold postpone sever return wag ... 
VBZ: verb, present tense, 3rd person singular 
    bases reconstructs marks mixes displeases seals carps weaves snatches 
    slumps stretches authorizes smolders pictures emerges stockpiles 
    seduces fizzes uses bolsters slaps speaks pleads ... 
WDT: WH-determiner 
    that what whatever which whichever 
WP: WH-pronoun 
    that what whatever whatsoever which who whom whosoever 
WP$: WH-pronoun, possessive 
    whose 
WRB: Wh-adverb 
    how however whence whenever where whereby whereever wherein whereof why 
+0

あなたはソースを挙げてください。 –

+0

句読点はどうですか?例えば、 '、'トークンはPoS '、'を取得します。これらのPoSを含むリストはありますか? –

+0

'('トークンのPoS "-LRB-"についてはどうですか? –

13

...

/** 
* Represents the English parts-of-speech, encoded using the 
* de facto <a href="http://www.cis.upenn.edu/~treebank/">Penn Treebank 
* Project</a> standard. 
* 
* @see <a href="ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz">Penn Treebank Specification</a> 
*/ 
public enum PartOfSpeech { 
    ADJECTIVE("JJ"), 
    ADJECTIVE_COMPARATIVE(ADJECTIVE + "R"), 
    ADJECTIVE_SUPERLATIVE(ADJECTIVE + "S"), 

    /* This category includes most words that end in -ly as well as degree 
    * words like quite, too and very, posthead modi ers like enough and 
    * indeed (as in good enough, very well indeed), and negative markers like 
    * not, n't and never. 
    */ 
    ADVERB("RB"), 

    /* Adverbs with the comparative ending -er but without a strictly comparative 
    * meaning, like <i>later</i> in <i>We can always come by later</i>, should 
    * simply be tagged as RB. 
    */ 
    ADVERB_COMPARATIVE(ADVERB + "R"), 
    ADVERB_SUPERLATIVE(ADVERB + "S"), 

    /* This category includes how, where, why, etc. 
    */ 
    ADVERB_WH("W" + ADVERB), 

    /* This category includes and, but, nor, or, yet (as in Y et it's cheap, 
    * cheap yet good), as well as the mathematical operators plus, minus, less, 
    * times (in the sense of "multiplied by") and over (in the sense of "divided 
    * by"), when they are spelled out. <i>For</i> in the sense of "because" is 
    * a coordinating conjunction (CC) rather than a subordinating conjunction. 
    */ 
    CONJUNCTION_COORDINATING("CC"), 
    CONJUNCTION_SUBORDINATING("IN"), 
    CARDINAL_NUMBER("CD"), 
    DETERMINER("DT"), 

    /* This category includes which, as well as that when it is used as a 
    * relative pronoun. 
    */ 
    DETERMINER_WH("W" + DETERMINER), 
    EXISTENTIAL_THERE("EX"), 
    FOREIGN_WORD("FW"), 

    LIST_ITEM_MARKER("LS"), 

    NOUN("NN"), 
    NOUN_PLURAL(NOUN + "S"), 
    NOUN_PROPER_SINGULAR(NOUN + "P"), 
    NOUN_PROPER_PLURAL(NOUN + "PS"), 

    PREDETERMINER("PDT"), 
    POSSESSIVE_ENDING("POS"), 

    PRONOUN_PERSONAL("PRP"), 
    PRONOUN_POSSESSIVE("PRP$"), 

    /* This category includes the wh-word whose. 
    */ 
    PRONOUN_POSSESSIVE_WH("WP$"), 

    /* This category includes what, who and whom. 
    */ 
    PRONOUN_WH("WP"), 

    PARTICLE("RP"), 

    /* This tag should be used for mathematical, scientific and technical symbols 
    * or expressions that aren't English words. It should not used for any and 
    * all technical expressions. For instance, the names of chemicals, units of 
    * measurements (including abbreviations thereof) and the like should be 
    * tagged as nouns. 
    */ 
    SYMBOL("SYM"), 
    TO("TO"), 

    /* This category includes my (as in M y, what a gorgeous day), oh, please, 
    * see (as in See, it's like this), uh, well and yes, among others. 
    */ 
    INTERJECTION("UH"), 

    VERB("VB"), 
    VERB_PAST_TENSE(VERB + "D"), 
    VERB_PARTICIPLE_PRESENT(VERB + "G"), 
    VERB_PARTICIPLE_PAST(VERB + "N"), 
    VERB_SINGULAR_PRESENT_NONTHIRD_PERSON(VERB + "P"), 
    VERB_SINGULAR_PRESENT_THIRD_PERSON(VERB + "Z"), 

    /* This category includes all verbs that don't take an -s ending in the 
    * third person singular present: can, could, (dare), may, might, must, 
    * ought, shall, should, will, would. 
    */ 
    VERB_MODAL("MD"), 

    /* Stanford. 
    */ 
    SENTENCE_TERMINATOR("."); 

    private final String tag; 

    private PartOfSpeech(String tag) { 
    this.tag = tag; 
    } 

    /** 
    * Returns the encoding for this part-of-speech. 
    * 
    * @return A string representing a Penn Treebank encoding for an English 
    * part-of-speech. 
    */ 
    public String toString() { 
    return getTag(); 
    } 

    protected String getTag() { 
    return this.tag; 
    } 

    public static PartOfSpeech get(String value) { 
    for(PartOfSpeech v : values()) { 
     if(value.equals(v.getTag())) { 
     return v; 
     } 
    } 

    throw new IllegalArgumentException("Unknown part of speech: '" + value + "'."); 
    } 
} 
31

上記の受け入れ答えは、以下の情報が欠落しています。

定義された9個の句読点タグもあります。 (一部の参考文献には記載されていません、hereを参照)。これらは次のとおりである:

  1. $
  2. ''(終了引用符のすべての形態のために使用される)
  3. ((括弧のあらゆる形態のために使用される)
  4. )は(すべてのフォームで使用します閉じ括弧の)
  5. ,
  6. (すべての文末の句読点のために使用される):(コロン、セミコロンや楕円に使用)
  7. ``(引用符を開くのあらゆる形態のために使用される)
+0

新しいソースに更新されたリンク – Jules

16

ここでは、タグのより完全なリストですペンツリーバンク(completnessのためにここに掲載)のために:

http://www.surdeanu.info/mihai/teaching/ista555-fall13/readings/PennTreebankConstituents.html

また、句と句レベルのタグを含んでいます。

句レベル

- S 
- SBAR 
- SBARQ 
- SINV 
- SQ 

フレーズレベル

- ADJP 
- ADVP 
- CONJP 
- FRAG 
- INTJ 
- LST 
- NAC 
- NP 
- NX 
- PP 
- PRN 
- PRT 
- QP 
- RRC 
- UCP 
- VP 
- WHADJP 
- WHAVP 
- WHNP 
- WHPP 
- X 

(リンクの説明)

+2

これは人々が必要とする真のリストです.Penn Treebank POSタグだけでなく、言葉 –

5

私はここに全体のリストを提供していますし、また、参照リンクを与える

1. CC Coordinating conjunction 
2. CD Cardinal number 
3. DT Determiner 
4. EX Existential there 
5. FW Foreign word 
6. IN Preposition or subordinating conjunction 
7. JJ Adjective 
8. JJR Adjective, comparative 
9. JJS Adjective, superlative 
10. LS List item marker 
11. MD Modal 
12. NN Noun, singular or mass 
13. NNS Noun, plural 
14. NNP Proper noun, singular 
15. NNPS Proper noun, plural 
16. PDT Predeterminer 
17. POS Possessive ending 
18. PRP Personal pronoun 
19. PRP$ Possessive pronoun 
20. RB Adverb 
21. RBR Adverb, comparative 
22. RBS Adverb, superlative 
23. RP Particle 
24. SYM Symbol 
25. TO to 
26. UH Interjection 
27. VB Verb, base form 
28. VBD Verb, past tense 
29. VBG Verb, gerund or present participle 
30. VBN Verb, past participle 
31. VBP Verb, non-3rd person singular present 
32. VBZ Verb, 3rd person singular present 
33. WDT Wh-determiner 
34. WP Wh-pronoun 
35. WP$ Possessive wh-pronoun 
36. WRB Wh-adverb 

音声タグの部分の一覧は、hereです。ここで

4

、ここにありますあなたが従うことができるサンプルコード。

public static void main(String[] args) { 
    Properties properties = new Properties(); 
    properties.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse"); 
    StanfordCoreNLP pipeline = new StanfordCoreNLP(properties); 

    String input = "Colorless green ideas sleep furiously."; 
    Annotation annotation = pipeline.process(input); 
    List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class); 
    List<String> output = new ArrayList<>(); 
    String regex = "([{pos:/NN|NNS|NNP/}])"; //Noun 
    for (CoreMap sentence : sentences) { 
     List<CoreLabel> tokens = sentence.get(CoreAnnotations.TokensAnnotation.class); 
     TokenSequencePattern pattern = TokenSequencePattern.compile(regex); 
     TokenSequenceMatcher matcher = pattern.getMatcher(tokens); 
     while (matcher.find()) { 
      output.add(matcher.group()); 
     } 
    } 
    System.out.println("Input: "+input); 
    System.out.println("Output: "+output); 
} 

出力は次のようになります。

Input: Colorless green ideas sleep furiously. 
Output: [ideas]