2016-05-11 19 views
0

文字列の途中にある文字列を文字列から取り除きたい文字列は、データフレームのフィールドです。データフレームは次のようになります。R - 文字列の途中にある部分文字列

Common.name price description 
Animal 1 $50 Field Collected\nRoughly 2-3 Inches In Length\nVibrant Red Coloration\nWill Do Fine In Groups\nFeeding On Various Vegetation & Fruits\nSizes Range From 1-2.5 Feet In Total Length\nField Collected\nSizes Vary From Juvenile ... 
Animal 2 $40 Captive Bred\nApproximately 10-12 Inches In Length\nBeautiful Hybrid to Add To Your Collection\nInsane Patterning And Beautiful Vibrant Colors\nMales And Females... 
... ... ... ... 
Animal 500 $29 Field Collected Beauties\nMales And Females Available\nApproximate Length: no bigger than a penny\nAmazingly Friendly! Make Great Pets!\nOnly Reach About 9 Inches At Most!\nFeeding On Vitamin Dusted Greens And ... 

各フィールドの長さと単位を新しいフィールドとして抽出します。動物1では、1つのフィールドで2〜3、別のフィールドで1インチになります。 Animal 500の場合、長さは「ペニーより大きくない」となり、単位フィールドはNAになります。

Rでこれを行うにはどうすればよいですか?

+0

あなたは、「まつ毛のまつ毛も、青年のピグミーマーモセットの尻尾よりも大きくない」というものを抽出することは、ほとんど不可能になることを理解しているでしょうか? – cory

+0

haha​​、はい私は誰かが自分の袖を上げてくれることを望んでいた。しかし、それらのほとんどは実際の数字なので、悪い場合は手動で数値なしで数字を追加します –

+0

これは、各説明に「Length:5」があれば簡単になります。しかし代わりに 'Approximately'と' Roughly'を使います。これらすべてのケースを記述する必要があります。それが一般的であれば、 'regexpr'を使って識別子がどこかにあるかもしれない文字番号を得ることができます。 – giraffehere

答えて

1

説明

この正規表現は、次の操作を行います。どこかの単語lengthで最初のフィールドを見つける数

  • キャプチャ動物の数
  • 続いAnimalで始まる

    • マッチラインをフィールド内で
    • 長さが数値で表される場合
      • 捕捉単数234または数字
      • 3-342の範囲と長さは、長さがいくつかの奇妙なテキストとして発現される場合、数字の後に文字列がメジャー
    • の単位であると仮定
      • :
      • がnull
      としてUnitOfMeasureを離れた後、すべてのものをキャプチャ

    正規表現

    ^(?<Animal>Animal\s[0-9]+)\s+\S+\s+(?:(?:(?!\\n|$).)*\\n)*?(?=(?:(?!\\n).)*Length)(?:(?:(?!\\n).)*?(?<Length>[0-9]+\s*(?:-\s*[0-9]+)?)\s+(?<UnitOfMeasure>\S+)|(?:(?!\\n).)*?Length:\s*(?<Length>(?:(?!\\n).)*))?

    Regular expression visualization

    ノート

    • 私は、次のフラグを使用:複数行をグローバル、重複サブパターンナムを許可します
    • ソーステキスト内の\n文字列が文字どおり\nか、返された文字を表しているかどうかはわかりませんでした。したがって、この正規表現は文字通り\文字の後にn文字が続くものと仮定して構成されています。あなたは改行文字を表現するためにこれらの文字を意味している場合、正規表現では、すべての\\n\nに変更

    ライブ例

    https://regex101.com/r/nL1fW1/2

    サンプル入力テキスト

    Common.name price description 
    Animal 1 $50 Field Collected\nRoughly 2-3 Inches In Length\nVibrant Red Coloration\nWill Do Fine In Groups\nFeeding On Various Vegetation & Fruits\nSizes Range From 1-2.5 Feet In Total Length\nField Collected\nSizes Vary From Juvenile ... 
    Animal 2 $40 Captive Bred\nApproximately 10-12 Inches In Length\nBeautiful Hybrid to Add To Your Collection\nInsane Patterning And Beautiful Vibrant Colors\nMales And Females... 
    Animal 3 $40 Captive Bred\nApproximately 10 Inches In Length\nBeautiful Hybrid to Add To Your Collection\nInsane Patterning And Beautiful Vibrant Colors\nMales And Females... 
    ... ... ... ... 
    Animal 500 $29 Field Collected Beauties\nMales And Females Available\nApproximate Length: no bigger than a penny\nAmazingly Friendly! Make Great Pets!\nOnly Reach About 9 Inches At Most!\nFeeding On Vitamin Dusted Greens And ... 
    

    サンプルは、これは上記のライブリンクでは、説明フィールドからコピーされた説明

    [0][0] = Animal 1 $50 Field Collected\nRoughly 2-3 Inches 
    [0][Animal] = Animal 1 
    [0][Length] = 2-3 
    [0][UnitOfMeasure] = Inches 
    
    [1][0] = Animal 2 $40 Captive Bred\nApproximately 10-12 Inches 
    [1][Animal] = Animal 2 
    [1][Length] = 10-12 
    [1][UnitOfMeasure] = Inches 
    
    [2][0] = Animal 3 $40 Captive Bred\nApproximately 10 Inches 
    [2][Animal] = Animal 3 
    [2][Length] = 10 
    [2][UnitOfMeasure] = Inches 
    
    [3][0] = Animal 500 $29 Field Collected Beauties\nMales And Females Available\nApproximate Length: no bigger than a penny 
    [3][Animal] = Animal 500 
    [3][UnitOfMeasure] = 
    [3][Length] = no bigger than a penny 
    

    にマッチします。

    ^ assert position at start of a line 
    (?<Animal>Animal\s[0-9]+) Named capturing group Animal 
    Animal matches the characters Animal literally (case sensitive) 
    \s match any white space character [\r\n\t\f ] 
    [0-9]+ match a single character present in the list below 
    Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy] 
    0-9 a single character in the range between 0 and 9 
    \s+ match any white space character [\r\n\t\f ] 
    Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy] 
    \S+ match any non-white space character [^\r\n\t\f ] 
    Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy] 
    \s+ match any white space character [\r\n\t\f ] 
    Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy] 
    (?:(?:(?!\\n|$).)*\\n)*? Non-capturing group 
    Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy] 
    (?:(?!\\n|$).)* Non-capturing group 
    Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy] 
    (?!\\n|$) Negative Lookahead - Assert that it is impossible to match the regex below 
    1st Alternative: \\n 
    \\ matches the character \ literally 
    n matches the character n literally (case sensitive) 
    2nd Alternative: $ 
    $ assert position at end of a line 
    . matches any character (except newline) 
    \\ matches the character \ literally 
    n matches the character n literally (case sensitive) 
    (?=(?:(?!\\n).)*Length) Positive Lookahead - Assert that the regex below can be matched 
    (?:(?!\\n).)* Non-capturing group 
    Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy] 
    (?!\\n) Negative Lookahead - Assert that it is impossible to match the regex below 
    \\ matches the character \ literally 
    n matches the character n literally (case sensitive) 
    . matches any character (except newline) 
    Length matches the characters Length literally (case sensitive) 
    (?:(?:(?!\\n).)*?(?<Length>[0-9]+\s*(?:-\s*[0-9]+))\s+(?<UnitOfMeasure>\S+)|(?:(?!\\n).)*?Length:\s*(?<Length>(?:(?!\\n).)*))? Non-capturing group 
    Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy] 
    1st Alternative: (?:(?!\\n).)*?(?<Length>[0-9]+\s*(?:-\s*[0-9]+))\s+(?<UnitOfMeasure>\S+) 
    (?:(?!\\n).)*? Non-capturing group 
    Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy] 
    (?!\\n) Negative Lookahead - Assert that it is impossible to match the regex below 
    \\ matches the character \ literally 
    n matches the character n literally (case sensitive) 
    . matches any character (except newline) 
    (?<Length>[0-9]+\s*(?:-\s*[0-9]+)?) Named capturing group Length 
    [0-9]+ match a single character present in the list below 
    Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy] 
    0-9 a single character in the range between 0 and 9 
    \s* match any white space character [\r\n\t\f ] 
    Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy] 
    (?:-\s*[0-9]+)? Non-capturing group 
    - matches the character - literally 
    \s* match any white space character [\r\n\t\f ] 
    Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy] 
    [0-9]+ match a single character present in the list below 
    Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy] 
    0-9 a single character in the range between 0 and 9 
    \s+ match any white space character [\r\n\t\f ] 
    Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy] 
    (?<UnitOfMeasure>\S+) Named capturing group UnitOfMeasure 
    \S+ match any non-white space character [^\r\n\t\f ] 
    Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy] 
    2nd Alternative: (?:(?!\\n).)*?Length:\s*(?<Length>(?:(?!\\n).)*) 
    (?:(?!\\n).)*? Non-capturing group 
    Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy] 
    (?!\\n) Negative Lookahead - Assert that it is impossible to match the regex below 
    \\ matches the character \ literally 
    n matches the character n literally (case sensitive) 
    . matches any character (except newline) 
    Length: matches the characters Length: literally (case sensitive) 
    \s* match any white space character [\r\n\t\f ] 
    Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy] 
    (?<Length>(?:(?!\\n).)*) Named capturing group Length 
    (?:(?!\\n).)* Non-capturing group 
    Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy] 
    (?!\\n) Negative Lookahead - Assert that it is impossible to match the regex below 
    \\ matches the character \ literally 
    n matches the character n literally (case sensitive) 
    . matches any character (except newline) 
    m modifier: multi-line. Causes^and $ to match the begin/end of each line (not only begin/end of string) 
    g modifier: global. All matches (don't return on first match) 
    J modifier: Allow duplicate subpattern names 
    

  • 関連する問題