2017-05-16 8 views
1

私は以下のようなデータフレームを持っており、sic2列の値に応じて 'string'を挿入します。pandas dataframe別の列の値の範囲に応じて値を挿入します

 conm   sic2 
115466 ALLEGION PLC 34.0 
115471 AGILITY HEALTH INC 80.0 
115473 NORDIC AMERICAN OFFSHORE 44.0 
115474 AAD    54.0 
115477 DORIAN LPG LTD 44.0 
115484 NOMAD FOODS LTD 20.0 
115486 ATHENE HOLDING LTD 63.0 
115490 MIDATECH PHARMA PLC 28.0 
115495 MOTIF BIO PLC 28.0 

文字列へのsic2の数値の範囲は以下のとおりです。

1-9 Agriculture, Forestry and Fishing 
10-14 Mining 
15-17 Construction 
18-19 not used 
20-39 Manufacturing 
40-49 Transportation, Communications, Electric, Gas and Sanitary service 
50-51 Wholesale Trade 
52-59 Retail Trade 
60-67 Finance, Insurance and Real Estate 
70-89 Services 
91-97 Public Administration 
99-99 Nonclassifiable 
0 -1 Agricultural Production-Crops 

pandas.DataFrameを大型データセット全体に適用するにはどうすればよいですか?

いくつかの条件付きコードを試しましたが、失敗し続けます。

 conm   sic2    industry 
115466 ALLEGION PLC 34.0    Manufacturing 
115471 AGILITY HEALTH INC 80.0   Services 
115473 NORDIC AMERICAN OFFSHORE 44.0 Transportation, Communications, Electric, Gas and Sanitary service 
115474 AAD    54.0    Retail Trade 

答えて

2

あなたが辞書にsics番号を有効にした場合、必要に応じて業種を検索するためにかなり単純です:

コード:

sic = [x.strip().split(' ', 1) for x in """ 
    1-9 Agriculture, Forestry and Fishing 
    10-14 Mining 
    15-17 Construction 
    18-19 not used 
    20-39 Manufacturing 
    40-49 Transportation, Communications, ... 
    50-51 Wholesale Trade 
    52-59 Retail Trade 
    60-67 Finance, Insurance and Real Estate 
    70-89 Services 
    91-97 Public Administration 
    99-99 Nonclassifiable 
""".split('\n')[1:-1]] 

sic_dict = dict(sum([[(x, z) for x in 
         range(*[int(y) for y in v.split('-')])] 
        for v, z in sic], [])) 

テストコード:

df = pd.read_fwf(StringIO(u""" 
    number conm      sic2 
    115466 ALLEGION PLC    34.0 
    115471 AGILITY HEALTH INC  80.0 
    115473 NORDIC AMERICAN OFFSHORE 44.0 
    115474 AAD      54.0 
    115477 DORIAN LPG LTD   44.0 
    115484 NOMAD FOODS LTD   20.0 
    115486 ATHENE HOLDING LTD  63.0 
    115490 MIDATECH PHARMA PLC  28.0 
    115495 MOTIF BIO PLC    28.0"""), header=1) 

df['industry'] = df.sic2.apply(lambda x: sic_dict[int(x)]) 

print(df) 

結果:

number      conm sic2        industry 
0 115466    ALLEGION PLC 34.0      Manufacturing 
1 115471  AGILITY HEALTH INC 80.0        Services 
2 115473 NORDIC AMERICAN OFFSHORE 44.0 Transportation, Communications, ... 
3 115474      AAD 54.0       Retail Trade 
4 115477   DORIAN LPG LTD 44.0 Transportation, Communications, ... 
5 115484   NOMAD FOODS LTD 20.0      Manufacturing 
6 115486  ATHENE HOLDING LTD 63.0 Finance, Insurance and Real Estate 
7 115490  MIDATECH PHARMA PLC 28.0      Manufacturing 
8 115495    MOTIF BIO PLC 28.0      Manufacturing 
0
#Save your mapping table to a data frame 

df2 = pd.DataFrame({'id_end': {0: 9, 1: 14, 2: 17, 3: 19, 4: 39, 5: 49, 6: 51, 7: 59, 8: 67, 9: 89, 10: 97, 11: 99, 12: 1}, 
'id_start': {0: 1, 1: 10, 2: 15, 3: 18, 4: 20, 5: 40, 6: 50, 7: 52, 8: 60, 9: 70, 10: 91, 11: 99, 12: 0}, 
'industry': {0: 'Agriculture, Forestry and Fishing', 1: 'Mining', 2: 'Construction', 3: 'not used', 4: 'Manufacturing', 
    5: 'Transportation, Communications, Electric, Gas and Sanitary service', 
    6: 'Wholesale Trade', 7: 'Retail Trade', 8: 'Finance, Insurance and Real Estate', 9: 'Services', 
    10: 'Public Administration', 11: 'Nonclassifiable', 12: 'Agricultural Production Crops'}}) 

df2 = df2.sort_values(by='id_end') 

Out[354]: 
    id_end id_start           industry 
12  1   0      Agricultural Production Crops 
0  9   1     Agriculture, Forestry and Fishing 
1  14  10            Mining 
2  17  15          Construction 
3  19  18           not used 
4  39  20          Manufacturing 
5  49  40 Transportation, Communications, Electric, Gas ... 
6  51  50         Wholesale Trade 
7  59  52          Retail Trade 
8  67  60     Finance, Insurance and Real Estate 
9  89  70           Services 
10  97  91        Public Administration 
11  99  99         Nonclassifiable 

#Map sic2 number to industry names 
df['industry'] = df['sic2'].astype(np.int).apply(lambda x: df2.loc[df2.id_end>=x,'industry'].iloc[0]) 


Out[352]: 
          conm sic2            industry 
115466    ALLEGION PLC 34.0          Manufacturing 
115471  AGILITY HEALTH INC 80.0            Services 
115473 NORDIC AMERICAN OFFSHORE 44.0 Transportation, Communications, Electric, Gas ... 
115474      AAD 54.0           Retail Trade 
115477   DORIAN LPG LTD 44.0 Transportation, Communications, Electric, Gas ... 
115484   NOMAD FOODS LTD 20.0          Manufacturing 
115486  ATHENE HOLDING LTD 63.0     Finance, Insurance and Real Estate 
115490  MIDATECH PHARMA PLC 28.0          Manufacturing 
115495    MOTIF BIO PLC 28.0          Manufacturing 
関連する問題