2017-10-17 9 views
1

私は最近、疎な値の10k〜100kサンプル(cells)×20kの特徴(genes)であり、多くのメタデータを含む単一細胞RNAシーケンシングデータを扱っています。起源の組織(「脳」対「肝臓」)。メタデータは〜10〜100列で、私はpandas.DataFrameとして保存します。今、私はxarray.DataSetsを、メタデータをififingして座標として追加することで作成しています。私はノートの間でスニペットをコピーしているので、それはclunkyでエラーが発生しやすいようです。簡単な方法がありますか?メタデータ+値から簡単にxarray DataSetを作成できますか?

cell_metadata_dict = cell_metadata.to_dict(orient='list') 
coords = {k: ('cell', v) for k, v in cell_metadata_dict.items()} 
coords.update(dict(gene=counts.columns, cell=counts.index)) 

ds = xr.Dataset(
    {'counts': (['cell', 'gene'], counts), 
    }, 
    coords=coords) 

EDIT:

いくつかの例のデータを表示するには、ここでcell_metadata.head().to_csv()だ:

cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex 
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 

counts.iloc[:5, :20].to_csv()

cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik 
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37 
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65 

日時:pandas.DataFrame.to_xarray() - これは非常に遅く、それはそうですあまりにも多くの数値とカテゴリをエンコードする私には奇妙なoricalデータを100レベルのMultiIndexとして保存します。それと、私がMultiIndexを使ってみるたびに、常に「ああ、それで私はMultiIndexを使わない」という結果になり、別のメタデータとカウントデータフレームに戻ることになります。

+0

DataFrameのサンプル( 'df.head()')とターゲットDatasetまたはDataArrayの詳細な説明を提供できますか?あなたはパンダのto_xarray()メソッドを使ってみましたか? – jhamman

+0

Joeのコメントに追加するには、xarrayドキュメントの[pandas](http://xarray.pydata.org/en/stable/pandas.html)セクションを参照してください。あなたのデータに適切な 'pandas.MultiIndex'を設定できれば、xarrayへの変換は通常は簡単です。 – shoyer

答えて

0

Xarrayは、デフォルトのメタデータにpandasインデックス/列ラベルを使用します。すべての変数が同じ次元を共有しているときに単一の関数呼び出しで変換できますが、異なる変数が異なる次元を持つ場合は、それらを個別にpandasから変換してからxarray側にまとめる必要があります。たとえば:あなたの代わりにデータセットをしたい場合は、

<xarray.DataArray (cell: 5, gene: 20)> 
array([[308, 289, 81, 0, 4, 88, 52, 0, 0, 104, 65, 0, 1, 0, 
      9, 8, 12, 283, 12, 37], 
     [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0], 
     [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0], 
     [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0], 
     [375, 325, 70, 0, 2, 72, 36, 13, 0, 60, 105, 0, 13, 0, 
      0, 29, 15, 264, 0, 65]]) 
Coordinates: 
    * cell       (cell) object 'A1-MAA100140-3_57_F-1-1' ... 
    * gene       (gene) object '0610005C13Rik' ... 
    Uniquely mapped reads number (cell) int64 428699 324428 381310 393498 717 
    Number of input reads   (cell) int64 502312 360285 431800 446705 918 
    EXP_ID      (cell) object '170928_A00111_0068_AH3YKKDMXX' ... 
    TAXON       (cell) object 'mus' 'mus' 'mus' 'mus' 'mus' 
    WELL_MAPPING     (cell) object 'MAA100140' 'MAA100140' ... 
    Lysis Plate Batch    (cell) float64 nan nan nan nan nan 
    dNTP.batch     (cell) float64 nan nan nan nan nan 
    oligodT.order.no    (cell) float64 nan nan nan nan nan 
    plate.type     (cell) object 'Biorad 96well' ... 
    preparation.site    (cell) object 'Stanford' 'Stanford' ... 
    date.prepared     (cell) float64 nan nan nan nan nan 
    date.sorted     (cell) int64 170720 170720 170720 170720 ... 
    tissue      (cell) object 'Liver' 'Liver' 'Liver' ... 
    subtissue      (cell) object 'Hepatocytes' 'Hepatocytes' ... 
    mouse.id      (cell) object '3_57_F' '3_57_F' '3_57_F' ... 
    FACS.selection    (cell) float64 nan nan nan nan nan 
    nozzle.size     (cell) float64 nan nan nan nan nan 
    FACS.instument    (cell) float64 nan nan nan nan nan 
    Experiment ID     (cell) float64 nan nan nan nan nan 
    Columns sorted    (cell) float64 nan nan nan nan nan 
    Double check     (cell) float64 nan nan nan nan nan 
    Plate       (cell) float64 nan nan nan nan nan 
    Location      (cell) float64 nan nan nan nan nan 
    Comments      (cell) float64 nan nan nan nan nan 
    mouse.age      (cell) int64 3 3 3 3 3 
    mouse.number     (cell) int64 57 57 57 57 57 
    mouse.sex      (cell) object 'F' 'F' 'F' 'F' 'F' 

DataArrayは、例えば、データセットのコンストラクタにオブジェクトを入れ、

# shouldn't really need to use .data_vars here, that might be an xarray bug 
>>> xarray.Dataset({'counts': xarray.DataArray(counts.set_index('cell'), 
...           dims=['cell', 'gene'])}, 
...    coords=cell_metadata.set_index('cell').to_xarray().data_vars) <xarray.Dataset> 

Dimensions:      (cell: 5, gene: 20) 
Coordinates: 
    * cell       (cell) object 'A1-MAA100140-3_57_F-1-1' ... 
    * gene       (gene) object '0610005C13Rik' ... 
    Uniquely mapped reads number (cell) int64 428699 324428 381310 393498 717 
    Number of input reads   (cell) int64 502312 360285 431800 446705 918 
    EXP_ID      (cell) object '170928_A00111_0068_AH3YKKDMXX' ... 
    TAXON       (cell) object 'mus' 'mus' 'mus' 'mus' 'mus' 
    WELL_MAPPING     (cell) object 'MAA100140' 'MAA100140' ... 
    Lysis Plate Batch    (cell) float64 nan nan nan nan nan 
    dNTP.batch     (cell) float64 nan nan nan nan nan 
    oligodT.order.no    (cell) float64 nan nan nan nan nan 
    plate.type     (cell) object 'Biorad 96well' ... 
    preparation.site    (cell) object 'Stanford' 'Stanford' ... 
    date.prepared     (cell) float64 nan nan nan nan nan 
    date.sorted     (cell) int64 170720 170720 170720 170720 ... 
    tissue      (cell) object 'Liver' 'Liver' 'Liver' ... 
    subtissue      (cell) object 'Hepatocytes' 'Hepatocytes' ... 
    mouse.id      (cell) object '3_57_F' '3_57_F' '3_57_F' ... 
    FACS.selection    (cell) float64 nan nan nan nan nan 
    nozzle.size     (cell) float64 nan nan nan nan nan 
    FACS.instument    (cell) float64 nan nan nan nan nan 
    Experiment ID     (cell) float64 nan nan nan nan nan 
    Columns sorted    (cell) float64 nan nan nan nan nan 
    Double check     (cell) float64 nan nan nan nan nan 
    Plate       (cell) float64 nan nan nan nan nan 
    Location      (cell) float64 nan nan nan nan nan 
    Comments      (cell) float64 nan nan nan nan nan 
    mouse.age      (cell) int64 3 3 3 3 3 
    mouse.number     (cell) int64 57 57 57 57 57 
    mouse.sex      (cell) object 'F' 'F' 'F' 'F' 'F' 
Data variables: 
    counts      (cell, gene) int64 308 289 81 0 4 88 52 0 ... 

import pandas as pd 
import io 
import xarray 

# read your data 
cell_metadata = pd.read_csv(io.StringIO(u"""\ 
cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex 
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F 
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F""")) 
counts = pd.read_csv(io.StringIO(u"""\ 
cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik 
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37 
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65""")) 

# build the output 
xarray_counts = xarray.DataArray(counts.set_index('cell'), dims=['cell', 'gene']) 
xarray_counts.coords.update(cell_metadata.set_index('cell').to_xarray()) 
print(xarray_counts) 

これはカウントのための素晴らしい、きちんとxarray.DataArrayになり

関連する問題