2017-04-12 16 views
0

ご協力いただきありがとうございます。私は、以下のpythonコードを使用して、affymetrixマイクロアレイデータセットからデータを読み取り、処理したいと考えています。私は、単核細胞において、クローン病および潰瘍性大腸炎の疾患状態における差次的な遺伝子発現を解明したい。コードは完全に実行されますが、Xの内容を見ようとすると、出力(配列([]、dtype = float64)などの空の配列が得られますが、これは当然役に立ちません。 https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS1615 なぜ私は空の処理不能な出力を持っているのか理解していませんが、役に立たないものです。ここでは、コードは:コンソールでマイクロアレイのデータはどのように入手できますか?

import gzip 
import numpy as np 

""" 
Read in a SOFT format data file. The following values can be exported: 

GID : A list of gene identifiers of length d 
SID : A list of sample identifiers of length n 
STP : A list of sample descriptions of length d 
X : A dxn array of gene expression values 
""" 
#path to the data file 
fname = "../data/GDS1615_full.soft.gz" 

## Open the data file directly as a gzip file 
with gzip.open(fname) as fid: 
    SIF = {} 
    for line in fid: 
     if line.startswith(line, len("!dataset_table_begin")): 
      break 
     elif line.startswith(line, len("!subject_description")): 
      subset_description = line.split("=")[1].strip() 
     elif line.startswith(line, len("!subset_sample_id")): 
      subset_ids = [x.strip() for x in subset_ids] 
      for k in subset_ids: 
       SIF[k] = subset_description 
    ## Next line is the column headers (sample id's) 
    SID = next(fid).split("\t") 

    ## The column indices that contain gene expression data 
    I = [i for i,x in enumerate(SID) if x.startswith("GSM")] 

    ## Restrict the column headers to those that we keep 
    SID = [SID[i] for i in I] 

    ## Get a list of sample labels 
    STP = [SIF[k] for k in SID] 

    ## Read the gene expression data as a list of lists, also get the gene 
    ## identifiers 
    GID,X = [],[] 
    for line in fid: 

     ## This is what signals the end of the gene expression data 
     ## section in the file 
     if line.startswith("!dataset_table_end"): 
      break 

     V = line.split("\t") 

     ## Extract the values that correspond to gene expression measures 
     ## and convert the strings to numbers 
     x = [float(V[i]) for i in I] 

     X.append(x) 
     GID.append(V[0] + ";" + V[1]) 
X = np.array(X) 

## The indices of samples for the ulcerative colitis group 
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"] 

## The indices of samples for the Crohn's disease group 
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"] 

、私は、このような出力を得る:X OUT [94]:配列([]、DTYPE =のfloat64)

X.shape OUT [95] :(0、)

もう一度お返事ありがとうございます。

答えて

0

これは完全に働いた:

import gzip 
    import numpy as np 


    """ 
    Read in a SOFT format data file. The following values can be exported: 

    GID : A list of gene identifiers of length d 
    SID : A list of sample identifiers of length n 
    STP : A list of sample desriptions of length d 
    X : A dxn array of gene expression values 
    """ 
    #path to the data file 
    fname = "../data/GDS1615_full.soft.gz" 

    ## Open the data file directly as a gzip file 
    with gzip.open(fname) as fid: 
     SIF = {} 
     for line in fid: 
      if line.startswith(b"!dataset_table_begin"): 
       break 
      elif line.startswith(b"!subset_description"): 

       subset_description = line.decode('utf8').split("=")[1].strip() 
      elif line.startswith(b"!subset_sample_id"): 
       subset_ids = line.decode('utf8').split("=")[1].split(",") 
       subset_ids = [x.strip() for x in subset_ids] 
       for k in subset_ids: 
        SIF[k] = subset_description 
     ## Next line is the column headers (sample id's) 
     SID = next(fid).split(b"\t") 
     ## The column indices that contain gene expression data 
     I = [i for i,x in enumerate(SID) if x.startswith(b"GSM")] 
     ## Restrict the column headers to those that we keep 
     SID = [SID[i] for i in I] 
     ## Get a list of sample labels 
     STP = [SIF[k.decode('utf8')] for k in SID] 
    ## Read the gene expression data as a list of lists, also get the gene 
    ## identifiers 
    GID,X = [],[] 
    for line in fid: 
     ## This is what signals the end of the gene expression data 
     ## section in the file 
     if line.startswith(b"!dataset_table_end"): 
      break 
     V = line.split(b"\t") 
     ## Extract the values that correspond to gene expression measures 
     ## and convert the strings to numbers 
     x = [float(V[i]) for i in I] 
     X.append(x) 
     GID.append(V[0].decode() + ";" + V[1].decode()) 

X = np.array(X) 
## The indices of samples for the ulcerative colitis group 
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"] 
## The indices of samples for the Crohn's disease group 
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"] 

結果:

X.shape OUT [4]:(22283 127)

関連する問題