2016-05-29 13 views
0

異なる生物種を含むgenbankファイルから最大、最小、平均の長さを取得したいと思います。各生物からの配列長の取得 - genbankファイル

例:

Organism: Homo sapiens 
average length = 160 
shortest length = 20 | id of shortest seq 
longest length = 500 | id of longest seq 

Organism: Caenorhabditis elegans 
average length = 140 
shortest length = 40 | id of shortest seq 
longest length = 300 | id of longest seq 

私は長さを得ることができたが、私は、私は各生物のための長さを印刷するには変更する必要があり一部

use strict; 
use warnings; 

print "enter file path: "; 
my $filename = <>; 
chomp ($filename); 
open(IN, $filename) or die "\n error opening file \n;/\n"; 

$/ = "//"; 

my %organisms ; 
while (my $block = <IN>) { 
    next if $block =~ /^\s*\n\s*$/; 
    my ($definition , $sequence) = split "ORIGIN", $block; 

    my $accession; 
    $definition =~ m/(ACCESSION.+[0-9])/x 
     ? $accession = $1 
     : die "No ACCESSION"; 

    my $organism; 
    $definition =~ m/(ORGANISM\s+.+\n)/x 
     ? $organism = $1 
     : die "No ORGANISM"; 


    $sequence =~ s/[\d\n\s\t\/]//g; 
    $organisms{ $sequence } = [ $organism, $accession ]; 
    } 

my $sum = 0; 
foreach my $sequence(keys %organisms) { 

    my $current_len = length($sequence); 
     $sum += $current_len; 
}; 
my $number_seqs = scalar keys %organisms; 
my $average = ($sum/$number_seqs); 
print "average length = $average \n"; 

my @sorted_keys = sort { length $a <=> length $b } keys %organisms ; 
my $shortest = $sorted_keys[0]; 
my $longest = $sorted_keys[-1]; 
my $short = length $shortest; 
my $long = length $longest; 
my $short_id = $organisms{$shortest}->[1]; 
my $long_id = $organisms{$longest}->[1]; 
my $short_type = $organisms{$shortest}->[0]; 
my $long_type = $organisms{$longest}->[0]; 

print "shortest length = $short | $short_id | $short_type\n"; 
print "longest length = $long | $long_id | $long_type\n"; 


exit; 

生物

によってそれらを分離することはできませんか?

入力例:

LOCUS  NM_001112   40 bp mRNA linear PRI 20-AP 
DEFINITION Homo sapiens transcript variant 5, mRNA. 
ACCESSION NM_001112 
KEYWORDS RefSeq. 
SOURCE  Homo sapiens (human) 
    ORGANISM Homo sapiens 
ORIGIN  
     1 actgggcggc ccttagaccc 
     // 
LOCUS  NM_854    212 bp mRNA linear PRI 20-AP 
DEFINITION Homo sapiens transcript variant 1, mRNA. 
ACCESSION NM_854 
KEYWORDS RefSeq. 
SOURCE  Homo sapiens (human) 
    ORGANISM Homo sapiens 
ORIGIN  
     1 gggcaaaag aagcaggtca cacagcctgt ttcctgtttt caaacgggga acttagaaag 

// 
LOCUS  AW057463     469 bp mRNA linear EST 29-SE 
DEFINITION ca03d09.x1 C elegans fem3 Q23 S1 Caenorhabditis elegans cDNA 3' 
ACCESSION AW057463 
VERSION  AW057463.1 GI:5933102 
DBLINK  BioSample: LIBEST_002392 
KEYWORDS EST. 
SOURCE  Caenorhabditis elegans 
    ORGANISM Caenorhabditis elegans 
ORIGIN  
     1 ttttactcaa aactatctat ccaagttaat cagtagtgtt agttctagtt aagttattaa 
     61 ggcgcacggt ctgtctcctt gcttcttctc tttgtatccc ctttctcctt tttcaaaact 
     121 tcactttcat caataattgg ttctttagaa tacagttttc caatttccac gtactctctt 
     181 ctcttccgat ccttgtcaaa ctttttcttc gggagctcat cttctggaac tactttcaca 
// 
LOCUS  AW04463     259 bp mRNA linear EST 14-SE 
DEFINITION ca02d86.x1 C elegans fem6 S12 Q3 Caenorhabditis elegans cDNA 3' 
ACCESSION AW04463 
VERSION  AW04463.1 GI:90872 
DBLINK  BioSample: LIBEST_004372 
KEYWORDS EST. 
SOURCE  Caenorhabditis elegans 
    ORGANISM Caenorhabditis elegans 
ORIGIN  
     241 tttttcgatg gaaccaaacg ggaacgagtt ggcttttcca ccaaaagatt agcgtactcc 
     301 gaactgtatt tccccttctt tttcttttca agaggaacat tttctcgttg agtatcatcg 
     361 tcctccaaac tttgttgagt agtcatggac tgggtccgag agaattcaac ggtaggcatg 
     421 gaacctttgc tcttgtcgtc gtttgccttt ggtgcctttc ccttttgaa 
// 

答えて

0

私は各生物のための長さを印刷するように変更する必要があり一部?

私は、プログラムの多くの部分を変更する必要があると思います(主に、シーケンスをハッシュキーとして使用するロジックを変更して、その生物をハッシュキーとして使用します)。ここで提案です:

use feature qw(say); 
use strict; 
use warnings; 

my $filename = 'datafile.txt'; 
chomp ($filename); 
open(IN, $filename) or die "\n error opening file \n;/\n"; 

$/ = "//"; 

my %organisms ; 
while (my $block = <IN>) { 
    next if $block =~ /^\s*\n\s*$/; 
    my ($definition , $sequence) = split "ORIGIN", $block; 

    my $accession; 
    $definition =~ m/ACCESSION\s*(\S+)/x 
     ? $accession = $1 
     : die "No ACCESSION"; 

    my $organism; 
    $definition =~ m/ORGANISM\s*(\S.+)\n/x 
     ? $organism = $1 
     : die "No ORGANISM"; 

    $sequence =~ s/[\d\n\s\t\/]//g; 
    push @{ $organisms{ $organism }{info} }, [ $sequence, $accession ]; 
} 

for my $organism (keys %organisms) { 
    say "Organism: $organism"; 
    my $info = $organisms{ $organism }{info}; 
    my $sum = 0; 
    map { $sum += length $_->[0] } @$info; 
    my ($min_len, $min_id, $max_len, $max_id) = map { length $_->[0], $_->[1] } 
     (sort { length $a->[0] <=> length $b->[0]} @$info)[0,-1]; 

    say "average length = " . $sum/(scalar @$info); 
    say "shortest length = $min_len | $min_id"; 
    say "longest length = $max_len | $max_id"; 
    say ""; 
} 

出力は以下のようになります。

Organism: Homo sapiens 
average length = 39.5 
shortest length = 20 | NM_001112 
longest length = 59 | NM_854 

Organism: Caenorhabditis elegans 
average length = 234.5 
shortest length = 229 | AW04463 
longest length = 240 | AW057463 
関連する問題