Perlが異なるケース間で有効な行のペアを見つける

私はHTTPヘッダー要求と応答データをタブ区切り形式で各GET/POSTと異なる行に返信しています。このデータは、1つのTCPフローに対して複数のGET、POST、およびREPLYが存在するようなデータです。これらのケースのうち、最初の有効なGET - REPLYペアのみを選択する必要があります。例（簡略化）は：Perlが異なるケース間で有効な行のペアを見つける

ID  Source Dest Bytes Type Content-Length host    lines.... 
1   A   B  10  GET  NA   yahoo.com   2 
1   A   B  10  REPLY  10   NA     2 
2   C   D  40  GET  NA   google.com   4 
2   C   D  40  REPLY  20   NA     4 
2   C   D  40  GET  NA   google.com   4 
2   C   D  40  REPLY  30   NA     4 
3   A   B  250 POST  NA   mail.yahoo.com  5 
3   A   B  250 REPLY  NA   NA     5 
3   A   B  250 REPLY  15   NA     5 
3   A   B  250 GET  NA   yimg.com    5 
3   A   B  250 REPLY  35   NA     5 
4   G   H  415 REPLY  10   NA     6 
4   G   H  415 POST  NA   facebook.com   6 
4   G   H  415 REPLY  NA   NA     6 
4   G   H  415 REPLY  NA   NA     6 
4   G   H  415 GET  NA   photos.facebook.com 6 
4   G   H  415 REPLY  50   NA     6 

....

したがって、基本的には、各IDに対して1つのリクエスト - 応答ペアを取得し、それらを新しいファイルに書き込む必要があります。

'1'は単なるペアなので簡単です。 しかし、両方の行がGET、POSTまたはREPLYである偽の場合もあります。したがって、そのような場合は無視されます。

「2」の場合は、最初のGET-REPLYペアを選択します。

'3'の場合、最初のGETを選択しますが、2番目のREPLYはContent-Lengthとして存在しません（サブシーケンスREPLYをより良い候補にします）。

「4」の場合、最初のヘッダーは返信できないため、最初のPOST（またはGET）を選択します。 POST後に内容の長さが欠落していても、2番目のGET後にREPLYを選択することはありません。だから私は最初のREPLYを選択するだけです。

したがって、最適なリクエストと返信のペアを選択した後、それらを1行にペアで設定する必要があります。たとえば、出力は次のようになります。

ID  Source Dest Bytes Type Content-Length host   .... 
    1   A   B  10  GET  10   yahoo.com 
    2   C   D  40  GET  20   google.com 
    3   A   B  250 POST  15   mail.yahoo.com 
    4   G   H  415 POST  NA   facebook.com

実際のデータには他に多くのヘッダーがありますが、この例では必要なものがほとんど表示されています。 Perlでこれをどうやって行いますか？私はかなり最初に立ち往生していますので、一度に1行ずつしかファイルを読むことができませんでした。

open F, "<", "file.txt" || die "Cannot open $f: $!"; 

    while (<F>) { 
    chomp; 
    my @line = split /\t/; 


     # get the valid pairs for cases with multiple request - replies 


     # get the paired up data together 

    } 
    close (F);

* 編集：各IDにHTTPヘッダー行の数を追加する列を追加しました。これは、後続の行数を確認するのに役立ちます。また、最初のヘッダー行がREPLYになるようにID '4'を変更しました。 *

出典

2012-04-29 sfactor

+1を生成します。ありがとうございました！ –

IDは処理されるラインのグループを識別するのに十分ですか？もしそうなら、ID内で、ソースとデスティネーションが同じであると仮定できますか？ –

@ JonathanLefflerはい、これは、同じ送信元と宛先、ポートなどを持つ1つのTCPフローを表しているので十分です。図に示すように、各IDに対して1つの要求 - 応答ペアを作成する必要があります。 – sfactor

以下のプログラムは、私があなたが必要と考えるものです。

コメントがあり、わかりやすいと思います。何か不明な点がある場合は質問してください。

use strict; 
use warnings; 

use List::Util 'max'; 

my $file = $ARGV[0] // 'file.txt'; 
open my $fh, '<', $file or die qq(Unable to open "$file" for reading: $!); 

# Read the field names from the first line to index the hashes 
# Remember where the data in the file starts so we can get back here 
# 
my @fields = split ' ', <$fh>; 
my $start = tell $fh; 

# Build a format to print the accumulated data 
# Create a hash that relates column headers to their widths 
# 
my @headers = qw/ ID Source Dest Bytes Type Content-Length host /; 
my %len = map { $_ => length } @headers; 

# Read through the file to find the maximum data width for each column 
# 
while (<$fh>) { 
    my %data; 
    @data{@fields} = split; 
    next unless $data{ID} =~ /^\d/; 
    $len{$_} = max($len{$_}, length $data{$_}) for @headers; 
} 

# Build a format string using the values calculated 
# 
my $format = join ' ', map sprintf('%%%ds', $_), @len{@headers}; 
$format .= "\n"; 

# Go back to the start of the data 
# Print the column headers 
# 
seek $fh, $start, 0; 
printf $format, @headers; 

# Build transaction data hashes into $record and print them 
# Ignore any events before the first request 
# Ignore the second request and anything after it 
# Update the stored Content-Length field if a value other than NA appears 
# 
my $record; 
my $nreq = 0; 

while (<$fh>) { 

    my %data; 
    @data{@fields} = split; 
    my ($id, $type) = @data{ qw/ ID Type/}; 
    next unless $id =~ /^\d/; 

    if ($record and $id ne $record->{ID}) { 
    printf $format, @{$record}{@headers}; 
    undef $record; 
    $nreq = 0; 
    } 

    if ($type eq 'GET' or $type eq 'POST') { 
    $record = \%data if $nreq == 0; 
    $nreq++; 
    } 
    elsif ($nreq == 1) { 
    if ($record->{'Content-Length'} eq 'NA' and $data{'Content-Length'} ne 'NA') { 
     $record->{'Content-Length'} = $data{'Content-Length'}; 
    } 
    } 
} 

printf $format, @{$record}{@headers} if $record;

出力

質問に与えられたデータでは、このプログラムは、必要なものの詳細な説明のための

ID Source Dest Bytes Type Content-Length     host 
1  A  B  10  GET    10    yahoo.com 
2  C  D  40  GET    20   google.com 
3  A  B  250 POST    15  mail.yahoo.com 
4  G  H  415 POST    NA   facebook.com

出典

2012-05-02 14:33:25 Borodin

これは、与えられたデータに動作するようです：

#!/usr/bin/env perl 
use strict; 
use warnings; 

# Shape of input records 
use constant ID  => 0; 
use constant Source => 1; 
use constant Dest  => 2; 
use constant Bytes => 3; 
use constant Type  => 4; 
use constant Length => 5; 
use constant Host  => 6; 

use constant fmt_head => "%-6s %-6s %-6s %-6s %-6s %-6s %s\n"; 
use constant fmt_data => "%-6d %-6s %-6s % 6d %-6s % 6s %s\n"; 

printf fmt_head, "ID", "Source", "Dest", "Bytes", "Type", "Length", "Host"; 

my @post_get; 
my @reply; 
my $lastid = -1; 
my $pg_count = 0; 

sub print_data 
{ 
    # Final validity checking 
    if ($lastid != -1) 
    { 
     printf fmt_data, $post_get[ID], $post_get[Source], 
       $post_get[Dest], $post_get[Bytes], $post_get[Type], $reply[Length], $post_get[Host]; 
     # Reset arrays; 
     @post_get =(); 
     @reply =(); 
     $pg_count = 0; 
    } 
} 

while (<>) 
{ 
    chomp; 
    my @record = split; 
    # Validate record here (number of fields, etc) 
    # Detect change in ID 
    print_data if ($record[ID] != $lastid); 
    $lastid = $record[ID]; 

    if ($record[Type] eq "REPLY") 
    { 
     # Discard REPLY if there wasn't already a POST/GET 
     next unless defined $post_get[ID]; 
     # Discard REPLY if there was a second POST/GET 
     next if $pg_count > 1; 
     @reply = @record if !defined $reply[ID]; 
     $reply[Length] = $record[Length] 
         if $reply[Length] eq "NA" && $record[Length] ne "NA"; 
    } 
    else 
    { 
     $pg_count++; 
     @post_get = @record if !defined $post_get[ID]; 
     $post_get[Length] = $record[Length] 
          if $post_get[Length] eq "NA" && $record[Length] ne "NA"; 
    } 
} 
print_data;

それが生成する：

ID Source Dest Bytes Type Content-Length    host 
1  A  B  10 GET    10  yahoo.com 
2  C  D  40 GET    20  google.com 
3  A  B  250 POST    15 mail.yahoo.com 
4  G  H  415 POST    NA  facebook.com

質問からメイン偏差は「コンテンツの長さのための「長さ」の置換であります'; —の場合は、fmt_dataおよびfmt_headの6番目の長さを長さ14に変更し、"Length"を"Content-Length"に変更すると、修正は容易です。

出典

2012-04-29 16:58:48

を改善することによって何を意味しているのかを説明してください。 'print_data'でグローバル変数を使用し、それらのグローバルをリセットすることがベスト・アイデアかもしれません。代わりに参照を使用し、メインループ内の配列をクリアします。また、 'chomp'は空白での分割では必要ありません。しかし、タブ区切り形式を尊重し、 'chomp' +' split/\ t/'を使う方が良い選択肢になります。 – TLP

また、配列スライス 'printf fmt_data、@post_get [ID、Source、Dest、Bytes、Type]、$ reply [Length]、$ post_get [Host]'を少し読みやすくしています。 – TLP

@Jonathan Leffler：単純なハッシュではなく、効果的に 'enum'でインデックス付けされた配列を使うのは難しいようです。 – Borodin

Perlが異なるケース間で有効な行のペアを見つける

答えて

関連する問題