ダンプまたはutf-8のバイトをCで取り除く

私は駅のプリンタへの着信パケットをキャプチャする私のFirestationに "C"プログラムを持っています。次に、プログラムはパケットをスキャンし、どの装置が通話にかかっているかを通知します。郡は最近UTF-8パケットを使用し始めましたが、cプログラムはデータフローの余分な "00"をすべて処理できません。私は00を無視するか、UTF-8を処理するプログラムを設定する必要があります。私は何日も探していて、自分のような初心者が扱うことができるutf-8をどう扱うかについては何も具体的なことはありません。以下は、プログラムの解釈部分です。ダンプまたはutf-8のバイトをCで取り除く

72 00 65 00 61 00 74 00 68 00 69 00 6E 00 67 00後パケットの

43 4F 44 45 53 45 54 3D 55 54 46 38 0A 40 50 4A先頭パケットで

***void compressUtf16 (char *buff, size_t count) { 
int i; 
for (i = 0; i < count; i++) 
    buff[i] = buff[i*2];  // for xx 00 xx 00 xx 00 ...

} * {u_int I = 0。 char * searcher = 0; char c; int j; int locflag; static int locationtripped = 0;

static char currentline[256]; 
static int currentlinepos = 0; 
static char lastdispatched[256]; 
static char dispatchstring[256]; 

char betastring[256]; 

static int a = 0; 
static int e = 0; 
static int pe = 0; 
static int md = 0; 

static int pulse = 0; 

static char location[128]; 
static char type[16]; 
static char station[16]; 

static FILE *fp; 
static int printoutscanning = 0; 
static char printoutID[20]; 
static char printoutfileID[32]; 

static FILE *dbg; 

if(pulse) { 
    if(pulse == 80) { 
     sprintf(betastring, "beta a a a"); 
     printf("betastring: \"%s\"\n", betastring); 
     system(betastring); 
     pulse = 0; 
    } else 
     pulse++; 
} 

    if(header->len > 96) { 
     for(i=55; (i < header->caplen + 1) ; i++) { 
      c = pkt_data[i-1]; 

     if(c == 13 || c == 10) { 
      currentline[currentlinepos] = 0; 
      currentlinepos = 0; 
      j = strlen(currentline); 
      if(j && (j > 1)) { 
       if(strlen(printoutfileID) && printoutscanning) { 
        dbg = fopen(printoutfileID, "a"); 
        fprintf(dbg, "%s\n", currentline); 
        fclose(dbg); 
       } 

       if(!printoutscanning) { 
        searcher = 0; 
        searcher = strstr(currentline, "INCIDENT HISTORY DETAIL:"); 
        if(searcher) { 
         searcher = searcher + 26; 
         strncpy(printoutID, searcher, 9); 
         printoutID[9] = 0; 
         printoutscanning = 1; 
         a = 0; 
       e = 0; 
         pe = 0; 
         md = 0; 
      for(j = 0; j < 128; j++) 
          location[j] = 0; 
         for(j = 0; j < 16; j++) { 
          type[j] = 0; 
          station[j] = 0; 
         } 
         sprintf(printoutfileID, "calls/%s %.6d.txt", printoutID, header-> ts.tv_usec); 
         dbg = fopen(printoutfileID, "a"); 
         fprintf(dbg, "%s\n", currentline); 
         fclose(dbg); 
        }

出典

2011-07-12 PGFDBUG

、あなたはおそらくUTF-16を意味しますか？ –

UTF-8は特殊なエンコーディング文字として00を使用しません。あなたはUTF-16を持っていませんか？ –

"C"プログラム？ Cのステータスは疑わしいですか？ – nil

ゼロコードポイント自体を除いて、UTF-8は0バイトを含まない。すべてのマルチバイトエンコーディング（非ASCIIコードポイント）の最初のバイトは、常に11ビットパターンで始まり、その後のバイトは常にビットパターンで始まります。

次の表からわかるように、U+0000は、UTF-8でゼロバイトを与える唯一のコードポイントです。

+----------------+----------+----------+----------+----------+ 
| Unicode  | Byte 1 | Byte 2 | Byte 3 | Byte 4 | 
+----------------+----------+----------+----------+----------+ 
| U+0000-007F | 0xxxxxxx |   |   |   | 
| U+0080-07FF | 110yyyxx | 10xxxxxx |   |   | 
| U+0800-FFFF | 1110yyyy | 10yyyyxx | 10xxxxxx |   | 
| U+10000-10FFFF | 11110zzz | 10zzyyyy | 10yyyyxx | 10xxxxxx | 
+----------------+----------+----------+----------+----------+

UTF-16はあなたのそれ以外のASCIIバイトの間でゼロバイトをまき散らすでしょうが、それはその後、毎秒のバイトを捨てるだけの簡単な作業です。それが0, 2, 4, ...か1, 3, 5, ...かどうかは、UTF-16エンコーディングがビッグエンディアンかリトルエンディアンかによって異なります。私はあなたのデータストリームはUTF-8を示しん（43 4f 44 45 53 45 54 3d 55 54 46 38テキストCODESET=UTF8に変換）が、私はセグメント72 00 65 00 61 00 74 00 68 00 69 00 6e 00 67 00がある

:-)横たわっていますあなたを保証するだろうとあなたのサンプルから見

reathingのUTF-16は、おそらく単語の部分です（私はその言葉に慣れていないので、英語ではとにかく）。

明らかに間違っているので、誰でもデータを生成していることを明確にすることをお勧めします。あなたがUTF-16をどのように処理するかについては、上記を取り上げました。あなたは、単に、他エンディアンUTF-16を使用している場合、

// Process a UTF16 buffer containing ASCII-only characters. 
// buff is the buffer, count is the quantity of UTF-16 chars. 
// Will change buffer. 

void compressUtf16 (char *buff, size_t count) { 
    int i; 
    for (i = 0; i < count; i++) 
     buff[i] = buff[i*2];  // for xx 00 xx 00 xx 00 ... 
}

そして：（別のバイトは常にゼロです）、それはそこにASCIIデータです提供し、あなただけのようなもので、それらの交互を捨てることができます変更：へ

buff[i] = buff[i*2];  // for xx 00 xx 00 xx 00 ...

：UTF-8はあなたにゼロを与えることはありません

buff[i] = buff[i*2+1]; // for 00 xx 00 xx 00 xx ...

出典

2011-07-12 02:41:46 paxdiablo

パケットの始まりは簡単に解釈され、UTF-8は明らかに私の情報に基づいています。それが助けになるなら、私はダンプを投稿することができます。 – PGFDBUG

@PGFDBUG、そうです、それは助けになります。ただし、_claim_はUTF8である可能性がありますが、実際に入力ストリームのコードポイントがゼロでない限り、ゼロバイトを取得する方法はありません。 – paxdiablo

@PGFDBUG：はい、それは役に立ちます。必要に応じて匿名化してください。 –

ダンプまたはutf-8のバイトをCで取り除く

答えて

関連する問題