カウンタを出力するループを最適化する

私は、5000000から1までの数値を出力する非常に小さなループプログラムを用意しています。カウンタを出力するループを最適化する

私はNASMでLinux x86-64アセンブリを学んでいます。

global main 
extern printf 
main: 
    push rbx      
    mov rax,5000000d 
print: 
    push rax      
    push rcx      
    mov  rdi, format    
    mov  rsi, rax     
    call printf     
    pop  rcx      
    pop  rax      
    dec  rax      
    jnz  print     
    pop  rbx      
    ret 

format: 
db "%ld", 10, 0

出典

2016-12-07 Anonymous

あなたはどんな種類のモンスター計算'printf'はやってますか？：-oあなたはすべきです。数値が10進形式に変換される方法を確認します。ピーターの答えは、明らかにあなたの例では、文字列を操作して数値を避けるのと同じように、この事例でははるかに高速です*。 – Ped7g

printfの呼び出しは、その非常に非効率的なループの実行時間を完全に支配します。（あなたがどこにも決して使っていないのに、あなたはrcxをプッシュ/ポップすることに気づいたのでしょうか？おそらくそれはthe slow LOOP instructionを使って残されているでしょう）。

効率的なx86 asmの作成方法については、Agner Fog's Optimizing Assembly guideを参照してください。（また、特定のCPUの詳細とその違いについて実際に知りたい場合は、マイクロアーキテクチャーのガイドもあります.1つのuarch CPUで最適なものが別のものにない可能性があります。たとえば、IMUL r64はインテルCPUの方がAMDよりも、CMOVとADCは2サイクルのレイテンシで、Intelのpre-Broadwellでは2μopsです.3入力ALU m-ops（FLAGS +両レジスタ）はAMDにとって問題ではないため、AMDでは1です。。）x86タグwikiの他のリンクも見てください。

は純粋5Mはprintf関数の呼び出し変更せずにループを最適化するだけ実際にこのコードをスピードアップしないように、適切にループを作成する方法の一例として有用です。しかし、のは、それから始めましょう：

; trivial fixes to loop efficiently while calling the same slow function 
global main 
extern printf 
main: 
    push rbx 
    mov  ebx, 5000000   ; don't waste a REX prefix for constants that fit in 32 bits 
.print: 
    ;; removed the push/pops from inside the loop. 
    ; Use call-preserved regs instead of saving/restoring stuff inside a loop yourself. 
    mov  edi, format   ; static data/code always has a 32-bit address 
    mov  esi, ebx 
    xor  eax, eax    ; The x86-64 SysV ABI requires al = number of FP args passed in FP registers for variadic functions 
    call printf     
    dec  ebx 
    jnz  .print 

    pop  rbx    ; restore rbx, the one call-preserved reg we actually used. 
    xor  eax,eax   ; successful exit status. 
    ret 

section .rodata  ; it's usually best to put constant data in a separate section of the text segment, not right next to code. 
format: 
db "%ld", 10, 0

これをスピードアップするために、我々は文字列に連続する整数に変換するには、冗長性の利点を取る必要があります。 "5000000\n"は8バイト長（改行を含む）なので、文字列表現は64ビットのレジスタに収まります。

この文字列をバッファに格納し、ポインタを文字列の長さだけインクリメントすることができます。（より小さい数字の場合は短くなるので、現在の文字列の長さをレジスタ内に保持してください。特殊文字の場合は、その文字列の長さを変更することができます）。

文字列表現をインプレース整数を小数点以下の文字列に変換するために10で割るという処理をこれまで通り繰り返してきました。

carry/borrowはレジスタ内で自然に伝播せず、AAS命令は64ビットモードでは使用できません（EAではなくEAで動作し、遅いです）。それは自分自身です。私たちは毎回1ずつ減らしているので、何が起こるかを知っています。私たちは10回アンロールすることで最下位桁を扱うことができるので、それを処理する分岐はありません。

また、印刷順序で数字を使用したいので、x86はリトルエンディアンであるため、キャリーは間違った方向に進みます。他のバイトオーダーで私たちの文字列を利用する良い方法があれば、おそらくBSWAPまたはMOVBEを使うことができます。（しかし、それらの2 ALUののuop、MOVBE R64は、Skylakeマイクロアーキテクチャ上の3融合したドメインのuopであることに注意してください。BSWAPのR64も2つのuopである。）

をおそらく、我々は、の二つの部分に、並列に奇数/偶数カウンタをやるべきXMMベクトルレジスタ。しかし、文字列が8Bよりも短いと、それはうまく動作しなくなります。一度に1つの数値列を格納すると、簡単に重複することができます。それでも、ベクトルレジスタで桁上げ伝播を行い、2つの半分をMOVQとMOVHPSで別々に格納することができます。あるいは、0から5Mの数字の4/5が7桁であるので、2つの数字の16Bベクトル全体を格納できる特殊なケースのコードを持つ価値があるかもしれません。 SSSE3 PSHUFBは、2つの文字列をベクトルレジスタに左揃えにシャッフルし、1つのMOVUPSに2つずつ格納します。文字列の長さ（桁数）が変更されたときにシャッフルマスクを更新するだけで済みますので、あまり頻繁に実行されないキャリー処理の特殊ケースコードでも処理できます。

ループのホットな部分のベクトル化は、非常に簡単で安価で、約2倍のパフォーマンスが必要です。

;;; Optimized version: keep the string data in a register and modify it 
;;; instead of doing the whole int->string conversion every time. 

section .bss 
printbuf: resb 1024*128 + 4096  ; Buffer size ~= half L2 cache size on Intel SnB-family. Or use a giant buffer that we write() once. Or maybe vmsplice to give it away to the kernel, since we only run once. 

global main 
extern printf 
main: 
    push rbx 

    ; use some REX-only regs for values that we're always going to use a REX prefix with anyway for 64-bit operand size. 
    mov  rdx, `5000000\n` ; (NASM string constants as integers work like little-endian, so AL = '5' = 0x35 and the high byte holds '\n' = 10). Note that YASM doesn't support back-ticks for C-style backslash processing. 
    mov  r9, 1<<56   ; decrement by 1 in the 2nd-last byte: LSB of the decimal string 
    ;xor  r9d, r9d 
    ;bts  r9, 56   ; IDK if this code-size optimization outside the loop would help or not. 

    mov  eax, 8   ; string length. 
    mov  edi, printbuf 

.storeloop: 

    ;; rdx = "????x9\n". We compute the start value for the next iteration, i.e. counter -= 10 in rdx. 

    mov  r8, rdx 
    ;; r8 = rdx. We modify it to have each last digit from 9 down to 0 in sequence, and store those strings in the buffer. 
    ;; The string could be any length, always with the first ASCII digit in the low byte; our other constants are adjusted correctly for it 
    ;; narrower than 8B means that our stores overlap, but that's fine. 

    ;; Starting from here to compute the next unrolled iteration's starting value takes the `sub r8, r9` instructions off the critical path, vs. if we started from r8 at the bottom of the loop. This gives out-of-order execution more to play with. 
    ;; It means each loop iteration's sequence of subs and stores are a separate dependency chain (except for the store addresses, but OOO can get ahead on those because we only pointer-increment every 2 stores). 

    mov  [rdi], r8 
    sub  r8, r9    ; r8 = "xxx8\n" 

    mov  [rdi + rax], r8 ; defer p += len by using a 2-reg addressing mode 
    sub  r8, r9    ; r8 = "xxx7\n" 

    lea  edi, [rdi + rax*2] ; if we had len*3 in another reg, we could defer this longer 
      ;; our static buffer is guaranteed to be in the low 31 bits of address space so we can safely save a REX prefix on the LEA here. Normally you shouldn't truncate pointers to 32-bits, but you asked for the fastest possible. This won't hurt, and might help on some CPUs, especially with possible decode bottlenecks. 

    ;; repeat that block 3 more times. 
    ;; using a short inner loop for the 9..0 last digit might be a win on some CPUs (like maybe Core2), depending on their front-end loop-buffer capabilities if the frontend is a bottleneck at all here. 

    ;; anyway, then for the last one: 
    mov  [rdi], r8    ; r8 = "xxx1\n" 
    sub  r8, r9 
    mov  [rdi + rax], r8  ; r8 = "xxx0\n" 

    lea  edi, [rdi + rax*2] 


    ;; compute next iteration's RDX. It's probably a win to interleave some of this into the loop body, but out-of-order execution should do a reasonably good job here. 
    mov  rcx, r9 
    shr  rcx, 8  ; maybe hoist this constant out, too 
    ; rcx = 1 in the second-lowest digit 
    sub  rdx, rcx 

    ; detect carry when '0' (0x30) - 1 = 0x2F by checking the low bit of the high nibble in that byte. 
    shl  rcx, 5 
    test rdx, rcx 
    jz  .carry_second_digit 
    ; .carry_second_digit is some complicated code to propagate carry as far as it needs to go, up to the most-significant digit. 
    ; when it's done, it re-enters the loop at the top, with eax and r9 set appropriately. 
    ; it only runs once per 100 digits, so it doesn't have to be super-fast 

    ; maybe only do buffer-length checks in the carry-handling branch, 
    ; in which case the jz .carry can be jnz .storeloop 
    cmp  edi, esi    ; } while(p < endp) 
    jbe  .storeloop 

    ; write() system call on the buffer. 
    ; Maybe need a loop around this instead of doing all 5M integer-strings in one giant buffer. 

    pop  rbx 
    xor  eax,eax   ; successful exit status. 
    ret

これは完全にfleshed-outではありませんが、何がうまくいくかのアイデアを与える必要があります。

SSE2を使用してベクトル化する場合は、スカラ整数レジスタを使用して、いつキャリーを分割してキャリーを処理する必要があるかを把握してください。すなわち、10からのダウンカウンタである。

このスカラバージョンでさえも、クロックごとに1つのストアを維持することに近くなり、ストアポートを飽和させる。彼らはわずか8Bストアです（文字列が短くなると有効な部分はそれよりも短くなります）ので、キャッシュミスのボトルネックがなければ、テーブルにパフォーマンスを残しています。しかし、3GHz CPUとデュアルチャネルDDR3-1600（理論上の最大帯域幅〜25.6GB/s）では、1クロックあたり8Bで、メイン・メモリを単一のコアで飽和させるのに十分です。

これを並列化して5M .. 1の範囲をチャンクに分割することができます。巧妙な数学を使って、"2500000\n"の最初の文字を書き込むバイトを特定するか、または各スレッドが正しい順序でwrite()を呼び出すようにすることができます。（同じ巧妙な数学を使って、pwrite(2)を別々のファイルオフセットで独立して呼び出すようにすると、カーネルは同じファイルに対する複数のライターのすべての同期を処理します）。

出典

2016-12-08 01:18:56

ピーターにその記述的な答えをありがとうございました..私は答えを全部理解するための時間が必要です。今は別の質問をしたいのです...質問を投稿する前に、実行に時間がかかり、60秒でした私はあなたに最初の提案ファイルを適用し、私は82秒を得ました。私はそれに驚きました。私は元のコードをもう一度走らせました。それは約90秒で実行されますので、i4 4200mコアのラップトップでubuntu 14.04.5を実行しています私は60秒間に8GBのRAMとさらに奇妙なときに私は同時に多くのアプリを実行していたので非常に奇妙な –

@Adou：それはターミナルウィンドウで印刷する場合、それはさらに大きなボトルネックです。プログラムの時刻を決めるには、その出力を/ dev/nullにリダイレクトします。または少なくともファイルに。それを 'wc -c'にパイプすることも妥当でしょう。 'time ./a.out>/dev/null'を使い、ユーザー時間とシステム時間をリアルタイムで比較します。また、 'perf stat ./a.out>/dev/null'を試して、CPUが実行している周波数を調べてみましょう。それはかなり早く立ち上がるはずですが、ターボはラップトップCPUに大きな違いをもたらします（パワー・バジェットが限られているということは最大ターボが最大持続時間よりもかなり高いことを意味するからです）。 –

ありがとうございました非常に便利でした –

あなたは基本的に固定文字列を印刷しています。私はその文字列を1つの長い定数にあらかじめ生成しています。

プログラムは、write（または不完全な書き込みを処理するための短いループ）への単一の呼び出しになります。

出典

2016-12-07 22:33:51 melpomene

私はmemsetの速度に近い速度でそれを生成することができると思います。これは、ディスクからあらかじめ生成された文字列を読み込むよりもはるかに高速です。私の答えを見てください。 –

@PeterCordesええ、私はプログラムがすでにメモリにロードされていると仮定しています。 – melpomene

これは悪い仮定です。特に、小さなバッファをその場で書き直すことに比べて、カーネルがpagecacheにコピーしたときにwrite（）呼び出しが既にL2キャッシュ内のデータを読み込んでいるためです。（または、パイプバッファーなどに）。 –

カウンタを出力するループを最適化する

答えて

関連する問題