ジュリア語。ベクトル化された操作をどのように克服するか？

Julia for-loopsはベクトル化された操作と同じくらい速く、さらに高速に使用されると言われています（正しく使用されている場合）。私は2つのコードを持っています。この考え方は、与えられた0-1シーケンスのサンプル統計を見つけることです。これはxです（これらの2つの例では、私は合計を見つけようとしていますが、もっと複雑な例があります。私のコードでの落とし穴の）。最初のようになります。ジュリア語。ベクトル化された操作をどのように克服するか？

S = 2 * sum(x) - n 
s_obs_new = abs(S)/sqrt(2 * n) 
pval = erfc(s_obs_new)

と第二は、「ナイーブ」と古典ものです：私は、最初の例の実行中の時間は約11.8ミリ秒であることがわかってきました@benchmarkを使用して

S = 0 
for i in eachindex(x) 
    S += x[i] 
end 
S = 2 * S - n 
s_obs_new = abs(S)/sqrt(2 * n) 
pval = erfc(s_obs_new)

2回目は38msです。

この例は私にとって非常に重要です。なぜなら、ベクトル化が不可能な場所がたくさんあるからです。私は、ベクトル化のように速くベクトル化された「方法」で計算を行いたいと思います。

ベクトル化されたコードがベクター化されたコードよりも4倍遅くなる可能性があるという考えはありますか？最初の関数のコード型安定性は、OKである不要な大きなメモリ割り当てが存在しない等

ある：

function frequency_monobit_test1(x :: Array{Int8, 1}, n = 0) 
# count 1 and 0 in sequence, we want the number 
# of 1's and 0's to be approximately the same 
# reccomendation n >= 100 
# decision Rule(at 1% level): if pval < 0.01 -> non-random 
if (n == 0) 
    n = length(x) 
end 
S = 2 * sum(x) - n 
s_obs_new = abs(S)/sqrt(2 * n) 
pval = erfc(s_obs_new) 
return pval

秒である：

function frequency_monobit_test2(x :: Array{Int8, 1}, n = 0) 
# count 1 and 0 in sequence, we want the number 
# of 1's and 0's to be approximately the same 
# reccomendation n >= 100 
# decision Rule(at 1% level): if pval < 0.01 -> non-random 
if (n == 0) 
    n = length(x) 
end 
S = 0 
@inbounds for i in eachindex(x) 
    S += x[i] 
end 
S = 2 * S - n 
s_obs_new = abs(S)/sqrt(2 * n) 
pval = erfc(s_obs_new) 
return pval

出典

2017-06-28 Kirill Tsar.

まず、公式の[パフォーマンスのヒント]（https://docs.julialang.org/en/stable/manual/performance-tips/は）しようとする –

別のものを読むことが重要です '@inboundsを置くことです'for'ステートメントの前の@ simd' –

ベンチマークを取得するために実行する実際のコードを表示してください。 –

これは好奇心であります場合。変数Int64にInt8を蓄積すると、パフォーマンス上の問題があるようです。

のは、これらの機能を試してみましょう：

using SpecialFunctions, BenchmarkTools 

function frequency_monobit_test1(x, n=length(x)) 
    S = sum(x) 
    return erfc(abs(2S - n)/sqrt(2n)) 
end 

function frequency_monobit_test3(typ::Type{<:Integer}, x, n=length(x)) 
    S = zero(typ) 
    for i in eachindex(x) 
     @inbounds S += x[i] 
    end 
    return erfc(abs(2S - n)/sqrt(2n)) 
end

いくつかのベクトル

N = 2^25; 
x64 = rand(0:1, N); 
x8 = rand(Int8[0, 1], N); 
xB = rand(Bool, N); 
xb = bitrand(N);

ベンチマークの初期化：Int64入力について

を：

julia> @btime frequency_monobit_test1($x64) 
    17.540 ms (0 allocations: 0 bytes) 
0.10302739916042186 

julia> @btime frequency_monobit_test3(Int64, $x64) 
    17.796 ms (0 allocations: 0 bytes) 
0.10302739916042186 

julia> @btime frequency_monobit_test3(Int32, $x64) 
    892.715 ms (67106751 allocations: 1023.97 MiB) 
0.10302739916042186

我々は0ことを確認と明示的なループは等しく高速で、Int32で初期化するのは悪い考えです。 Int32入力について

：

julia> @btime frequency_monobit_test1($x32) 
    9.137 ms (0 allocations: 0 bytes) 
0.2386386867682374 

julia> @btime frequency_monobit_test3(Int64, $x32) 
    8.839 ms (0 allocations: 0 bytes) 
0.2386386867682374 

julia> @btime frequency_monobit_test3(Int32, $x32) 
    7.274 ms (0 allocations: 0 bytes) 
0.2386386867682374

sumとループは、速度に類似しています。 Int32に蓄積すると、少し時間が節約されます。

Int8入力：

julia> @btime frequency_monobit_test1($x8) 
    5.681 ms (0 allocations: 0 bytes) 
0.16482999123032094 

julia> @btime frequency_monobit_test3(Int64, $x8) 
    19.517 ms (0 allocations: 0 bytes) 
0.16482999123032094 

julia> @btime frequency_monobit_test3(Int32, $x8) 
    4.815 ms (0 allocations: 0 bytes) 
0.16482999123032094

明示的なループInt32に蓄積わずかに速い場合は、しかし、聖なる牛！ Int64で何が起こっているのですか？それは超スローです！

Boolについて

julia> @btime frequency_monobit_test1($xB) 
    9.627 ms (0 allocations: 0 bytes) 
0.7728544347518309 

julia> @btime frequency_monobit_test3(Int64, $xB) 
    9.629 ms (0 allocations: 0 bytes) 
0.7728544347518309 

julia> @btime frequency_monobit_test3(Int32, $xB) 
    4.815 ms (0 allocations: 0 bytes) 
0.7728544347518309

ループとsumは同じ速度を持っていますが、Int32に蓄積すると、半分の時間を節約できます。

今、私たちはBitArrayを試してみましょう：

だから、

julia> @btime frequency_monobit_test1($xb) 
    259.044 μs (0 allocations: 0 bytes) 
0.7002576522570715 

julia> @btime frequency_monobit_test3(Int64, $xb) 
    19.423 ms (0 allocations: 0 bytes) 
0.7002576522570715 

julia> @btime frequency_monobit_test3(Int32, $xb) 
    19.430 ms (0 allocations: 0 bytes) 
0.7002576522570715

、BitArray上sumあなたはチャンクの追加を行うことができるので、高速でクレイジーですが、ループ内の単一の要素を抽出することは、いくつかのオーバーヘッドがあります。

結論：

あなたは非常に特殊なケースですBitArray Sを除いて、sumよりもループと同等以上の性能を得ることができます。
あなたの配列の長さを知っていて、Int32があなたの合計を保持するのに十分であることを知っていれば、それは時間を節約します。
Int64にInt8を蓄積すると、非常に奇妙なことが起こります。なぜパフォーマンスが悪いのか分かりません。
あなたは0と1にのみ関心がある場合は、Bool秒の、ないInt8の配列を使用し、そしておそらくInt32に蓄積します。
BitArrayは、場合によっては超高速にすることができます。
sumは、Int8のベクトル上で、sum(::Vector{Bool})の2倍の速さである。

出典

2017-07-02 12:46:04 DNF

ジュリア語。ベクトル化された操作をどのように克服するか？

答えて

関連する問題