FacebookはどうやってAccessSpreaderがうまくいくのですか？

ここ

は、FacebookのフォリーライブラリからAccessSpreaderのためのコードです： https://github.com/facebook/folly/blob/master/folly/concurrency/CacheLocality.h#L212 FacebookはどうやってAccessSpreaderがうまくいくのですか？

/// AccessSpreader arranges access to a striped data structure in such a 
/// way that concurrently executing threads are likely to be accessing 
/// different stripes. It does NOT guarantee uncontended access. 
/// Your underlying algorithm must be thread-safe without spreading, this 
/// is merely an optimization. AccessSpreader::current(n) is typically 
/// much faster than a cache miss (12 nanos on my dev box, tested fast 
/// in both 2.6 and 3.2 kernels). 
/// 
/// If available (and not using the deterministic testing implementation) 
/// AccessSpreader uses the getcpu system call via VDSO and the 
/// precise locality information retrieved from sysfs by CacheLocality. 
/// This provides optimal anti-sharing at a fraction of the cost of a 
/// cache miss. 
/// 
/// When there are not as many stripes as processors, we try to optimally 
/// place the cache sharing boundaries. This means that if you have 2 
/// stripes and run on a dual-socket system, your 2 stripes will each get 
/// all of the cores from a single socket. If you have 16 stripes on a 
/// 16 core system plus hyperthreading (32 cpus), each core will get its 
/// own stripe and there will be no cache sharing at all. 
/// 
/// AccessSpreader has a fallback mechanism for when __vdso_getcpu can't be 
/// loaded, or for use during deterministic testing. Using sched_getcpu 
/// or the getcpu syscall would negate the performance advantages of 
/// access spreading, so we use a thread-local value and a shared atomic 
/// counter to spread access out. On systems lacking both a fast getcpu() 
/// and TLS, we hash the thread id to spread accesses. 
/// 
/// AccessSpreader is templated on the template type that is used 
/// to implement atomics, as a way to instantiate the underlying 
/// heuristics differently for production use and deterministic unit 
/// testing. See DeterministicScheduler for more. If you aren't using 
/// DeterministicScheduler, you can just use the default template parameter 
/// all of the time. 
template <template <typename> class Atom = std::atomic> 
struct AccessSpreader { 
    /// Returns the stripe associated with the current CPU. The returned 
    /// value will be < numStripes. 
    static size_t current(size_t numStripes) { 
    // widthAndCpuToStripe[0] will actually work okay (all zeros), but 
    // something's wrong with the caller 
    assert(numStripes > 0); 

    unsigned cpu; 
    getcpuFunc(&cpu, nullptr, nullptr); 
    return widthAndCpuToStripe[std::min(size_t(kMaxCpus), numStripes)] 
           [cpu % kMaxCpus]; 
    } 

private: 
    /// If there are more cpus than this nothing will crash, but there 
    /// might be unnecessary sharing 
    enum { kMaxCpus = 128 }; 

    typedef uint8_t CompactStripe; 

    static_assert(
     (kMaxCpus & (kMaxCpus - 1)) == 0, 
     "kMaxCpus should be a power of two so modulo is fast"); 
    static_assert(
     kMaxCpus - 1 <= std::numeric_limits<CompactStripe>::max(), 
     "stripeByCpu element type isn't wide enough"); 

    /// Points to the getcpu-like function we are using to obtain the 
    /// current cpu. It should not be assumed that the returned cpu value 
    /// is in range. We use a static for this so that we can prearrange a 
    /// valid value in the pre-constructed state and avoid the need for a 
    /// conditional on every subsequent invocation (not normally a big win, 
    /// but 20% on some inner loops here). 
    static Getcpu::Func getcpuFunc; 

    /// For each level of splitting up to kMaxCpus, maps the cpu (mod 
    /// kMaxCpus) to the stripe. Rather than performing any inequalities 
    /// or modulo on the actual number of cpus, we just fill in the entire 
    /// array. 
    static CompactStripe widthAndCpuToStripe[kMaxCpus + 1][kMaxCpus]; 

    static bool initialized; 

    /// Returns the best getcpu implementation for Atom 
    static Getcpu::Func pickGetcpuFunc() { 
    auto best = Getcpu::resolveVdsoFunc(); 
    return best ? best : &FallbackGetcpuType::getcpu; 
    } 

    /// Always claims to be on CPU zero, node zero 
    static int degenerateGetcpu(unsigned* cpu, unsigned* node, void*) { 
    if (cpu != nullptr) { 
     *cpu = 0; 
    } 
    if (node != nullptr) { 
     *node = 0; 
    } 
    return 0; 
    } 

    // The function to call for fast lookup of getcpu is a singleton, as 
    // is the precomputed table of locality information. AccessSpreader 
    // is used in very tight loops, however (we're trying to race an L1 
    // cache miss!), so the normal singleton mechanisms are noticeably 
    // expensive. Even a not-taken branch guarding access to getcpuFunc 
    // slows AccessSpreader::current from 12 nanos to 14. As a result, we 
    // populate the static members with simple (but valid) values that can 
    // be filled in by the linker, and then follow up with a normal static 
    // initializer call that puts in the proper version. This means that 
    // when there are initialization order issues we will just observe a 
    // zero stripe. Once a sanitizer gets smart enough to detect this as 
    // a race or undefined behavior, we can annotate it. 

    static bool initialize() { 
    getcpuFunc = pickGetcpuFunc(); 

    auto& cacheLocality = CacheLocality::system<Atom>(); 
    auto n = cacheLocality.numCpus; 
    for (size_t width = 0; width <= kMaxCpus; ++width) { 
     auto numStripes = std::max(size_t{1}, width); 
     for (size_t cpu = 0; cpu < kMaxCpus && cpu < n; ++cpu) { 
     auto index = cacheLocality.localityIndexByCpu[cpu]; 
     assert(index < n); 
     // as index goes from 0..n, post-transform value goes from 
     // 0..numStripes 
     widthAndCpuToStripe[width][cpu] = 
      CompactStripe((index * numStripes)/n); 
     assert(widthAndCpuToStripe[width][cpu] < numStripes); 
     } 
     for (size_t cpu = n; cpu < kMaxCpus; ++cpu) { 
     widthAndCpuToStripe[width][cpu] = widthAndCpuToStripe[width][cpu - n]; 
     } 
    } 
    return true; 
    } 
}; 

template <template <typename> class Atom> 
Getcpu::Func AccessSpreader<Atom>::getcpuFunc = 
    AccessSpreader<Atom>::degenerateGetcpu; 

template <template <typename> class Atom> 
typename AccessSpreader<Atom>::CompactStripe 
    AccessSpreader<Atom>::widthAndCpuToStripe[kMaxCpus + 1][kMaxCpus] = {}; 

template <template <typename> class Atom> 
bool AccessSpreader<Atom>::initialized = AccessSpreader<Atom>::initialize(); 

// Suppress this instantiation in other translation units. It is 
// instantiated in CacheLocality.cpp 
extern template struct AccessSpreader<std::atomic>;

限り、私は原子クラス内でいくつかのデータをラップすることになっています理解し、それが複数のスレッドがアクセスしていますとき、それは偽のキャッシュを減らす必要があるとして、共有？フォリーと一緒に仕事をしてくれた人が、どのように働くのか少し詳しく説明できますか？私はしばらくそれを見てきましたが、原子変数のメンバーをどこに配置するのかわかりません。

出典

2017-10-29 shortage_radeon

いいえ、このクラスはあなたの考えをしません。

全体的な考え方は、同等のリソース/データ構造を持ち、競合を最小限に抑えデータのローカリティを最大化するために異なるスレッドに異なるスレッドを割り当てたい場合は、AccessSpreaderを使用して、現在のコア/スレッド。例を挙げると、例えば、以下を参照されたい。 https://github.com/facebook/folly/blob/master/folly/IndexedMemPool.h。このメモリプールの実装は、割り当て/解放のスレッド競合を減らすために、いくつかのフリーオブジェクトリストを保持します。そしてここでAccessSpreaderが使用されている方法である。

AtomicStruct<TaggedPtr,Atom>& localHead() { 
    auto stripe = AccessSpreader<Atom>::current(NumLocalLists); 
    return local_[stripe].head; 
}

すなわち、現在のスレッドによる使用のために推奨される（いくつかの配列またはベクター等で）要素のインデックスを与えます。

：異なるインデックスに異なるインデックスを割り当てることは必ずしも可能ではありません（例：可能なインデックス（ストライプ）の数がCPUの数より少ない場合コメントは明示的に "それは不測のアクセスを保証するものではない"と述べている。このクラスは、競合を最小限に抑えるだけでなく、データの局所性を最大化するためにも使用できます。たとえば、共通のキャッシュを持つスレッド間でデータのインスタンスを共有したい場合があります。推奨インデックスは、現在のCPU（内部ではgetCpuFunc）とストライプ数（パラメータnumStripesとして渡される）の2つの変数の関数です。これが2D配列が必要な理由です。配列は、システム固有の情報（クラスCacheLocalityを介して）を使用してプログラムの初期化時に満たされるため、推奨される索引ではデータ局所性が考慮されます。

std::atomicは、クラス宣言の直前のコメントで説明されているように、テスト用と運用用の別々のインスタンスを持つために使用されています。クラスには、任意のアトミックメンバー変数はありません（また、必要もありません）。

出典

2017-11-18 00:09:32

Ok、ちょうどもう1つ - AccessSpreader :: currentでは、widthAndCpuToStripeに1D配列を使用するだけに減らすことはできませんか、それとももっと良い方法ですか？CPU番号を返すだけであれば、異なるコアは異なるインデックスにアクセスしますか？私はwidthAndCpuToStripeを実装するために使用する公式とそれが達成しようとしているものをよく理解していません。 –

私は解答を更新しました。私は数式の詳細には入っていませんでしたが、その目的を説明しようとしました。 –

FacebookはどうやってAccessSpreaderがうまくいくのですか？

答えて

関連する問題