なぜこの再帰的なC++関数は、このような悪いキャッシュ動作をしますか？

Tをすべての内部ノードにちょうど2つの子があるようなルートのある2分木にします。ツリーのノードは配列に格納されます。プリオーダーのレイアウトに従って、TreeArrayとします。なぜこの再帰的なC++関数は、このような悪いキャッシュ動作をしますか？

これは我々が持っている木であれば、たとえば：

その後TreeArrayには、次のノードオブジェクトが含まれます：

7, 3, 1, 0, 2, 6, 12, 9, 8, 11, 13

このツリー内のノードは、この構造体であります種類：

struct tree_node{ 

    int id; //id of the node, randomly generated 
    int numChildren; //number of children, it is 2 but for the leafs it's 0 

    int pos; //position in TreeArray where the node is stored 
    int lpos; //position of the left child 
    int rpos; //position of the right child 

    tree_node(){ 
      id = -1; 
      pos = lpos = rpos = -1; 
      numChildren = 0; 
     } 

};

ここで、afツリー内のすべてのIDの合計を返すunction。非常に簡単ですが、あなたがしなければならないことは、forループを使用してTreeArrayを反復し、見つかったすべてのIDを累積することだけです。 500万葉、perf stat -B -e cache-misses,cache-references,instructions ./run_tests 111.txtでランダムツリーの

r = 0; //r is a global variable 
int main(int argc, char* argv[]){ 

    for(int i=0;i<100;i++) { 
     r = 0; 
     testCache1(0); 
    } 

    cout<<r<<endl; 
    return 0; 
}

：

void testCache1(int cur){ 

    //find the positions of the left and right children 
    int lpos = TreeArray[cur].lpos; 
    int rpos = TreeArray[cur].rpos; 

    //if there are no children we are at a leaf so update r and return 

    if(TreeArray[cur].numChildren == 0){ 
     r += TreeArray[cur].id; 
     return; 
    } 

    //otherwise we are in an internal node, so update r and recurse 
    //first to the left subtree and then to the right subtree 

    r += TreeArray[cur].id; 

    testCache1(lpos); 
    testCache1(rpos); 

}

は、私は次のような実験を持っているキャッシュの動作をテストするには：しかし、私は次の実装のキャッシュの動作を理解する上で興味を持っています

Performance counter stats for './run_tests 111.txt': 

    469,511,047  cache-misses    # 89.379 % of all cache refs  
    525,301,814  cache-references            
20,715,360,185  instructions    

    11.214075268 seconds time elapsed

最初は私がそれを含めて除外するツリーを生成する方法のためかもしれないと考えました

我々はキャッシュミスのほとんどを見ることができるように

は、この関数から来る：私は sudo perf record -e cache-misses ./run_tests 111.txtを実行したときのn私の質問は、私は次の出力を受けました。しかし、私はなぜこれが当てはまるのか理解できません。 curの値はシーケンシャルになり、私は何が起こっているかの私の理解に多くの疑問を追加するなど TreeArrayの最初のアクセス位置 0、そして位置 1、 2、 3

ますが、私が見つけた以下の機能を持っています同じ合計：

void testCache4(int index){ 

    if(index == TreeArray.size) return; 

    r += TreeArray[index].id; 

    testCache4(index+1); 

}

testCache4は同じようにTreeArrayの要素にアクセスしますが、キャッシュの動作はとても良いです。 perf stat -B -e cache-misses,cache-references,instructions ./run_tests 11.txtから

出力：出力の

Performance counter stats for './run_tests 111.txt': 

    396,941,872  cache-misses    # 54.293 % of all cache refs  
    731,109,661  cache-references            
11,547,097,924  instructions    

    4.306576556 seconds time elapsed

機能もありませんsudo perf record -e cache-misses ./run_tests 111.txtから：

私は長い記事について謝罪が、私は完全に失われた感じ。前もって感謝します。

編集：ここでは

が一緒に必要とされるパーサとすべてに、全体のテストファイルです。ツリーは引数として与えられたテキストファイルの中で利用可能であると仮定されています。 g++ -O3 -std=c++11 file.cppと入力してコンパイルし、./executable tree.txtと入力して実行します。私が使用しているツリーはhereです（開かないで、save usをクリックしてください）。

#include <iostream> 
#include <fstream> 
#define BILLION 1000000000LL 

using namespace std; 

/* 
* 
* Timing functions 
* 
*/ 

timespec startT, endT; 

void startTimer(){ 
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startT); 
} 

double endTimer(){ 
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endT); 
    return endT.tv_sec * BILLION + endT.tv_nsec - (startT.tv_sec * BILLION + startT.tv_nsec); 
} 

/* 
* 
* tree node 
* 
*/ 

//this struct is used for creating the first tree after reading it from the external file, for this we need left and child pointers 

struct tree_node_temp{ 

    int id; //id of the node, randomly generated 
    int numChildren; //number of children, it is 2 but for the leafs it's 0 
    int size; //size of the subtree rooted at the current node 
    tree_node_temp *leftChild; 
    tree_node_temp *rightChild; 

    tree_node_temp(){ 
     id = -1; 
     size = 1; 
     leftChild = nullptr; 
     rightChild = nullptr; 
     numChildren = 0; 
    } 

}; 

struct tree_node{ 

    int id; //id of the node, randomly generated 
    int numChildren; //number of children, it is 2 but for the leafs it's 0 
    int size; //size of the subtree rooted at the current node 

    int pos; //position in TreeArray where the node is stored 
    int lpos; //position of the left child 
    int rpos; //position of the right child 

    tree_node(){ 
     id = -1; 
     pos = lpos = rpos = -1; 
     numChildren = 0; 
    } 

}; 

/* 
* 
* Tree parser. The input is a file containing the tree in the newick format. 
* 
*/ 

string treeNewickStr; //string storing the newick format of a tree that we read from a file 
int treeCurSTRindex; //index to the current position we are in while reading the newick string 
int treeNumLeafs; //number of leafs in current tree 
tree_node ** treeArrayReferences; //stack of references to free node objects 
tree_node *treeArray; //array of node objects 
int treeStackReferencesTop; //the top index to the references stack 
int curpos; //used to find pos,lpos and rpos when creating the pre order layout tree 


//helper function for readNewick 
tree_node_temp* readNewickHelper() { 

    int i; 
    if(treeCurSTRindex == treeNewickStr.size()) 
     return nullptr; 

    tree_node_temp * leftChild; 
    tree_node_temp * rightChild; 

    if(treeNewickStr[treeCurSTRindex] == '('){ 
     //create a left child 
     treeCurSTRindex++; 
     leftChild = readNewickHelper(); 
    } 

    if(treeNewickStr[treeCurSTRindex] == ','){ 
     //create a right child 
     treeCurSTRindex++; 
     rightChild = readNewickHelper(); 
    } 

    if(treeNewickStr[treeCurSTRindex] == ')' || treeNewickStr[treeCurSTRindex] == ';'){ 
     treeCurSTRindex++; 
     tree_node_temp * cur = new tree_node_temp(); 
     cur->numChildren = 2; 
     cur->leftChild = leftChild; 
     cur->rightChild = rightChild; 
     cur->size = 1 + leftChild->size + rightChild->size; 
     return cur; 
    } 

    //we are about to read a label, keep reading until we read a "," ")" or "(" (we assume that the newick string has the right format) 
    i = 0; 
    char treeLabel[20]; //buffer used for the label 
    while(treeNewickStr[treeCurSTRindex]!=',' && treeNewickStr[treeCurSTRindex]!='(' && treeNewickStr[treeCurSTRindex]!=')'){ 
     treeLabel[i] = treeNewickStr[treeCurSTRindex]; 
     treeCurSTRindex++; 
     i++; 
    } 

    treeLabel[i] = '\0'; 
    tree_node_temp * cur = new tree_node_temp(); 
    cur->numChildren = 0; 
    cur->id = atoi(treeLabel)-1; 
    treeNumLeafs++; 

    return cur; 
} 

//create the pre order tree, curRoot in the first call points to the root of the first tree that was given to us by the parser 
void treeInit(tree_node_temp * curRoot){ 

    tree_node * curFinalRoot = treeArrayReferences[curpos]; 

    curFinalRoot->pos = curpos; 

    if(curRoot->numChildren == 0) { 
     curFinalRoot->id = curRoot->id; 
     return; 
    } 

    //add left child 
    tree_node * cnode = treeArrayReferences[treeStackReferencesTop]; 
    curFinalRoot->lpos = curpos + 1; 
    curpos = curpos + 1; 
    treeStackReferencesTop++; 
    cnode->id = curRoot->leftChild->id; 
    treeInit(curRoot->leftChild); 

    //add right child 
    curFinalRoot->rpos = curpos + 1; 
    curpos = curpos + 1; 
    cnode = treeArrayReferences[treeStackReferencesTop]; 
    treeStackReferencesTop++; 
    cnode->id = curRoot->rightChild->id; 
    treeInit(curRoot->rightChild); 

    curFinalRoot->id = curRoot->id; 
    curFinalRoot->numChildren = 2; 
    curFinalRoot->size = curRoot->size; 

} 

//the ids of the leafs are deteremined by the newick file, for the internal nodes we just incrementally give the id determined by the dfs traversal 
void updateInternalNodeIDs(int cur){ 

    tree_node* curNode = treeArrayReferences[cur]; 

    if(curNode->numChildren == 0){ 
     return; 
    } 
    curNode->id = treeNumLeafs++; 
    updateInternalNodeIDs(curNode->lpos); 
    updateInternalNodeIDs(curNode->rpos); 

} 

//frees the memory of the first tree generated by the parser 
void treeFreeMemory(tree_node_temp* cur){ 

    if(cur->numChildren == 0){ 
     delete cur; 
     return; 
    } 
    treeFreeMemory(cur->leftChild); 
    treeFreeMemory(cur->rightChild); 

    delete cur; 

} 

//reads the tree stored in "file" under the newick format and creates it in the main memory. The output (what the function returns) is a pointer to the root of the tree. 
//this tree is scattered anywhere in the memory. 

tree_node* readNewick(string& file){ 

    treeCurSTRindex = -1; 
    treeNewickStr = ""; 
    treeNumLeafs = 0; 

    ifstream treeFin; 

    treeFin.open(file, ios_base::in); 
    //read the newick format of the tree and store it in a string 
    treeFin>>treeNewickStr; 
    //initialize index for reading the string 
    treeCurSTRindex = 0; 
    //create the tree in main memory 
    tree_node_temp* root = readNewickHelper(); 

    //store the tree in an array following the pre order layout 
    treeArray = new tree_node[root->size]; 
    treeArrayReferences = new tree_node*[root->size]; 
    int i; 
    for(i=0;i<root->size;i++) 
     treeArrayReferences[i] = &treeArray[i]; 
    treeStackReferencesTop = 0; 

    tree_node* finalRoot = treeArrayReferences[treeStackReferencesTop]; 
    curpos = treeStackReferencesTop; 
    treeStackReferencesTop++; 
    finalRoot->id = root->id; 
    treeInit(root); 

    //update the internal node ids (the leaf ids are defined by the ids stored in the newick string) 
    updateInternalNodeIDs(0); 
    //close the file 
    treeFin.close(); 

    //free the memory of initial tree 
    treeFreeMemory(root); 
    //return the pre order tree 
    return finalRoot; 

} 

/* 
* experiments 
* 
*/ 

int r; 
tree_node* T; 

void testCache1(int cur){ 

    int lpos = treeArray[cur].lpos; 
    int rpos = treeArray[cur].rpos; 

    if(treeArray[cur].numChildren == 0){ 
     r += treeArray[cur].id; 
     return; 
    } 

    r += treeArray[cur].id; 

    testCache1(lpos); 
    testCache1(rpos); 

} 


void testCache4(int index){ 

    if(index == T->size) return; 

    r += treeArray[index].id; 

    testCache4(index+1); 

} 


int main(int argc, char* argv[]){ 

    string Tnewick = argv[1]; 
    T = readNewick(Tnewick); 
    double tt; 

    startTimer(); 
    for(int i=0;i<100;i++) { 
     r = 0; 
     testCache4(0); 
    } 
    tt = endTimer(); 
    cout<<r<<endl; 
    cout<<tt/BILLION<<endl; 

    startTimer(); 
    for(int i=0;i<100;i++) { 
     r = 0; 
     testCache1(0); 
    } 
    tt = endTimer(); 
    cout<<r<<endl; 
    cout<<tt/BILLION<<endl; 

    delete[] treeArray; 
    delete[] treeArrayReferences; 

    return 0; 
}

EDIT2：

私はvalgrindのといくつかのプロファイリングのテストを実行します。命令は実際にはオーバーヘッドかもしれませんが、私は理由を理解していません。例えば、上記のperfでの実験でさえ、あるバージョンは約200億命令と110億命令を提供します。それは90億の違いです。

ので、関数呼び出しがtestCache1に高価であり、testCache4に何の費用がかかりません：？-O3で

は、私は次の取得が有効両方の場合の関数呼び出しの量は同じでなければなりません...

出典

2016-09-07 jsguy

あなたは完全なテストプログラムを投稿できますか？ –

残念ながら、それは木の生成のために大きく、より大きなシステムの一部です。私はそれを扱いやすくするために多くの不必要な機能を整理する必要があります。しかし、少なくとも実験ファイルはそれほど大きくはありません。 http://pastebin.com/HwJYyuhQ 再帰に関するキャッシュの動作についての私の理解が間違っているかどうかは疑問でした。私の関数の多くは、再帰を使ってコード化する方が簡単です。私はこのプリオーダーレイアウトを使用して、コードがキャッシュフレンドリーであることを確認していますが、ツリー内のすべてのIDを合計する単純なものであっても、それを得ることはできません。 – jsguy

ノードのサイズは20なので、3つのノードがおおよそ1つの行を占有します。キャッシュサイズは、いくつかのi7でキャッシュサイズがL2 8MBであるとします。 50億個のノードがあるため、アレイ全体のサイズはL2キャッシュの12715倍です。なぜあなたは少数のミスを期待していますか？ – Serge

実際には、どのようなキャッシュ参照が実際にカウントされるのかという誤解があります。

で説明されているように、インテルCPUのキャッシュ参照は、実際には最終レベルキャッシュへの参照の数です。したがって、L1キャッシュによってサービスされたメモリ参照はカウントされません。 Intel 64 and IA-32 Architectures Developer's Manualには、L1プリフェッチャからのロードがカウントされると記載されています。

実際にキャッシュミスの絶対数を比較すると、両方の機能がほぼ同じであることがわかります。私は、テストのために完全にバランスの取れた木を使用したキャッシュラインにうまくフィット16のサイズを取得するためにposを削除し、次の番号だ：

testCache4：

843.628.131  L1-dcache-loads            (56,83%) 
193.006.858  L1-dcache-load-misses  # 22,73% of all L1-dcache hits (57,31%) 
326.698.621  cache-references            (57,07%) 
188.435.203  cache-misses    # 57,679 % of all cache refs  (56,76%)

testCache1：

3.519.968.253 L1-dcache-loads            (57,17%) 
193.664.806  L1-dcache-load-misses  # 5,50% of all L1-dcache hits (57,24%) 
256.638.490  cache-references            (57,12%) 
188.007.927  cache-misses    # 73,258 % of all cache refs  (57,23%)

を

すべてのハードウェアプリフェッチャを手動で無効にした場合：

testCache4：

846.124.474  L1-dcache-loads            (57,22%) 
192.495.450  L1-dcache-load-misses  # 22,75% of all L1-dcache hits (57,31%) 
193.699.811  cache-references            (57,03%) 
185.445.753  cache-misses    # 95,739 % of all cache refs  (57,17%)

：

3.534.308.118 L1-dcache-loads            (57,16%) 
193.595.962  L1-dcache-load-misses  # 5,48% of all L1-dcache hits (57,18%) 
193.639.498  cache-references            (57,12%) 
185.120.733  cache-misses    # 95,601 % of all cache refs  (57,15%)

あなたは違いが今なくなっている見ることができるように。プリフェッチャと実際の参照が2回カウントされるため、追加のキャッシュ参照イベントが発生しただけです。あなたはすべてのメモリ参照をカウントした場合、各tree_nodeは4回の代わりのものが、同じキャッシュライン上のtree_node嘘の各データメンバを参照し、その一つだけがありますされているので

は実際には、低い総キャッシュ・ミス率を持っていtestCache1 4回のミスのうち

testCache4の場合、sizeof(tree_node) == 16とキャッシュラインが64バイトの場合、L1dロードミス率は実際には25％に近いことがわかります。

コンパイラ（-O2を指定したgcc以上）は、testCache4の再帰を排除する両方の関数にテール再帰最適化を適用し、testCache1を一方向再帰的にします。したがって、testCache1には、testCache4にはないスタックフレームへの多くの追加のキャッシュ参照があります。

valgrindを使用することによってもプリフェッチャなしで結果を得ることができます。これはおそらく出力の信頼性も高まります。それはCPUキャッシュのすべての特性をシミュレートしません。編集内容について

：私は、GCC aplies末尾再帰の最適化をmetionedとして、そうtestCache1でtestCache4で、コースの再帰と追加メモリ負荷の左に通話がないがtestCache4に残っ単純なロード/追加ループに比べてオーバーヘッドの重要な命令を持っています。

出典

2016-09-08 17:02:07 user4407569

答えをありがとう、すべてがとてもはっきりしています。私が最適化フラグを無効にして実験を実行すると、 'testCache4'の関数呼び出しは実際に高価になり始めました。一見シンプルな機能をストレステストするときは、重要な役割を果たすことができるので、これらの低レベルのトピックについて詳しく読む必要があります。再度、感謝します！ – jsguy

なぜこの再帰的なC++関数は、このような悪いキャッシュ動作をしますか？

答えて

関連する問題