std :: thread to std :: asyncを実行すると、パフォーマンスが大幅に向上します。それはどのようにすることができますか？

私はstd :: threadとstd :: asyncの間にテストコードを作った。 4コアのCentOS 7箱（GCC 4.8.5）、バージョン1（STDを使用して::スレッド）でstd :: thread to std :: asyncを実行すると、パフォーマンスが大幅に向上します。それはどのようにすることができますか？

#include <iostream> 
#include <mutex> 
#include <fstream> 
#include <string> 
#include <memory> 
#include <thread> 
#include <future> 
#include <functional> 
#include <boost/noncopyable.hpp> 
#include <boost/lexical_cast.hpp> 
#include <boost/filesystem.hpp> 
#include <boost/date_time/posix_time/posix_time.hpp> 
#include <boost/asio.hpp> 

namespace fs = boost::filesystem; 
namespace pt = boost::posix_time; 
namespace as = boost::asio; 
class Log : private boost::noncopyable 
{ 
public: 
    void LogPath(const fs::path& filePath) { 
     boost::system::error_code ec; 
     if(fs::exists(filePath, ec)) { 
      fs::remove(filePath); 
     } 
     this->ofStreamPtr_.reset(new fs::ofstream(filePath)); 
    }; 

    void WriteLog(std::size_t i) { 
     assert(*this->ofStreamPtr_); 
     std::lock_guard<std::mutex> lock(this->logMutex_); 
     *this->ofStreamPtr_ << "Hello, World! " << i << "\n"; 
    }; 

private: 
    std::mutex logMutex_; 
    std::unique_ptr<fs::ofstream> ofStreamPtr_; 
}; 

int main(int argc, char *argv[]) { 
    if(argc != 2) { 
     std::cout << "Wrong argument" << std::endl; 
     exit(1); 
    } 
    std::size_t iter_count = boost::lexical_cast<std::size_t>(argv[1]); 

    Log log; 
    log.LogPath("log.txt"); 

    std::function<void(std::size_t)> func = std::bind(&Log::WriteLog, &log, std::placeholders::_1); 

    auto start_time = pt::microsec_clock::local_time(); 
    ////// Version 1: use std::thread ////// 
// { 
//  std::vector<std::shared_ptr<std::thread> > threadList; 
//  threadList.reserve(iter_count); 
//  for(std::size_t i = 0; i < iter_count; i++) { 
//   threadList.push_back(
//    std::make_shared<std::thread>(func, i)); 
//  } 
// 
//  for(auto it: threadList) { 
//   it->join(); 
//  } 
// } 

// pt::time_duration duration = pt::microsec_clock::local_time() - start_time; 
// std::cout << "Version 1: " << duration << std::endl; 

    ////// Version 2: use std::async ////// 
    start_time = pt::microsec_clock::local_time(); 
    { 
     for(std::size_t i = 0; i < iter_count; i++) { 
      auto result = std::async(func, i); 
     } 
    } 

    duration = pt::microsec_clock::local_time() - start_time; 
    std::cout << "Version 2: " << duration << std::endl; 

    ////// Version 3: use boost::asio::io_service ////// 
// start_time = pt::microsec_clock::local_time(); 
// { 
//  as::io_service ioService; 
//  as::io_service::strand strand{ioService}; 
//  { 
//   for(std::size_t i = 0; i < iter_count; i++) { 
//    strand.post(std::bind(func, i)); 
//   } 
//  } 
//  ioService.run(); 
// } 

// duration = pt::microsec_clock::local_time() - start_time; 
// std::cout << "Version 3: " << duration << std::endl; 


}

は、他の実装と比較して約100倍遅いです。

 
Iteration Version1 Version2 Version3 
100  0.0034s 0.000051s 0.000066s 
1000  0.038s  0.00029s 0.00058s 
10000  0.41s  0.0042s 0.0059s 
100000 throw  0.026s  0.061s

なぜスレッド化されたバージョンが遅いのですか？私は各スレッドが完了するために長い時間がかからないと思ったLog::WriteLog機能。

出典

2016-03-21 Byoungchan Lee

私の意見では、（CPUコア以上の）あまりにも多くのスレッドを起動しているので、それらのすべてがCPU時間とコンテキスト切り替えのために競合しているため、速度が遅いです。非同期の場合、ランタイムは十分なスレッドでコードを効率的に管理して実行し、必要に応じてプロセッサ時間を節約します。 – Saleem

スレッドを作ることは_非常に広範囲です。コア数以上のものはパフォーマンスを低下させます（ロック/ IOによってブロックされたスレッドは無視します）。スレッドプールが推奨される理由です。 –

あなたのコードが100000回の反復で失敗するのは十分なヒントです。スレッドは高価なオペレーティングシステムオブジェクトであり、スレッドを作成して破棄するためのコストを負担します。スレッドによって行われる作業の量がこの小さな場合、間違いなくオーバーヘッドが表示されます。 std :: asyncの実装では、スレッドプールを使用してそのコストを償却することができます。おおまかなガイドラインは、スレッドが最低100マイクロ秒間実行されるべきであり、非同期関数は1秒以上かかるべきではないということです。 –

関数が呼び出されないことがあります。あなたはバージョン2でstd::launchポリシーを渡していないので、あなたはthe default behavior of std::async（強調鉱山）に依存している：

がasync(std::launch::async | std::launch::deferred, f, args...)と同じように動作します。言い換えれば、fは別のスレッドで実行されてもよく、結果として得られたstd::futureが値に対してクエリされるときに同期して実行されてもよい。

てみてください、このマイナーチェンジであなたのベンチマークを再実行している：また

auto result = std::async(std::launch::async, func, i);

、あなたはすべてのスレッドでjoin()を呼び出すのと同じように、第二のループ内の各std::futureにresult.wait()を呼び出すことができますこれにより、std::futureの評価が強制されます。

このベンチマークには大きな関連性のない問題があることに注意してください。 funcは、直ちに関数呼び出しの全期間ロックを取得するため、並列処理が不可能になります。ここでスレッドを使用する利点はありません - シリアル実装よりも大幅に遅い（スレッドの作成とロックのオーバーヘッドのため）と思われます。

出典

2016-03-21 04:30:36

はい、そうです。起動ポリシーを指定しなければ、そのコードは決して実行されません... std :: launch :: asyncを使用するようにコードを変更し、バージョン1と同じパフォーマンスを実現します（10000回の反復で0.29秒）注： '' std :: lock_guard''を削除してパフォーマンスを向上させる –

私は、シリアル実装に対して並列コードを健全性チェックとしてベンチマークすることを提案します。 'std :: fstream'はスレッドセーフではないので、' std :: lock_guard'を削除してはいけません。また、これが元の質問に答えた場合は、回答を受け入れることを検討してください。 –

std :: thread to std :: asyncを実行すると、パフォーマンスが大幅に向上します。それはどのようにすることができますか？

答えて

関連する問題