スレッドローカルストレージのオーバーヘッドを避ける（ffmpeg YADIFをスケーラブルにする）

yadifフィルタの並列実行を可能にする小さなffmpeg "hack"を作成しようとしています。スレッドローカルストレージのオーバーヘッドを避ける（ffmpeg YADIFをスケーラブルにする）

私は解決策を見つけたと思いますが、同時に1つのインスタンスしか存在できません。これは、 "scalable_yadif_context"が元のyadif "filter_line"関数を置き換える関数 "scalable_yadif_filter_line1"に対してローカルであるためです。私は "scalable_yadif_context"スレッドをローカルにすることができましたが、この関数はしばしば呼び出されるため、非常に高いオーバーヘッドがあります。

どのようにこの問題を解決するためのアイデアですか？

// We need the context description in order to access the original filter_line function. Just redefine it here and hope that it is not changed inside of libavfilter. 
typedef struct { 
    int mode; 
    int parity; 
    int frame_pending; 
    int auto_enable; 
    AVFilterBufferRef *cur; 
    AVFilterBufferRef *next; 
    AVFilterBufferRef *prev; 
    AVFilterBufferRef *out; 
    void (*filter_line)(uint8_t *dst, 
         uint8_t *prev, uint8_t *cur, uint8_t *next, 
         int w, int prefs, int mrefs, int parity, int mode); 
    const AVPixFmtDescriptor *csp; 
} YADIFContext; 

struct scalable_yadif_context 
{ 
    std::vector<std::function<void()>> calls; 
    int end_prefs; 

    scalable_yadif_context() : end_prefs(std::numeric_limits<int>::max()){} 
}; 

void (*org_yadif_filter_line)(uint8_t *dst, uint8_t *prev, uint8_t *cur, uint8_t *next, int w, int prefs, int mrefs, int parity, int mode) = 0; 

void scalable_yadif_filter_line(scalable_yadif_context& ctx, uint8_t *dst, uint8_t *prev, uint8_t *cur, uint8_t *next, int w, int prefs, int mrefs, int parity, int mode) 
{ 
    if(ctx.end_prefs == std::numeric_limits<int>::max()) 
     ctx.end_prefs = -prefs; // Last call to filter_line will have negative pref 

    ctx.calls.push_back([=] 
    { 
     org_yadif_filter_line(dst, prev, cur, next, w, prefs, mrefs, parity, mode); 
    });  

    if(prefs == ctx.end_prefs) 
    {  
     tbb::parallel_for(tbb::blocked_range<size_t>(0, ctx.calls.size()), [=](const tbb::blocked_range<size_t>& r) 
     { 
      for(auto n = r.begin(); n != r.end(); ++n) 
       ctx.calls[n](); 
     }); 
     ctx.calls.clear(); 
     ctx.end_prefs = std::numeric_limits<int>::max(); 
    } 
} 

void scalable_yadif_filter_line1(uint8_t *dst, uint8_t *prev, uint8_t *cur, uint8_t *next, int w, int prefs, int mrefs, int parity, int mode) 
{ 
    // local to the current function, making this thread local would incur heavy overhead. 
    static scalable_yadif_context ctx; 
    scalable_yadif_filter_line(ctx, dst, prev, cur, next, w, prefs, mrefs, parity, mode); 
} 

void make_scalable_yadif(AVFilterContext* ctx) 
{ 
    YADIFContext* yadif = (YADIFContext*)ctx->priv; 

    // Data race should not be problem since we are always writing the same value 
    org_yadif_filter_line = yadif->filter_line; 

    // hmm, will only work for one concurrent instance... 
    // I need a unique "scalable_yadif_filter_line1" for each call... 
    yadif->filter_line = scalable_yadif_filter_line1; 
}

私は、最大18個の同時インスタンスのために働く非常に醜いソリューションを作成しました。

#define RENAME(a) f ## a 

#define ff(x) \ 
void RENAME(x)(uint8_t *dst, uint8_t *prev, uint8_t *cur, uint8_t *next, int w, int prefs, int mrefs, int parity, int mode) \ 
{\ 
    static scalable_yadif_context ctx;\ 
    scalable_yadif_filter_line(ctx, dst, prev, cur, next, w, prefs, mrefs, parity, mode);\ 
} 

ff(0); ff(1); ff(2); ff(3); ff(4); ff(5); ff(6); ff(7); ff(8); ff(9); ff(10); ff(11); ff(12); ff(13); ff(14); ff(15); ff(16); ff(17); 

void (*fs[])(uint8_t *dst, uint8_t *prev, uint8_t *cur, uint8_t *next, int w, int prefs, int mrefs, int parity, int mode) = 

{f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11, f12, f13, f14, f15, f16, f17}; 

namespace caspar { 

void init() 
{ 
    for(int n = 0; n < 18; ++n) 
     tags.push(n); 
} 

int make_scalable_yadif(AVFilterContext* ctx) 
{ 
    static boost::once_flag flag = BOOST_ONCE_INIT; 
    boost::call_once(&init, flag); 

    YADIFContext* yadif = (YADIFContext*)ctx->priv; 
    org_yadif_filter_line = yadif->filter_line; 

    int tag; 
    if(!tags.try_pop(tag)) 
    { 
     LOG(warning) << "Not enough scalable-yadif instances. Running non-scalable"; 
     return -1; 
    } 

    yadif->filter_line = fs[tag]; 
    return tag; 
} 

void release_scalable_yadif(int tag) 
{ 
    if(tag != -1) 
     tags.push(tag); 
}

出典

2011-07-30 ronag

理由だけscalable_yadif_filter_line1関数にスレッドごとのバッファを渡しませんか？スレッドの再編成が必要になるかもしれませんが、統計やスレッドローカルを使用するよりもはるかに優れています（結局のところ、スレッドが別の処理を行うためにスレッドローカルバッファに何が起こるのでしょうか）。

（固定のffmpeg APIのために）関数にバッファリングする必要がある場合、おそらくTLSが唯一の選択肢です。オーバーヘッドはあなたが考えるかもしれないほど悪くはありませんが、まだそれほど良いものではありません。コンテキストパラメータを追加するためにffmpegを変更することを強くお勧めします。

出典

2011-07-30 22:46:26 bdonlan

どうすればよいですか？ yadif-> filter_lineはffmpegによって内部的に呼び出されます。私はそのシグネチャやその呼び出し方法を変更できません。 – ronag

@ronagなら、コンテキスト引数で渡すようにffmpegをパッチすることができます... – bdonlan

真実ですが、特にffmpegを構築できないため、非常にラウンドしています。私は窓で作業し、gcc/linux/ffmpegビルドツールの経験はありません。可能であれば、私はそれを避けることを好むでしょう。 – ronag

スレッドローカルストレージのオーバーヘッドを避ける（ffmpeg YADIFをスケーラブルにする）

答えて

関連する問題