2016-10-03 13 views
1

Google BigQueryの公開Reddit Datasetをクエリしようとしています。私の計画はその後、自分の計算8月2016年コメントの数の点で上位N = 1000 subredditsを選択することですBigQuery - 複雑な相関クエリ

Jaccards Formula

:私の目標は、で定義されJaccards' Indexを使用してsubredditsの類似性を、計算することですカタログ番号subreddit1, subreddit2のサブデリートのすべての組み合わせを得ることができます。

これらの組み合わせの行を使用して、subreddit1とsubreddit2の間のユーザーとその交差点の間のユーザーの組み合わせを照会します。 Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.

:このクエリは、次のBigQueryのエラーを私に与えしかし

subreddit1, subreddit2, subreddits_union, subreddits_interception 
----------------------------------------------------------------- 
    Art  | Politics |  50000  |  21000 
    Art  | Science |  92320  |  15000 
    ...  | ...  |  ...  |  ... 

SELECT 
    subreddit1, 
    subreddit2, 
    (SELECT 
    COUNT(DISTINCT author) 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    WHERE subreddit = subreddit1 
    OR subreddit = subreddit2 
    LIMIT 1 
) as subreddits_union, 

    (
    SELECT 
     COUNT(DISTINCT author) 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    WHERE subreddit = subreddit1 
    AND author IN ( 
     SELECT author 
     FROM `fh-bigquery.reddit_comments.2016_08` 
     WHERE subreddit= subreddit2 
     GROUP BY author 
    ) as subreddits_intersection 

FROM 

(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2 
FROM (
    SELECT subreddit, count(*) as n_comments 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    GROUP BY subreddit 
    ORDER BY n_comments DESC 
    LIMIT 1000 
    ) a 
CROSS JOIN (
    SELECT subreddit, count(*) as n_comments 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    GROUP BY subreddit 
    ORDER BY n_comments DESC 
    LIMIT 1000 
    ) b 
WHERE a.subreddit < b.subreddit 
) 

理想的な結果をもたらすことになる:

私がこれまで持っているクエリはこれです私はそれを理解しています。しかし、私はこのクエリが効率的な結合に変換できるとは思わない。 BQにapplyメソッドがないことを考えれば、個々のクエリに頼らずにこのクエリを設定する方法はありますか?たぶんPARTITION BYと?

答えて

1

Thanks for your answer. This one works pretty well in returning the subreddit union , however, how would you implement the intersection ?

あなたの答えのための

WITH top_most AS (
    SELECT subreddit, count(*) as n_comments 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    GROUP BY subreddit 
    ORDER BY n_comments DESC 
    LIMIT 20 
), 
authors AS (
    SELECT DISTINCT author, subreddit 
    FROM `fh-bigquery.reddit_comments.2016_08` 
) 
SELECT 
count(DISTINCT a1.author), 
subreddit1, subreddit2 
FROM 
(
    SELECT t1.subreddit subreddit1, t2.subreddit subreddit2 
    FROM top_most t1 CROSS JOIN top_most t2 LIMIT 1000000 
) 
INNER JOIN authors a1 on a1.subreddit = subreddit1 
INNER JOIN authors a2 on a2.subreddit = subreddit2 
WHERE a1.author = a2.author 
GROUP BY subreddit1, subreddit2 
ORDER BY subreddit1, subreddit2 
+0

ああ、ありがとう!両方のクエリが正確に私が必要とするものであり、それらは超高速で実行されます! –

1

わかりません計算しようとしていることを十分に理解しています。たぶん

SELECT 
    subreddit1, 
    subreddit2, 
    COUNT(DISTINCT author) 
FROM 
`fh-bigquery.reddit_comments.2016_08` as f 
CROSS JOIN 
(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2 
FROM (
    SELECT subreddit, count(*) as n_comments 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    GROUP BY subreddit 
    ORDER BY n_comments DESC 
    LIMIT 10 
    ) a 
CROSS JOIN (
    SELECT subreddit, count(*) as n_comments 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    GROUP BY subreddit 
    ORDER BY n_comments DESC 
    LIMIT 10 
    ) b 
WHERE a.subreddit < b.subreddit 
LIMIT 1000000 
) 
WHERE f.subreddit = subreddit1 OR f.subreddit = subreddit2 
GROUP BY subreddit1, subreddit2 
ORDER BY subreddit1, subreddit2 
+0

感謝の線に沿って何かを:しかし、おそらくこの例では、解決策を考え出すに役立つ可能性があります。サブディジットユニオンを返すのにうまくいきますが、どのように交差点を実装しますか? –