アノテーションと条件付きカウントが遅すぎるDjangoクエリ

私は注釈、数式、条件式を持つこのクエリを持っていますが、これは非常に遅く実行され、永遠にかかります。アノテーションと条件付きカウントが遅すぎるDjangoクエリ

私は、譜表の出版物を格納するモデルと、twitterの出版物を格納するモデルの2モデルを持っています。各出版物は、都市内の六角形の地理的領域を表す別のモデルへのFKも有する。

出版[FK] - > HexCityArea

TwitterPublication [FK] - > HexCityArea

私はそれぞれの六角形の資料をカウントしようとしているが、刊行物は、日付のような他の分野で事前にフィルタ処理されています、そのコードは次のとおりです。

instagram_publications_ids = list(instagram_publications.values_list('id', flat=True)) 
twitter_publications_ids = list(twitter_publications.values_list('id', flat=True)) 

print "\n[HEXAGONS QUERY]> List of publications ids insta\n %s \n" % instagram_publications.query 
print instagram_publications.explain() 
print "\n[HEXAGONS QUERY]> List of publications ids twitter\n %s \n" % twitter_publications.query 
print twitter_publications.explain() 

# Get count of publications by hexagon 
resultant_hexagons = HexagonalCityArea.objects.filter(city=city).annotate(
    instagram_count=Count(Case(
     When(publication__id__in=instagram_publications_ids, then=1), 
     output_field=IntegerField(), 
    )) 
).annotate(
    twitter_count=Count(Case(
     When(twitterpublication__id__in=twitter_publications_ids, then=1), 
     output_field=IntegerField(), 
    )) 
)#filter(instagram_count__gt=0).filter(twitter_count__gt=0) # Discard empty hexagons 

# For debug only 
print "\n[HEXAGONS QUERY]> Count of publications\n %s \n" % resultant_hexagons.query 
print resultant_hexagons.explain() 

resultant_hexagons_list = list(resultant_hexagons) 
# Iterate remaining hexagons 
city_hexagons = [h for h in resultant_hexagons_list if h.instagram_count > 0 or h.twitter_count > 0]

あなたが見ることができるように、最初に私が選択した出版物のIDのリストを取得し、私はのみの出版物をカウントするために、後でそれらを使用しています。

私が見る1つの問題は、IDリストが28000個ほどの要素で非常に長いことですが、IDのリストを使用しないと希望の結果が得られない場合、カウント条件が正しく機能しませんその都市のすべての出版物が数えられる。

私はIDのリストを使用しないように、これを試してみた：ここ

 resultant_hexagons = HexagonalCityArea.objects.filter(city=city).annotate(
      instagram_count=Count(Case(

       When(publication__in=instagram_publications, then=1), 
       output_field=IntegerField(), 
      )) 
     ).annotate(
      twitter_count=Count(Case(

       When(twitterpublication__in=twitter_publications, then=1), 
       output_field=IntegerField(), 
      )) 
     ).filter(instagram_count__gt=0).filter(twitter_count__gt=0) # Discard empty hexagons 

     # For debug only 
     print "\n[HEXAGONS QUERY]> Count of publications\n %s \n" % resultant_hexagons.query 
     print resultant_hexagons.explain()

は、生成されたSQLです：

SELECT 
    "instanalysis_hexagonalcityarea"."id", 
    "instanalysis_hexagonalcityarea"."created", 
    "instanalysis_hexagonalcityarea"."modified", 
    "instanalysis_hexagonalcityarea"."geom", 
    "instanalysis_hexagonalcityarea"."city_id", 
    COUNT(
    CASE 
     WHEN 
     "instanalysis_publication"."id" IN 
     (
      SELECT 
       U0."id" 
      FROM 
       "instanalysis_publication" U0 
       INNER JOIN 
        "instanalysis_instagramlocation" U1 
        ON (U0."location_id" = U1."id") 
       INNER JOIN 
        "instanalysis_spot" U2 
        ON (U1."spot_id" = U2."id") 
       INNER JOIN 
        "instanalysis_city" U3 
        ON (U2."city_id" = U3."id") 
      WHERE 
       (
        U3."name" = Durban 
        AND U0."publication_date" >= 2016 - 12 - 01 00:00:00 + 01:00 
        AND U0."publication_date" <= 2016 - 12 - 11 00:00:00 + 01:00 
       ) 
     ) 
     THEN 
     1 
     ELSE 
     NULL 
    END 
) AS "instagram_count", COUNT(
    CASE 
     WHEN 
     "instanalysis_twitterpublication"."id" IN 
     (
      SELECT 
       U0."id" 
      FROM 
       "instanalysis_twitterpublication" U0 
       INNER JOIN 
        "instanalysis_twitterlocation" U1 
        ON (U0."location_id" = U1."id") 
       INNER JOIN 
        "instanalysis_spot" U2 
        ON (U1."spot_id" = U2."id") 
       INNER JOIN 
        "instanalysis_city" U3 
        ON (U2."city_id" = U3."id") 
      WHERE 
       (
        U3."name" = Durban 
        AND U0."publication_date" >= 2016 - 12 - 01 00:00:00 + 01:00 
        AND U0."publication_date" <= 2016 - 12 - 11 00:00:00 + 01:00 
       ) 
     ) 
     THEN 
     1 
     ELSE 
     NULL 
    END 
) AS "twitter_count" 
FROM 
    "instanalysis_hexagonalcityarea" 
    LEFT OUTER JOIN 
     "instanalysis_publication" 
     ON ("instanalysis_hexagonalcityarea"."id" = "instanalysis_publication"."hexagon_id") 
    LEFT OUTER JOIN 
     "instanalysis_twitterpublication" 
     ON ("instanalysis_hexagonalcityarea"."id" = "instanalysis_twitterpublication"."hexagon_id") 
WHERE 
    "instanalysis_hexagonalcityarea"."city_id" = 7 
GROUP BY 
    "instanalysis_hexagonalcityarea"."id" 
HAVING 
(COUNT(
    CASE 
     WHEN 
     "instanalysis_publication"."id" IN 
     (
      SELECT 
       U0."id" 
      FROM 
       "instanalysis_publication" U0 
       INNER JOIN 
        "instanalysis_instagramlocation" U1 
        ON (U0."location_id" = U1."id") 
       INNER JOIN 
        "instanalysis_spot" U2 
        ON (U1."spot_id" = U2."id") 
       INNER JOIN 
        "instanalysis_city" U3 
        ON (U2."city_id" = U3."id") 
      WHERE 
       (
        U3."name" = Durban 
        AND U0."publication_date" >= 2016 - 12 - 01 00:00:00 + 01:00 
        AND U0."publication_date" <= 2016 - 12 - 11 00:00:00 + 01:00 
       ) 
     ) 
     THEN 
     1 
     ELSE 
     NULL 
    END 
) > 0 
    AND COUNT(
    CASE 
     WHEN 
     "instanalysis_twitterpublication"."id" IN 
     (
      SELECT 
       U0."id" 
      FROM 
       "instanalysis_twitterpublication" U0 
       INNER JOIN 
        "instanalysis_twitterlocation" U1 
        ON (U0."location_id" = U1."id") 
       INNER JOIN 
        "instanalysis_spot" U2 
        ON (U1."spot_id" = U2."id") 
       INNER JOIN 
        "instanalysis_city" U3 
        ON (U2."city_id" = U3."id") 
      WHERE 
       (
        U3."name" = Durban 
        AND U0."publication_date" >= 2016 - 12 - 01 00:00:00 + 01:00 
        AND U0."publication_date" <= 2016 - 12 - 11 00:00:00 + 01:00 
       ) 
     ) 
     THEN 
     1 
     ELSE 
     NULL 
    END 
) > 0)

これは、はるかに高速で、analizeを説明参照：

GroupAggregate (cost=1.14..743590.08 rows=3300 width=184) (actual time=5186.606..46907.530 rows=334 loops=1) 
    Group Key: instanalysis_hexagonalcityarea.id 
    Filter: ((count(CASE WHEN (hashed SubPlan 3) THEN 1 ELSE NULL::integer END) > 0) AND (count(CASE WHEN (hashed SubPlan 4) THEN 1 ELSE NULL::integer END) > 0)) 
    Rows Removed by Filter: 2966 
    -> Merge Left Join (cost=1.14..320194.96 rows=7166797 width=184) (actual time=4851.792..17369.232 rows=70436610 loops=1) 
     Merge Cond: (instanalysis_hexagonalcityarea.id = instanalysis_publication.hexagon_id) 
     -> Merge Left Join (cost=0.71..21686.40 rows=49328 width=180) (actual time=109.033..164.451 rows=30857 loops=1) 
       Merge Cond: (instanalysis_hexagonalcityarea.id = instanalysis_twitterpublication.hexagon_id) 
       -> Index Scan using instanalysis_hexagonalcityarea_pkey on instanalysis_hexagonalcityarea (cost=0.29..591.47 rows=3300 width=176) (actual time=22.783..23.878 rows=3300 loops=1) 
        Filter: (city_id = 7) 
        Rows Removed by Filter: 7282 
       -> Index Scan using instanalysis_twitterpublication_5c78aecb on instanalysis_twitterpublication (cost=0.42..64392.25 rows=504291 width=8) (actual time=0.018..111.677 rows=170305 loops=1) 
     -> Materialize (cost=0.43..501402.61 rows=3754731 width=8) (actual time=0.011..6788.670 rows=71922153 loops=1) 
       -> Index Scan using instanalysis_publication_5c78aecb on instanalysis_publication (cost=0.43..492015.78 rows=3754731 width=8) (actual time=0.005..4034.838 rows=1778030 loops=1) 
    SubPlan 1 
    -> Nested Loop (cost=0.72..105061.24 rows=27624 width=4) (actual time=0.326..74.024 rows=21824 loops=1) 
      -> Nested Loop (cost=0.29..620.11 rows=2767 width=4) (actual time=0.024..2.915 rows=3374 loops=1) 
       -> Nested Loop (cost=0.00..143.13 rows=504 width=4) (actual time=0.016..0.618 rows=829 loops=1) 
         Join Filter: (u2.city_id = u3.id) 
         Rows Removed by Join Filter: 3350 
         -> Seq Scan on instanalysis_city u3 (cost=0.00..1.10 rows=1 width=4) (actual time=0.004..0.006 rows=1 loops=1) 
          Filter: ((name)::text = 'Durban'::text) 
          Rows Removed by Filter: 7 
         -> Seq Scan on instanalysis_spot u2 (cost=0.00..89.79 rows=4179 width=8) (actual time=0.001..0.242 rows=4179 loops=1) 
       -> Index Scan using instanalysis_instagramlocation_e72b53d4 on instanalysis_instagramlocation u1 (cost=0.29..0.89 rows=6 width=8) (actual time=0.001..0.002 rows=4 loops=829) 
         Index Cond: (spot_id = u2.id) 
      -> Index Scan using instanalysis_publication_e274a5da on instanalysis_publication u0 (cost=0.43..37.45 rows=30 width=8) (actual time=0.006..0.021 rows=6 loops=3374) 
       Index Cond: (location_id = u1.id) 
       Filter: ((publication_date >= '2016-11-30 23:00:00+00'::timestamp with time zone) AND (publication_date <= '2016-12-10 23:00:00+00'::timestamp with time zone)) 
       Rows Removed by Filter: 80 
    SubPlan 2 
    -> Hash Join (cost=2595.62..25893.51 rows=9013 width=4) (actual time=22.511..73.141 rows=6220 loops=1) 
      Hash Cond: (u0_1.location_id = u1_1.id) 
      -> Seq Scan on instanalysis_twitterpublication u0_1 (cost=0.00..22927.36 rows=74772 width=8) (actual time=15.212..59.628 rows=75775 loops=1) 
       Filter: ((publication_date >= '2016-11-30 23:00:00+00'::timestamp with time zone) AND (publication_date <= '2016-12-10 23:00:00+00'::timestamp with time zone)) 
       Rows Removed by Filter: 428516 
      -> Hash (cost=2348.24..2348.24 rows=19790 width=4) (actual time=6.538..6.538 rows=15589 loops=1) 
       Buckets: 32768 Batches: 1 Memory Usage: 805kB 
       -> Nested Loop (cost=0.70..2348.24 rows=19790 width=4) (actual time=0.023..5.052 rows=15589 loops=1) 
         -> Nested Loop (cost=0.28..39.28 rows=504 width=4) (actual time=0.015..0.186 rows=829 loops=1) 
          -> Seq Scan on instanalysis_city u3_1 (cost=0.00..1.10 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=1) 
            Filter: ((name)::text = 'Durban'::text) 
            Rows Removed by Filter: 7 
          -> Index Scan using instanalysis_spot_c7141997 on instanalysis_spot u2_1 (cost=0.28..33.14 rows=504 width=8) (actual time=0.010..0.124 rows=829 loops=1) 
            Index Cond: (city_id = u3_1.id) 
         -> Index Scan using instanalysis_twitterlocation_e72b53d4 on instanalysis_twitterlocation u1_1 (cost=0.42..3.93 rows=65 width=8) (actual time=0.001..0.004 rows=19 loops=829) 
          Index Cond: (spot_id = u2_1.id) 
    SubPlan 3 
    -> Nested Loop (cost=0.72..105061.24 rows=27624 width=4) (actual time=0.348..80.863 rows=21824 loops=1) 
      -> Nested Loop (cost=0.29..620.11 rows=2767 width=4) (actual time=0.028..3.507 rows=3374 loops=1) 
       -> Nested Loop (cost=0.00..143.13 rows=504 width=4) (actual time=0.016..0.646 rows=829 loops=1) 
         Join Filter: (u2_2.city_id = u3_2.id) 
         Rows Removed by Join Filter: 3350 
         -> Seq Scan on instanalysis_city u3_2 (cost=0.00..1.10 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=1) 
          Filter: ((name)::text = 'Durban'::text) 
          Rows Removed by Filter: 7 
         -> Seq Scan on instanalysis_spot u2_2 (cost=0.00..89.79 rows=4179 width=8) (actual time=0.001..0.276 rows=4179 loops=1) 
       -> Index Scan using instanalysis_instagramlocation_e72b53d4 on instanalysis_instagramlocation u1_2 (cost=0.29..0.89 rows=6 width=8) (actual time=0.001..0.003 rows=4 loops=829) 
         Index Cond: (spot_id = u2_2.id) 
      -> Index Scan using instanalysis_publication_e274a5da on instanalysis_publication u0_2 (cost=0.43..37.45 rows=30 width=8) (actual time=0.007..0.022 rows=6 loops=3374) 
       Index Cond: (location_id = u1_2.id) 
       Filter: ((publication_date >= '2016-11-30 23:00:00+00'::timestamp with time zone) AND (publication_date <= '2016-12-10 23:00:00+00'::timestamp with time zone)) 
       Rows Removed by Filter: 80 
    SubPlan 4 
    -> Hash Join (cost=2595.62..25893.51 rows=9013 width=4) (actual time=41.392..92.680 rows=6220 loops=1) 
      Hash Cond: (u0_3.location_id = u1_3.id) 
      -> Seq Scan on instanalysis_twitterpublication u0_3 (cost=0.00..22927.36 rows=74772 width=8) (actual time=32.641..78.020 rows=75775 loops=1) 
       Filter: ((publication_date >= '2016-11-30 23:00:00+00'::timestamp with time zone) AND (publication_date <= '2016-12-10 23:00:00+00'::timestamp with time zone)) 
       Rows Removed by Filter: 428516 
      -> Hash (cost=2348.24..2348.24 rows=19790 width=4) (actual time=7.907..7.907 rows=15589 loops=1) 
       Buckets: 32768 Batches: 1 Memory Usage: 805kB 
       -> Nested Loop (cost=0.70..2348.24 rows=19790 width=4) (actual time=0.044..6.136 rows=15589 loops=1) 
         -> Nested Loop (cost=0.28..39.28 rows=504 width=4) (actual time=0.026..0.220 rows=829 loops=1) 
          -> Seq Scan on instanalysis_city u3_3 (cost=0.00..1.10 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=1) 
            Filter: ((name)::text = 'Durban'::text) 
            Rows Removed by Filter: 7 
          -> Index Scan using instanalysis_spot_c7141997 on instanalysis_spot u2_3 (cost=0.28..33.14 rows=504 width=8) (actual time=0.016..0.135 rows=829 loops=1) 
            Index Cond: (city_id = u3_3.id) 
         -> Index Scan using instanalysis_twitterlocation_e72b53d4 on instanalysis_twitterlocation u1_3 (cost=0.42..3.93 rows=65 width=8) (actual time=0.001..0.005 rows=19 loops=829) 
          Index Cond: (spot_id = u2_3.id) 
Planning time: 50.735 ms 
Execution time: 46908.482 ms

問題は、私が欲しいものを手に入れられないということです。それはもっと多くの出版物を数えているようです。パブリケートは以前は日付でフィルタリングされており、フィルタリングされたパブリケーションの数は各六角形に含まれているだけですが、When節が機能していなかった場合と同様に、すべてのパブリケーションを六角形で数えているようです。

ありがとうございました。

出典

2017-03-25 Martinez Mariano

なぜ[カウント集計]（https://docs.djangoproject.com/en/1.10/topics/db/aggregation/#generating-aggregates-for-each-item-in-a-queryset ）はオプションではありませんか？理論的には、 'count'を使った2つの集約クエリは、IN節を持つユニオンクエリよりも効率的になるはずです – Marat

あなたのコメント@Maratに感謝します。あなたの方法ははるかに高速ですが、問題は私が間違った結果を得ることです。私はポストをSQLで更新し、アナリシスを説明しました。 –

最も遅い理由はサブクエリです。すなわち、hexarea DBサーバーのすべてのレコードで、そのIDに一致するinstagram/twitterレコードをカウントする別のクエリが発行されます。更新後も、それは本質的に同じです。

解決方法：集約クエリを実行します。このようにして、DBサーバーはIDのリストを1回だけ直線的に実行することができます。これはおそらくより効率的な規模のオーダーです。例：

from django.db.models import Count 

# assuming "instagram_publications" is the related_name 
# of the correspondent Instagram/Twitter post model 
instacounts = HexagonalCityArea.objects.filter(city=city 
      ).filter(instagram_publications__publicat‌ion_date__lte=end_da‌te 
      ).filter(instagram_publications__publicat‌ion_date__gte=start_da‌te 
      ).aggregate(Count('instagram_publications')))

出典

2017-03-25 15:38:10 Marat

もう一度@Maratをありがとう。集計は注釈よりもはるかに速いことがわかりましたが、すべてのものではなく、特定のInstagramの出版物のみを数えるために条件付き集約を使用するにはどうすればよいですか？私の場合、instagram_publicationsは関連するモデルのrelated_nameではなく、 instagram_publications = Publication.objects.filter（location__spot__city__name = location）のような以前にフィルタリングされたパブリケーションを持つクエリーセットです。filter（publication_date__gte = start_date）.filter（publication_date__lte = end_date） –

@MartinezMarianoこの例を更新します。 hexareaがすでに正確な場所で特定されているので、場所別に投稿をフィルタリングする必要はありません – Marat

もう一度@Maratに感謝します。このアプローチには2つの問題があります。最初の1つは、集計が、すべての六角形の合計カウントが1つのタプルを返し、六角形でカウントが必要なことです。息子は注釈を使用する必要があります。 2番目の問題は、注釈を試したときに、クエリがHexagonモデルで実行されている間にパブリケーションモデルに適用されたフィルタが正しく動作していないように見えることです。なぜなら各六角形の最終的なカウントが間違っているからです。大きすぎる。 –

アノテーションと条件付きカウントが遅すぎるDjangoクエリ

答えて

関連する問題