JOIN高速化のためにインデックスを作成しても効果はないのか

業務中に以前、自分が書いた記事を読み返す機会がありました。

www.okb-shelf.work

JOINを使っているSQLを高速化できないかと思って新規インデックスの作成を検討をしていました。
ただし、過去の記事にも書いた通り「結合述語にインデックスを作成しても、ハッシュ結合のパフォーマンスは良くならない」という事実があります。

ただ、前回は知識として「そうなんだ」という理解をしたものの、実際にインデックスを作成前後で、JOINのパフォーマンスが変化しないことを確認までしていませんでした。自分の理解をより深めるために実際に計測してみたいと思います。

サンプルデータについて

以下2つのテーブルを用意しました。

students: 生徒情報
exam_results: 試験結果

詳細はこちらからご覧下さい。

studentsには1万人のランダムなレコード、exam_resultsには一生徒あたり5教科分のレコードを作成しました。

postgres=> SELECT COUNT(*) FROM students;
 count
-------
 10000
(1 row)

postgres=> SELECT COUNT(*) FROM exam_results;
 count
-------
 50000
(1 row)

postgres=> SELECT * FROM students LIMIT 1;
 student_id | first_name | last_name |  birthday  | gender |     created_at      |     updated_at
------------+------------+-----------+------------+--------+---------------------+---------------------
          7 | 初音       | 坂田      | 1963-05-14 | 1      | 2023-09-03 23:30:17 | 2023-09-03 23:30:17
(1 row)

postgres=> SELECT * FROM exam_results LIMIT 1;
 exim_result_id | student_id | subject_id | score |     created_at      |     updated_at
----------------+------------+------------+-------+---------------------+---------------------
              6 |          7 | 1          |    53 | 2023-09-03 23:30:17 | 2023-09-03 23:30:17
(1 row)

検証結果

シンプルに結合のみを行うSQLをEXPLAIN ANALYZEを使って、結果を検証していきます。

EXPLAIN ANALYZE SELECT * FROM students JOIN exam_results ON students.student_id = exam_results.student_id;

この時点で作成されているインデックスはstudentsとexam_resultsにプライマリキー制約のために、自動で作成されたインデックスのみです。

結合先テーブル: インデックスなし

結合先テーブル(exam_results)にインデックスがない場合、Hash Joinを選択後にSeq Scanが実行されます。
特に注目すべき点はありません。一生懸命、最適解を探してくれるプランナーくん、ありがとう。

postgres=> EXPLAIN ANALYZE SELECT * FROM students JOIN exam_results ON students.student_id = exam_results.student_id;
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=318.00..1317.31 rows=50000 width=66) (actual time=3.891..29.970 rows=50000 loops=1)
   Hash Cond: (exam_results.student_id = students.student_id)
   ->  Seq Scan on exam_results  (cost=0.00..868.00 rows=50000 width=28) (actual time=0.022..5.472 rows=50000 loops=1)
   ->  Hash  (cost=193.00..193.00 rows=10000 width=38) (actual time=3.727..3.728 rows=10000 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 902kB
         ->  Seq Scan on students  (cost=0.00..193.00 rows=10000 width=38) (actual time=0.015..1.723 rows=10000 loops=1)
 Planning Time: 0.335 ms
 Execution Time: 32.429 ms
(8 rows)

結合先テーブル: インデックスあり

結合先テーブル(exam_results)のstudent_idに単一のインデックスを作成しました。
exam_results.student_idはJOINの結合述語に使用しているカラムですが、前述の通り、結合述語にインデックスを作成してもハッシュ結合のパフォーマンスが良くならないはずです。

postgres=> CREATE INDEX exam_results_student_id_idx ON exam_results (student_id);
CREATE INDEX
postgres=> EXPLAIN ANALYZE SELECT * FROM students JOIN exam_results ON students.student_id = exam_results.student_id;
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=318.00..1317.31 rows=50000 width=66) (actual time=3.696..31.245 rows=50000 loops=1)
   Hash Cond: (exam_results.student_id = students.student_id)
   ->  Seq Scan on exam_results  (cost=0.00..868.00 rows=50000 width=28) (actual time=0.011..5.512 rows=50000 loops=1)
   ->  Hash  (cost=193.00..193.00 rows=10000 width=38) (actual time=3.655..3.656 rows=10000 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 902kB
         ->  Seq Scan on students  (cost=0.00..193.00 rows=10000 width=38) (actual time=0.005..1.366 rows=10000 loops=1)
 Planning Time: 1.244 ms
 Execution Time: 33.941 ms
(8 rows)

期待通りの結果になりました。
プランナーが選択した結果は、先ほどのインデックスがない場合と全く同じです。結合述語を満たすインデックスを作成しても、JOINのパフォーマンスが改善されていないことが分かりました。

ただし、WHEREを含むSQLの場合には当然、インデックスを作成することでパフォーマンスが良くなるケースが考えられます。基本的には対象の結合元テーブル、結合先テーブルの行数を減らすことが、読み込みデータ・メモリ使用量という観点から重要になります。今まで検証したSQLに簡単なWHEREを追加して、結果を見てみます。

+WHERE 結合元: インデックスあり

誕生日による簡易な絞り込みを追加しました。
studentsからレコードが絞り込まれるはずなので、多少は高速化されるかなと思われますが、どうでしょうか。

postgres=> CREATE INDEX students_birthday_idx ON students (birthday);
postgres=> EXPLAIN ANALYZE SELECT * FROM students JOIN exam_results ON students.student_id = exam_results.student_id WHERE birthday > '2000/1/1';
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=258.95..1258.26 rows=18575 width=66) (actual time=1.606..20.779 rows=18585 loops=1)
   Hash Cond: (exam_results.student_id = students.student_id)
   ->  Seq Scan on exam_results  (cost=0.00..868.00 rows=50000 width=28) (actual time=0.007..5.408 rows=50000 loops=1)
   ->  Hash  (cost=212.51..212.51 rows=3715 width=38) (actual time=1.585..1.586 rows=3717 loops=1)
         Buckets: 4096  Batches: 1  Memory Usage: 320kB
         ->  Bitmap Heap Scan on students  (cost=73.08..212.51 rows=3715 width=38) (actual time=0.325..0.970 rows=3717 loops=1)
               Recheck Cond: (birthday > '2000-01-01'::date)
               Heap Blocks: exact=93
               ->  Bitmap Index Scan on students_birthday_idx  (cost=0.00..72.15 rows=3715 width=0) (actual time=0.310..0.310 rows=3717 loops=1)
                     Index Cond: (birthday > '2000-01-01'::date)
 Planning Time: 0.856 ms
 Execution Time: 21.937 ms
(12 rows)

最終的なデータ量をみると320kBとなっているので、WHEREがなかった場合のSQLと比べると1/3程度にデータ量が抑えられていることが分かります。その一方で、プランナーの結果をみると50,000レコードに対してSeq Scanをしてしまっているので、大幅なパフォーマンス改善がされたとは考えにくい結果となりました。

+WHERE 結合先: 単一インデックスあり

次は結合先テーブルで絞り込みを行います。
絞り込み条件はexam_results.subject_idによる科目の絞り込みです。ここでは1(国語)に絞り込みます。

postgres=> CREATE INDEX exam_results_subject_id_idx ON exam_results (subject_id);
postgres=> EXPLAIN ANALYZE SELECT * FROM students JOIN exam_results ON students.student_id = exam_results.student_id WHERE exam_results.subject_id = '1';;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=435.87..955.28 rows=10010 width=66) (actual time=3.276..10.523 rows=10000 loops=1)
   Hash Cond: (exam_results.student_id = students.student_id)
   ->  Bitmap Heap Scan on exam_results  (cost=117.87..610.99 rows=10010 width=28) (actual time=0.580..3.733 rows=10000 loops=1)
         Recheck Cond: (subject_id = '1'::bpchar)
         Heap Blocks: exact=368
         ->  Bitmap Index Scan on exam_results_subject_id_idx  (cost=0.00..115.37 rows=10010 width=0) (actual time=0.529..0.530 rows=10000 loops=1)
               Index Cond: (subject_id = '1'::bpchar)
   ->  Hash  (cost=193.00..193.00 rows=10000 width=38) (actual time=2.658..2.660 rows=10000 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 902kB
         ->  Seq Scan on students  (cost=0.00..193.00 rows=10000 width=38) (actual time=0.007..0.990 rows=10000 loops=1)
 Planning Time: 0.628 ms
 Execution Time: 11.022 ms
(12 rows)

結合先テーブルを絞り込んだ場合、HashJoinを通じてレコード数は10,010となりました。
作成したインデックスも使われているようですし、実行時間も今まで一番早いです。しかし、結合元テーブルのレコード数は全レコード(10,000)であるため、最終的なデータ量はインデックスがない場合の902kBと同じ値になりました。

+WHERE 結合先: 複合インデックスあり

結合述語を含めた複合インデックスを試してみましたが、結果は変わりませんでした。

postgres=> CREATE INDEX exam_results_student_id_subject_id_idx ON exam_results(student_id, subject_id);
postgres=> EXPLAIN ANALYZE SELECT * FROM students JOIN exam_results ON students.student_id = exam_results.student_id WHERE exam_results.subject_id = '1';
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=318.00..1337.29 rows=10010 width=66) (actual time=3.003..13.469 rows=10000 loops=1)
   Hash Cond: (exam_results.student_id = students.student_id)
   ->  Seq Scan on exam_results  (cost=0.00..993.00 rows=10010 width=28) (actual time=0.019..6.553 rows=10000 loops=1)
         Filter: (subject_id = '1'::bpchar)
         Rows Removed by Filter: 40000
   ->  Hash  (cost=193.00..193.00 rows=10000 width=38) (actual time=2.964..2.966 rows=10000 loops=1)
         Buckets: 16384  Batches: 1  Memory Usage: 902kB
         ->  Seq Scan on students  (cost=0.00..193.00 rows=10000 width=38) (actual time=0.004..1.067 rows=10000 loops=1)
 Planning Time: 0.970 ms
 Execution Time: 14.266 ms
(10 rows)

+WHERE 結合先・結合元: インデックスあり

今までの結合元と結合先のWHEREを合体させて、それぞれのテーブルにインデックスを作成しました。
結果、本日計測したSQLの中で最もパフォーマンスが良い結果が得られました。SeqScanが実行されていないことも分かりますし、最終的なデータ量も320kBでした。

postgres=> CREATE INDEX students_birthday_idx ON students (birthday);
postgres=> CREATE INDEX exam_results_subject_id_idx ON exam_results (subject_id);
postgres=> EXPLAIN ANALYZE SELECT * FROM students JOIN exam_results ON students.student_id = exam_results.student_id WHERE birthday > '2000/1/1' AND exam_results.subject_id = '1';
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=376.82..896.23 rows=3719 width=66) (actual time=3.784..9.042 rows=3717 loops=1)
   Hash Cond: (exam_results.student_id = students.student_id)
   ->  Bitmap Heap Scan on exam_results  (cost=117.87..610.99 rows=10010 width=28) (actual time=1.208..3.803 rows=10000 loops=1)
         Recheck Cond: (subject_id = '1'::bpchar)
         Heap Blocks: exact=368
         ->  Bitmap Index Scan on exam_results_subject_id_idx  (cost=0.00..115.37 rows=10010 width=0) (actual time=1.158..1.158 rows=10000 loops=1)
               Index Cond: (subject_id = '1'::bpchar)
   ->  Hash  (cost=212.51..212.51 rows=3715 width=38) (actual time=2.558..2.560 rows=3717 loops=1)
         Buckets: 4096  Batches: 1  Memory Usage: 320kB
         ->  Bitmap Heap Scan on students  (cost=73.08..212.51 rows=3715 width=38) (actual time=0.788..1.614 rows=3717 loops=1)
               Recheck Cond: (birthday > '2000-01-01'::date)
               Heap Blocks: exact=93
               ->  Bitmap Index Scan on students_birthday_idx  (cost=0.00..72.15 rows=3715 width=0) (actual time=0.774..0.774 rows=3717 loops=1)
                     Index Cond: (birthday > '2000-01-01'::date)
 Planning Time: 1.363 ms
 Execution Time: 9.657 ms
(16 rows)

まとめ

やはり、JOINをする際は対象となるレコード数を少なくなるように絞り込んでおくのが重要になりそうです。
インデックスはそのために使われるべきで、何度も記載しているように結合述語にインデックスを作成しても、インデックスが使われることはありませんし、JOIN自体のパフォーマンスUPは期待できませんでした。