关系型数据库引入 JSON 支持后,出现了一些更高复杂度的 SQL 查询。
多年来,对象关系映射(ORM)工具已将查询生成流程标准化,让开发人员得以专注于优化少数真正需要调整的查询。但这些工具有时会生成过于复杂的 SQL,包含过多连接操作和业务逻辑,使得调优工作如同噩梦。开发者往往难以判断何时该将一个查询拆分为多个小型查询。而 JSON 数据类型的使用进一步加剧了这种复杂性。
ORM 生成的低效查询示例 以下是一个由 ORM 生成的复杂 SQL 查询的精简版本。由于其执行时间超过 90 分钟,不符合业务需求,我们需要对其进行调优。
WITH result_list AS ( SELECT o.id, o.data AS order_data, ARRAY_AGG ( DISTINCT cr.id) AS customer_refs_id, ARRAY_AGG ( DISTINCT cr.data) AS customer_refs_data, ARRAY_AGG ( DISTINCT cp.policy_id) AS customer_policy_id, ROW_NUMBER () OVER () AS rowid FROM orders o LEFT JOIN customer_refs cr ON cr.data - >> 'id' = o.data - > 'customer' - >> 'id' LEFT JOIN customer_policies cp ON cp.policy_data - >> 'id' = o.data - > 'customer' - >> 'id' AND cp.policy_data - >> 'region' = ANY ( ARRAY [ 'GLOBAL' , 'REGION_A' , 'REGION_B' , 'REGION_C' , 'REGION_D' ]) AND ( ( cp.policy_data - >> 'status' IS NOT NULL AND cp.policy_data - >> 'status' NOT IN ( 'inactive' , 'blocked' , 'suspended' ) ) AND ( ( cp.policy_data - >> 'customer_type' IS NOT NULL AND cp.policy_data - >> 'customer_type' = 'issuer' ) OR ( cp.policy_data - >> 'customer_type' IS NOT NULL AND cp.policy_data - >> 'customer_type' IN ( 'type1' , 'type2' , 'type3' ) ) OR ( get_text_from_path(cp.policy_data - > 'shared_groups' , '$[*]' , 'NONE' ) IS NOT NULL AND get_text_from_path(cp.policy_data - > 'shared_groups' , '$[*]' , 'NONE' ) && ARRAY [ 'GROUP_1' ] ) ) ) WHERE EXISTS ( SELECT 1 FROM account_policies ap WHERE ap.policy_id = ARRAY_TO_STRING( ARRAY [ o.data - > 'account' - > 'id' - >> 'system_ref' , o.data - > 'account' - > 'id' - >> 'type_ref' , o.data - > 'account' - > 'id' - >> 'region' , o.data - > 'account' - > 'id' - >> 'id' ], '-' ) AND ap.policy_data - >> 'region' = ANY ( ARRAY [ 'GLOBAL' , 'REGION_A' , 'REGION_B' , 'REGION_C' , 'REGION_D' ]) AND ap.policy_data - >> 'status' NOT IN ( 'inactive' , 'blocked' , 'suspended' ) AND ap.policy_data - >> 'contract' != 'prospect' AND ( ap.policy_data - >> 'account_type' IN ( 'typeX' , 'typeY' , 'typeZ' ) OR get_text_from_path(ap.policy_data - > 'shared_groups' , '$[*]' , 'NONE' ) && ARRAY [ 'GROUP_1' ] ) ) AND ( ( o.data - >> 'order_type' = 'MONEY_ORDER' AND ( o.data - > 'close_date' - >> 'value' > '2024-09-10' OR o.data - > 'close_date' - >> 'value' IS NULL OR o.data - > 'close_date' - >> 'value' = '' ) ) OR ( o.data - >> 'order_type' != 'MONEY_ORDER' AND COALESCE ( NULLIF (o.data - > 'valuation' - > 'quantity' - >> 'value' , '' ):: DECIMAL , 0 ) != 0 ) ) AND ( LOWER (o.data - >> 'display_name' ) LIKE '%current%' OR LOWER (o.data - > 'product' - > 'item' - > 'item_name' - >> 'abbreviation' ) LIKE '%current%' OR LOWER (o.data - > 'product' - > 'item' - > 'item_name' - >> 'full' ) LIKE '%current%' OR LOWER (o.data - > 'product' - > 'item' - >> 'identifier' ) LIKE '%current%' OR LOWER (o.data - > 'product' - > 'issuer' - >> 'display_name' ) LIKE '%current%' OR get_text_for_search_from_path(o.data - > 'product' - > 'underlying' , '$[*].item.item_name.abbreviation' , 'LOWER' ) LIKE '%current%' ) GROUP BY o.id, o.data ORDER BY o.id ), ordered_list AS ( SELECT * FROM result_list ORDER BY rowid LIMIT 23 ) SELECT * FROM ordered_list ORDER BY rowid;
查询的核心特征:
• 多连接操作 :查询包含多个 LEFT JOIN
,且 ON
条件涉及复杂的 JSON 字段操作。 • 嵌套 JSON 访问 :频繁使用 JSON 运算符( ->
、 ->>
)访问深层嵌套的字段。 • 自定义函数 :使用 get_text_from_path
和 get_text_for_search_from_path
等函数,这类函数可能未优化性能。 • 复杂 WHERE 条件 : WHERE
子句包含多个嵌套条件和子查询,且常涉及 JSON 字段。 查询低效的原因 1. 在大型数据集中,使用运算符访问深层嵌套的 JSON 字段速度较慢;若未正确建立索引,会导致全表扫描,大幅增加执行时间。 2. 查询对 JSON 字段执行大量的 LOWER()
和 LIKE
操作,计算成本高,且会导致索引无法生效。 3. 由于条件的写法问题,JSON 字段上的索引可能无法被利用。 4. LEFT JOIN
的条件中包含函数和 JSON 字段比较,导致执行计划效率低下。 5. EXISTS
子查询进一步增加了执行复杂性,可能引发全表扫描。 6. ORM 通常会生成通用性 SQL,未考虑特定数据库的优化方式。 7. ORM 的抽象特性可能导致冗余或不必要的条件与连接操作。 糟糕的执行计划 Limit (cost=292646835.98..292646835.99 rows=1 width=1813) (actual time=5504844.738..5504844.745 rows=23 loops=1) Buffers: shared hit=3622772239 read=7862262 dirtied=3639 -> Sort (cost=292646835.98..292646835.99 rows=1 width=1813) (actual time=5502255.883..5502255.889 rows=23 loops=1) Sort Key: (row_number() OVER (?)) Sort Method: quicksort Memory: 114kB Buffers: shared hit=3622772239 read=7862262 dirtied=3639 -> WindowAgg (cost=292646835.93..292646835.97 rows=1 width=1813) (actual time=5502254.506..5502255.810 rows=43 loops=1) Buffers: shared hit=3622772236 read=7862262 dirtied=3639 -> GroupAggregate (cost=292646835.93..292646835.96 rows=1 width=1805) (actual time=5502253.088..5502253.249 rows=43 loops=1) Group Key: o.id Buffers: shared hit=3622772236 read=7862262 dirtied=3639 -> Sort (cost=292646835.93..292646835.93 rows=1 width=1930) (actual time=5502244.952..5502244.964 rows=43 loops=1) Sort Key: o.id, cr.id Sort Method: quicksort Memory: 71kB Buffers: shared hit=3622772236 read=7862262 dirtied=3639 -> Nested Loop (cost=0.98..292646835.92 rows=1 width=1930) (actual time=2015030.222..5502243.020 rows=43 loops=1) Join Filter: (array_to_string(ARRAY[...]) = ap.policy_id) Rows Removed by Join Filter: 246198062 Buffers: shared hit=3622772236 read=7862262 dirtied=3639 --- "Planning:" " Buffers: shared hit=649 read=2" "Planning Time: 133.626 ms" "JIT:" " Functions: 32" " Options: Inlining true, Optimization true, Expressions true, Deforming true" " Timing: Generation 226.277 ms, Inlining 631.748 ms, Optimization 801.960 ms, Emission 1156.811 ms, Total 2816.797 ms" "Execution Time: 5505139.477 ms"
执行计划观察结果:
执行时间 :约 90 分钟
缓冲区使用 :
• 共享命中(shared hit):约 36 亿次 核心问题:
• 嵌套循环(Nested Loop) :逐行处理效率低下。 • 全表扫描 :连接筛选器过滤掉大量行,说明存在大量全表扫描。 优化查询 让我们来对查询做如下修改,以帮助优化器使用索引:
1. 建立索引并创建生成列(generated column),避免在 WHERE
子句中对字段应用函数。 2. 在可能的情况下,将 LEFT JOIN
改为 INNER JOIN
,以便在查询执行早期缩小数据集范围。 3. 重写部分条件,避免对 JSON 字段使用函数。 ALTER TABLE orders ADD COLUMN customer_id TEXT GENERATED ALWAYS AS (data - > 'customer' - >> 'id' ) STORED; ALTER TABLE orders ADD COLUMN order_type TEXT GENERATED ALWAYS AS (data - >> 'order_type' ) STORED; ALTER TABLE orders ADD COLUMN display_name_lower TEXT GENERATED ALWAYS AS ( LOWER (data - >> 'display_name' )) STORED; CREATE INDEX idx_orders_customer_id ON orders (customer_id); CREATE INDEX idx_orders_order_type ON orders (order_type); CREATE INDEX idx_orders_display_name_lower ON orders (display_name_lower); WITH result_list AS ( SELECT o.id, o.data AS order_data, ARRAY_AGG ( DISTINCT cr.id) AS customer_refs_id, ARRAY_AGG ( DISTINCT cr.data) AS customer_refs_data, ARRAY_AGG ( DISTINCT cp.policy_id) AS customer_policy_id, ROW_NUMBER () OVER () AS rowid FROM orders o INNER JOIN customer_refs cr ON cr.data - >> 'id' = o.customer_id INNER JOIN customer_policies cp ON cp.policy_data - >> 'id' = o.customer_id AND cp.policy_data - >> 'region' = ANY ( ARRAY [ 'GLOBAL' , 'REGION_A' , 'REGION_B' , 'REGION_C' , 'REGION_D' ]) AND cp.policy_data - >> 'status' NOT IN ( 'inactive' , 'blocked' , 'suspended' ) AND ( cp.policy_data - >> 'customer_type' = 'issuer' OR cp.policy_data - >> 'customer_type' IN ( 'type1' , 'type2' , 'type3' ) OR get_text_from_path(cp.policy_data - > 'shared_groups' , '$[*]' , 'NONE' ) && ARRAY [ 'GROUP_1' ] ) WHERE EXISTS ( SELECT 1 FROM account_policies ap WHERE ap.policy_id = ARRAY_TO_STRING( ARRAY [ o.data - > 'account' - > 'id' - >> 'system_ref' , o.data - > 'account' - > 'id' - >> 'type_ref' , o.data - > 'account' - > 'id' - >> 'region' , o.data - > 'account' - > 'id' - >> 'id' ], '-' ) AND ap.policy_data - >> 'region' = ANY ( ARRAY [ 'GLOBAL' , 'REGION_A' , 'REGION_B' , 'REGION_C' , 'REGION_D' ]) AND ap.policy_data - >> 'status' NOT IN ( 'inactive' , 'blocked' , 'suspended' ) AND ap.policy_data - >> 'contract' != 'prospect' AND ( ap.policy_data - >> 'account_type' IN ( 'typeX' , 'typeY' , 'typeZ' ) OR get_text_from_path(ap.policy_data - > 'shared_groups' , '$[*]' , 'NONE' ) && ARRAY [ 'GROUP_1' ] ) ) AND o.order_type = 'MONEY_ORDER' AND (o.data - > 'close_date' - >> 'value' > '2024-09-10' OR o.data - > 'close_date' - >> 'value' IS NULL OR o.data - > 'close_date' - >> 'value' = '' ) AND o.display_name_lower LIKE '%current%' GROUP BY o.id, o.data ORDER BY o.id ) SELECT * FROM result_list ORDER BY rowid LIMIT 23 ;
更新后的执行计划分析 Limit (cost=9013972.46..9013972.46 rows=1 width=1789) (actual time=175757.446..175757.449 rows=0 loops=1) Buffers: shared hit=20716139 read=836484 dirtied=1676 -> Sort (cost=9013972.46..9013972.46 rows=1 width=1789) (actual time=174722.049..174722.052 rows=0 loops=1) Sort Key: (row_number() OVER (?)) Sort Method: quicksort Memory: 25kB Buffers: shared hit=20716139 read=836484 dirtied=1676 -> WindowAgg (cost=9013972.40..9013972.45 rows=1 width=1789) (actual time=174722.010..174722.013 rows=0 loops=1) Buffers: shared hit=20716136 read=836484 dirtied=1676 -> GroupAggregate (cost=9013972.40..9013972.44 rows=1 width=1781) (actual time=174722.008..174722.011 rows=0 loops=1) Group Key: positions.id Buffers: shared hit=20716136 read=836484 dirtied=1676 -> Sort (cost=9013972.40..9013972.41 rows=1 width=1906) (actual time=174722.007..174722.009 rows=0 loops=1) Sort Key: positions.id, client_refs.id Sort Method: quicksort Memory: 25kB Buffers: shared hit=20716136 read=836484 dirtied=1676 -> Nested Loop (cost=1.41..9013972.39 rows=1 width=1906) (actual time=174722.001..174722.004 rows=0 loops=1) Buffers: shared hit=20716136 read=836484 dirtied=1676 -> Nested Loop Left Join (cost=0.99..9013961.75 rows=1 width=1933) (actual time=184.358..170945.266 rows=8554 loops=1) Buffers: shared hit=20683249 read=835147 dirtied=1672 -> Nested Loop Left Join (cost=0.56..9013950.91 rows=1 width=1931) (actual time=138.461..146595.622 rows=8554 loops=1) Join Filter: (client_refs.id = pv.client_id) Rows Removed by Join Filter: 328,884,138 Buffers: shared hit=20655467 read=828553 dirtied=1153 -> Nested Loop (cost=0.56..9008615.75 rows=1 width=1736) (actual time=33.568..75336.287 rows=8554 loops=1) Buffers: shared hit=27,870 read=824,707 dirtied=851 -> Seq Scan on positions_view pv (cost=0.00..9008612.97 rows=1 width=78) (actual time=16.384..12504.255 rows=8554 loops=1) Filter: ... Rows Removed by Filter: 32,144,940 Buffers: shared hit=7 read=809,465 -> Index Scan using positions_pkey on positions positions (cost=0.56..2.78 rows=1 width=1685) (actual time=7.327..7.327 rows=1 loops=8554) Index Cond: (id = pv.id) Buffers: shared hit=27,550 read=15,242 dirtied=602 -> Seq Scan on client_refs (cost=0.00..4513.96 rows=65,696 width=195) (actual time=0.004..5.734 rows=38,449 loops=8554) Buffers: shared hit=20,627,597 read=3,846 dirtied=302 -> Index Scan using clients_policy_pkey on clients_policy (cost=0.43..10.84 rows=1 width=26) (actual time=2.841..2.841 rows=0 loops=8554) Index Cond: (policy_id = pv.client_id) Filter: ... Rows Removed by Filter: 1 Buffers: shared hit=27,782 read=6,594 dirtied=519 -> Index Scan using idx_btree_portfolios_policy_id on portfolios_policy (cost=0.42..10.64 rows=1 width=28) (actual time=0.439..0.439 rows=0 loops=8554) Index Cond: (policy_id = pv.portfolio_id) Filter: ... Rows Removed by Filter: 1 Buffers: shared hit=32,887 read=1,337 dirtied=4 --- "Planning:" " Buffers: shared hit=954 read=78 dirtied=9" "Planning Time: 130.627 ms" "JIT:" " Functions: 33" " Options: Inlining true, Optimization true, Expressions true, Deforming true" " Timing: Generation 5.483 ms, Inlining 81.013 ms, Optimization 530.635 ms, Emission 423.850 ms, Total 1040.982 ms" "Execution Time: 175799.036 ms"
执行时间 :不到 3 分钟
缓冲区使用 :
优化点:
• 高效连接 :使用 INNER JOIN
在早期缩小数据集范围。 • 索引列 :通过索引实现快速数据检索,避免全表扫描。 • 简化条件 :移除 WHERE
子句中的部分函数操作,让索引得以生效。 与之前执行计划的对比 :尽管仍需处理大量行,但优化后的查询执行速度显著提升。
阅读原文:原文链接
该文章在 2025/9/20 10:55:05 编辑过