一、窗口函数是什么?为什么它能解决复杂分析问题?
说到SQL分析,很多人第一时间想到的就是GROUP BY。但当你需要同时保留明细数据又要做聚合计算时,GROUP BY就显得力不从心了。这时候窗口函数就像一把瑞士军刀,能优雅地解决这类问题。
窗口函数的特殊之处在于,它不会像GROUP BY那样折叠行,而是在每一行旁边"开个窗口",在这个窗口范围内做计算。比如你想知道每个员工的薪水在部门内的排名,同时又需要保留每个人的详细信息,窗口函数就是最佳选择。
举个例子(使用PostgreSQL 14):
-- 计算部门内员工薪水排名
SELECT
emp_name,
dept_name,
salary,
RANK() OVER (PARTITION BY dept_name ORDER BY salary DESC) as dept_rank
FROM employees;
/*
emp_name | dept_name | salary | dept_rank
---------+-----------+--------+----------
张三 | 研发部 | 15000 | 1
李四 | 研发部 | 12000 | 2
王五 | 销售部 | 18000 | 1
*/
二、窗口函数的四大金刚:分区、排序、框架和函数
窗口函数有四个核心组成部分,理解它们的关系非常重要:
- PARTITION BY - 相当于GROUP BY的分组
- ORDER BY - 决定窗口内的排序方式
- 框架子句 - 定义窗口范围
- 窗口函数本身 - 执行具体计算
来看个综合示例:
-- 计算每个部门内员工的累计薪水
SELECT
emp_id,
emp_name,
dept_name,
salary,
SUM(salary) OVER (
PARTITION BY dept_name
ORDER BY hire_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as running_total
FROM employees;
/*
emp_id | emp_name | dept_name | salary | running_total
-------+----------+-----------+--------+--------------
101 | 张三 | 研发部 | 15000 | 15000
103 | 李四 | 研发部 | 12000 | 27000
105 | 王五 | 销售部 | 18000 | 18000
*/
框架子句特别有意思,它支持多种定义方式:
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING- 前后各扩展一行RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW- 按时间范围
三、高级应用场景:从简单排名到复杂会话分析
3.1 会话分割:识别用户连续活动
电商场景中,我们经常需要识别用户的连续访问是否属于同一次会话。下面这个例子展示了如何使用窗口函数实现:
-- 识别用户会话(30分钟不活动视为新会话)
SELECT
user_id,
event_time,
event_type,
SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) as session_id
FROM (
SELECT
user_id,
event_time,
event_type,
CASE WHEN EXTRACT(EPOCH FROM (event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time))) > 1800
THEN 1 ELSE 0 END as is_new_session
FROM user_events
) t;
/*
user_id | event_time | event_type | session_id
--------+---------------------+------------+-----------
1001 | 2023-01-01 10:00:00 | 浏览 | 1
1001 | 2023-01-01 10:05:00 | 点击 | 1
1001 | 2023-01-01 11:00:00 | 购买 | 2
*/
3.2 移动平均与趋势分析
金融分析中经常需要计算移动平均:
-- 计算7天移动平均股价
SELECT
trade_date,
stock_code,
closing_price,
AVG(closing_price) OVER (
PARTITION BY stock_code
ORDER BY trade_date
RANGE BETWEEN INTERVAL '3' DAY PRECEDING AND INTERVAL '3' DAY FOLLOWING
) as ma_7day
FROM stock_daily;
/*
trade_date | stock_code | closing_price | ma_7day
-----------+------------+---------------+--------
2023-06-01 | 600519 | 1850.00 | 1832.50
2023-06-02 | 600519 | 1865.00 | 1845.00
2023-06-05 | 600519 | 1820.00 | 1840.00
*/
四、性能优化与避坑指南
4.1 索引优化策略
窗口函数的性能很大程度上依赖于PARTITION BY和ORDER BY的列。为这些列创建合适的索引能显著提升性能:
-- 为窗口函数常用列创建索引
CREATE INDEX idx_employees_dept ON employees(dept_name, hire_date);
4.2 常见陷阱
- 框架范围过大:避免使用
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING,除非确实需要 - 排序不稳定:当ORDER BY的值相同时,不同数据库可能返回不同结果
- 嵌套过深:多层嵌套窗口函数会导致性能急剧下降
-- 错误示例:过度使用窗口函数
SELECT
emp_id,
AVG(salary) OVER (PARTITION BY dept_id) as avg_salary,
RANK() OVER (ORDER BY AVG(salary) OVER (PARTITION BY dept_id)) as rank_by_avg -- 嵌套窗口函数
FROM employees;
4.3 替代方案对比
有时候CTE(Common Table Expression)可能是更好的选择:
-- 使用CTE优化复杂窗口函数
WITH dept_avg AS (
SELECT dept_id, AVG(salary) as avg_salary
FROM employees
GROUP BY dept_id
)
SELECT
e.emp_id,
d.avg_salary,
RANK() OVER (ORDER BY d.avg_salary) as rank_by_avg
FROM employees e
JOIN dept_avg d ON e.dept_id = d.dept_id;
五、实战案例:电商用户行为漏斗分析
让我们用一个完整的电商分析案例收尾:
-- 电商用户转化漏斗分析
WITH user_events_marked AS (
SELECT
user_id,
event_time,
event_type,
-- 标记关键事件
CASE WHEN event_type = '浏览商品' THEN 1 ELSE 0 END as view_product,
CASE WHEN event_type = '加入购物车' THEN 1 ELSE 0 END as add_cart,
CASE WHEN event_type = '提交订单' THEN 1 ELSE 0 END as place_order,
-- 会话标记
SUM(CASE WHEN EXTRACT(EPOCH FROM (event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time))) > 1800
THEN 1 ELSE 0 END) OVER (PARTITION BY user_id ORDER BY event_time) as session_id
FROM user_events
WHERE event_time BETWEEN '2023-06-01' AND '2023-06-30'
),
session_stats AS (
SELECT
user_id,
session_id,
MAX(view_product) as has_view,
MAX(add_cart) as has_add,
MAX(place_order) as has_order
FROM user_events_marked
GROUP BY user_id, session_id
)
SELECT
COUNT(*) as total_sessions,
SUM(has_view) as view_sessions,
SUM(has_add) as add_sessions,
SUM(has_order) as order_sessions,
ROUND(100.0 * SUM(has_add) / NULLIF(SUM(has_view), 0), 2) as view_to_add_rate,
ROUND(100.0 * SUM(has_order) / NULLIF(SUM(has_add), 0), 2) as add_to_order_rate
FROM session_stats;
/*
total_sessions | view_sessions | add_sessions | order_sessions | view_to_add_rate | add_to_order_rate
---------------+---------------+--------------+----------------+------------------+------------------
12560 | 9840 | 5420 | 3210 | 55.08 | 59.23
*/
这个案例展示了如何组合使用多种窗口函数技术,包括会话分割、事件标记和漏斗计算,最终输出完整的转化率分析。
窗口函数就像SQL中的超级武器,能解决传统GROUP BY难以处理的复杂分析场景。掌握它不仅能写出更简洁的SQL,还能大幅提升分析效率。不过也要注意合理使用,避免过度嵌套导致的性能问题。下次当你遇到既要聚合又要保留明细的需求时,不妨试试窗口函数这个利器。
评论