一、窗口函数是什么?为什么它能解决复杂分析问题?

说到SQL分析,很多人第一时间想到的就是GROUP BY。但当你需要同时保留明细数据又要做聚合计算时,GROUP BY就显得力不从心了。这时候窗口函数就像一把瑞士军刀,能优雅地解决这类问题。

窗口函数的特殊之处在于,它不会像GROUP BY那样折叠行,而是在每一行旁边"开个窗口",在这个窗口范围内做计算。比如你想知道每个员工的薪水在部门内的排名,同时又需要保留每个人的详细信息,窗口函数就是最佳选择。

举个例子(使用PostgreSQL 14):

-- 计算部门内员工薪水排名
SELECT 
    emp_name,
    dept_name,
    salary,
    RANK() OVER (PARTITION BY dept_name ORDER BY salary DESC) as dept_rank
FROM employees;
/*
emp_name | dept_name | salary | dept_rank
---------+-----------+--------+----------
张三     | 研发部    | 15000  | 1        
李四     | 研发部    | 12000  | 2        
王五     | 销售部    | 18000  | 1        
*/

二、窗口函数的四大金刚:分区、排序、框架和函数

窗口函数有四个核心组成部分,理解它们的关系非常重要:

  1. PARTITION BY - 相当于GROUP BY的分组
  2. ORDER BY - 决定窗口内的排序方式
  3. 框架子句 - 定义窗口范围
  4. 窗口函数本身 - 执行具体计算

来看个综合示例:

-- 计算每个部门内员工的累计薪水
SELECT 
    emp_id,
    emp_name,
    dept_name,
    salary,
    SUM(salary) OVER (
        PARTITION BY dept_name 
        ORDER BY hire_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as running_total
FROM employees;
/*
emp_id | emp_name | dept_name | salary | running_total
-------+----------+-----------+--------+--------------
101    | 张三     | 研发部    | 15000  | 15000        
103    | 李四     | 研发部    | 12000  | 27000        
105    | 王五     | 销售部    | 18000  | 18000        
*/

框架子句特别有意思,它支持多种定义方式:

  • ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING - 前后各扩展一行
  • RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW - 按时间范围

三、高级应用场景:从简单排名到复杂会话分析

3.1 会话分割:识别用户连续活动

电商场景中,我们经常需要识别用户的连续访问是否属于同一次会话。下面这个例子展示了如何使用窗口函数实现:

-- 识别用户会话(30分钟不活动视为新会话)
SELECT 
    user_id,
    event_time,
    event_type,
    SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) as session_id
FROM (
    SELECT 
        user_id,
        event_time,
        event_type,
        CASE WHEN EXTRACT(EPOCH FROM (event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time))) > 1800 
             THEN 1 ELSE 0 END as is_new_session
    FROM user_events
) t;
/*
user_id | event_time          | event_type | session_id
--------+---------------------+------------+-----------
1001    | 2023-01-01 10:00:00 | 浏览       | 1         
1001    | 2023-01-01 10:05:00 | 点击       | 1         
1001    | 2023-01-01 11:00:00 | 购买       | 2         
*/

3.2 移动平均与趋势分析

金融分析中经常需要计算移动平均:

-- 计算7天移动平均股价
SELECT 
    trade_date,
    stock_code,
    closing_price,
    AVG(closing_price) OVER (
        PARTITION BY stock_code 
        ORDER BY trade_date
        RANGE BETWEEN INTERVAL '3' DAY PRECEDING AND INTERVAL '3' DAY FOLLOWING
    ) as ma_7day
FROM stock_daily;
/*
trade_date | stock_code | closing_price | ma_7day
-----------+------------+---------------+--------
2023-06-01 | 600519     | 1850.00       | 1832.50
2023-06-02 | 600519     | 1865.00       | 1845.00
2023-06-05 | 600519     | 1820.00       | 1840.00
*/

四、性能优化与避坑指南

4.1 索引优化策略

窗口函数的性能很大程度上依赖于PARTITION BY和ORDER BY的列。为这些列创建合适的索引能显著提升性能:

-- 为窗口函数常用列创建索引
CREATE INDEX idx_employees_dept ON employees(dept_name, hire_date);

4.2 常见陷阱

  1. 框架范围过大:避免使用UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING,除非确实需要
  2. 排序不稳定:当ORDER BY的值相同时,不同数据库可能返回不同结果
  3. 嵌套过深:多层嵌套窗口函数会导致性能急剧下降
-- 错误示例:过度使用窗口函数
SELECT 
    emp_id,
    AVG(salary) OVER (PARTITION BY dept_id) as avg_salary,
    RANK() OVER (ORDER BY AVG(salary) OVER (PARTITION BY dept_id)) as rank_by_avg -- 嵌套窗口函数
FROM employees;

4.3 替代方案对比

有时候CTE(Common Table Expression)可能是更好的选择:

-- 使用CTE优化复杂窗口函数
WITH dept_avg AS (
    SELECT dept_id, AVG(salary) as avg_salary
    FROM employees
    GROUP BY dept_id
)
SELECT 
    e.emp_id,
    d.avg_salary,
    RANK() OVER (ORDER BY d.avg_salary) as rank_by_avg
FROM employees e
JOIN dept_avg d ON e.dept_id = d.dept_id;

五、实战案例:电商用户行为漏斗分析

让我们用一个完整的电商分析案例收尾:

-- 电商用户转化漏斗分析
WITH user_events_marked AS (
    SELECT 
        user_id,
        event_time,
        event_type,
        -- 标记关键事件
        CASE WHEN event_type = '浏览商品' THEN 1 ELSE 0 END as view_product,
        CASE WHEN event_type = '加入购物车' THEN 1 ELSE 0 END as add_cart,
        CASE WHEN event_type = '提交订单' THEN 1 ELSE 0 END as place_order,
        -- 会话标记
        SUM(CASE WHEN EXTRACT(EPOCH FROM (event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time))) > 1800 
                 THEN 1 ELSE 0 END) OVER (PARTITION BY user_id ORDER BY event_time) as session_id
    FROM user_events
    WHERE event_time BETWEEN '2023-06-01' AND '2023-06-30'
),
session_stats AS (
    SELECT 
        user_id,
        session_id,
        MAX(view_product) as has_view,
        MAX(add_cart) as has_add,
        MAX(place_order) as has_order
    FROM user_events_marked
    GROUP BY user_id, session_id
)
SELECT 
    COUNT(*) as total_sessions,
    SUM(has_view) as view_sessions,
    SUM(has_add) as add_sessions,
    SUM(has_order) as order_sessions,
    ROUND(100.0 * SUM(has_add) / NULLIF(SUM(has_view), 0), 2) as view_to_add_rate,
    ROUND(100.0 * SUM(has_order) / NULLIF(SUM(has_add), 0), 2) as add_to_order_rate
FROM session_stats;
/*
total_sessions | view_sessions | add_sessions | order_sessions | view_to_add_rate | add_to_order_rate
---------------+---------------+--------------+----------------+------------------+------------------
12560          | 9840          | 5420         | 3210           | 55.08            | 59.23           
*/

这个案例展示了如何组合使用多种窗口函数技术,包括会话分割、事件标记和漏斗计算,最终输出完整的转化率分析。

窗口函数就像SQL中的超级武器,能解决传统GROUP BY难以处理的复杂分析场景。掌握它不仅能写出更简洁的SQL,还能大幅提升分析效率。不过也要注意合理使用,避免过度嵌套导致的性能问题。下次当你遇到既要聚合又要保留明细的需求时,不妨试试窗口函数这个利器。