When working with grouped data, you often need either the top N rows per group (e.g., highest scores per category) or a random sample per group. Here’s how to do both efficiently, especially in BigQuery.
You can check also how to get Random Sample per group in Pandas.
1. Selecting Top N Rows Per Group
To get the top N records for each group (based on a sort order), use window functions with ROW_NUMBER():
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY group_col ORDER BY sort_col DESC) AS rn
FROM my_table
)
WHERE rn <= 3;
Explanation:
PARTITION BYgroups the data bygroup_col.ORDER BYdefines “top” within each group.rn <= 3keeps the top 3 rows per group.
2. Taking a Random Sample Per Group
To randomly sample N rows per group in BigQuery:
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY group_col ORDER BY RAND()) AS rn
FROM my_table
)
WHERE rn <= 5;
Explanation:
ORDER BY RAND()randomizes rows within each group.- The outer filter keeps the first N (here 5) random rows per group.
3. Example & Demo
You can test the SQL on this link: Get Random N results per group in SQL:
CREATE TABLE yourtable (
year int,
id varchar(20),
rate decimal(10,2)
);
INSERT INTO yourtable VALUES
(2006, 'p01', 8),
(2003, 'p01', 7.4),
(2008, 'p01', 6.8),
(2001, 'p01', 5.9),
(2007, 'p01', 5.3),
(2009, 'p01', 4.4),
(2002, 'p01', 3.9),
(2004, 'p01', 3.5),
(2005, 'p01', 2.1),
(2000, 'p01', 0.8),
(2001, 'p02', 12.5),
(2004, 'p02', 12.4),
(2002, 'p02', 12.2),
(2003, 'p02', 10.3),
(2000, 'p02', 8.7),
(2006, 'p02', 4.6),
(2007, 'p02', 3.3);
SELECT yourtable.*
FROM yourtable INNER JOIN (
SELECT id, GROUP_CONCAT(year ORDER BY rate DESC) grouped_year
FROM yourtable
GROUP BY id) group_max
ON yourtable.id = group_max.id
AND FIND_IN_SET(year, grouped_year) <=3
ORDER BY
yourtable.id, yourtable.year DESC;
The result will be 3 samples for each group:
- p01
- p02
| year | id | rate |
|---|---|---|
| 2008 | p01 | 6.80 |
| 2006 | p01 | 8.00 |
| 2003 | p01 | 7.40 |
| 2004 | p02 | 12.40 |
| 2002 | p02 | 12.20 |
| 2001 | p02 | 12.50 |