GROUP BY
Overview
The GROUP BY clause is a core component in SQL queries used for data grouping and aggregation analysis. With GROUP BY, you can group a dataset by specified columns or expressions and apply aggregate functions (such as SUM(), COUNT(), AVG(), MAX(), MIN(), etc.) to each group to compute summary values.
Singdata Lakehouse fully supports standard SQL GROUP BY syntax and provides advanced aggregation capabilities:
| Feature | Description | Use Case |
|---|---|---|
| Standard GROUP BY | Basic grouping and aggregation | Daily statistics, basic reports |
| GROUPING SETS | Custom multiple grouping sets | Multi-dimensional business analysis |
| ROLLUP | Hierarchical rollup (from detail to total) | Financial reports, sales hierarchy analysis |
| CUBE | Full dimensional cross analysis | BI multi-dimensional data pivot, cross analysis |
Syntax Structure
Key Points
- SELECT clause constraint: Non-aggregate columns in the SELECT list must appear in the GROUP BY clause
- Execution order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT
- HAVING filter: Used to filter aggregation results, while WHERE filters original rows
- NULL value handling: NULL is treated as a separate group
Prepare Test Data
This document uses car dealership sales data as an example. Create the test view:
View the original data:
Query result:
| id | city | car_model | quantity |
|---|---|---|---|
| 100 | Fremont | Honda Accord | 15 |
| 100 | Fremont | Honda CRV | 7 |
| 100 | Fremont | Honda Civic | 10 |
| 200 | Dublin | Honda Accord | 10 |
| 200 | Dublin | Honda CRV | 3 |
| 200 | Dublin | Honda Civic | 20 |
| 300 | San Jose | Honda Accord | 8 |
| 300 | San Jose | Honda Civic | 5 |
1. Standard GROUP BY
1.1 Single-Column Grouping
Group by dealer ID and calculate the total quantity for each dealer:
Query result:
| id | total_quantity |
|---|---|
| 100 | 32 |
| 200 | 33 |
| 300 | 13 |
Business interpretation:
- Dealer 100 (Fremont) sold 32 cars
- Dealer 200 (Dublin) sold 33 cars
- Dealer 300 (San Jose) sold 13 cars
1.2 Multi-Column Grouping
Group by the combination of city and car model:
Query result:
| id | city | total_quantity |
|---|---|---|
| 200 | Dublin | 33 |
| 100 | Fremont | 32 |
| 300 | San Jose | 13 |
In this example, each dealer ID corresponds to a unique city, so the result is the same as single-column grouping.
1.3 Grouping with Expressions
Use a CASE expression to segment sales volumes:
Query result:
| sales_level | record_count | total_quantity |
|---|---|---|
| High (>= 15) | 2 | 35 |
| Low (< 10) | 4 | 23 |
| Medium (10-14) | 2 | 20 |
Business interpretation:
- High-volume records (>= 15 cars): 2 records, 35 cars total
- Low-volume records (< 10 cars): 4 records, 23 cars total
- Medium-volume records (10-14 cars): 2 records, 20 cars total
1.4 Global Aggregation (No Grouping)
When GROUP BY is not used, a global aggregation is performed on the entire table:
Query result:
| total_quantity |
|---|
| 78 |
The total quantity across all 8 records is 78 cars.
2. Combining Multiple Aggregate Functions
2.1 Common Aggregate Functions Summary
Use multiple aggregate functions in a single query:
Query result:
| city | model_count | total_quantity | avg_quantity | max_quantity | min_quantity |
|---|---|---|---|---|---|
| Dublin | 3 | 33 | 11.0 | 20 | 3 |
| Fremont | 3 | 32 | 10.666666666666666 | 15 | 7 |
| San Jose | 2 | 13 | 6.5 | 8 | 5 |
Business insights:
- Dublin: 3 car models, average 11 cars per model, highest single item 20 (Honda Civic)
- Fremont: 3 car models, average 10.67 cars, relatively balanced distribution
- San Jose: Only 2 car models, average 6.5 cars, the smallest market
2.2 Statistics by Car Model Covering Number of Cities
Count how many cities each car model is sold in:
Query result:
| car_model | total_sales | city_count | avg_per_dealer |
|---|---|---|---|
| Honda Civic | 35 | 3 | 11.67 |
| Honda Accord | 33 | 3 | 11.00 |
| Honda CRV | 10 | 2 | 5.00 |
Business insights:
- Honda Civic: Best-selling model, total sales 35 cars, covering 3 cities, average 11.67 per location
- Honda Accord: Total sales 33 cars, covering 3 cities
- Honda CRV: Lowest total sales, sold in only 2 cities (not present in San Jose market)
3. GROUPING SETS (Custom Grouping Sets)
3.1 Basic Concepts
GROUPING SETS allows generating multiple grouping results at different dimensions in a single query. It is essentially a shorthand for combining multiple GROUP BY queries via UNION ALL.
Syntax:
3.2 Multi-Dimensional Aggregation Analysis
Generate four-dimensional statistical results in a single query:
Query result:
| city | car_model | total_quantity |
|---|---|---|
| NULL | NULL | 78 |
| NULL | Honda Accord | 33 |
| NULL | Honda CRV | 10 |
| NULL | Honda Civic | 35 |
| Dublin | NULL | 33 |
| Dublin | Honda Accord | 10 |
| Dublin | Honda CRV | 3 |
| Dublin | Honda Civic | 20 |
| Fremont | NULL | 32 |
| Fremont | Honda Accord | 15 |
| Fremont | Honda CRV | 7 |
| Fremont | Honda Civic | 10 |
| San Jose | NULL | 13 |
| San Jose | Honda Accord | 8 |
| San Jose | Honda Civic | 5 |
Result interpretation:
- Grand total row (city=NULL, car_model=NULL): Total sales 78 cars
- Model subtotal rows (city=NULL, car_model has value): Cross-city totals for each car model
- City subtotal rows (city has value, car_model=NULL): Cross-model totals for each city
- Detail rows (both city and car_model have values): Specific city + model combination sales
Equivalent query:
Performance advantage: GROUPING SETS only scans the data once internally, making it more efficient than UNION ALL.
3.3 Identifying Aggregation Levels
Use the COALESCE function to replace NULL values, making aggregation rows more readable:
Query result:
| city | car_model | total_quantity |
|---|---|---|
| All Cities | All Models | 78 |
| Dublin | All Models | 33 |
| Dublin | Honda Accord | 10 |
| Dublin | Honda CRV | 3 |
| Dublin | Honda Civic | 20 |
| Fremont | All Models | 32 |
| Fremont | Honda Accord | 15 |
| Fremont | Honda CRV | 7 |
| Fremont | Honda Civic | 10 |
| San Jose | All Models | 13 |
| San Jose | Honda Accord | 8 |
| San Jose | Honda Civic | 5 |
4. ROLLUP (Hierarchical Rollup)
4.1 Basic Concepts
ROLLUP generates hierarchical rollups from fine granularity to global. For ROLLUP(a, b, c), the following grouping sets are generated:
(a, b, c)- Finest granularity(a, b)- Intermediate level(a)- Higher level()- Global aggregation
Equivalence relationship:
4.2 Sales Hierarchy Rollup
Generate "city → car model" two-level rollup:
Query result:
| city | car_model | total_quantity |
|---|---|---|
| Fremont | Honda Civic | 10 |
| Fremont | Honda Accord | 15 |
| Fremont | Honda CRV | 7 |
| Dublin | Honda Civic | 20 |
| Dublin | Honda Accord | 10 |
| Dublin | Honda CRV | 3 |
| San Jose | Honda Civic | 5 |
| San Jose | Honda Accord | 8 |
| Fremont | NULL | 32 |
| Dublin | NULL | 33 |
| San Jose | NULL | 13 |
| NULL | NULL | 78 |
Data hierarchy:
- Detail layer (first 8 rows): Sales of each car model in each city
- City subtotals (car_model=NULL): Total sales for each city
- Grand total (city=NULL, car_model=NULL): Total sales across all cities
4.3 Using the GROUPING() Function to Identify Levels
The GROUPING() function returns 0 or 1, used to distinguish real NULLs from aggregation markers:
- Returns 0: The column participates in grouping, the value is actual data
- Returns 1: The column does not participate in grouping, NULL indicates aggregation
Query result:
| city | car_model | total_quantity | city_grouping | model_grouping |
|---|---|---|---|---|
| Dublin | Honda Accord | 10 | 0 | 0 |
| Dublin | Honda CRV | 3 | 0 | 0 |
| Dublin | Honda Civic | 20 | 0 | 0 |
| Dublin | NULL | 33 | 0 | 1 |
| Fremont | Honda Accord | 15 | 0 | 0 |
| Fremont | Honda CRV | 7 | 0 | 0 |
| Fremont | Honda Civic | 10 | 0 | 0 |
| Fremont | NULL | 32 | 0 | 1 |
| San Jose | Honda Accord | 8 | 0 | 0 |
| San Jose | Honda Civic | 5 | 0 | 0 |
| San Jose | NULL | 13 | 0 | 1 |
| NULL | NULL | 78 | 1 | 1 |
GROUPING() value interpretation:
- (0, 0): Detail data, both columns participate in grouping
- (0, 1): City subtotal, only city participates in grouping
- (1, 1): Grand total, neither column participates in grouping
5. CUBE (Full Dimensional Cross Analysis)
5.1 Basic Concepts
CUBE generates grouping sets for all possible combinations of the specified columns. For CUBE(a, b), it generates 22 = 4 grouping sets:
(a, b)- Detail(a)- Aggregated by a(b)- Aggregated by b()- Global aggregation
Equivalence relationship:
5.2 Multi-Dimensional Cross Analysis
Generate all dimension combinations for city and car model:
Query result:
| city | car_model | total_quantity |
|---|---|---|
| Fremont | Honda Civic | 10 |
| Fremont | Honda Accord | 15 |
| Fremont | Honda CRV | 7 |
| Dublin | Honda Civic | 20 |
| Dublin | Honda Accord | 10 |
| Dublin | Honda CRV | 3 |
| San Jose | Honda Civic | 5 |
| San Jose | Honda Accord | 8 |
| NULL | Honda CRV | 10 |
| NULL | Honda Civic | 35 |
| NULL | Honda Accord | 33 |
| Fremont | NULL | 32 |
| Dublin | NULL | 33 |
| San Jose | NULL | 13 |
| NULL | NULL | 78 |
Result includes:
- Detail rows (first 8 rows): City + model combinations
- Model subtotal rows (city=NULL): Cross-city model sales
- City subtotal rows (car_model=NULL): Cross-model city sales
- Grand total row (city=NULL, car_model=NULL): Global total
5.3 Identifying Aggregation Levels
Use a CASE statement with the GROUPING() function to identify the aggregation level of each row:
Query result:
| city | car_model | total_quantity | aggregation_level |
|---|---|---|---|
| Dublin | Honda Accord | 10 | Detail |
| Dublin | Honda CRV | 3 | Detail |
| Dublin | Honda Civic | 20 | Detail |
| Fremont | Honda Accord | 15 | Detail |
| Fremont | Honda CRV | 7 | Detail |
| Fremont | Honda Civic | 10 | Detail |
| San Jose | Honda Accord | 8 | Detail |
| San Jose | Honda Civic | 5 | Detail |
| Dublin | NULL | 33 | City Subtotal |
| Fremont | NULL | 32 | City Subtotal |
| San Jose | NULL | 13 | City Subtotal |
| NULL | Honda Accord | 33 | Model Subtotal |
| NULL | Honda CRV | 10 | Model Subtotal |
| NULL | Honda Civic | 35 | Model Subtotal |
| NULL | NULL | 78 | Total |
This query clearly identifies the aggregation level to which each row belongs.
5.4 ROLLUP vs CUBE
| Dimension | ROLLUP | CUBE |
|---|---|---|
| Generated grouping sets | Hierarchical subset (left to right) | All possible combinations |
| ROLLUP(a,b) | (a,b), (a), () | (a,b), (a), (b), () |
| Number of groupings | n+1 (n = number of columns) | 2n |
| Use case | Hierarchical reports, top-down analysis | Multi-dimensional cross analysis, BI pivot tables |
| Performance | Fewer grouping sets, better performance | More grouping sets, higher performance overhead |
Selection suggestions:
- Use
ROLLUPwhen you need hierarchical rollups (e.g., Region → City → Store) - Use
CUBEwhen you need multi-dimensional cross analysis (e.g., Region x Product x Time) - Use
GROUPING SETSwhen you need custom combinations
6. HAVING Clause: Filtering Aggregation Results
6.1 WHERE vs HAVING
| Clause | Execution Timing | Target | Available Functions |
|---|---|---|---|
| WHERE | Before grouping | Original row data | Scalar functions, column comparisons |
| HAVING | After grouping | Aggregation results | Aggregate functions, GROUP BY columns |
Execution order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT
6.2 Using HAVING to Filter Aggregation Results
Filter cities with total sales exceeding 15 cars:
Query result:
| city | record_count | total_quantity |
|---|---|---|
| Dublin | 3 | 33 |
| Fremont | 3 | 32 |
San Jose (total sales 13 cars) is filtered out because it does not meet the HAVING condition.
6.3 Combining WHERE and HAVING
Filter original data with WHERE first, then filter aggregation results with HAVING:
Execution logic:
- WHERE filter: Exclude records where quantity < 10 (filters out 4 records)
- GROUP BY grouping: Group remaining records by city
- HAVING filter: Keep only groups with record count >= 2
7. Advanced Use Cases
7.1 Row-to-Column: Pivot Analysis
Convert car models from rows to columns to achieve a pivot table effect:
Query result:
| city | civic_sales | accord_sales | crv_sales | total_sales |
|---|---|---|---|---|
| Dublin | 20 | 10 | 3 | 33 |
| Fremont | 10 | 15 | 7 | 32 |
| San Jose | 5 | 8 | 0 | 13 |
Business value:
- Clearly shows the sales distribution of each car model in each city
- San Jose has no CRV sales records (shown as 0)
- Suitable for Excel reports and BI dashboards
7.2 Proportion Analysis: Window Functions and GROUP BY
Calculate the sales proportion of each car model within its city:
Query result:
| city | car_model | quantity | city_total | percentage |
|---|---|---|---|---|
| Dublin | Honda Accord | 10 | 33 | 30.30 |
| Dublin | Honda CRV | 3 | 33 | 9.09 |
| Dublin | Honda Civic | 20 | 33 | 60.61 |
| Fremont | Honda Accord | 15 | 32 | 46.88 |
| Fremont | Honda CRV | 7 | 32 | 21.88 |
| Fremont | Honda Civic | 10 | 32 | 31.25 |
| San Jose | Honda Accord | 8 | 13 | 61.54 |
| San Jose | Honda Civic | 5 | 13 | 38.46 |
Business insights:
- Dublin: Honda Civic accounts for 60.61%, the absolute flagship model
- Fremont: Sales distribution is relatively balanced, Accord accounts for the highest (46.88%)
- San Jose: Accord accounts for 61.54%, clear market preference
Technical note: The window function SUM() OVER (PARTITION BY city) calculates the total sales for each city without changing the number of rows.
7.3 Year-over-Year and Month-over-Month Analysis Pattern
Although the sample data does not include a time dimension, here is a standard YoY/MoM analysis SQL pattern:
Key techniques:
LAG()window function to get the previous row's value- MoM growth rate = (current value - previous value) / previous value x 100%
7.4 TopN Analysis: Taking Top N per Group
Use window functions combined with GROUP BY to implement grouped TopN:
Expected logic (based on current data):
- Dublin: Honda Civic (20), Honda Accord (10)
- Fremont: Honda Accord (15), Honda Civic (10)
- San Jose: Honda Accord (8), Honda Civic (5)
8. Performance Optimization and Best Practices
8.1 Query Performance Optimization
1. Use LIMIT Appropriately
For large datasets, always add LIMIT to restrict the number of returned rows:
2. Choose the Right Aggregation Tool
| Requirement | Recommended Solution | Grouping Sets Generated | Performance |
|---|---|---|---|
| Simple hierarchical rollup | ROLLUP | n+1 | Highest |
| Multi-dimensional cross analysis | CUBE | 2n | Moderate |
| Custom grouping sets | GROUPING SETS | Custom | High |
Example: For 3-column grouping
ROLLUP(a,b,c)generates 4 grouping setsCUBE(a,b,c)generates 8 grouping setsGROUPING SETScan customize any combination
3. Reduce Grouping Column Cardinality
Use functions to reduce granularity on high-cardinality columns (e.g., ID, timestamp):
4. WHERE Filtering is Better than HAVING Filtering
Try to filter data in WHERE to reduce the data volume processed by GROUP BY:
8.2 Common Errors and Solutions
Error 1: SELECT Column Not in GROUP BY
Error 2: Using Aggregate Functions in WHERE
Error 3: Confusing ROLLUP/CUBE Syntax
Error 4: Ignoring NULL Value Impact
8.3 Data Quality Checks
Verify Grouping Completeness
Verify that the sum of group subtotals equals the grand total:
9. Practical Case Summary
9.1 Sales Analysis Report
Requirement: Generate a sales report with detail rows, subtotals, and grand total
9.2 Multi-Dimensional Cross Analysis
Requirement: BI pivot table, needing all dimension combinations of city and car model
9.3 Top/Bottom N Analysis
Requirement: Find the cities with the highest and lowest sales
9.4 Proportion and Ranking
Requirement: Calculate the sales proportion and ranking of each car model
10. Feature Comparison and Selection Guide
10.1 Comparison of Four Grouping Methods
| Feature | GROUP BY | GROUPING SETS | ROLLUP | CUBE |
|---|---|---|---|---|
| Syntax complexity | Simple | Moderate | Simple | Simple |
| Grouping sets generated | 1 | Custom | n+1 | 2n |
| Flexibility | Low | High | Medium | Medium |
| Performance | Best | Good | Good | Poorer (with many columns) |
| Applicable scenarios | Basic statistics | Custom multi-dimensional analysis | Hierarchical reports | Full dimensional cross analysis |
10.2 Selection Decision Tree
10.3 Recommended Use Cases
| Business Scenario | Recommended Solution | Example |
|---|---|---|
| Daily statistics and reports | GROUP BY | Statistics by department |
| Financial hierarchical rollup | ROLLUP | Company → Department → Team hierarchical rollup |
| BI data pivot tables | CUBE | Product x Region x Time three-dimensional analysis |
| Custom business reports | GROUPING SETS | Generate daily, weekly, and monthly reports simultaneously |
| Proportion analysis | GROUP BY + window functions | Sales proportion of each product |
| Top N analysis | GROUP BY + LIMIT | Top 10 products by sales |
11. Advanced Techniques
11.1 Dynamic Grouping
Use variables or subqueries to implement dynamic grouping conditions (example logic):
11.2 Recursive Rollup (Simulated)
Although Singdata Lakehouse does not yet directly support recursive CTE, you can simulate multi-level rollups via UNION ALL:
11.3 NULL Value Handling Tips
Tip 1: Distinguish Real NULLs from Aggregation NULLs
Tip 2: Use COALESCE to Beautify Output
12. Summary
Key Takeaways
- GROUP BY is the foundation of data aggregation analysis; mastering its usage is an essential SQL analysis skill
- GROUPING SETS / ROLLUP / CUBE provide powerful multi-dimensional analysis capabilities, generating multi-level rollups in a single query
- HAVING is used to filter aggregation results, working with WHERE for flexible data filtering
- Window functions combined with GROUP BY enable complex proportion, ranking, and period-over-period analysis
- Performance optimization requires balancing flexibility and efficiency, choosing the right grouping method
Learning Path Suggestions
- Basic stage: Master standard GROUP BY and common aggregate functions
- Advanced stage: Learn the use cases of GROUPING SETS, ROLLUP, and CUBE
- Expert stage: Combine window functions and CTE for complex business analysis
- Practical stage: Apply to real business scenarios and optimize query performance
