Map Join Optimization
Introduction
Map Join is an efficient JOIN optimization in Lakehouse, particularly suited for small table and large table JOIN scenarios. Map Join broadcasts the small table to each compute node and completes the JOIN directly in the Map phase, avoiding expensive Shuffle and Reduce operations, thereby saving resources and improving query performance.
Syntax
Add the /*+ MAPJOIN(table_alias) */ hint to the query, where table_alias is the alias of the small table to be broadcast into memory:
Advantages
- Eliminates the Shuffle phase, reducing network transfer and disk I/O overhead.
- Avoids data skew issues, since data does not need to be distributed by JOIN column.
- Significantly improves query speed for small table JOIN large table scenarios.
Notes
- The small table must fit entirely in memory; otherwise it may cause out-of-memory errors. Lakehouse currently limits the small table size to 1 GB.
- Only suitable for small table and large table JOINs; not appropriate for large table and large table scenarios.
Usage Examples
Example 1: Employee and Department Join Query
The departments table (3 rows) is the small table; the employees table (5 rows) is the large table. Broadcast departments to each node to complete the JOIN.
Result:
Example 2: Order and Product Join Query
The products table (5 rows) is the small table; the orders table is the large table.
Example 3: Specifying Multiple Broadcast Tables in a Multi-Table JOIN
Multiple small tables can be specified in the same hint:
Notes
- The table name in the
MAPJOINhint is the alias used in the query, not the original table name. - If the small table exceeds 1 GB, Lakehouse will ignore the hint and automatically fall back to a regular JOIN.
- Map Join does not support FULL OUTER JOIN or RIGHT OUTER JOIN (the small table must be on the right side).
- In Lakehouse, the optimizer also automatically selects Map Join based on statistics, so you don't need to add the hint manually every time.
