TABLESAMPLE - Data Sampling

Overview

TABLESAMPLE is an efficient data sampling method provided by the Singdata Lakehouse platform, supporting random sampling based on probability or a fixed number of rows. With two different sampling strategies (SYSTEM and ROW), you can flexibly balance between performance and accuracy to meet a variety of needs, from rapid data exploration to precise statistical analysis.

Core Features

  • Flexible Sampling Methods: Supports percentage-based sampling and fixed-row-count sampling
  • Dual Sampling Strategies: Provides both file-level (SYSTEM) and row-level (ROW) modes
  • High-Performance Design: SYSTEM mode is optimized for big data scenarios
  • Precise Control: ROW mode delivers accurate random sampling results

Typical Use Cases

ScenarioRecommended ModeDescription
Quick data previewSYSTEMHigh performance, suitable for quickly browsing large tables
Data quality checkSYSTEMRapidly sample data to verify quality
Machine learning training setROWPrecise random sampling to ensure sample representativeness
Statistical analysisROWAccurate probability sampling meeting statistical requirements
Development/test data generationROWGenerate small-scale test datasets
Large-scale data analysisSYSTEMEfficient sampling of datasets with millions of rows or more

Syntax

SELECT <column_list> FROM <table_name> TABLESAMPLE [ROW | SYSTEM] ( { <percentage> | <num> ROWS } ) [ LIMIT <n> ] [ ...other clauses... ]

Parameter Description

Sampling Type

TypeDescriptionApplicable ScenariosPerformance
ROWRow-level random sampling - Each row is independently evaluated for sampling, with precise result row countsSmall to medium datasets (< 1 million rows); scenarios requiring precise randomnessSlower, needs to scan all rows
SYSTEMFile-level random sampling - Randomly selects storage file blocks for higher performanceLarge datasets (> 1 million rows); quick data explorationExtremely fast, only reads partial files
DefaultAutomatically uses SYSTEM mode when not specifiedGeneral purposeHigh performance

Sampling Quantity Specification

FormatDescriptionExampleNotes
<percentage>Sample by percentage, range 0-10030 means sample 30%Actual row count may fluctuate
<num> ROWSSpecify an exact number of rows to sample5 ROWS means sample 5 rowsMore precise in ROW mode

āš ļø Important Notes:

  • Percentage-based sampling in SYSTEM mode may be imprecise (especially for small datasets)
  • It is recommended to add a LIMIT clause to the query to optimize performance
  • Sampling results are random; each execution may return different data

Usage Examples

Prepare Test Data

-- Create a test view CREATE OR REPLACE VIEW test(id, name) AS VALUES ( 1, 'Lisa'), ( 2, 'Mary'), ( 3, 'Evan'), ( 4, 'Fred'), ( 5, 'Alex'), ( 6, 'Mark'), ( 7, 'Lily'), ( 8, 'Lucy'), ( 9, 'Eric'), (10, 'Adam'); -- Create a test table CREATE TABLE employee (id INT, name STRING); INSERT INTO employee VALUES ( 1, 'Lisa'), ( 2, 'Mary'), ( 3, 'Evan'), ( 4, 'Fred'), ( 5, 'Alex'), ( 6, 'Mark'), ( 7, 'Lily'), ( 8, 'Lucy'), ( 9, 'Eric'), (10, 'Adam');

Basic Sampling Examples

1. Percentage Sampling (SYSTEM Mode)

-- Randomly sample 30% of the data (default SYSTEM mode) SELECT * FROM test TABLESAMPLE (30) LIMIT 50; -- Result: approximately returns 3 rows (30% x 10 rows) -- Note: on small datasets, SYSTEM mode may return all data or be imprecise

2. Fixed-Row Sampling (SYSTEM Mode)

-- Randomly sample 5 rows of data SELECT * FROM test TABLESAMPLE (5 ROWS) LIMIT 50; -- Result: approximately returns 5 rows -- Sample output: -- | id | name | -- |----|------| -- | 1 | Lisa | -- | 2 | Mary | -- | 3 | Evan | -- | 4 | Fred | -- | 5 | Alex |

3. Precise Row-Level Sampling (ROW Mode)

-- Precisely random sample 5 rows (ROW mode) SELECT * FROM employee TABLESAMPLE ROW (5 ROWS) LIMIT 50; -- Result: exactly returns 5 random rows -- Sample output: -- | id | name | -- |----|------| -- | 10 | Adam | -- | 1 | Lisa | -- | 8 | Lucy | -- | 6 | Mark | -- | 3 | Evan |

Real-World Application Scenarios

Scenario 1: Quick Data Preview (SYSTEM Recommended)

-- Quickly view sample data from the orders table SELECT order_id, customer_id, order_date, amount FROM doc_test.orders TABLESAMPLE SYSTEM (50) -- 50% sampling (for small table demonstration) LIMIT 100; -- Applicable for: quickly understanding data structure and content

Scenario 2: Data Quality Check (SYSTEM Recommended)

-- Rapidly sample from the orders table to check for abnormal data SELECT order_id, amount, status FROM doc_test.orders TABLESAMPLE SYSTEM (50) WHERE amount < 0 LIMIT 50; -- Applicable for: quickly discovering data quality issues

Scenario 3: Generating Machine Learning Training Sets (ROW Recommended)

-- Randomly sample training data from the user behavior table SELECT user_id, behavior_type, item_id, timestamp FROM user_behavior TABLESAMPLE ROW (20) -- Precise 20% random sampling LIMIT 1000000; -- Applicable for: ensuring randomness and representativeness of samples

Scenario 4: Statistical Analysis Sampling (ROW Recommended)

-- Estimate average order amount SELECT AVG(total_amount) as avg_order_amount, COUNT(*) as sample_count FROM large_orders_table TABLESAMPLE ROW (5) -- 5% precise sampling LIMIT 1000000; -- Applicable for: scenarios requiring statistically rigorous random sampling

Scenario 5: Development Environment Test Data

-- Generate small-scale test data CREATE TABLE test_customers AS SELECT * FROM production_customers TABLESAMPLE ROW (1000 ROWS) -- Precisely 1000 rows LIMIT 1000; -- Applicable for: preparing data for development and testing environments

Advanced Usage

Combining with WHERE Conditions

-- TABLESAMPLE must immediately follow the table name, and WHERE conditions come after SELECT order_id, customer_id, amount, status FROM doc_test.orders TABLESAMPLE SYSTEM (50) WHERE status = 'completed' LIMIT 50;

Combining with Aggregate Analysis

-- Perform aggregate analysis on sampled data SELECT DATE_TRUNC('month', order_date) as month, COUNT(*) as order_count, AVG(total_amount) as avg_amount FROM large_orders TABLESAMPLE SYSTEM (5) GROUP BY DATE_TRUNC('month', order_date) ORDER BY month LIMIT 50;

Multi-Table Sampling with JOIN

-- Sample multiple large tables and then JOIN SELECT o.order_id, c.customer_name, o.total_amount FROM orders o TABLESAMPLE SYSTEM (10) JOIN customers c TABLESAMPLE SYSTEM (10) ON o.customer_id = c.customer_id LIMIT 100;


In-Depth Comparison of Sampling Strategies

SYSTEM Mode Details

How It Works:

  • Randomly selects at the file/data block level
  • If a file is selected, all rows within that file are returned
  • Does not read unselected files, resulting in extremely high performance

Performance Characteristics:

  • ⚔ Extremely fast: Only reads partial files, low I/O overhead
  • šŸ“Š Suitable for big data: The larger the data volume, the more obvious the performance advantage
  • šŸ’¾ Memory friendly: No need to cache all data

Applicable Scenarios:

  • āœ… Large tables (millions of rows or more)
  • āœ… Quick data exploration and preview
  • āœ… Scenarios with low precision requirements
  • āœ… Data quality checks

Caveats:

  • āš ļø Small datasets may produce imprecise results (may return all or no data)
  • āš ļø If data storage is skewed, sampling may be biased
  • āš ļø Result row count may differ significantly from expectations

ROW Mode Details

How It Works:

  • Evaluates each row independently for sampling (based on a pseudo-random algorithm)
  • Each row has an independent probability of being selected
  • Requires scanning all data rows

Performance Characteristics:

  • 🐢 Slower: Needs to read and evaluate all rows
  • šŸŽÆ Precise: Result row count is close to expectations
  • šŸ’” Statistically rigorous: Meets the statistical requirements for random sampling

Applicable Scenarios:

  • āœ… Small to medium tables (under 1 million rows)
  • āœ… Machine learning training set generation
  • āœ… Statistical analysis and scientific computing
  • āœ… Scenarios requiring precise randomness

Caveats:

  • āš ļø High performance overhead on large tables
  • āš ļø Still requires scanning the entire table
  • āœ… Result row count is more predictable

Performance Comparison

Data SizeSYSTEM ModeROW ModeRecommendation
< 10K rows~10ms~15msROW (precision matters more)
10K-100K rows~50ms~200msROW or SYSTEM
100K-1M rows~100ms~2sSYSTEM (noticeable performance difference)
> 1M rows~200ms~10s+SYSTEM (huge performance advantage)

Note: Actual performance depends on hardware, data distribution, and other factors


Best Practices

1. How to Choose a Sampling Mode

Data size < 1 million rows? |-- Yes -> Need precise randomness? | |-- Yes -> Use ROW mode | |-- No -> Use SYSTEM mode (faster) |-- No -> Use SYSTEM mode (performance first)

2. Performance Optimization Tips

-- 1. Always add a LIMIT clause SELECT * FROM large_table TABLESAMPLE (10) LIMIT 1000; -- āœ… Limit the final number of returned rows -- 2. Prioritize SYSTEM on large tables SELECT * FROM billion_rows_table TABLESAMPLE SYSTEM (1) -- āœ… 1% sampling is sufficient LIMIT 10000; -- 3. Sample first, then JOIN (reduce JOIN data volume) SELECT a.*, b.* FROM ( SELECT * FROM large_table_a TABLESAMPLE (5) LIMIT 100000 ) a JOIN large_table_b b ON a.id = b.id LIMIT 1000;

āŒ Practices to Avoid

-- 1. Using SYSTEM on small tables expecting precise results SELECT * FROM small_table_10_rows TABLESAMPLE SYSTEM (30); -- āŒ May return 0 rows or 10 rows -- 2. Using ROW on large tables without LIMIT SELECT * FROM billion_rows_table TABLESAMPLE ROW (10); -- āŒ Extremely poor performance, may return 100 million rows -- 3. Excessive sampling (close to 100%) SELECT * FROM large_table TABLESAMPLE (95); -- āŒ Nearly a full table scan, defeats the purpose of sampling

3. Sampling Accuracy Recommendations

RequirementRecommended Sampling RatioDescription
Quick preview0.1% - 1%Sufficient for understanding data structure
Data quality check5% - 10%Balances performance and coverage
Statistical estimation10% - 20%Ensures statistical significance
Machine learning training20% - 50%Adjust based on data volume and model complexity

4. Frequently Asked Questions

Q1: What happens if the sampling ratio exceeds 100%?

SELECT * FROM test TABLESAMPLE (150) LIMIT 50; -- A: An error will be reported! Singdata requires the sampling ratio to be between 0 and 100. -- Error message: tablesample percentage number should be greater than 0 and less than 100

Q2: What happens if the sampled row count exceeds the total number of rows in the table?

SELECT * FROM test TABLESAMPLE (100 ROWS) LIMIT 50; -- Table only has 10 rows -- A: The entire table data (10 rows) will be returned

Q3: What does sampling from an empty table return?

SELECT * FROM empty_table TABLESAMPLE (50) LIMIT 50; -- A: Returns an empty result set without errors

Q4: Is there a difference between sampling views and tables?

-- Both are supported, but views may have slightly lower performance (need to be materialized first) SELECT * FROM my_view TABLESAMPLE ROW (10) LIMIT 50;

Q5: How to obtain repeatable sampling results?

-- TABLESAMPLE does not support SEED. For repeatable sampling: SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY id) as rn FROM my_table ) WHERE rn % 10 = 0 -- Take 1 row out of every 10 LIMIT 50;


Limitations and Notes

Usage Restrictions

  1. Unsupported Scenarios

    • āŒ Cannot be used in the WHERE clause of a subquery
    • āŒ Cannot be used in CTE (WITH clause) definitions
    • āŒ Cannot be used in materialized view definitions
  2. Syntax Limitations

    • āš ļø Must immediately follow the table name in the FROM clause
    • āš ļø TABLESAMPLE must come before the WHERE clause
    • āš ļø Cannot be used together with FOR UPDATE
    • āš ļø The percentage sampling range must be between 0 and 100 (exclusive of 0 and 100)
  3. Performance Notes

    • āš ļø ROW mode can be very slow on extremely large tables (> 10 million rows)
    • āš ļø Even with 1% sampling, ROW mode needs to scan the entire table

Data Consistency

  • šŸ”„ Non-deterministic: Each execution returns different results
  • šŸ“Š Statistical bias: SYSTEM mode may have bias when data distribution is uneven
  • ā±ļø Timeliness: Sampling reflects the data snapshot at the time of query

Best Practices Summary

ScenarioData VolumeRecommended ModeSampling RatioLIMIT
Data previewAnySYSTEM1-5%100-1000
Quality checkAnySYSTEM5-10%1000-10000
Statistical analysis< 1MROW10-20%As appropriate
Statistical analysis> 1MSYSTEM5-10%As appropriate
ML training set< 1MROW20-50%As appropriate
ML training set> 1MSYSTEM10-30%As appropriate

  • Data Exploration: Combine with DESCRIBE and SHOW commands to understand table structure
  • Performance Optimization: Use with partitioned tables and WHERE conditions to improve sampling efficiency
  • Data Analysis: Combine with aggregate functions and window functions