TABLESAMPLE - Data Sampling
Overview
TABLESAMPLE is an efficient data sampling method provided by the Singdata Lakehouse platform, supporting random sampling based on probability or a fixed number of rows. With two different sampling strategies (SYSTEM and ROW), you can flexibly balance between performance and accuracy to meet a variety of needs, from rapid data exploration to precise statistical analysis.
Core Features
- Flexible Sampling Methods: Supports percentage-based sampling and fixed-row-count sampling
- Dual Sampling Strategies: Provides both file-level (SYSTEM) and row-level (ROW) modes
- High-Performance Design: SYSTEM mode is optimized for big data scenarios
- Precise Control: ROW mode delivers accurate random sampling results
Typical Use Cases
| Scenario | Recommended Mode | Description |
|---|---|---|
| Quick data preview | SYSTEM | High performance, suitable for quickly browsing large tables |
| Data quality check | SYSTEM | Rapidly sample data to verify quality |
| Machine learning training set | ROW | Precise random sampling to ensure sample representativeness |
| Statistical analysis | ROW | Accurate probability sampling meeting statistical requirements |
| Development/test data generation | ROW | Generate small-scale test datasets |
| Large-scale data analysis | SYSTEM | Efficient sampling of datasets with millions of rows or more |
Syntax
Parameter Description
Sampling Type
| Type | Description | Applicable Scenarios | Performance |
|---|---|---|---|
| ROW | Row-level random sampling - Each row is independently evaluated for sampling, with precise result row counts | Small to medium datasets (< 1 million rows); scenarios requiring precise randomness | Slower, needs to scan all rows |
| SYSTEM | File-level random sampling - Randomly selects storage file blocks for higher performance | Large datasets (> 1 million rows); quick data exploration | Extremely fast, only reads partial files |
| Default | Automatically uses SYSTEM mode when not specified | General purpose | High performance |
Sampling Quantity Specification
| Format | Description | Example | Notes |
|---|---|---|---|
<percentage> | Sample by percentage, range 0-100 | 30 means sample 30% | Actual row count may fluctuate |
<num> ROWS | Specify an exact number of rows to sample | 5 ROWS means sample 5 rows | More precise in ROW mode |
ā ļø Important Notes:
- Percentage-based sampling in SYSTEM mode may be imprecise (especially for small datasets)
- It is recommended to add a
LIMITclause to the query to optimize performance - Sampling results are random; each execution may return different data
Usage Examples
Prepare Test Data
Basic Sampling Examples
1. Percentage Sampling (SYSTEM Mode)
2. Fixed-Row Sampling (SYSTEM Mode)
3. Precise Row-Level Sampling (ROW Mode)
Real-World Application Scenarios
Scenario 1: Quick Data Preview (SYSTEM Recommended)
Scenario 2: Data Quality Check (SYSTEM Recommended)
Scenario 3: Generating Machine Learning Training Sets (ROW Recommended)
Scenario 4: Statistical Analysis Sampling (ROW Recommended)
Scenario 5: Development Environment Test Data
Advanced Usage
Combining with WHERE Conditions
Combining with Aggregate Analysis
Multi-Table Sampling with JOIN
In-Depth Comparison of Sampling Strategies
SYSTEM Mode Details
How It Works:
- Randomly selects at the file/data block level
- If a file is selected, all rows within that file are returned
- Does not read unselected files, resulting in extremely high performance
Performance Characteristics:
- ā” Extremely fast: Only reads partial files, low I/O overhead
- š Suitable for big data: The larger the data volume, the more obvious the performance advantage
- š¾ Memory friendly: No need to cache all data
Applicable Scenarios:
- ā Large tables (millions of rows or more)
- ā Quick data exploration and preview
- ā Scenarios with low precision requirements
- ā Data quality checks
Caveats:
- ā ļø Small datasets may produce imprecise results (may return all or no data)
- ā ļø If data storage is skewed, sampling may be biased
- ā ļø Result row count may differ significantly from expectations
ROW Mode Details
How It Works:
- Evaluates each row independently for sampling (based on a pseudo-random algorithm)
- Each row has an independent probability of being selected
- Requires scanning all data rows
Performance Characteristics:
- š¢ Slower: Needs to read and evaluate all rows
- šÆ Precise: Result row count is close to expectations
- š” Statistically rigorous: Meets the statistical requirements for random sampling
Applicable Scenarios:
- ā Small to medium tables (under 1 million rows)
- ā Machine learning training set generation
- ā Statistical analysis and scientific computing
- ā Scenarios requiring precise randomness
Caveats:
- ā ļø High performance overhead on large tables
- ā ļø Still requires scanning the entire table
- ā Result row count is more predictable
Performance Comparison
| Data Size | SYSTEM Mode | ROW Mode | Recommendation |
|---|---|---|---|
| < 10K rows | ~10ms | ~15ms | ROW (precision matters more) |
| 10K-100K rows | ~50ms | ~200ms | ROW or SYSTEM |
| 100K-1M rows | ~100ms | ~2s | SYSTEM (noticeable performance difference) |
| > 1M rows | ~200ms | ~10s+ | SYSTEM (huge performance advantage) |
Note: Actual performance depends on hardware, data distribution, and other factors
Best Practices
1. How to Choose a Sampling Mode
2. Performance Optimization Tips
ā Recommended Practices
ā Practices to Avoid
3. Sampling Accuracy Recommendations
| Requirement | Recommended Sampling Ratio | Description |
|---|---|---|
| Quick preview | 0.1% - 1% | Sufficient for understanding data structure |
| Data quality check | 5% - 10% | Balances performance and coverage |
| Statistical estimation | 10% - 20% | Ensures statistical significance |
| Machine learning training | 20% - 50% | Adjust based on data volume and model complexity |
4. Frequently Asked Questions
Q1: What happens if the sampling ratio exceeds 100%?
Q2: What happens if the sampled row count exceeds the total number of rows in the table?
Q3: What does sampling from an empty table return?
Q4: Is there a difference between sampling views and tables?
Q5: How to obtain repeatable sampling results?
Limitations and Notes
Usage Restrictions
-
Unsupported Scenarios
- ā Cannot be used in the WHERE clause of a subquery
- ā Cannot be used in CTE (WITH clause) definitions
- ā Cannot be used in materialized view definitions
-
Syntax Limitations
- ā ļø Must immediately follow the table name in the FROM clause
- ā ļø TABLESAMPLE must come before the WHERE clause
- ā ļø Cannot be used together with
FOR UPDATE - ā ļø The percentage sampling range must be between 0 and 100 (exclusive of 0 and 100)
-
Performance Notes
- ā ļø ROW mode can be very slow on extremely large tables (> 10 million rows)
- ā ļø Even with 1% sampling, ROW mode needs to scan the entire table
Data Consistency
- š Non-deterministic: Each execution returns different results
- š Statistical bias: SYSTEM mode may have bias when data distribution is uneven
- ā±ļø Timeliness: Sampling reflects the data snapshot at the time of query
Best Practices Summary
| Scenario | Data Volume | Recommended Mode | Sampling Ratio | LIMIT |
|---|---|---|---|---|
| Data preview | Any | SYSTEM | 1-5% | 100-1000 |
| Quality check | Any | SYSTEM | 5-10% | 1000-10000 |
| Statistical analysis | < 1M | ROW | 10-20% | As appropriate |
| Statistical analysis | > 1M | SYSTEM | 5-10% | As appropriate |
| ML training set | < 1M | ROW | 20-50% | As appropriate |
| ML training set | > 1M | SYSTEM | 10-30% | As appropriate |
Related References
- Data Exploration: Combine with DESCRIBE and SHOW commands to understand table structure
- Performance Optimization: Use with partitioned tables and
WHEREconditions to improve sampling efficiency - Data Analysis: Combine with aggregate functions and window functions
