Virtual Cluster (VCluster)
A Virtual Cluster (VCluster) is the Lakehouse's elastic compute resource unit, providing CPU and memory resources for SQL queries, ETL jobs, and streaming analytics. Storage and compute are fully separated — data is stored in object storage, and virtual clusters handle only computation. Multiple clusters can access the same data simultaneously without interfering with each other.
Think of a virtual cluster as an "on-demand compute engine" — start it when you need it, stop it when you're done, and pay only for actual usage time. This is fundamentally different from traditional databases, where compute and storage are bound to the same machine and scaling requires data migration. In the Lakehouse, virtual clusters can be created, resized, and paused at any time without affecting any data.
Cluster Types
| Type | Use Case | Characteristics |
|---|---|---|
| General (GENERAL) | ETL data processing, offline batch jobs | Jobs share resources with fair scheduling; supports elastic scaling |
| Analytics (ANALYTICS) | BI queries, ad-hoc analysis, high-concurrency queries | Multi-instance auto-scaling; supports result caching for acceleration |
| Integration (INTEGRATION) | Data sync jobs (offline/real-time) | Optimized for integration tasks; multiple jobs share one cluster |
Selection Guide
| Scenario | Recommended Type | Reason |
|---|---|---|
| Periodic ETL jobs | General | Shared resources, lower cost |
| BI reports / ad-hoc queries | Analytics | Multi-instance concurrency, result caching |
| Data sync jobs | Integration | Optimized for integration tasks |
| Dynamic Table refresh | General (low-frequency, large data volume) or Analytics (high-frequency, small data volume) | Choose based on refresh frequency and data volume |
Core Mechanisms
CRU (Compute Resource Unit): The Lakehouse's abstract unit for compute resources, abstracting away differences between cloud platforms and CPU architectures. 1 CRU = 1 hour of compute resource consumption.
Auto start/stop: Clusters can automatically pause when idle (billing stops) and automatically start when a new job is submitted. Recommended configuration:
- ETL job clusters: set auto-stop to 60 seconds to release resources quickly
- BI query clusters: set auto-stop to 30 minutes or more to leverage caching for query acceleration
Horizontal scaling (Analytics only): When concurrent queries exceed the capacity of a single instance, additional replicas are automatically started to share the load and scaled back down after queries complete.
Quick Operations
Cost Implications
Compute Cost
- Billed by CRU × hours; no charges when suspended
- Usage under 1 minute is billed as 1 minute
- Setting auto-stop to less than 1 minute may cause frequent start/stop cycles, potentially increasing costs
Storage Cost
- Virtual clusters themselves do not incur storage costs; data is stored in object storage
PRELOAD_TABLESon Analytics clusters uses local SSD cache space (temporary storage)
Lifecycle Management
Best Practices
- Workload isolation: Use separate clusters for ETL jobs and BI queries to avoid resource contention
- Right-sizing: Start with a small size, then scale up incrementally to the minimum size that meets your SLA
- Auto start/stop: Set ETL clusters to auto-stop after 60 seconds; set BI clusters to 30 minutes or more
- Large job isolation: Use separate clusters for large and small jobs to prevent large jobs from starving small ones
Related Documentation
- Virtual Cluster Details — Full concepts and Web UI operations
- Concurrency Scaling — Horizontal scaling for Analytics clusters
- Size Reference — CRU size comparison table
- CREATE VCLUSTER — Complete SQL syntax
