Built-in Sample Datasets
clickzetta_sample_data is a public dataset built into Singdata Lakehouse via the data sharing mechanism. It is available out of the box for every tenant — no application or import required, and it does not count against your storage quota.
Query it directly in SQL using three-part naming:
It contains 5 schemas covering performance benchmarking, business analytics practice, and AI vector retrieval:
| Schema | Data Content | Scale |
|---|---|---|
tpch_100g | TPC-H standard benchmark, supply-chain order data | 100 GB, 600 million rows in lineitem |
tpcds_10tb | TPC-DS standard benchmark, retail multi-channel sales data | 10 TB, 28.8 billion rows in store_sales |
ecommerce_events_history | E-commerce user behavior event stream | 110 million rows in history table, 370 million rows in live table |
nyc_taxi_tripdata | New York City ride-hailing trip records | 1.49 billion rows |
clickzetta_doc_kb | Singdata product documentation vector knowledge base | Contains 1024-dimensional embeddings |
tpch_100g
Data source: TPC-H is the industry-standard database benchmark, simulating a supply-chain scenario with 8 tables covering orders, parts, suppliers, customers, and more.
Scale: 100 GB, 8 tables total.
| Table | Row Count | Description |
|---|---|---|
lineitem | 600 million | Order line items, the largest table |
orders | 150 million | Order master table |
customer | 15 million | Customer information |
supplier | 1 million | Supplier information |
part | 2 million | Part information |
partsupp | 8 million | Part-supplier relationships |
nation | 25 | Country dimension |
region | 5 | Region dimension |
Use cases:
- Test SQL query performance and compare response times across different VCluster sizes
- Learn multi-table JOIN, aggregation, window function, and other SQL syntax
- Validate index effectiveness (Bloomfilter, inverted index)
Example query: Calculate annual revenue by region (TPC-H Q5)
tpcds_10tb
Data source: TPC-DS is a more complex retail benchmark than TPC-H, simulating multi-channel (store, catalog, web) sales scenarios with 24 tables and 99 standard queries.
Scale: 10 TB, 24 tables total.
| Table | Row Count | Description |
|---|---|---|
store_sales | 28.8 billion | Store sales line items |
catalog_sales | 14.4 billion | Catalog sales line items |
web_sales | 7.2 billion | Web sales line items |
store_returns | 2.88 billion | Store return records |
inventory | 1.31 billion | Inventory records |
customer | 65 million | Customer information |
item | 402,000 | Product information |
date_dim | 73,000 | Date dimension |
| Other 16 tables | — | Dimension tables (store, promotion, address, etc.) |
Use cases:
- Large-scale stress testing to validate query performance at the 10 TB level
- Test execution plans for complex multi-table JOINs and subqueries
- Compare different VCluster sizes under heavy queries
Example query: Calculate quarterly sales by store
ecommerce_events_history
Data source: From a Kaggle public dataset, recording user behavior events on a multi-category e-commerce platform, including browsing, adding to cart, and purchasing.
Scale: 2 tables.
| Table | Row Count | Description |
|---|---|---|
ecommerce_events_multicategorystore | 110 million | Historical event snapshot table |
ecommerce_events_multicategorystore_live | 370 million | Continuously updated live event table with change_tracking enabled; supports creating a Table Stream |
Field descriptions:
| Field | Type | Description |
|---|---|---|
event_time | varchar | Event occurrence time |
event_timestamp | timestamp_ltz | Event timestamp |
event_type | varchar | Event type: view / cart / purchase |
product_id | varchar | Product ID |
category_id | varchar | Category ID |
category_code | varchar | Category path, e.g. electronics.smartphone |
brand | varchar | Brand |
price | decimal(10,2) | Product price |
user_id | varchar | User ID |
user_session | varchar | Session ID |
event_date | date | Event date (partition key) |
Use cases:
- Funnel analysis (view → add to cart → purchase conversion rate)
- User retention and repeat purchase analysis
- Category and brand sales ranking
- Incremental data processing practice based on Table Stream
Example query: Calculate the conversion funnel by event type
nyc_taxi_tripdata
Data source: Publicly released trip data from the New York City Taxi and Limousine Commission (TLC), covering ride-hailing trips from platforms such as Uber and Lyft (FHVHV: For-Hire Vehicle High Volume).
Scale: 1 table, 1.49 billion rows.
Key fields:
| Field | Type | Description |
|---|---|---|
hvfhs_license_num | varchar | Platform license number (HV0003=Uber, HV0005=Lyft) |
pickup_datetime | timestamp_ltz | Pickup time |
dropoff_datetime | timestamp_ltz | Dropoff time |
trip_miles | double | Trip distance (miles) |
trip_time | bigint | Trip duration (seconds) |
base_passenger_fare | double | Base passenger fare |
tips | double | Tips |
driver_pay | double | Driver earnings |
shared_request_flag | varchar | Shared ride request (Y/N) |
wav_request_flag | varchar | Wheelchair-accessible vehicle request (Y/N) |
Use cases:
- Time-series aggregation analysis (trip volume by hour, by day of week)
- Large-table aggregation performance testing
- Platform comparison analysis (Uber vs Lyft)
- Geospatial data analysis (combined with pickup/dropoff zone fields)
Example query: Average trip distance and fare by platform
clickzetta_doc_kb
Data content: A vector knowledge base of Singdata Lakehouse product documentation. Document content is converted into 1024-dimensional vectors using Alibaba Cloud DashScope's text embedding model, for use in semantic retrieval and AI question answering.
Scale: 1 table, dashscope_clickzetta_elements.
Key fields:
| Field | Type | Description |
|---|---|---|
id | string | Unique record ID |
type | string | Element type (Title / NarrativeText / Table, etc.) |
filename | string | Source document filename |
text | string | Raw text content |
embeddings | vector(float, 1024) | 1024-dimensional vector representation of the text |
element_type | string | Document element classification |
documents_source | string | Document source identifier |
date_processed | timestamp_ltz | Vector processing timestamp |
Use cases:
- Experience vector similarity retrieval (the
cosine_distancefunction) - Build a RAG (Retrieval-Augmented Generation) Q&A system based on product documentation
- Learn how to use the
AI_EMBEDDINGfunction together with vector indexes
Example query: Retrieve document fragments most relevant to "dynamic table" using vector similarity
