Built-in Sample Datasets

clickzetta_sample_data is a public dataset built into Singdata Lakehouse via the data sharing mechanism. It is available out of the box for every tenant — no application or import required, and it does not count against your storage quota.

Query it directly in SQL using three-part naming:

SELECT * FROM clickzetta_sample_data.tpch_100g.orders LIMIT 10;

It contains 5 schemas covering performance benchmarking, business analytics practice, and AI vector retrieval:

SchemaData ContentScale
tpch_100gTPC-H standard benchmark, supply-chain order data100 GB, 600 million rows in lineitem
tpcds_10tbTPC-DS standard benchmark, retail multi-channel sales data10 TB, 28.8 billion rows in store_sales
ecommerce_events_historyE-commerce user behavior event stream110 million rows in history table, 370 million rows in live table
nyc_taxi_tripdataNew York City ride-hailing trip records1.49 billion rows
clickzetta_doc_kbSingdata product documentation vector knowledge baseContains 1024-dimensional embeddings

tpch_100g

Data source: TPC-H is the industry-standard database benchmark, simulating a supply-chain scenario with 8 tables covering orders, parts, suppliers, customers, and more.

Scale: 100 GB, 8 tables total.

TableRow CountDescription
lineitem600 millionOrder line items, the largest table
orders150 millionOrder master table
customer15 millionCustomer information
supplier1 millionSupplier information
part2 millionPart information
partsupp8 millionPart-supplier relationships
nation25Country dimension
region5Region dimension

Use cases:

  • Test SQL query performance and compare response times across different VCluster sizes
  • Learn multi-table JOIN, aggregation, window function, and other SQL syntax
  • Validate index effectiveness (Bloomfilter, inverted index)

Example query: Calculate annual revenue by region (TPC-H Q5)

SELECT n.n_name, SUM(l.l_extendedprice * (1 - l.l_discount)) AS revenue FROM clickzetta_sample_data.tpch_100g.customer c JOIN clickzetta_sample_data.tpch_100g.orders o ON c.c_custkey = o.o_custkey JOIN clickzetta_sample_data.tpch_100g.lineitem l ON o.o_orderkey = l.l_orderkey JOIN clickzetta_sample_data.tpch_100g.supplier s ON l.l_suppkey = s.s_suppkey JOIN clickzetta_sample_data.tpch_100g.nation n ON c.c_nationkey = n.n_nationkey JOIN clickzetta_sample_data.tpch_100g.region r ON n.n_regionkey = r.r_regionkey WHERE r.r_name = 'ASIA' AND o.o_orderdate >= '1994-01-01' AND o.o_orderdate < '1995-01-01' GROUP BY n.n_name ORDER BY revenue DESC;

tpcds_10tb

Data source: TPC-DS is a more complex retail benchmark than TPC-H, simulating multi-channel (store, catalog, web) sales scenarios with 24 tables and 99 standard queries.

Scale: 10 TB, 24 tables total.

TableRow CountDescription
store_sales28.8 billionStore sales line items
catalog_sales14.4 billionCatalog sales line items
web_sales7.2 billionWeb sales line items
store_returns2.88 billionStore return records
inventory1.31 billionInventory records
customer65 millionCustomer information
item402,000Product information
date_dim73,000Date dimension
Other 16 tablesDimension tables (store, promotion, address, etc.)

Use cases:

  • Large-scale stress testing to validate query performance at the 10 TB level
  • Test execution plans for complex multi-table JOINs and subqueries
  • Compare different VCluster sizes under heavy queries

Example query: Calculate quarterly sales by store

SELECT s.s_store_name, d.d_year, d.d_qoy, SUM(ss.ss_net_paid) AS total_sales FROM clickzetta_sample_data.tpcds_10tb.store_sales ss JOIN clickzetta_sample_data.tpcds_10tb.store s ON ss.ss_store_sk = s.s_store_sk JOIN clickzetta_sample_data.tpcds_10tb.date_dim d ON ss.ss_sold_date_sk = d.d_date_sk WHERE d.d_year = 2001 GROUP BY s.s_store_name, d.d_year, d.d_qoy ORDER BY d.d_qoy, total_sales DESC LIMIT 20;

ecommerce_events_history

Data source: From a Kaggle public dataset, recording user behavior events on a multi-category e-commerce platform, including browsing, adding to cart, and purchasing.

Scale: 2 tables.

TableRow CountDescription
ecommerce_events_multicategorystore110 millionHistorical event snapshot table
ecommerce_events_multicategorystore_live370 millionContinuously updated live event table with change_tracking enabled; supports creating a Table Stream

Field descriptions:

FieldTypeDescription
event_timevarcharEvent occurrence time
event_timestamptimestamp_ltzEvent timestamp
event_typevarcharEvent type: view / cart / purchase
product_idvarcharProduct ID
category_idvarcharCategory ID
category_codevarcharCategory path, e.g. electronics.smartphone
brandvarcharBrand
pricedecimal(10,2)Product price
user_idvarcharUser ID
user_sessionvarcharSession ID
event_datedateEvent date (partition key)

Use cases:

  • Funnel analysis (view → add to cart → purchase conversion rate)
  • User retention and repeat purchase analysis
  • Category and brand sales ranking
  • Incremental data processing practice based on Table Stream

Example query: Calculate the conversion funnel by event type

SELECT event_type, COUNT(DISTINCT user_id) AS users, COUNT(*) AS events FROM clickzetta_sample_data.ecommerce_events_history.ecommerce_events_multicategorystore GROUP BY event_type ORDER BY events DESC;

nyc_taxi_tripdata

Data source: Publicly released trip data from the New York City Taxi and Limousine Commission (TLC), covering ride-hailing trips from platforms such as Uber and Lyft (FHVHV: For-Hire Vehicle High Volume).

Scale: 1 table, 1.49 billion rows.

Key fields:

FieldTypeDescription
hvfhs_license_numvarcharPlatform license number (HV0003=Uber, HV0005=Lyft)
pickup_datetimetimestamp_ltzPickup time
dropoff_datetimetimestamp_ltzDropoff time
trip_milesdoubleTrip distance (miles)
trip_timebigintTrip duration (seconds)
base_passenger_faredoubleBase passenger fare
tipsdoubleTips
driver_paydoubleDriver earnings
shared_request_flagvarcharShared ride request (Y/N)
wav_request_flagvarcharWheelchair-accessible vehicle request (Y/N)

Use cases:

  • Time-series aggregation analysis (trip volume by hour, by day of week)
  • Large-table aggregation performance testing
  • Platform comparison analysis (Uber vs Lyft)
  • Geospatial data analysis (combined with pickup/dropoff zone fields)

Example query: Average trip distance and fare by platform

SELECT hvfhs_license_num, COUNT(*) AS trips, ROUND(AVG(trip_miles), 2) AS avg_miles, ROUND(AVG(trip_time) / 60, 1) AS avg_minutes, ROUND(AVG(base_passenger_fare), 2) AS avg_fare FROM clickzetta_sample_data.nyc_taxi_tripdata.fhvhv_tripdata GROUP BY hvfhs_license_num ORDER BY trips DESC;

clickzetta_doc_kb

Data content: A vector knowledge base of Singdata Lakehouse product documentation. Document content is converted into 1024-dimensional vectors using Alibaba Cloud DashScope's text embedding model, for use in semantic retrieval and AI question answering.

Scale: 1 table, dashscope_clickzetta_elements.

Key fields:

FieldTypeDescription
idstringUnique record ID
typestringElement type (Title / NarrativeText / Table, etc.)
filenamestringSource document filename
textstringRaw text content
embeddingsvector(float, 1024)1024-dimensional vector representation of the text
element_typestringDocument element classification
documents_sourcestringDocument source identifier
date_processedtimestamp_ltzVector processing timestamp

Use cases:

  • Experience vector similarity retrieval (the cosine_distance function)
  • Build a RAG (Retrieval-Augmented Generation) Q&A system based on product documentation
  • Learn how to use the AI_EMBEDDING function together with vector indexes

Example query: Retrieve document fragments most relevant to "dynamic table" using vector similarity

SELECT filename, type, text, cosine_distance(embeddings, AI_EMBEDDING('ai_gateway_conn:text-embedding-v4', 'What is a dynamic table')) AS distance FROM clickzetta_sample_data.clickzetta_doc_kb.dashscope_clickzetta_elements ORDER BY distance ASC LIMIT 5;