Lance External Tables
Overview
Lance is a columnar storage format designed for machine learning, with native support for vector data and ANN (Approximate Nearest Neighbor) indexes. Lakehouse reads Lance datasets stored on OSS directly via the external table mechanism, enabling vector similarity search without any data migration.
Use cases: vector retrieval, image/text semantic search, RAG knowledge base queries, recommendation system candidate recall.
SQL Commands Involved
| Command / Function | Purpose | When to Use |
|---|---|---|
CREATE STORAGE CONNECTION | Create OSS access credentials | Run once when first connecting to OSS |
CREATE EXTERNAL TABLE ... USING LANCE | Mount a Lance dataset as an external table | Run once per Lance dataset |
DESCRIBE TABLE | View column definitions and vector dimensions | Verify that the schema matches the Lance file |
SHOW CREATE TABLE | Reverse-engineer the full DDL | Troubleshoot LOCATION / CONNECTION configuration |
cosine_distance(a, b) | Cosine distance, range [0, 2], smaller means more similar | Text/image semantic similarity search |
l2_distance(a, b) | Euclidean distance, range [0, +∞) | Spatial distance, image feature matching |
Prerequisites
All examples in this guide are based on the following test dataset:
- OSS path:
oss://lakehouse-hz-daily/clickzetta/lance/test_lance_1024_cosine_small.lance - Row count: 8,192 rows
- Vector dimensions: 1,024 (float32)
- Index type: cosine ANN
Scenario 1: Create a Lance External Table
Create a Storage Connection
Verify the connection was created successfully:
Sample output:
Create the Lance External Table
Key parameter notes:
USING LANCE: Specifies the format as Lance.LOCATION: The OSS path pointing to the root of the.lancedirectory (which must contain a_versions/subdirectory).CONNECTION: References the name of an already-created Storage Connection.vector(float, N): The vector column type.Nmust exactly match the dimension of thefixed_size_listin the Lance file; otherwise queries will return NULL.
Verify the external table was created successfully:
Sample output:
Scenario 2: Basic Data Reading
Row Count
Sample output:
Read Scalar Columns
Sample output:
Read the Vector Column
Sample output (vector values truncated):
Filter by Scalar Column
Sample output:
Group-by Aggregation
Sample output:
10 source groups, 8,192 total rows, with a fairly even distribution (approximately 800 rows per group).
Scenario 3: Vector Similarity Search
Top-K Cosine Distance Search
Query the 20 most similar records to a given vector using the cosine_distance function — smaller values indicate greater similarity.
WHERE cosine_distance(...) <= 1000.0 is a loose threshold used to trigger Lance's ANN index acceleration. In production, you can tighten the threshold based on your similarity requirements (e.g., <= 0.3).
Verify Vector Distance Function Accuracy
Use self-similarity to verify correctness (the cosine distance between a vector and itself should be close to 0):
Sample output:
All self-similarity absolute values are below 1e-6, which is consistent with float32 precision expectations (theoretical value is 0; floating-point error is within the normal range).
Scenario 4: Hybrid Filter Queries
Vector distance filtering and scalar column filtering can be combined. Use scalar conditions to narrow the candidate set first, then sort by vector distance.
Sample output:
Only rows with source = 'source_1' are returned, confirming that both the scalar filter and the vector distance filter are applied correctly.
Important Notes
-
Vector dimensions must match exactly: The
Ninvector(float, N)in the DDL must exactly match the length offixed_size_listin the Lance file. If dimensions do not match, queries will not error but the vector distance column will return NULL. -
Timestamp column type mapping: The
timestamptype in Lance may be mapped tobigint(Unix timestamp in seconds) in Lakehouse. The actual type shown bySHOW CREATE TABLEis determined by system inference; declaringtimestampin the DDL does not affect data reading. -
Lance version: The current supported format is Lance v2 (
TBLPROPERTIES('lance.version'='2')). -
Cross-cloud networking: The Lakehouse cluster and the OSS bucket must be on the same cloud provider. Cross-cloud access (e.g., a Tencent Cloud cluster accessing Alibaba Cloud OSS) will cause query timeouts (default 300s) due to network connectivity issues.
