VECTOR Type
VECTOR is the native vector data type in Singdata Lakehouse, used for storing and computing high-dimensional floating-point arrays. It is the core data structure for AI applications — text, images, and audio processed by embedding models are converted into vectors, stored in VECTOR columns, and then retrieved via vector similarity search to find semantically similar content.
Feature Overview
When to use VECTOR:
- Building RAG (Retrieval-Augmented Generation) knowledge bases to store document embeddings
- Image/audio similarity search
- User/item feature vectors in recommendation systems
- Any query scenario requiring "semantic similarity" rather than "exact match"
Differences from ARRAY:
| Feature | VECTOR | ARRAY |
|---|---|---|
| Purpose | Optimized for vector similarity search | General-purpose collection type |
| Dimension | Fixed dimension | Dynamic length |
| Element type | Numeric only (float, int, tinyint) | Any type |
| Index support | Supports HNSW vector index acceleration | Not supported |
Syntax and Definition
scalar_type: vector element type, optional, defaults tofloat. Supportstinyint,int,floatdimension: vector dimension (number of elements), required
Table creation example:
Vector Operations and Functions
Lakehouse provides a set of built-in functions for vector computation.
1. Distance and Similarity Calculation
This is the core of vector search, used to measure how close two vectors are.
| Function | Description | Use Case |
|---|---|---|
L2_DISTANCE(v1, v2) | Euclidean distance; smaller value means more similar | Image, audio feature matching |
COSINE_DISTANCE(v1, v2) | Cosine distance; smaller value means more similar | Text semantic similarity (most common) |
INNER_PRODUCT(v1, v2) | Inner product; larger value means more similar | Similarity of normalized vectors |
Example:
2. Vector Normalization
l2_normalize(v) scales a vector to a unit vector (magnitude of 1).
3. Dot Product
dot_product(v1, v2) computes the dot product of two vectors.
Advanced Query Examples
Scenario 1: Semantic Similarity Search
Find the Top K documents most similar to a query vector.
Scenario 2: Extracting Vectors from JSON
If embedding results are stored or transmitted in JSON format, use json_extract to extract and convert to VECTOR.
Scenario 3: Real-time Generation with AI_EMBEDDING
Use the ai_embedding function to convert text directly to vectors for end-to-end semantic search.
Type Conversion
| Conversion Direction | Method | Notes |
|---|---|---|
| STRING → VECTOR | CAST('[1,2,3]' AS vector(3)) | String format is [v1, v2, ...]; extra spaces are ignored |
| ARRAY → VECTOR | CAST(arr AS vector(n)) | Array length must match dimension n; otherwise returns NULL |
| VECTOR → ARRAY | Implicit conversion | Can be passed directly to functions that accept ARRAY arguments |
| VECTOR → STRING | CAST(vec AS STRING) | Output format is [1, 2, 3] (comma followed by space) |
Conversion examples:
Vector Index Acceleration
For large-scale vector data, it is recommended to create a vector index to accelerate ANN (Approximate Nearest Neighbor) search:
See Create Vector Index for details.
Performance Optimization Tips
- Quantized storage: For storage-sensitive scenarios, use
tinyinttype quantized vectors (e.g.,vector(tinyint, 128)). Storage is only 1/4 offloat, and computation is faster. - Scalar pre-filtering: Before vector search, apply scalar conditions first (e.g.,
WHERE category = 'AI') to reduce the number of vectors participating in distance computation and improve query efficiency. - Query limitations:
- VECTOR columns do not support
ORDER BY,GROUP BY, orDISTINCT. - Directly selecting a vector column (
SELECT vector_col) may raise an error; useCAST(vector_col AS STRING)to view data.
- VECTOR columns do not support
