VECTOR Type

VECTOR is the native vector data type in Singdata Lakehouse, used for storing and computing high-dimensional floating-point arrays. It is the core data structure for AI applications — text, images, and audio processed by embedding models are converted into vectors, stored in VECTOR columns, and then retrieved via vector similarity search to find semantically similar content.

Feature Overview

When to use VECTOR:

  • Building RAG (Retrieval-Augmented Generation) knowledge bases to store document embeddings
  • Image/audio similarity search
  • User/item feature vectors in recommendation systems
  • Any query scenario requiring "semantic similarity" rather than "exact match"

Differences from ARRAY:

FeatureVECTORARRAY
PurposeOptimized for vector similarity searchGeneral-purpose collection type
DimensionFixed dimensionDynamic length
Element typeNumeric only (float, int, tinyint)Any type
Index supportSupports HNSW vector index accelerationNot supported

Syntax and Definition

vector(scalar_type, dimension) vector(dimension)

  • scalar_type: vector element type, optional, defaults to float. Supports tinyint, int, float
  • dimension: vector dimension (number of elements), required

Table creation example:

CREATE TABLE doc_embeddings ( doc_id BIGINT, content STRING, vec_1536 vector(float, 1536), -- OpenAI text-embedding-3-small output dimension vec_512 vector(512), -- Default float, dimension 512 vec_int8 vector(tinyint, 128) -- Quantized vector, saves storage );


Vector Operations and Functions

Lakehouse provides a set of built-in functions for vector computation.

1. Distance and Similarity Calculation

This is the core of vector search, used to measure how close two vectors are.

FunctionDescriptionUse Case
L2_DISTANCE(v1, v2)Euclidean distance; smaller value means more similarImage, audio feature matching
COSINE_DISTANCE(v1, v2)Cosine distance; smaller value means more similarText semantic similarity (most common)
INNER_PRODUCT(v1, v2)Inner product; larger value means more similarSimilarity of normalized vectors

Example:

SELECT l2_distance(vector(1, 2), vector(3, 4)) AS l2_dist, -- Result: 2.828 cosine_distance(vector(1, 2), vector(3, 4)) AS cos_dist; -- Result: 0.016

2. Vector Normalization

l2_normalize(v) scales a vector to a unit vector (magnitude of 1).

-- Correct: ensure input is float type SELECT l2_normalize(vector(3.0, 4.0)); -- Result: [0.6, 0.8] -- Incorrect: int vector normalizes to [0, 0] SELECT l2_normalize(vector(3, 4));

3. Dot Product

dot_product(v1, v2) computes the dot product of two vectors.

SELECT dot_product(vector(1, 2), vector(3, 4)); -- Result: 1*3 + 2*4 = 11


Advanced Query Examples

Find the Top K documents most similar to a query vector.

-- Assume query_vec is the vector produced by embedding the user's input SELECT doc_id, content, COSINE_DISTANCE(vec_1536, CAST('[0.12, 0.34, ...]' AS vector(1536))) AS dist FROM doc_embeddings WHERE COSINE_DISTANCE(vec_1536, CAST('[0.12, 0.34, ...]' AS vector(1536))) < 0.3 ORDER BY dist LIMIT 5;

Scenario 2: Extracting Vectors from JSON

If embedding results are stored or transmitted in JSON format, use json_extract to extract and convert to VECTOR.

-- Note: directly querying a VECTOR column may raise an error; cast to STRING to view SELECT CAST( CAST(json_extract_string(parse_json('{"vec": [0.1, 0.2, 0.3]}'), '$.vec') AS vector(3)) AS STRING) AS vec_from_json;

Scenario 3: Real-time Generation with AI_EMBEDDING

Use the ai_embedding function to convert text directly to vectors for end-to-end semantic search.

-- Compute the query text vector in real time and perform search SELECT doc_id, content, COSINE_DISTANCE(vec_1536, ai_embedding('endpoint:my_model', 'user query text')) AS dist FROM doc_embeddings ORDER BY dist LIMIT 5;


Type Conversion

Conversion DirectionMethodNotes
STRING → VECTORCAST('[1,2,3]' AS vector(3))String format is [v1, v2, ...]; extra spaces are ignored
ARRAY → VECTORCAST(arr AS vector(n))Array length must match dimension n; otherwise returns NULL
VECTOR → ARRAYImplicit conversionCan be passed directly to functions that accept ARRAY arguments
VECTOR → STRINGCAST(vec AS STRING)Output format is [1, 2, 3] (comma followed by space)

Conversion examples:

-- ARRAY → VECTOR (dimension matches) SELECT CAST(array(1.0, 2.0, 3.0) AS vector(3)); -- Success -- ARRAY → VECTOR (dimension mismatch, returns NULL) SELECT CAST(array(1.0, 2.0) AS vector(3)); -- NULL -- VECTOR implicitly converted to ARRAY for computation SELECT array_append(vector(1, 2, 3), 4); -- [1, 2, 3, 4]


Vector Index Acceleration

For large-scale vector data, it is recommended to create a vector index to accelerate ANN (Approximate Nearest Neighbor) search:

-- Create HNSW vector index CREATE VECTOR INDEX idx_vec ON doc_embeddings (vec_1536) USING HNSW PROPERTIES ( "metric_type" = "cosine" );

See Create Vector Index for details.


Performance Optimization Tips

  1. Quantized storage: For storage-sensitive scenarios, use tinyint type quantized vectors (e.g., vector(tinyint, 128)). Storage is only 1/4 of float, and computation is faster.
  2. Scalar pre-filtering: Before vector search, apply scalar conditions first (e.g., WHERE category = 'AI') to reduce the number of vectors participating in distance computation and improve query efficiency.
  3. Query limitations:
    • VECTOR columns do not support ORDER BY, GROUP BY, or DISTINCT.
    • Directly selecting a vector column (SELECT vector_col) may raise an error; use CAST(vector_col AS STRING) to view data.