AI Data Preparation

Singdata Lakehouse unifies vector search, full-text search, and structured data analysis on a single platform, letting AI applications complete retrieval and computation directly where the data lives — no need to move data to an external vector database or search engine.


Selection Guide

What you need to doRecommended approach
Semantic similarity search, RAG retrieval, image searchVector Search
Keyword search, log retrieval, Chinese tokenized searchFull-Text Search
Vector + keyword hybrid search to improve recall qualityHybrid Search (RRF)
Vector search + structured filtering (e.g., time range, category tags) on the same tableMulti-modal Data Retrieval

Core Capabilities

Create a vector index (HNSW) on a table to support approximate nearest neighbor (ANN) retrieval. Suitable for semantic search, knowledge base Q&A, image similarity, and similar scenarios.

-- Create a table with a vector column CREATE TABLE docs ( id BIGINT, content STRING, embedding VECTOR(1536) ); -- Create a vector index CREATE VECTOR INDEX idx_vec ON TABLE docs (embedding) PROPERTIES ("scalar.type" = "f32", "distance.function" = "cosine_distance"); -- Semantic search: find the 5 most similar documents -- endpoint:my_embedding is the embedding endpoint name you configured in AI Gateway SELECT id, content FROM docs ORDER BY cosine_distance(embedding, AI_EMBEDDING('endpoint:my_embedding', 'user question')) ASC LIMIT 5;

Full Vector Search Guide


Based on an inverted index, supports Chinese and English tokenization, BM25 relevance ranking, and phrase matching. Suitable for document search, log retrieval, comment analysis, and similar scenarios.

-- Create an inverted index CREATE INVERTED INDEX idx_content ON TABLE docs (content) PROPERTIES('analyzer'='chinese'); -- Full-text search SELECT id, content FROM docs WHERE match_any(content, 'vector search RAG');

Full Full-Text Search Guide · BM25 Parameter Tuning


Merges vector search and full-text search results using Reciprocal Rank Fusion, balancing semantic relevance and exact keyword matching. Recall quality is better than either approach alone.

-- Vector search results + full-text search results → RRF merged ranking WITH vec_results AS ( SELECT id, ROW_NUMBER() OVER (ORDER BY cosine_distance(embedding, AI_EMBEDDING('endpoint:my_embedding', 'query')) ASC) AS rk FROM docs LIMIT 20 ), fts_results AS ( SELECT id, ROW_NUMBER() OVER (ORDER BY SCORE() DESC) AS rk FROM docs WHERE match_any(content, 'query') LIMIT 20 ) SELECT id, SUM(1.0 / (60 + rk)) AS rrf_score FROM (SELECT * FROM vec_results UNION ALL SELECT * FROM fts_results) GROUP BY id ORDER BY rrf_score DESC LIMIT 5;

Hybrid Search Best Practices


Multi-modal Data Retrieval

Build both a vector index and an inverted index on the same table, supporting combined filtering of vector similarity and structured conditions (time, category, tags) without cross-table JOINs.

-- Semantic search + structured filtering SELECT id, content FROM docs WHERE category = 'tech' AND create_time >= '2024-01-01' ORDER BY cosine_distance(embedding, AI_EMBEDDING('endpoint:my_embedding', 'machine learning')) ASC LIMIT 10;

Multi-modal Data Retrieval Guide


Typical Scenarios

RAG Knowledge Base Q&A: Ingest documents → vectorize with AI_EMBEDDING → vector index → ANN retrieval on user query → generate answer with AI_COMPLETE

Enterprise Search: Chinese tokenized inverted index + vector index hybrid search, balancing exact matching and semantic understanding

Recommendation Systems: Vectorize user behavior → ANN retrieval of similar users or items

Image Retrieval: Store image feature vectors in a VECTOR column → ANN search for similar images


DocumentDescription
Vector SearchVector index creation, ANN search, distance functions
Full-Text SearchInverted index, tokenizers, MATCH queries
Hybrid Search Best PracticesComplete RRF fusion ranking example
Multi-modal Data RetrievalVector + structured filtering combination
AI FunctionsBuilt-in SQL AI functions such as AI_EMBEDDING and AI_COMPLETE
Vector IndexVector index DDL syntax reference
Inverted IndexInverted index DDL syntax reference