Scenario 1: Enterprise Knowledge Base Search System
Common approaches in Hive/Spark:
-- Text search in Hive
SELECT * FROM knowledge_base
WHERE content LIKE '%Artificial Intelligence%'
OR content LIKE '%Machine Learning%'
OR content LIKE '%Deep Learning%';
-- Performance issues:
-- 1. Full table scan, cannot use indexes
-- 2. Only exact matching, no semantic search
-- 3. Chinese word segmentation difficulty, poor search results
Technical challenges faced:
Performance limitations: LIKE queries require full table scans, extremely poor performance with large data volumes
Functional limitations: No word segmentation search, only exact string matching
Maintenance complexity: Manual management of search keywords and synonyms required
Scalability difficulties: Multi-language support requires additional development
-- 1. Create a table with an inverted index
CREATE TABLE documents (
id BIGINT,
title STRING,
content STRING,
created_at TIMESTAMP,
-- Define inverted index at table creation
INDEX title_idx (title) INVERTED PROPERTIES('analyzer'='keyword'),
INDEX content_idx (content) INVERTED PROPERTIES('analyzer'='chinese', 'mode'='smart')
);
-- 2. Insert data
INSERT INTO documents VALUES
(1, 'Singdata Technical Guide', 'Singdata Lakehouse supports vector search and full-text search features', CURRENT_TIMESTAMP());
-- 3. Build index for existing data
BUILD INDEX content_idx ON documents;
-- 4. Execute full-text search query
SELECT id, title FROM documents
WHERE MATCH_ANY(content, 'full-text search vector search', map('analyzer', 'auto'));
-- 5. Use phrase matching
SELECT id, title FROM documents
WHERE MATCH_PHRASE(content, 'full-text search features', map('analyzer', 'auto'));
Technical Comparison: Traditional Approach vs Native Full-Text Search
Traditional LIKE Query Approach
-- Text search using LIKE
SELECT * FROM documents
WHERE content LIKE '%Artificial Intelligence%'
OR content LIKE '%Machine Learning%'
OR content LIKE '%Deep Learning%';
-- Existing problems:
-- 1. Full table scan, poor performance
-- 2. No word segmentation, only exact matching
-- 3. Case sensitive
-- 4. Cannot handle synonyms
External Search Engine Approach
Requires maintaining an additional Elasticsearch cluster:
-- Test different tokenizer effects
SELECT
'keyword' as analyzer,
TOKENIZE('Machine Learning Algorithm', map('analyzer', 'keyword')) as tokens
UNION ALL
SELECT
'chinese',
TOKENIZE('Machine Learning Algorithm', map('analyzer', 'chinese'))
UNION ALL
SELECT
'unicode',
TOKENIZE('Machine Learning Algorithm', map('analyzer', 'unicode'));
MATCH_ALL - Match All Function
MATCH_ALL(column, query, option)
Purpose: Requires documents to contain all tokenized results from the query text
Logic: AND relationship, all words must be present
Usage example:
-- Find documents containing both "Machine Learning" and "Algorithm"
SELECT id, title FROM documents
WHERE MATCH_ALL(content, 'Machine Learning Algorithm', map('analyzer', 'auto'));
MATCH_ANY - Match Any Function
MATCH_ANY(column, query, option)
Purpose: Documents need only contain any single token from the query text
Logic: OR relationship, any word present is sufficient
Usage example:
-- Find documents containing "AI" or "Machine Learning" or "Deep Learning"
SELECT id, title FROM documents
WHERE MATCH_ANY(content, 'AI Machine Learning Deep Learning', map('analyzer', 'auto'));
MATCH_PHRASE - Phrase Matching Function
MATCH_PHRASE(column, query, option)
Purpose: Requires documents to contain the query phrase, with word order consistent and continuous
Characteristics:
Order-sensitive
Requires consecutive occurrence
Case-insensitive
Usage example:
-- Find documents containing the phrase "Natural Language Processing"
SELECT id, title FROM documents
WHERE MATCH_PHRASE(content, 'Natural Language Processing', map('analyzer', 'auto'));
-- Counterexample: will not match "Natural Language Processing" (not consecutive)
-- Counterexample: will not match "Language Natural Processing" (wrong order)
MATCH_PHRASE_PREFIX - Phrase Prefix Matching
MATCH_PHRASE_PREFIX(column, query, option)
Purpose: First n-1 words matched as phrase, last word matched as prefix
-- Find phrases starting with "Data An" (e.g., "Data Analysis", "Data Annotation")
SELECT id, title FROM documents
WHERE MATCH_PHRASE_PREFIX(content, 'Data An', map('analyzer', 'auto'));
MATCH_REGEXP - Regex Matching Function
MATCH_REGEXP(column, query, option)
Purpose: Perform regex matching on tokenization results
Usage example:
-- Find documents containing words ending with "ing"
SELECT id, title FROM documents
WHERE MATCH_REGEXP(content, '.*ing', map('analyzer', 'auto'));
-- Find documents containing digits
SELECT id, title FROM documents
WHERE MATCH_REGEXP(content, '.*[0-9].*', map('analyzer', 'auto'));
Important Parameter Notes
analyzer Option
Recommended auto parameter:
map('analyzer', 'auto') -- Automatically matches the column's tokenizer settings
⚠️ Note: The tokenizer in the function must match the tokenizer used when creating the index; otherwise, the index cannot be leveraged for acceleration!
Index Management
Index Lifecycle
1. Create Index
-- Method 1: Create at table creation
CREATE TABLE documents (
id BIGINT,
content STRING,
INDEX content_idx (content) INVERTED PROPERTIES('analyzer'='chinese')
);
-- Method 2: Create for existing table
CREATE INVERTED INDEX content_idx ON TABLE documents(content)
PROPERTIES('analyzer'='chinese', 'mode'='smart');
2. Build Index (Important!)
-- Build index for existing data (synchronous task)
BUILD INDEX content_idx ON documents;
-- Build index by partition
BUILD INDEX content_idx ON documents
WHERE partition_date >= '2024-01-01';
⚠️ Note:
CREATE INDEX only takes effect for new data
Existing data requires BUILD INDEX to leverage the index
Building an index consumes compute resources; it is recommended to perform during off-peak hours
3. View Index
-- List all indexes on a table
SHOW INDEX FROM table_name;
-- View index details (if supported by the environment)
DESC INDEX index_name;
DESC INDEX EXTENDED index_name; -- Includes size information
4. Drop Index
DROP INDEX index_name ON table_name;
Drop characteristics:
Metadata deleted immediately
Index files cleaned up asynchronously
Does not affect the data itself
Index Performance Tuning
Partitioned Table Index Strategy
-- Recommended approach for large table partition building
BUILD INDEX content_idx ON large_table
WHERE year = '2024' AND month = '01';
BUILD INDEX content_idx ON large_table
WHERE year = '2024' AND month = '02';
-- Build partition by partition to avoid excessive resource consumption
Storage Cost Optimization
-- View index storage occupancy (if supported)
DESC INDEX EXTENDED index_name;
-- Choose appropriate tokenizer based on business needs
-- keyword: Least storage, exact match only
-- chinese: Medium storage, supports intelligent tokenization
-- unicode: Most storage, supports all languages
Application Scenario Design
Enterprise Knowledge Base System
-- Knowledge base table design
CREATE TABLE knowledge_base (
doc_id BIGINT PRIMARY KEY,
title STRING,
content STRING,
category STRING,
tags STRING,
created_at TIMESTAMP,
updated_at TIMESTAMP,
-- Multi-level index design
INDEX title_keyword_idx (title) INVERTED PROPERTIES('analyzer'='keyword'),
INDEX content_chinese_idx (content) INVERTED PROPERTIES('analyzer'='chinese', 'mode'='smart'),
INDEX tags_unicode_idx (tags) INVERTED PROPERTIES('analyzer'='unicode')
);
-- Multi-dimensional search query
SELECT doc_id, title FROM knowledge_base
WHERE MATCH_ANY(content, 'AI Machine Learning', map('analyzer', 'auto'))
OR MATCH_ANY(tags, 'AI ML', map('analyzer', 'auto'))
ORDER BY updated_at DESC LIMIT 20;
Log Analysis System
-- Application log table
CREATE TABLE application_logs (
log_id BIGINT,
timestamp TIMESTAMP,
level STRING,
message STRING,
stack_trace STRING,
source_ip STRING,
INDEX message_idx (message) INVERTED PROPERTIES('analyzer'='english'),
INDEX stack_trace_idx (stack_trace) INVERTED PROPERTIES('analyzer'='keyword')
) PARTITIONED BY (DATE(timestamp));
-- Troubleshooting query
SELECT log_id, timestamp, message FROM application_logs
WHERE DATE(timestamp) >= CURRENT_DATE() - INTERVAL '7' DAY
AND level = 'ERROR'
AND MATCH_ANY(message, 'timeout connection database', map('analyzer', 'auto'))
ORDER BY timestamp DESC;
E-Commerce Product Search
-- Product information table
CREATE TABLE products (
product_id BIGINT,
name STRING,
description STRING,
brand STRING,
category STRING,
tags STRING,
price DECIMAL(10,2),
INDEX name_chinese_idx (name) INVERTED PROPERTIES('analyzer'='chinese', 'mode'='max_word'),
INDEX desc_chinese_idx (description) INVERTED PROPERTIES('analyzer'='chinese', 'mode'='max_word'),
INDEX brand_keyword_idx (brand) INVERTED PROPERTIES('analyzer'='keyword')
);
-- Product search query
SELECT product_id, name, price FROM products
WHERE MATCH_ANY(name, 'phone smart camera', map('analyzer', 'auto'))
OR MATCH_ANY(description, 'phone smart camera', map('analyzer', 'auto'))
ORDER BY price;
Content Management System
-- Article content table
CREATE TABLE articles (
article_id BIGINT,
title STRING,
content STRING,
author STRING,
publish_date DATE,
status STRING,
INDEX title_chinese_idx (title) INVERTED PROPERTIES('analyzer'='chinese'),
INDEX content_chinese_idx (content) INVERTED PROPERTIES('analyzer'='chinese', 'mode'='smart'),
INDEX author_keyword_idx (author) INVERTED PROPERTIES('analyzer'='keyword')
);
-- Content search
SELECT article_id, title, author FROM articles
WHERE status = 'published'
AND publish_date >= CURRENT_DATE() - INTERVAL '30' DAY
AND MATCH_PHRASE(content, 'technology development trends', map('analyzer', 'auto'))
ORDER BY publish_date DESC;
Important Pitfall Avoidance Guide
Tokenizer Consistency Issue
Common Error: Tokenizer Mismatch
-- ❌ Error example: Index uses chinese tokenizer, query uses english tokenizer
CREATE TABLE docs (
content STRING,
INDEX content_idx (content) INVERTED PROPERTIES('analyzer'='chinese')
);
SELECT * FROM docs
WHERE MATCH_ANY(content, 'test', map('analyzer', 'english'));
-- Result: Cannot leverage index, poor performance
Correct Approach:
-- ✅ Solution 1: Use auto parameter (recommended)
SELECT * FROM docs
WHERE MATCH_ANY(content, 'test', map('analyzer', 'auto'));
-- ✅ Solution 2: Manually match the index tokenizer
SELECT * FROM docs
WHERE MATCH_ANY(content, 'test', map('analyzer', 'chinese'));
Index Build Pitfalls
The Necessity of BUILD INDEX
-- ❌ Common error: Forgetting to build index
CREATE INVERTED INDEX content_idx ON TABLE existing_table(content)
PROPERTIES('analyzer'='chinese');
-- Query directly (only effective for new data, existing data cannot leverage index)
SELECT * FROM existing_table
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'));
Correct Index Build Process
-- ✅ Complete process
-- 1. Create index
CREATE INVERTED INDEX content_idx ON TABLE existing_table(content)
PROPERTIES('analyzer'='chinese');
-- 2. Build index (required!)
BUILD INDEX content_idx ON existing_table;
-- 3. Verify index
SHOW INDEX FROM existing_table;
-- 4. Execute query
SELECT * FROM existing_table
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'));
Query Performance Pitfalls
Performance Limitations for Sub-Second Queries
According to the official documentation: In most cases, inverted indexes do not significantly improve performance for queries with sub-second execution times.
-- ❌ Not ideal for inverted index scenarios
-- 1. Small tables (fewer than hundreds of rows)
-- 2. Queries already executing in <1 second
-- 3. Simple equality queries
-- ✅ Suitable for inverted index scenarios
-- 1. Large data tables (10,000+ rows)
-- 2. Complex text searches
-- 3. Multi-keyword matching
Unsupported Query Patterns
-- ❌ Not supported: Type conversion on the column
SELECT * FROM docs
WHERE MATCH_ANY(CAST(id AS STRING), '123', map('analyzer', 'auto'));
-- ✅ Supported: Type conversion on the query value
SELECT * FROM docs
WHERE MATCH_ANY(content, CAST(123 AS STRING), map('analyzer', 'auto'));
-- ❌ Not supported: External tables
-- Inverted indexes do not support external tables
Storage Cost Pitfalls
Index Storage Overhead
-- Inverted indexes create additional index files
-- Storage cost = Original data + Index files
-- Optimization suggestions:
-- 1. Only create indexes for columns that truly need searching
-- 2. Choose appropriate tokenizer based on query patterns
-- 3. Regularly clean up unnecessary indexes
Tokenizer Selection Strategy
Tokenizer
Index Size
Query Performance
Feature Richness
Recommended Scenario
keyword
Smallest
Fastest
Exact match
IDs, statuses, tags
english
Small
Fast
English tokenization
Pure English content
chinese
Medium
Medium
Chinese tokenization
Chinese documents
unicode
Largest
Relatively slow
Full language support
Multi-language mixed
Performance Optimization
Query Optimization Strategies
1. Choose Query Functions Wisely
-- Choose appropriate functions based on business needs
-- Exact search: Use MATCH_ALL
SELECT * FROM docs WHERE MATCH_ALL(content, 'Machine Learning Algorithm', map('analyzer', 'auto'));
-- Broad search: Use MATCH_ANY
SELECT * FROM docs WHERE MATCH_ANY(content, 'AI Machine Learning Deep Learning', map('analyzer', 'auto'));
-- Phrase search: Use MATCH_PHRASE
SELECT * FROM docs WHERE MATCH_PHRASE(content, 'Natural Language Processing', map('analyzer', 'auto'));
2. Combined Query Optimization
-- ✅ Recommended: Use full-text search as the primary filter condition
SELECT * FROM documents
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'))
AND category = 'tech' -- Secondary filtering on top of full-text search
AND created_at >= '2024-01-01';
3. Paginated Query Optimization
-- ✅ Efficient paginated query
SELECT id, title FROM documents
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'))
ORDER BY id -- Sort by primary key
LIMIT 20 OFFSET 0;
-- -- Avoid deep pagination
-- -- LIMIT 20 OFFSET 10000; -- Performance degrades as offset increases
Index Optimization Strategies
1. Partitioned Table Index Management
-- Build index gradually by partition
BUILD INDEX content_idx ON large_table
WHERE year = '2024' AND month = '06';
-- Monitor build progress
-- Job Profile can be used to view progress
2. Index Maintenance Strategy
-- Regularly check index status
SHOW INDEX FROM table_name;
-- For frequently updated tables, index rebuild may be needed
DROP INDEX old_idx ON table_name;
CREATE INVERTED INDEX new_idx ON TABLE table_name(column)
PROPERTIES('analyzer'='chinese');
BUILD INDEX new_idx ON table_name;
System Resource Optimization
1. Compute Resource Configuration
Index building: Use larger Virtual Clusters for BUILD INDEX operations
Query execution: Use smaller Virtual Clusters for daily queries
Mixed workloads: Separate index building and query workloads
2. Storage Resource Optimization
-- Monitor index storage occupancy (if supported)
DESC INDEX EXTENDED index_name;
-- Clean up unnecessary indexes
DROP INDEX unused_idx ON table_name;
Migration Strategy
Migrating from Traditional LIKE Queries
Migration Assessment
Scenarios suitable for migration:
Frequent text search queries
Large data tables (10,000+ rows)
Complex multi-keyword searches
Scenarios requiring Chinese tokenization
Scenarios not in a hurry to migrate:
Small data tables
Occasional text queries
Simple queries already fast (<1 second)
Migration Steps
-- 1. Analyze existing query patterns
-- Identify frequent LIKE queries
SELECT query_text, execution_count
FROM query_logs
WHERE query_text LIKE '%LIKE%'
ORDER BY execution_count DESC;
-- 2. Create test table to verify effects
CREATE TABLE docs_test AS SELECT * FROM docs_original LIMIT 1000;
-- 3. Add inverted index
CREATE INVERTED INDEX content_idx ON TABLE docs_test(content)
PROPERTIES('analyzer'='chinese');
BUILD INDEX content_idx ON docs_test;
-- 4. Performance comparison test
-- Original LIKE query
SELECT COUNT(*) FROM docs_test WHERE content LIKE '%keyword%';
-- Full-text search query
SELECT COUNT(*) FROM docs_test
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'));
-- 5. Gradually migrate to production environment
Migrating from External Search Engines
Elasticsearch Migration Mapping
Elasticsearch Query
Lakehouse Full-Text Search
Description
match_all
MATCH_ALL
Match all keywords
match
MATCH_ANY
Match any keyword
match_phrase
MATCH_PHRASE
Phrase matching
match_phrase_prefix
MATCH_PHRASE_PREFIX
Phrase prefix matching
regexp
MATCH_REGEXP
Regex matching
Migration Benefits Assessment
Architecture simplification benefits:
Eliminate 1 separate search engine system
Unified data storage and retrieval platform
Reduced operational complexity
Cost optimization benefits:
Reduced additional storage and compute resources
No data sync ETL processes needed
Lower total cost of ownership
Data consistency benefits:
Strong consistency guarantees
No data sync delays
Real-time search of latest data
Hybrid Solution Design
For complex enterprise environments, a hybrid approach can be adopted:
-- Approach 1: Separate by data type
-- Structured queries + Simple text search: Use Lakehouse full-text search
-- Complex semantic search + Recommendation algorithms: Keep Elasticsearch
-- Approach 2: Separate by real-time requirements
-- Real-time search: Use Lakehouse full-text search
-- Offline analysis: Use Elasticsearch
-- Approach 3: Separate by data volume
-- Large data volumes: Use Lakehouse full-text search
-- Small volume rapid prototyping: Use Elasticsearch
Best Practices Summary
Index Design Best Practices
1. Tokenizer Selection Principles
-- Choose tokenizer based on data characteristics
CREATE TABLE multilingual_docs (
id BIGINT,
title_cn STRING, -- Chinese title
title_en STRING, -- English title
content STRING, -- Mixed content
tags STRING, -- Tags (exact match)
INDEX title_cn_idx (title_cn) INVERTED PROPERTIES('analyzer'='chinese', 'mode'='smart'),
INDEX title_en_idx (title_en) INVERTED PROPERTIES('analyzer'='english'),
INDEX content_idx (content) INVERTED PROPERTIES('analyzer'='unicode'),
INDEX tags_idx (tags) INVERTED PROPERTIES('analyzer'='keyword')
);
2. Index Maintenance Strategy
-- Production environment index maintenance process
-- 1. Create index during off-peak hours
CREATE INVERTED INDEX content_idx ON TABLE large_table(content)
PROPERTIES('analyzer'='chinese');
-- 2. Build index in batches
BUILD INDEX content_idx ON large_table
WHERE partition_date = '2024-06-01';
-- 3. Verify index effectiveness
SELECT COUNT(*) FROM large_table
WHERE MATCH_ANY(content, 'test keyword', map('analyzer', 'auto'));
-- 4. Monitor index status
SHOW INDEX FROM large_table;
Query Pattern Best Practices
1. Query Function Selection Guide
Business Scenario
Recommended Function
Example Query
Exact search with multiple keywords
MATCH_ALL
Must contain "AI" and "Machine Learning"
Broad topic search
MATCH_ANY
Contains "AI" or "ML" or "Deep Learning"
Professional term search
MATCH_PHRASE
Exact match for "Natural Language Processing"
Search suggestions / Autocomplete
MATCH_PHRASE_PREFIX
Phrases starting with "Data An"
Pattern matching
MATCH_REGEXP
Contains digits or specific formats
2. Performance-Optimized Query Patterns
-- Efficient query patterns
-- 1. Leverage partition filtering
SELECT * FROM logs
WHERE log_date >= '2024-01-01'
AND MATCH_ANY(message, 'error timeout', map('analyzer', 'auto'));
-- 2. Combine index filtering
SELECT * FROM documents
WHERE MATCH_ALL(content, 'keyword1 keyword2', map('analyzer', 'auto'))
AND status = 'active';
Monitoring and Diagnostics
1. Index Effectiveness Verification
-- Verify if index is being used (via execution plan)
EXPLAIN SELECT * FROM documents
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'));
-- Compare LIKE query performance
-- Method: Record query execution time, compare optimization effects
2. Common Issue Diagnosis
-- Check tokenization results
SELECT TOKENIZE('test text', map('analyzer', 'auto'));
-- Check index status
SHOW INDEX FROM table_name;
-- Verify query syntax
SELECT MATCH_ANY('test text', 'keyword', map('analyzer', 'chinese'));
Known Limitations and Considerations
Functional Limitations
1. Unsupported Scenarios
External tables: Inverted indexes do not support external tables
Column type conversion: Forced type conversion on table columns is not supported
2. Query Limitations
-- ❌ Unsupported query patterns
-- 1. Type conversion on table columns
WHERE MATCH_ANY(CAST(column AS STRING), 'value', map('analyzer', 'auto'))
-- 2. Creating inverted index on external table
CREATE INVERTED INDEX idx ON EXTERNAL_TABLE table_name(column)
PROPERTIES('analyzer'='chinese');
Performance Considerations
1. Index Build Cost
BUILD INDEX is a synchronous operation that consumes compute resources
Large tables are recommended to be built partition by partition
Recommended to perform during off-peak hours
2. Storage Cost
Inverted indexes create additional index files
Index size depends on data volume and tokenizer type
Need to balance query performance and storage cost
Operational Considerations
1. Version Compatibility
Recommend thorough validation in test environments
Monitor version updates and feature improvements
2. Backup and Recovery
Index metadata is backed up along with the table structure
Index files need to be rebuilt
BUILD INDEX needs to be executed after recovery
Quick Reference Card
Tokenizer Selection Quick Reference
-- Exact matching (IDs, statuses, etc.)
'analyzer'='keyword'
-- Pure English content
'analyzer'='english'
-- Chinese content (recommended)
'analyzer'='chinese', 'mode'='smart'
-- Multi-language mixed
'analyzer'='unicode'
-- Auto matching (recommended)
'analyzer'='auto'
Common Query Templates
-- 1. Basic text search
SELECT * FROM table_name
WHERE MATCH_ANY(column, 'keyword', map('analyzer', 'auto'));
-- 2. Exact phrase search
SELECT * FROM table_name
WHERE MATCH_PHRASE(column, 'exact phrase', map('analyzer', 'auto'));
-- 3. Multi-condition combined search
SELECT * FROM table_name
WHERE MATCH_ALL(content, 'keyword1 keyword2', map('analyzer', 'auto'))
AND category = 'target_category';
-- 4. Paginated query
SELECT id, title FROM table_name
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'))
ORDER BY id LIMIT 20 OFFSET 0;
Index Management Commands
-- Create index
CREATE INVERTED INDEX idx_name ON TABLE table_name(column)
PROPERTIES('analyzer'='chinese');
-- Build index
BUILD INDEX idx_name ON table_name;
-- View index
SHOW INDEX FROM table_name;
-- Drop index
DROP INDEX idx_name ON table_name;
Note: This document is based on Lakehouse product documentation as of June 2025. It is recommended to regularly check the official documentation for the latest updates. Before using in a production environment, be sure to verify the correctness and performance impact of all operations in a test environment.