Dify and Singdata Lakehouse Integration Overview

Background

Dify Platform and Provider Architecture Introduction

Dify is an open-source large language model (LLM) application development platform that combines Backend-as-a-Service (BaaS) and LLMOps principles. As an abbreviation of "Define + Modify", Dify is dedicated to helping developers quickly build production-grade generative AI applications.

Dify Core Features:

🚀 Rapid Development: Visual prompt orchestration interface, supports hundreds of AI models, comprehensive knowledge base management
🔧 Complete Tech Stack: Built-in high-quality RAG engine, powerful Agent framework, and flexible workflows
🌐 Open Ecosystem: Apache License 2.0 open-source protocol, supports self-deployment and data control
🛠️ Rich Features: Supports multiple built-in tools and deployment options (Docker, Helm)

Dify Provider Architecture Core Concepts

Dify adopts the Provider architecture pattern to achieve decoupled integration with external services. This is a mature software design pattern, analogous to the "service provider" concept in cloud computing. In the Dify architecture, Providers assume the following key roles:

Provider Categories and Responsibilities:

🤖 Model Providers: Provide AI model services (LLM, Embedding, TTS, etc.)
- Examples: OpenAI, Anthropic, Azure OpenAI, local models
🔍 Vector Database Providers: Provide vector storage and retrieval services
- Examples: Qdrant, Milvus, Weaviate, Singdata (Singdata Lakehouse)
💾 Storage Providers: Provide file storage services
- Examples: AWS S3, Azure Blob, Alibaba Cloud OSS, Singdata Volume (Singdata Lakehouse Volume)
🛠️ Tool Providers: Provide functional tool collections
- Examples: search tools, API call tools, data processing tools

Provider Architecture Advantages:

✅ Standardized Interface: Unified integration approach, reducing development and maintenance costs
✅ Pluggable Design: Providers can be freely switched without modifying core code
✅ Strong Extensibility: New Providers can be integrated non-invasively into existing systems
✅ Flexible Configuration: Provider parameters managed flexibly via environment variables and configuration files
✅ Vendor Neutrality: Avoids vendor lock-in, supports multi-cloud and hybrid cloud deployment

The Unique Positioning of Singdata Lakehouse as a Provider

Unlike traditional single-function Providers, Singdata Lakehouse plays a multi-role Provider in Dify:

1. Vector Database Provider: Provides HNSW vector indexing, inverted index, and hybrid search

2. Storage Provider: Provides Volume storage, multi-tenant isolation, and cross-cloud compatibility

3. Compute Provider: Provides elastic computing, SQL querying, and data processing capabilities

This "one Provider, multiple roles" design is the core value proposition of the lakehouse architecture.

Singdata Lakehouse Introduction

Singdata Lakehouse is an innovative cloud-native lakehouse integrated data platform, adopting a fully managed service model that fundamentally transforms enterprise data management. Built from the ground up with a new cloud-native design philosophy, it is equipped with a self-developed next-generation SQL computing engine.

Singdata Lakehouse Core Advantages:

🏗️ Lakehouse Unification: Seamlessly integrates data warehouse, data lake, real-time processing, and business intelligence
⚡ Industry-Leading Performance: 9.84x Trino performance on TPC-H 100GB benchmark
🌊 Extreme Elasticity: Serverless architecture, second-level start/stop, on-demand scaling, per-second billing
🤖 AI-Native: Natively supports vector search, HNSW indexing, and AI model integration
🔓 Open Architecture: Supports mainstream open-source formats, no vendor lock-in
☁️ Multi-Cloud Support: Covers Alibaba Cloud, Tencent Cloud, AWS, GCP, and other major cloud platforms

Singdata Lakehouse Provider's Integration Value and Technical Comparison

Based on in-depth analysis of the existing Dify Provider ecosystem, the Singdata Lakehouse Provider offers unique technical advantages. Currently, Dify supports 32 vector database Providers (including Qdrant, Milvus, Weaviate, pgvector, etc.) and 17 storage Providers (AWS S3, Azure Blob, Alibaba Cloud OSS, etc.), but all adopt a separated Provider architecture.

Traditional Separated Provider Architecture vs. Singdata Unified Provider Architecture:

Technical Challenges of Existing Solutions:

Key Technical Difference Analysis:

1. Provider Integration Complexity

Traditional approach: Requires configuring Storage Provider (S3/OSS) + Vector Provider (Qdrant/Milvus), maintaining dual data mappings
Singdata Lakehouse approach: Single Provider provides both Volume storage + Vector indexing, with automatic metadata synchronization

2. Unified Retrieval Capability Comparison

Traditional Separated Approach:

Vector search: Qdrant/Milvus:

vector_results = qdrant_client.search(vector=query_vector, limit=top_k)

Full-text search: requires Elasticsearch:

text_results = elasticsearch.search(query=text_query)

Hybrid search: requires application-level fusion:

combined_results = merge_and_rank(vector_results, text_results)

Singdata Lakehouse Unified Approach:

Vector search - HNSW index:

vector_results = client.execute(""" SELECT content, metadata, COSINE_DISTANCE(vector, CAST(? AS VECTOR(1536))) as score FROM dataset_table ORDER BY score LIMIT ? """, [query_vector, top_k])

Full-text search - inverted index (Chinese word segmentation):

text_results = client.execute(""" SELECT content, metadata, SCORE(content) as score FROM dataset_table WHERE MATCH_ALL(content, ?) ORDER BY score DESC LIMIT ? """, [text_query, top_k])

SQL Like search - fallback option:

like_results = client.execute(""" SELECT content, metadata, 0.5 as score FROM dataset_table WHERE content LIKE ? LIMIT ? """, [f"%{text_query}%", top_k])

Hybrid search - unified single SQL processing:

hybrid_results = client.execute(""" SELECT content, metadata, (COSINE_DISTANCE(vector, CAST(? AS VECTOR(1536))) * 0.7 + SCORE(content) * 0.3) as combined_score FROM dataset_table WHERE MATCH_ALL(content, ?) ORDER BY combined_score DESC LIMIT ? """, [query_vector, text_query, top_k])

3. Performance Optimization Strategy Comparison

pgvector: Limited to ≤2000 dimensions, standalone HNSW index
Milvus: Distributed architecture, requires independent cluster operations
Singdata: CRU-based elastic computing, automatic resource scheduling, 9.84x leading performance in TPC-H benchmarks

4. Provider Configuration Cross-Cloud Compatibility

Traditional separated Provider approach: cross-cloud migration requires modifying multiple Provider configs:

Migrating from AWS S3 to Alibaba Cloud OSS:

STORAGE_TYPE = 's3' # needs to be changed to 'aliyun-oss' AWS_ACCESS_KEY = '...' # needs to be changed to ALIYUN_OSS_ACCESS_KEY AWS_SECRET_KEY = '...' # needs to be changed to ALIYUN_OSS_SECRET_KEY VECTOR_STORE = 'qdrant' # Vector Provider may also need adjustment

Singdata unified Provider approach: Volume abstraction layer shields cloud storage differences:

STORAGE_TYPE = 'clickzetta-volume' # unchanged across clouds VECTOR_STORE = 'clickzetta' # unchanged across clouds CLICKZETTA_VOLUME_TYPE = 'user' # unchanged across clouds

Singdata Provider backend auto-adapts to underlying cloud storage:

Goals

Primary Goals:

🎯 Provide Dify with a unified lakehouse storage solution
🚀 Achieve high-performance vector retrieval and full-text search
📈 Support storage and processing of large-scale datasets
🔧 Provide a complete Provider integration solution
🛡️ Ensure enterprise-grade security and reliability

Technical Goals:

Achieve high-performance vector retrieval based on Singdata Lakehouse elastic computing
Leverage cloud object storage to support massive document storage
Multi-tenant data isolation and permission management
Hybrid search (vector + full-text) capabilities
Serverless elastic scaling and pay-as-you-go billing

Core Value

For Users:

📚 Unified Knowledge Base: File storage, vector retrieval, and full-text search integrated into one
🔍 Multi-Mode Retrieval: HNSW vector search + inverted index full-text search + SQL Like + hybrid search, covering all Dify scenarios
⚡ Elastic Computing: CRU-based Serverless computing, second-level start/stop and scaling
🔒 Enterprise-Grade Security: Complete permission management and data isolation
💰 Cost Optimization: Storage-compute separation, per-second billing based on actual usage
🌐 Cross-Cloud Compatibility: Lakehouse Volume shields cloud storage differences, zero code changes for cross-cloud migration