Singdata Lakehouse Volume Best Practices Guide

Content Overview

This document provides best practice guidance for Singdata Lakehouse Volume features, helping you efficiently manage data files, optimize storage performance, reduce costs, and improve data operation efficiency. Key topics include:

  • Volume Basics: Core concepts, applicable scenarios, and creation methods for different Volume types
  • Daily Operations Guide: Common file management commands and format settings
  • Performance Optimization Strategies: Storage efficiency improvements and access speed optimization methods
  • Automated Workflows: Using advanced features such as PIPE and DIRECTORY
  • Security and Cost Control: Permission management and cost optimization strategies
  • Troubleshooting: Solutions to common issues and best practices

All SQL examples have been validated in actual environments and can be directly applied in production.


Volume Overview

Volume is the core storage abstraction of Singdata Lakehouse, providing unified file management and data access capabilities. With Volume, you can:

  • Unified File Management: Consistent operation experience across different storage systems
  • Secure Access Control: Permission-based file access management
  • Automated Data Pipelines: Automatic data import and cleanup via PIPE
  • Metadata Queries: Detailed file information via the DIRECTORY function
  • Intelligent Optimization: Automatic small file compaction and storage optimization
  • Multi-format Support: Full support for CSV, Parquet, JSON, TEXT, and more

Volume Types and Selection

User Volume

Applicable Scenarios: Temporary data storage, intermediate results, personal workspace

-- View User Volume contents SHOW USER VOLUME DIRECTORY; -- List all files LIST USER VOLUME; -- Filter by regular expression LIST USER VOLUME REGEXP = '.*\.csv'; -- View by subdirectory LIST USER VOLUME SUBDIRECTORY 'temp/';

Characteristics:

  • Ready to use, no configuration required
  • Suitable for temporary data storage
  • Users have management privileges by default
  • DIRECTORY function not supported
  • PIPE direct monitoring not supported

Table Volume

Applicable Scenarios: Table-level data management, ETL intermediate results, table-related file storage

-- View Table Volume directory SHOW TABLE VOLUME DIRECTORY table_name; -- List Table Volume files LIST TABLE VOLUME table_name; -- Filter by regular expression LIST TABLE VOLUME table_name REGEXP = '.*\.parquet'; -- View specific subdirectory LIST TABLE VOLUME table_name SUBDIRECTORY 'backup/';

Characteristics:

  • Each table automatically has an independent Volume space
  • Bound to table lifecycle and permissions
  • Supports standard Volume operations
  • Convenient for table-level data management and backup

External Volume

Applicable Scenarios: Formal data lake, production data storage, cross-system data sharing

Creating External Volume

-- Step 1: Create Storage Connection CREATE STORAGE CONNECTION my_oss_connection TYPE = 'OSS' -- Supports multiple storage types: OSS/S3/COS/GCS PROPERTIES = ( 'access_key' = 'your_access_key', 'secret_key' = 'your_secret_key', 'endpoint' = 'oss-region.aliyuncs.com' ); -- Step 2: Create External Volume CREATE VOLUME my_external_volume WITH CONNECTION = my_oss_connection LOCATION = 'oss://bucket-name/path/' DIRECTORY = (enable = TRUE) -- Enable DIRECTORY feature COMMENT = 'External storage volume'; -- Step 3: Set access permissions GRANT READ VOLUME ON VOLUME my_schema.my_external_volume TO ROLE data_analyst; GRANT WRITE VOLUME ON VOLUME my_schema.my_external_volume TO ROLE data_engineer;

Using External Volume

-- View all External Volumes SHOW VOLUMES; -- Check Volume configuration DESC VOLUME schema_name.volume_name; -- List Volume files LIST VOLUME schema_name.volume_name; -- Advanced filtering LIST VOLUME schema_name.volume_name SUBDIRECTORY 'data/2024/' REGEXP = '.*\.parquet';

Characteristics:

  • Supports DIRECTORY function (requires enabling)
  • Supports PIPE automatic monitoring
  • Production-grade data management
  • Multi-cloud storage support (OSS/COS/S3/GCS)
  • Can integrate with external data lakes

Basic File Management

File List Query

-- Basic list query LIST USER VOLUME; LIST TABLE VOLUME my_table; LIST VOLUME mcp_demo.data_volume; -- Subdirectory query LIST USER VOLUME SUBDIRECTORY 'reports/2024/'; LIST TABLE VOLUME my_table SUBDIRECTORY 'backups/'; LIST VOLUME mcp_demo.data_volume SUBDIRECTORY 'csv_data/'; -- Regular expression filtering LIST USER VOLUME REGEXP = '.*\.parquet'; LIST TABLE VOLUME my_table REGEXP = '.*backup.*'; LIST VOLUME mcp_demo.data_volume REGEXP = '.*month=0[1-5].*'; -- Combined query: subdirectory + regular expression LIST VOLUME mcp_demo.data_volume SUBDIRECTORY 't_search_log' REGEXP = '.*c000';

File Deletion Management (REMOVE Command)

-- Delete a single file REMOVE USER VOLUME FILE 'temp/data.csv'; REMOVE TABLE VOLUME my_table FILE 'backup/old_data.parquet'; REMOVE VOLUME my_external_vol FILE 'processed/result.json'; -- Delete multiple files (requires multiple REMOVE command executions) -- Note: The current version does not support comma-separated multi-file deletion REMOVE USER VOLUME FILE 'temp/file1.csv'; REMOVE USER VOLUME FILE 'temp/file2.csv'; -- Delete entire directory (recursive deletion) REMOVE USER VOLUME SUBDIRECTORY 'temp/'; REMOVE TABLE VOLUME my_table SUBDIRECTORY 'old_backups/'; REMOVE VOLUME my_external_vol SUBDIRECTORY 'archive/2023/'; -- Note: REMOVE will also delete the actual files in storage; this action is irreversible

Directory Operation Best Practices

-- 1. Organize directory structure by business scenario -- Recommended directory structure: -- /raw/ # Raw data -- /processed/ # Processed data -- /temp/ # Temporary data -- /backup/ # Backup data -- /archive/ # Archived data -- 2. Organize by time-based partitions LIST USER VOLUME REGEXP = '.*2024.*'; -- By year LIST TABLE VOLUME my_table REGEXP = '.*202405.*'; -- By year-month LIST VOLUME external_vol REGEXP = '.*daily_20240529.*'; -- By date -- 3. Categorize by file type LIST USER VOLUME REGEXP = '.*\.csv'; -- CSV files LIST TABLE VOLUME my_table REGEXP = '.*\.parquet'; -- Parquet files LIST VOLUME external_vol REGEXP = '.*\.json'; -- JSON files

File Management Notes

  1. PUT/GET Command Limitations: Only available in clients such as SQLLine; Studio web UI does not support these commands
  2. File Size Limit: Single file must not exceed 5 GB
  3. File Overwrite Risk: Files with the same name will be automatically overwritten; exercise caution
  4. Permission Dependencies: Operation permissions depend on Volume type (see Permission Management section)

File Format and Compression Optimization

Supported File Formats

Singdata Lakehouse supports multiple file formats, each with its optimal usage scenarios:

-- CSV format: For data exchange, human-readable -- Parquet format: For analytical scenarios, columnar storage -- JSON format: For semi-structured data (output in JSON LINE format) -- TEXT format: For plain text data -- ORC format: Optimized row-column hybrid format -- BSON format: Binary JSON format

Compression Format Configuration

-- Parquet export (recommended, built-in compression) COPY INTO USER VOLUME SUBDIRECTORY 'parquet_data/' FROM my_table FILE_FORMAT = (TYPE = PARQUET); -- Automatically applies optimal compression -- TEXT format export (plain text) COPY INTO USER VOLUME SUBDIRECTORY 'text_output/' FROM my_table FILE_FORMAT = (TYPE = TEXT); -- Compression parameter settings for import (in OPTIONS) COPY INTO target_table FROM USER VOLUME USING CSV OPTIONS('header'='true') FILES ('data.csv');

File Format Parameter Notes

-- CSV import parameter settings COPY INTO target_table FROM USER VOLUME USING CSV OPTIONS( 'header'='true', -- Includes header row 'sep'='|' -- Delimiter ) FILES ('special_format.csv');

Format Selection Advice

FormatCompression RatioQuery PerformanceApplicable Scenario
Parquet (recommended)HighBestAnalytical queries, long-term storage
CSVMediumMediumData exchange, human-readable
JSONMediumMediumSemi-structured data, API interfaces
TEXTLowLowLog files, simple text

Error Handling and Data Quality

Error Handling Strategy

-- Import error handling -- When encountering errors, the system returns detailed error messages by default COPY INTO target_table FROM USER VOLUME USING CSV OPTIONS('header'='true') FILES ('import_data.csv'); -- For potential format errors, first import a small sample for validation COPY INTO target_table FROM USER VOLUME USING CSV OPTIONS('header'='true') FILES ('sample_data.csv');

PURGE Auto-Cleanup

The PURGE feature can automatically delete source files after a successful data import, saving storage space:

-- Automatically delete source files after successful import COPY INTO target_table FROM USER VOLUME (id INT, name STRING, value DOUBLE) USING CSV FILES ('import_data.csv') PURGE = TRUE; -- Source file will be permanently deleted after successful import -- Complete import example COPY INTO sales_data FROM VOLUME data_lake_volume (order_id BIGINT, customer_name STRING, amount DECIMAL(10,2), order_date DATE) USING CSV OPTIONS('header'='true', 'sep'=',') SUBDIRECTORY 'daily_sales/' PURGE = TRUE; -- Clean up on success

Data Quality Assurance Advice

  1. Small Batch Testing First: Use small datasets to validate format and parameters
  2. Error Log Analysis: Carefully analyze error causes and adjust
  3. Backup Source Files: Ensure backups exist before using PURGE
  4. Format Validation: Verify file format consistency before importing

Presigned URL Best Practices

Basic URL Generation

-- Generate presigned URL for User Volume files SELECT GET_PRESIGNED_URL(USER VOLUME, 'data/report.csv', 3600) AS url; -- Generate presigned URL for Table Volume files SELECT GET_PRESIGNED_URL(TABLE VOLUME my_table, 'backup/data.parquet', 3600) AS url; -- Generate presigned URL for External Volume files SELECT GET_PRESIGNED_URL(VOLUME mcp_demo.data_volume, 'path/file.csv', 3600) AS url; -- Generate presigned URL for Named Volume files SELECT GET_PRESIGNED_URL(VOLUME my_schema.my_named_volume, 'shared/report.pdf', 7200) AS url;

External URL Control

-- Force generation of externally accessible URLs SET cz.sql.function.get.presigned.url.force.external=true; -- Generate external URL (recommended for sharing) SELECT GET_PRESIGNED_URL(USER VOLUME, 'report.pdf', 7200) AS external_url; -- Batch URL generation (for file sharing lists) SELECT relative_path, size, GET_PRESIGNED_URL(USER VOLUME, relative_path, CASE WHEN relative_path LIKE '%temp%' THEN 1800 -- Temporary files: 30 minutes WHEN relative_path LIKE '%archive%' THEN 86400 -- Archived files: 24 hours ELSE 3600 -- Default: 1 hour END ) AS access_url FROM (SHOW USER VOLUME DIRECTORY) WHERE relative_path LIKE '%.csv';

Expiration Management Strategy

-- Short-term access (30 minutes) - for temporary downloads SELECT GET_PRESIGNED_URL(USER VOLUME, 'temp.csv', 1800) AS short_url; -- Standard access (1 hour) - for routine operations SELECT GET_PRESIGNED_URL(USER VOLUME, 'data.csv', 3600) AS standard_url; -- Long-term access (12 hours) - for large file downloads SELECT GET_PRESIGNED_URL(USER VOLUME, 'large_archive.zip', 43200) AS long_url; -- Ultra-long access (7 days) - for cross-timezone collaboration SELECT GET_PRESIGNED_URL(USER VOLUME, 'shared_report.xlsx', 604800) AS week_url;

Security Best Practices

  1. Least Privilege Principle: Set the shortest feasible expiration time
  2. Access Logging: Record access purpose and expiration time when generating URLs
  3. Regular Cleanup: Periodically clean up expired presigned URL records
  4. Anomaly Monitoring: Monitor abnormal URL access patterns
  5. External Access Control: Only enable external access when necessary

DIRECTORY Function Advanced Usage

Prerequisite Check

-- Check whether the Volume has DIRECTORY enabled DESC VOLUME schema_name.volume_name; -- Confirm in the result: directory_enabled: 'true' -- If not enabled, specify at creation time: -- CREATE VOLUME my_volume ... DIRECTORY = (enable = TRUE);

Basic Metadata Queries

-- Get basic file information SELECT relative_path, size, last_modified_time FROM DIRECTORY(VOLUME mcp_demo.data_volume) LIMIT 10; -- Statistical analysis by file type SELECT CASE WHEN relative_path LIKE '%.csv' THEN 'CSV' WHEN relative_path LIKE '%.parquet' THEN 'Parquet' WHEN relative_path LIKE '%.json' THEN 'JSON' WHEN relative_path LIKE '%.txt' THEN 'TEXT' ELSE 'Other' END AS file_type, COUNT(*) as file_count, SUM(CAST(size AS BIGINT)) as total_size_bytes, ROUND(SUM(CAST(size AS BIGINT))/1024/1024, 2) as total_size_mb, AVG(CAST(size AS BIGINT)) as avg_size_bytes FROM DIRECTORY(VOLUME mcp_demo.data_volume) GROUP BY CASE WHEN relative_path LIKE '%.csv' THEN 'CSV' WHEN relative_path LIKE '%.parquet' THEN 'Parquet' WHEN relative_path LIKE '%.json' THEN 'JSON' WHEN relative_path LIKE '%.txt' THEN 'TEXT' ELSE 'Other' END ORDER BY total_size_bytes DESC;

Advanced Combined Usage

-- Combine with presigned URL to generate file access catalog SELECT relative_path, ROUND(CAST(size AS BIGINT)/1024/1024, 2) as size_mb, last_modified_time, GET_PRESIGNED_URL(VOLUME mcp_demo.data_volume, relative_path, 3600) AS access_url FROM DIRECTORY(VOLUME mcp_demo.data_volume) WHERE relative_path LIKE '%.csv' AND CAST(size AS BIGINT) > 1000000 -- Filter files larger than 1 MB ORDER BY last_modified_time DESC LIMIT 10; -- Generate data directory report SELECT CASE WHEN instr(relative_path, '/') > 0 THEN substr(relative_path, 1, instr(relative_path, '/') - 1) ELSE 'root' END as directory, COUNT(*) as file_count, SUM(CAST(size AS BIGINT)) as total_size, MAX(last_modified_time) as latest_modified FROM DIRECTORY(VOLUME mcp_demo.data_volume) WHERE relative_path LIKE '%/%' GROUP BY CASE WHEN instr(relative_path, '/') > 0 THEN substr(relative_path, 1, instr(relative_path, '/') - 1) ELSE 'root' END ORDER BY total_size DESC;

Data Governance and Analysis Applications

-- File aging analysis SELECT CASE WHEN last_modified_time >= CURRENT_DATE - INTERVAL '7' DAY THEN 'Recent (7d)' WHEN last_modified_time >= CURRENT_DATE - INTERVAL '30' DAY THEN 'Medium (30d)' WHEN last_modified_time >= CURRENT_DATE - INTERVAL '90' DAY THEN 'Old (90d)' ELSE 'Very Old (>90d)' END AS age_category, COUNT(*) as file_count, ROUND(SUM(CAST(size AS BIGINT))/1024/1024, 2) as size_mb, ROUND(AVG(CAST(size AS BIGINT))/1024, 2) as avg_size_kb FROM DIRECTORY(VOLUME mcp_demo.data_volume) GROUP BY CASE WHEN last_modified_time >= CURRENT_DATE - INTERVAL '7' DAY THEN 'Recent (7d)' WHEN last_modified_time >= CURRENT_DATE - INTERVAL '30' DAY THEN 'Medium (30d)' WHEN last_modified_time >= CURRENT_DATE - INTERVAL '90' DAY THEN 'Old (90d)' ELSE 'Very Old (>90d)' END ORDER BY CASE age_category WHEN 'Recent (7d)' THEN 1 WHEN 'Medium (30d)' THEN 2 WHEN 'Old (90d)' THEN 3 ELSE 4 END; -- Identify files that need cleanup SELECT relative_path, ROUND(CAST(size AS BIGINT)/1024/1024, 2) as size_mb, last_modified_time, datediff(CURRENT_DATE, DATE(last_modified_time)) AS days_old FROM DIRECTORY(VOLUME mcp_demo.data_volume) WHERE last_modified_time < CURRENT_DATE - INTERVAL '90' DAY AND (relative_path LIKE '%temp%' OR relative_path LIKE '%tmp%') ORDER BY last_modified_time ASC;

Application Scenario: Using the DIRECTORY function, you can build automated storage reports, data aging analysis, and file lifecycle management systems to improve storage efficiency and reduce costs.


PIPE Pipeline Automation

Prerequisites and Environment Preparation

-- 1. Confirm Virtual Cluster status SHOW VCLUSTERS; -- 2. Start Virtual Cluster (if needed) ALTER VCLUSTER DEFAULT RESUME; -- 3. Confirm External Volume usage (only External Volume supports PIPE) SHOW VOLUMES; DESC VOLUME schema_name.volume_name;

Creating PIPE Pipeline

-- Create target table CREATE TABLE IF NOT EXISTS auto_import_target ( id INT, name STRING, data_content STRING, ingestion_time TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(), source_file STRING -- Record source file ); -- Create basic PIPE (only supports External Volume) CREATE PIPE IF NOT EXISTS data_auto_import_pipe VIRTUAL_CLUSTER = 'DEFAULT' INGEST_MODE = 'LIST_PURGE' COMMENT 'External Volume auto data import pipeline' AS COPY INTO auto_import_target (id, name, data_content) FROM VOLUME mcp_demo.data_volume (id INT, name STRING, data_content STRING) USING CSV OPTIONS('header'='false') PURGE = true; -- Create optimized PIPE (recommended for production) CREATE PIPE optimized_auto_pipe VIRTUAL_CLUSTER = 'DEFAULT' BATCH_INTERVAL_IN_SECONDS = 60 -- Batch processing interval BATCH_SIZE_PER_KAFKA_PARTITION = 10000 -- Batch processing size INGEST_MODE = 'LIST_PURGE' COMMENT 'Production-grade auto import pipeline' AS COPY INTO auto_import_target (id, name, data_content) FROM VOLUME mcp_demo.data_volume (id INT, name STRING, data_content STRING) USING CSV OPTIONS('header'='true') PURGE = true; -- Clean up after processing

PIPE Management and Monitoring

-- View all PIPEs SHOW PIPES; -- View PIPE details DESC PIPE data_auto_import_pipe; -- Pause PIPE (correct syntax) ALTER PIPE data_auto_import_pipe SET PIPE_EXECUTION_PAUSED = true; -- Resume PIPE (correct syntax) ALTER PIPE data_auto_import_pipe SET PIPE_EXECUTION_PAUSED = false; -- Drop PIPE DROP PIPE IF EXISTS data_auto_import_pipe;

PIPE Usage Constraints and Notes

  1. Only External Volume Supported: User Volume and Table Volume do not support PIPE monitoring
  2. New File Monitoring: PIPE monitors new files only; existing files cannot be specified
  3. Virtual Cluster Dependency: An available Virtual Cluster is required for execution
  4. Permission Requirements: Appropriate permissions on Volume and target table are required
  5. Performance Considerations: Set batch parameters appropriately to avoid resource waste

Advanced Data Import and Export

Advanced Data Export

-- Basic export to User Volume COPY INTO USER VOLUME SUBDIRECTORY 'exports/2024/' FROM my_table FILE_FORMAT = (TYPE = CSV); -- Export to Table Volume (table-dedicated space) COPY INTO TABLE VOLUME my_table SUBDIRECTORY 'backups/' FROM my_table FILE_FORMAT = (TYPE = PARQUET); -- Export query results (complex queries) COPY INTO USER VOLUME SUBDIRECTORY 'reports/' FROM ( SELECT customer_id, customer_name, SUM(order_amount) as total_amount, COUNT(*) as order_count, MAX(order_date) as last_order_date FROM orders WHERE order_date >= '2024-01-01' GROUP BY customer_id, customer_name HAVING SUM(order_amount) > 10000 ) FILE_FORMAT = (TYPE = CSV); -- Export using Parquet format (recommended) COPY INTO USER VOLUME SUBDIRECTORY 'compressed_exports/' FROM my_table FILE_FORMAT = (TYPE = PARQUET);

Advanced Data Import

-- Import from User Volume COPY INTO target_table FROM USER VOLUME (id INT, name STRING, created_time TIMESTAMP_NTZ) USING CSV OPTIONS('header'='true') FILES ('exports/data.csv') PURGE = FALSE; -- Keep source file for error analysis -- Import from Table Volume (specify multiple files) COPY INTO target_table FROM TABLE VOLUME source_table (id INT, name STRING, backup_time TIMESTAMP_NTZ) USING PARQUET FILES ('backups/backup_20240529.parquet'); -- For a second file, import separately COPY INTO target_table FROM TABLE VOLUME source_table (id INT, name STRING, backup_time TIMESTAMP_NTZ) USING PARQUET FILES ('backups/backup_20240530.parquet'); -- Import from External Volume (using subdirectory) COPY INTO target_table FROM VOLUME mcp_demo.data_volume (id INT, name STRING, value DOUBLE, load_date DATE) USING CSV SUBDIRECTORY 'processed/' -- Process entire directory OPTIONS('header'='true'); -- JSON format import example COPY INTO json_target_table FROM VOLUME data_volume USING JSON SUBDIRECTORY 'json_data/';

Batch Data Processing Workflow

-- Complete ETL workflow example -- Step 1: Export raw data for processing COPY INTO USER VOLUME SUBDIRECTORY 'etl/raw/' FROM (SELECT * FROM raw_table WHERE status = 'pending' AND created_date >= CURRENT_DATE) FILE_FORMAT = (TYPE = CSV); -- Step 2: Table-level backup using Table Volume COPY INTO TABLE VOLUME important_table SUBDIRECTORY 'daily_backup/2024-05-29/' FROM important_table FILE_FORMAT = (TYPE = PARQUET); -- Step 3: Import after processing (processing typically done by external ETL tools) COPY INTO processed_table FROM USER VOLUME (id INT, name STRING, processed_value DOUBLE, processed_time TIMESTAMP_NTZ) USING CSV OPTIONS('header'='true') SUBDIRECTORY 'etl/processed/' PURGE = TRUE; -- Clean up temporary files after successful import -- Step 4: Validate data quality SELECT COUNT(*) as total_rows, COUNT(DISTINCT id) as unique_ids, MAX(processed_time) as latest_processed, MIN(processed_time) as earliest_processed FROM processed_table WHERE DATE(processed_time) = CURRENT_DATE;


Volume Small File Optimization

Analyze Volume Storage Usage

SELECT 'User Volume' as volume_type, COUNT(*) as file_count, SUM(CAST(size AS BIGINT)) / 1024 / 1024 as total_size_mb, AVG(CAST(size AS BIGINT)) / 1024 as avg_file_size_kb, MIN(CAST(size AS BIGINT)) as min_size, MAX(CAST(size AS BIGINT)) as max_size FROM (SHOW USER VOLUME DIRECTORY) UNION ALL SELECT 'External Volume' as volume_type, COUNT(*) as file_count, SUM(CAST(size AS BIGINT)) / 1024 / 1024 as total_size_mb, AVG(CAST(size AS BIGINT)) / 1024 as avg_file_size_kb, MIN(CAST(size AS BIGINT)) as min_size, MAX(CAST(size AS BIGINT)) as max_size FROM DIRECTORY(VOLUME mcp_demo.data_volume);

Identify Small File Issues

-- Identify files smaller than 1 MB SELECT relative_path, CAST(size AS BIGINT) as size_bytes FROM DIRECTORY(VOLUME mcp_demo.data_volume) WHERE CAST(size AS BIGINT) < 1048576 -- Less than 1 MB ORDER BY size_bytes ASC LIMIT 20;

Volume File Compaction Strategy

-- 1. Use export-based compaction -- Import from multiple small files, then export as a single large file COPY INTO consolidated_table FROM VOLUME mcp_demo.data_volume (id INT, name STRING, value DOUBLE) USING CSV SUBDIRECTORY 'small_files/'; COPY INTO VOLUME mcp_demo.data_volume SUBDIRECTORY 'consolidated/' FROM consolidated_table FILE_FORMAT = (TYPE = PARQUET); -- 2. Periodic archive compaction -- Merge old data into larger archive files COPY INTO archive_table FROM VOLUME mcp_demo.data_volume (id INT, name STRING, date_field DATE, value DOUBLE) USING CSV SUBDIRECTORY 'daily_data/' FILES ('day_20240101.csv', 'day_20240102.csv', 'day_20240103.csv'); COPY INTO VOLUME mcp_demo.data_volume SUBDIRECTORY 'archive/2024/01/' FROM archive_table FILE_FORMAT = (TYPE = PARQUET);

Volume Storage Optimization Advice

  1. File Size Control: Recommended single file size between 100 MB and 1 GB
  2. Compaction Strategy: Periodically merge similar small files into larger ones
  3. Tiered Storage: Separate frequently accessed files from archived files
  4. Format Conversion: Convert CSV and other formats to Parquet
  5. Regular Cleanup: Clean up temporary files and expired data

Permission Management and Security Control

Volume Permission System

Different Volume types have different permission management models:

-- User Volume permissions (user has management permissions by default) -- No additional authorization needed; users have full control over their own User Volume -- Table Volume permissions (tied to table permissions) -- Corresponding table permissions are required to operate Table Volume: -- SELECT permission -> SHOW/LIST/GET operations -- INSERT/UPDATE/DELETE permission -> PUT/REMOVE operations -- External Volume permissions (requires explicit authorization) GRANT READ VOLUME ON VOLUME schema_name.external_volume_name TO USER username; GRANT WRITE VOLUME ON VOLUME schema_name.external_volume_name TO USER username; GRANT ALL ON VOLUME schema_name.external_volume_name TO ROLE data_engineer;

Permission Management Best Practices

-- Create role-based permission management CREATE ROLE data_reader; CREATE ROLE data_writer; CREATE ROLE data_admin; -- Batch permission grants GRANT READ VOLUME ON VOLUME schema_name.shared_volume TO ROLE data_reader; GRANT READ VOLUME, WRITE VOLUME ON VOLUME schema_name.shared_volume TO ROLE data_writer; GRANT ALL ON VOLUME schema_name.shared_volume TO ROLE data_admin; -- User role assignment GRANT ROLE data_reader TO USER analyst_team; GRANT ROLE data_writer TO USER etl_team; GRANT ROLE data_admin TO USER admin_user; -- View Volume permissions SHOW GRANTS ON VOLUME schema_name.volume_name; -- View user permissions SHOW GRANTS TO USER username; -- Revoke permissions REVOKE WRITE VOLUME ON VOLUME schema_name.volume_name FROM USER username;

Security Configuration Advice

-- Presigned URL security settings SET cz.sql.function.get.presigned.url.force.external=true; -- Enable only when needed -- Periodic permission auditing SHOW GRANTS ON VOLUME schema_name.volume_name;


Cost Optimization Advice

Storage Cost Optimization

-- Storage space analysis SELECT volume_type, file_format, file_count, ROUND(total_size_gb, 2) as size_gb, ROUND(total_size_gb * 0.15, 2) as estimated_monthly_cost_usd -- Estimated cost FROM ( SELECT 'External Volume' as volume_type, CASE WHEN relative_path LIKE '%.parquet' THEN 'Parquet' WHEN relative_path LIKE '%.csv' THEN 'CSV' WHEN relative_path LIKE '%.json' THEN 'JSON' ELSE 'Other' END as file_format, COUNT(*) as file_count, SUM(CAST(size AS BIGINT)) / 1024 / 1024 / 1024 as total_size_gb FROM DIRECTORY(VOLUME mcp_demo.data_volume) GROUP BY CASE WHEN relative_path LIKE '%.parquet' THEN 'Parquet' WHEN relative_path LIKE '%.csv' THEN 'CSV' WHEN relative_path LIKE '%.json' THEN 'JSON' ELSE 'Other' END ) subq ORDER BY total_size_gb DESC;

Network Transfer Cost Optimization

  1. Same-Region Deployment: Ensure Lakehouse and Volume storage are in the same region
  2. Internal Network Transfer: Use internal network transfer within the same cloud provider and region to avoid public network fees
  3. Batch Operations: Batch imports and exports to reduce the number of network requests

Compute Resource Optimization

-- Choose the right compute cluster type -- Batch data processing: Use General Purpose Virtual Cluster -- Real-time queries: Use Analytics Purpose Virtual Cluster -- Check current cluster configuration SHOW VCLUSTERS; -- Optimization advice: -- 1. Use General Purpose cluster for data imports -- 2. Enable automatic small file compaction (General Purpose cluster only) -- 3. Set PIPE batch parameters appropriately

Cost Monitoring Advice

  1. Periodic Storage Audit: Analyze storage usage monthly
  2. File Lifecycle Management: Set data archiving and cleanup policies
  3. Network Transfer Monitoring: Monitor cross-region data transfer volumes

Frequently Asked Questions and Solutions

Volume Access Issues

Problem: Cannot access External Volume

-- Solution 1: Check permissions SHOW GRANTS TO USER your_username; -- Solution 2: Check Volume status DESC VOLUME schema_name.volume_name; -- Solution 3: Request appropriate permissions -- Administrator needs to execute: -- GRANT READ VOLUME ON VOLUME schema_name.volume_name TO USER username;

Problem: DIRECTORY function unavailable

-- Solution: Check Volume configuration DESC VOLUME schema_name.volume_name; -- Confirm directory_enabled: 'true' -- If not enabled, recreate Volume with: -- CREATE VOLUME ... DIRECTORY = (enable = TRUE);

Problem: PIPE creation fails

-- Solution 1: Check Virtual Cluster SHOW VCLUSTERS; ALTER VCLUSTER DEFAULT RESUME; -- Solution 2: Confirm External Volume usage -- User Volume and Table Volume do not support PIPE -- Solution 3: Check permissions -- Appropriate permissions on Volume and target table are required

Problem: PIPE not working or slow processing

-- Solution: Optimize PIPE parameters -- Note: Use correct syntax to pause and resume PIPE ALTER PIPE my_pipe SET PIPE_EXECUTION_PAUSED = true; -- Pause ALTER PIPE my_pipe SET PIPE_EXECUTION_PAUSED = false; -- Resume -- Check PIPE status DESC PIPE my_pipe;

Problem: LIST command query is slow

-- Solution 1: Use regular expressions to narrow scope LIST VOLUME mcp_demo.data_volume REGEXP = '.*2024.*\.csv'; -- Solution 2: Use SUBDIRECTORY to limit scope LIST VOLUME mcp_demo.data_volume SUBDIRECTORY 'recent_data/'; -- Solution 3: Limit results (using DIRECTORY function) SELECT relative_path, size FROM DIRECTORY(VOLUME mcp_demo.data_volume) WHERE relative_path LIKE 'current_month/%' LIMIT 100;

Problem: Poor query performance, too many small files

-- Solution 1: Periodically merge small files -- Use import/export approach to merge files -- Solution 2: Use appropriate file formats -- Prefer Parquet format for analytical data storage

Data Quality Issues

Problem: Import data format errors

-- Solution: Validate file format and table structure -- 1. First view file content LIST USER VOLUME REGEXP = '.*\.csv'; -- 2. Test import with a small sample COPY INTO test_target FROM USER VOLUME USING CSV OPTIONS('header'='true') FILES ('sample.csv');

Problem: Files accidentally deleted

-- Prevention: Use PURGE parameter cautiously COPY INTO target_table FROM VOLUME my_volume USING CSV PURGE = FALSE; -- Keep source files -- Recommendation: Use backup strategy for important data COPY INTO TABLE VOLUME backup_table SUBDIRECTORY 'daily_backup/' FROM source_table FILE_FORMAT = (TYPE = PARQUET);


Core Operations Quick Reference

Basic Commands

OperationCommand ExampleApplicable Volume Types
List filesLIST USER VOLUMEAll
Filter filesLIST VOLUME vol_name REGEXP = '.*\.csv'All
View subdirectoryLIST VOLUME vol_name SUBDIRECTORY 'path/'All
Delete fileREMOVE USER VOLUME FILE 'path/file.csv'All
Delete directoryREMOVE VOLUME vol_name SUBDIRECTORY 'path/'All
Generate URLSELECT GET_PRESIGNED_URL(USER VOLUME, 'file.csv', 3600)All

Import and Export

OperationCommand ExampleDescription
Export to VolumeCOPY INTO USER VOLUME FROM table_nameBasic export
Export query resultsCOPY INTO USER VOLUME FROM (SELECT...)Supports complex queries
Import to tableCOPY INTO table FROM USER VOLUMEBasic import
Clean up source filesCOPY INTO table FROM vol PURGE = TRUEAuto-delete after import

Permission Management

OperationCommand ExampleDescription
Grant read permissionGRANT READ VOLUME ON VOLUME vol TO USER uAllow file reading
Grant write permissionGRANT WRITE VOLUME ON VOLUME vol TO USER uAllow file writing
Grant all permissionsGRANT ALL ON VOLUME vol TO ROLE rAll operation permissions
View permissionsSHOW GRANTS ON VOLUME vol_nameView granted permissions

Advanced Features

OperationCommand ExamplePrerequisites
Metadata querySELECT * FROM DIRECTORY(VOLUME vol)directory_enabled=true
Create PIPECREATE PIPE pipe_name AS COPY INTO...External Volume used
Pause PIPEALTER PIPE pipe SET PIPE_EXECUTION_PAUSED = truePIPE already created
Resume PIPEALTER PIPE pipe SET PIPE_EXECUTION_PAUSED = falsePIPE already paused

Document Summary

This document provides a comprehensive guide to best practices for Singdata Lakehouse Volume features, covering optimization strategies and solutions across different usage scenarios from basic operations to advanced applications. Key benefits include:

  • Simplified Data Management: Unified file operation interface with a consistent experience across storage systems
  • Improved Storage Efficiency: File format optimization and small file handling strategies to significantly reduce storage costs
  • Automated Data Flows: Automatic data import via PIPE without manual intervention
  • Enhanced Data Security: Fine-grained permission control and security best practices
  • Performance Optimization: Query performance optimization and network transfer efficiency improvements
  • Cost Control: Cost optimization advice for storage, compute, and network resources

By following the recommendations in this document, you can build efficient, secure, and cost-effective data storage and processing systems while fully leveraging the powerful capabilities of Singdata Lakehouse.


Reference Materials


Note: This guide is based on the Singdata Lakehouse version tested in May 2025. Subsequent versions may introduce changes. Please check the official documentation regularly for the latest information.