Description
Load data from files in object storage into a table. These files must exist in object storage.
- Currently supported object storage locations are Tencent Cloud COS, Alibaba Cloud OSS, and AWS S3. Currently, only Volume objects are supported.
- Cross-cloud vendor import is not supported at the moment. For example, if your Lakehouse service is on Alibaba Cloud but the object storage is on Tencent Cloud.
Syntax
Instructions
After the copy into statement, you can directly use the volume query syntax to transform data during the import process. Refer to volume query structured and semi-structured data analysis.
Parameter Description:
- OVERWRITE|INTO:
- INTO: Append mode. When using the
INTOclause for data import, new data will be appended to the target table, and this mode will not delete or modify existing data in the table. - OVERWRITE: Overwrite mode. When using the
OVERWRITEclause for data import, existing data in the target table will be cleared, and then new data will be imported. This mode is suitable for scenarios where new data needs to completely replace old data.
- INTO: Append mode. When using the
- column_name and column_type: Optional, Lakehouse supports automatic schema recognition for files. It is recommended not to fill them in. When specified, the column names and types in the file must match the predefined column types in the file.
- Automatic schema recognition for CSV files will automatically generate fields, with field numbers starting from f0. Currently, the automatically recognized types are int, double, string, and bool.
- For Parquet and ORC formats, the field names and types stored in the file will be automatically recognized. If the number of columns in the specified file is inconsistent, Lakehouse will attempt to merge them. If merging is not possible, an error will be reported.
-
FileFormatParams: Multiple parameters are separated by commas, in the string format 'key'='value'
-
CSV format: Supports the following file parameter combinations
- sep: Column separator, default is ",". Supports a maximum length of 2 character, for example:
'sep'=','orsep='||' - compression: Configures the file compression format. Supported compression formats are: gzip/zstd/zlib. For example:
'compression'='gzip' lineSep: The line separator, which by default recognizes both\r\nand\nas newline characters. It supports a maximum length of 2 characters. For example:'lineSep' = '$': Uses$as the line separator.lineSep = '\r\n': Uses the Windows-style carriage return and line feed as the newline character.
- quote: Sets a single character used to escape quote values. The default value is double quotes
", for example:'quote'='"' - header: Whether to parse the header, default is false. Boolean type, for example:
'header'='true' - timeZone: Configures the time zone, no default value. Used to specify the time zone for the time format in the file. For example:
'timeZone' = 'Asia/Shanghai' - escape: Used to escape quotes within quoted values, default value is
\, for example:'escape'='\' - nullValue: Used to determine what content should be Null, default value is
"", for example'nullValue'='\\\N'or'nullValue'=r'\N'
- sep: Column separator, default is ",". Supports a maximum length of 2 character, for example:
-
JSON format:
- compression: Whether the source file/target file is compressed, default is no compression, configuration like
'compression'='gzip' - explodeArray: Default value is true, optional value is false. Used when JSON content starts with an array. When set to true, the schema is considered to be the schema of a single unit within the array. When set to false, the schema is considered to be the schema of the array itself.
- compression: Whether the source file/target file is compressed, default is no compression, configuration like
- Parquet, ORC, BSON formats
-
- FILES: Used to specify the specific files to be read. Supports specifying multiple files of the same format. The file path is the sub-file path specified when creating the Volume. For example:
files('part-00002.snappy.parquet','part-00003.snappy.parquet'). - SUBDIRECTORY: Specifies the sub-path. Used to specify the sub-path to be read. When reading, all files in this directory will be recursively loaded. For example:
subdirectory 'month=02'. -
- REGEXP <pattern>: Regular expression matching. For example:
regexp 'part-.*.parquet'matches files with a parquet suffix that start with part-. Another example:regexp 'NYC/YellowTaxiTripRecords/parquet/yellow_tripdata_2022.*.parquet'matches all files with a parquet suffix that start with yellow_tripdata_2022 under the subpath NYC/YellowTaxiTripRecords/parquet. - PURGE=TRUE: When this parameter is set, it indicates that the source files in object storage will be deleted after the data import is successful. This helps save storage space, especially when dealing with large amounts of data. If the import operation fails, the source files will not be deleted.
- ON_ERROR=CONTINUE|ABORT: Controls the error - handling strategy during data loading. Adding this parameter also returns the list of imported files.
CONTINUE: Skips erroneous rows and continues loading subsequent data. Use when tolerating partial errors to maximize data loading. Currently, only file format mismatches (e.g., specifying zip compression in the command but encountering zstd - compressed files) are ignored.ABORT: Immediately terminates the entire COPY operation. Use in scenarios with strict data quality requirements where any error necessitates manual inspection.
Notes
- It is recommended to choose a General Purpose Virtual Cluster (GENERAL PURPOSE VIRTUAL CLUSTER) when importing data, as general-purpose computing resources are more suitable for running batch jobs and loading data jobs.
- It is recommended to choose the same region when importing data to avoid public network transmission costs. Data transmission within the same region and the same cloud provider supports intranet transmission.
Example
- Load data from user volume
- Load Data from Table Volume
Export to External Volume
Before using, it is necessary to create VOLUME and CONNECTION. The creation process can refer to CONNECTION creation and VOLUME creation
- Import data from OSS
- Import data from COS
- Import data from s3
- Import CSV file and parse with json function, import into json table
- Import BSON File
