Quick Analysis of Local CSV Files
Objective:
Through this document, you will be able to quickly analyze local CSV files using Lakehouse
Analyze CSV Format Files:
Data Preparation:
This article uses the Brazilian e-commerce public dataset as an example. In the local path /User/Downloads/brazil-ecommerce on the client, there are the following CSV files:
Using the Lakehouse JDBC client SQLLine (or DBeaver / Datagrip, etc.) to upload data to the Lakehouse User volume space (restriction: single file must be smaller than 5G):
(USER VOLUME is the file storage space that is automatically enabled for the current user in the current workspace, no need to create it in advance)
In the client tool, or in the Studio SQL task node, execute the following command to display the files that have been uploaded to the USER VOLUME:
Quickly Analyze Data with Lakehouse SQL
For example: Count the number of users in each state and sort in descending order by the number of users
Options parameters:
- sep: Column separator, default is ",". Supports a maximum length of 1 character, for example:
'sep'=',' - compression: Configures the file compression format. Supported compression formats are: gzip/zstd/zlib. For example:
'compression'='gzip' - header: Whether to parse the header, default is false. Boolean type, for example:
'header'='true'
In addition, the following parameters are also supported:
- timeZone: Configures the time zone, no default value. Used to specify the time zone for the time format in the file. For example:
'timeZone' = 'Asia/Shanghai' - escape: Used to escape quotes within quoted values, default value is "", for example:
'escape'='\' - lineSep: Line separator, default value is "\n". Supports a maximum length of 2 characters, for example:
'lineSep'='$' - quote: Sets a single character used to escape quote values. The default value is double quotes '"', for example:
'quote'='"'
