Using Data Sources in Python/Shell Tasks
Overview
Python/Shell tasks support the use of pre-configured data sources. Through the built-in clickzetta-dbutils package in the runtime environment, you can reuse connection configurations from Management → Data Sources directly in your tasks — no need to hard-code connection details in your code. This improves the security of sensitive credentials and simplifies both development and management.
Currently supported data source types:
- Lakehouse data source
- MySQL data source
- PostgreSQL data source
UI Guide
Selecting a Data Source in a Python/Shell Task
In the configuration panel of a Python/Shell task, you can select one or more data sources to use (make sure the data source has already been created and connection-tested under Management → Data Sources). This configuration applies to both direct runs and scheduled runs of the task:
Note: The default Lakehouse data source for the current workspace is accessible in code without needing to be added here.
Accessing a Data Source in Code
After adding a data source, you can start writing your Python/Shell task code. Call the get_active_engine("your_datasource_name") function from the clickzetta_dbutils library — just provide the data source name, with no need to specify the URL, password, or other connection details. You can also use the Builder pattern; see the API guide and code examples below.
API Reference
get_active_engine
A convenience function for creating a database engine in a Studio Python node. Currently supports MySQL, PostgreSQL, and Lakehouse data sources.
Function Signature
Parameters
ds_name(str): Data source name. Required. Must match the name of the data source in Management → Data Sources.vcluster(str, optional): Virtual cluster name for Singdata data sources. Required for Singdata data sources.workspace(str, optional): Workspace name. Defaults to the current workspace.schema(str, optional): Schema name to connect to. Defaults to'public'.options(dict, optional): Additional connection options.query(dict, optional): Additional query parameters for the SQLAlchemy URL.
Return Value
- A SQLAlchemy Engine instance.
Examples
- A Postgres data source named
"qiliang_test_pg"has been added under Management → Data Sources. - In the current Python node,
"qiliang_test_pg"has been added as a data source, with the selected database"answer"and schemapublic. - Access tables in
qiliang_test_pg → answer → publicdirectly viaget_active_engine:
Connect and run a query:
- Access a different database within
qiliang_test_pgby passing parameters (requires that the data source in Management → Data Sources is configured to allow access to other authorized databases — for example, the PostgreSQL database"sample"with schema"public"):
Connect and run a query:
get_active_lakehouse_engine
A convenience function for creating a database engine for a Lakehouse data source.
Function Signature
Parameters
vcluster(str, optional): Virtual cluster name for the Singdata data source. Required.workspace(str, optional): Workspace name. Defaults to the current workspace.schema(str, optional): Schema name to connect to. Defaults to'public'.options(dict, optional): Additional connection options.query(dict, optional): Additional query parameters for the SQLAlchemy URL.driver(str, optional): Driver name for the connection.
Return Value
- A SQLAlchemy Engine instance.
Exceptions
DatabaseConnectionError: Raised when no Lakehouse data source is found in the configuration.
Example
- The cluster to use is named
"default"(visible under Compute → Clusters). - The target data is in workspace
"ql_ws", schema"brazilianecommerce", table"olist_customers"(visible under Development → Data). - Access the table directly via
get_active_lakehouse_engine:
Connect and run a query:
DatabaseConnectionManager
A database connection manager that supports chained configuration of connection parameters. The actual SQLAlchemy connection is only established when build(self, *args, **kwargs) is called.
use_workspace
Sets the workspace for the connection. Only required for Lakehouse data sources.
use_schema
Sets the schema for the connection.
Note: Due to SQLAlchemy's design, for PostgreSQL, `` should be set to the database name.
use_vcluster
Sets the virtual cluster for the connection. Only required for Lakehouse data sources.
use_options
Sets additional connection options.
Note: Due to SQLAlchemy's design, the PostgreSQL schema should be set to undefined"})
use_query
Sets query parameters for the connection.
build
Creates a SQLAlchemy engine based on the data source name and optional configuration.
Usage Example
Chained call example:
Lakehouse connection example:
Code Examples
Using a PostgreSQL Data Source in a Python Node
This example retrieves all pg tables from the postgres_source_name data source:
Using the default schema:
Specifying a database and schema via options:
Using a MySQL Data Source in a Python Node
View all available data source configurations:
Create a connection with a specified schema:
Using a Lakehouse Data Source in a Python Node
Option 1: Using get_active_engine:
Option 2: Using get_active_lakehouse_engine:
Using a Data Source in a Shell Node
In a Shell node, you can use data sources by creating a Python script file:
Notes
- Data source configuration works in both ad-hoc execution and scheduled execution scenarios.
- Using a data source name that does not exist will cause an error. Make sure to select the corresponding data source under Development → Python Task → Data Sources before use. PostgreSQL and MySQL data sources must be created and connection-tested under Management → Data Sources.
- When using a Lakehouse data source, the
vclusterparameter is required. The Lakehouse data source uses the built-in Lakehouse data source visible under Management → Data Sources. - Data source connection details are handled securely to prevent plaintext password exposure.
- Multiple data sources of different types can be used within the same node.
