Credit Scoring with Zettapark and Python Machine Learning Libraries

Overview

In this step-by-step tutorial, you will use Zettapark for Python, along with your favorite data analysis and visualization Python libraries and the popular scikit-learn machine learning library, to solve an end-to-end machine learning use case.

Prerequisites

  • Lakehouse account
  • Client-side Zettapark environment with the Zettapark library installed.

What You Will Learn

  • Learn how to implement an end-to-end machine learning pipeline using Zettapark for Python.
  • Develop using Zettapark for Python APIs and vectorized functions.
  • Perform data exploration, visualization, and preparation using popular Python libraries (Pandas, seaborn).
  • Perform machine learning using the scikit-learn Python package.
  • Deploy and use machine learning models for scoring using Zettapark for Python.

Steps

  • Step 1: Run the credit scoring setup notebook. This downloads the dataset and creates the databases and tables required for this demo. Make sure to customize config.json.
  • Step 2: Now you can run the credit scoring tutorial.

You can get the source code (Jupyter Notebook ipynb files) and data files from the GitHub repository.

Credit Scoring Setup Notebook with Zettapark for Python

1. Lakehouse Trial Account

The prerequisite is having a Lakehouse account. If you do not have a Lakehouse account, contact us for a free trial.

After signing up for the trial, bookmark the Lakehouse account URL and save your credentials, which will be needed in this lab.

This release requires Zettapark 0.1.2 or later.

2. Python Libraries

The following libraries are required to run this demo. In this section, add any Python libraries that are missing from your environment.

# !pip install -q --upgrade clickzetta_zettapark_python

# !pip install scikit-plot

# !pip install pyarrow==6.0.0

# !pip install seaborn

# !pip install matplotlib

3. File Downloads

3.1 Dataset

! curl -o data/credit_files.csv https://raw.githubusercontent.com/yunqiqiliang/clickzetta_quickstart/refs/heads/main/Zettapark-credit-scoring/data/credit_files.csv

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 292k 100 292k 0 0 66280 0 0:00:04 0:00:04 --:--:-- 69100

! curl -o data/credit_request.csv https://raw.githubusercontent.com/yunqiqiliang/clickzetta_quickstart/refs/heads/main/Zettapark-credit-scoring/data/credit_request.csv

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6068 100 6068 0 0 2297 0 0:00:02 0:00:02 --:--:-- 2297

3.2 config.json Credentials File

Edit the following file with your Lakehouse account credentials and save it. It will be used to connect to Lakehouse in the main notebook:

{ "username": "<username>", "password": "<password>", "service": "<service url>", "instance": "<instance id>", "workspace": "<workspace>", "schema": "<schema>", "vcluster": "<vcluster>", "sdk_job_timeout": 60, "hints": { "sdk.job.timeout": 60, "query_tag": "test_zettapark_credit_scoring" } }

! curl -o config/config_tobe_renamed.json https://raw.githubusercontent.com/yunqiqiliang/clickzetta_quickstart/refs/heads/main/Zettapark-credit-scoring/config/config.json

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 321 100 321 0 0 138 0 0:00:02 0:00:02 --:--:-- 138

4. Database

In the section below, fill in the different parameters in the config.json file to connect to your Lakehouse environment.

import pandas as pd import json from clickzetta.zettapark.session import Session import clickzetta.zettapark.functions as F import warnings warnings.filterwarnings("ignore", category=FutureWarning) # Read connection parameters from config file with open('config/config.json', 'r') as config_file: config = json.load(config_file) schema = config['schema'] vcluster = config['vcluster'] print("Connecting to Lakehouse.....\n") # Create session session = Session.builder.configs(config).create() session.sql(f"CREATE SCHEMA IF NOT EXISTS {schema}").collect() session.sql(f"CREATE VCLUSTER IF NOT EXISTS {vcluster} VCLUSTER_SIZE=1 VCLUSTER_TYPE = GENERAL").collect() print(session.sql("SELECT current_instance_id(), current_workspace(),current_workspace_id(), current_schema(), current_user(),current_user_id(), current_vcluster()").collect()) print("\nConnected!...\n")

5. Tables

This demo includes 2 tables:

  • CREDIT_FILES: This table contains the credit standing on current files and whether the loan is being repaid or has actual issues in terms of credit repayment. This dataset will be used for historical analysis and to build a machine learning model for scoring new applications.

  • CREDIT_REQUESTS: This table contains new credit requests that the bank needs to approve based on the ML algorithm.

5.1 CREDIT_FILES Table

After running the following commands, log in to your Lakehouse environment and verify the table has been created. It should have 2.9K rows.

credit_files = pd.read_csv('data/credit_files.csv') credit_files.columns = credit_files.columns.str.lower() session.sql("drop table if exists CREDIT_FILES").collect() session.write_pandas(credit_files,"CREDIT_FILES",auto_create_table='True', quote_identifiers=False)

<clickzetta.zettapark.table.Table at 0x7fe58538e990>

credit_df = session.table("CREDIT_FILES") credit_df.schema

StructType([StructField('`credit_request_id`', LongType(), nullable=True), StructField('`credit_amount`', LongType(), nullable=True), StructField('`credit_duration`', LongType(), nullable=True), StructField('`purpose`', StringType(), nullable=True), StructField('`installment_commitment`', LongType(), nullable=True), StructField('`other_parties`', StringType(), nullable=True), StructField('`credit_standing`', StringType(), nullable=True), StructField('`credit_score`', LongType(), nullable=True), StructField('`checking_balance`', DoubleType(), nullable=True), StructField('`savings_balance`', DoubleType(), nullable=True), StructField('`existing_credits`', LongType(), nullable=True), StructField('`assets`', StringType(), nullable=True), StructField('`housing`', StringType(), nullable=True), StructField('`qualification`', StringType(), nullable=True), StructField('`job_history`', LongType(), nullable=True), StructField('`age`', LongType(), nullable=True), StructField('`sex`', StringType(), nullable=True), StructField('`marital_status`', StringType(), nullable=True), StructField('`num_dependents`', LongType(), nullable=True), StructField('`residence_since`', LongType(), nullable=True), StructField('`other_payment_plans`', StringType(), nullable=True)])

credit_df.toPandas().head()

credit_df.toPandas().info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 2940 entries, 0 to 2939 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 credit_request_id 2940 non-null int64 1 credit_amount 2940 non-null int64 2 credit_duration 2940 non-null int64 3 purpose 2940 non-null object 4 installment_commitment 2940 non-null int64 5 other_parties 271 non-null object 6 credit_standing 2940 non-null object 7 credit_score 2940 non-null int64 8 checking_balance 2940 non-null float64 9 savings_balance 2940 non-null float64 10 existing_credits 2940 non-null int64 11 assets 2489 non-null object 12 housing 2940 non-null object 13 qualification 2940 non-null object 14 job_history 2940 non-null int64 15 age 2940 non-null int64 16 sex 2940 non-null object 17 marital_status 2940 non-null object 18 num_dependents 2940 non-null int64 19 residence_since 2940 non-null int64 20 other_payment_plans 2940 non-null object dtypes: float64(2), int64(10), object(9) memory usage: 482.5+ KB

5.2 CREDIT_REQUEST Table

After running the following commands, log in to your Lakehouse environment and verify the table has been created. It should have 60 rows.

credit_requests = pd.read_csv('data/credit_request.csv') credit_requests.columns = credit_requests.columns.str.lower() session.sql("drop table if exists CREDIT_REQUESTS").collect() session.write_pandas(credit_requests,"CREDIT_REQUESTS",auto_create_table='True', quote_identifiers=False)

<clickzetta.zettapark.table.Table at 0x7fe50b7556d0>

credit_req_df = session.table("CREDIT_REQUESTS") credit_req_df.schema

StructType([StructField('`credit_request_id`', LongType(), nullable=True), StructField('`credit_amount`', LongType(), nullable=True), StructField('`credit_duration`', LongType(), nullable=True), StructField('`purpose`', StringType(), nullable=True), StructField('`installment_commitment`', LongType(), nullable=True), StructField('`other_parties`', StringType(), nullable=True), StructField('`credit_score`', LongType(), nullable=True), StructField('`checking_balance`', DoubleType(), nullable=True), StructField('`savings_balance`', DoubleType(), nullable=True), StructField('`existing_credits`', LongType(), nullable=True), StructField('`assets`', StringType(), nullable=True), StructField('`housing`', StringType(), nullable=True), StructField('`qualification`', StringType(), nullable=True), StructField('`job_history`', LongType(), nullable=True), StructField('`age`', LongType(), nullable=True), StructField('`sex`', StringType(), nullable=True), StructField('`marital_status`', StringType(), nullable=True), StructField('`num_dependents`', LongType(), nullable=True), StructField('`residence_since`', LongType(), nullable=True), StructField('`other_payment_plans`', StringType(), nullable=True)])

credit_req_df.toPandas().head()

credit_req_df.toPandas().info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 60 entries, 0 to 59 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 credit_request_id 60 non-null int64 1 credit_amount 60 non-null int64 2 credit_duration 60 non-null int64 3 purpose 60 non-null object 4 installment_commitment 60 non-null int64 5 other_parties 8 non-null object 6 credit_score 60 non-null int64 7 checking_balance 60 non-null float64 8 savings_balance 60 non-null float64 9 existing_credits 60 non-null int64 10 assets 49 non-null object 11 housing 60 non-null object 12 qualification 60 non-null object 13 job_history 60 non-null int64 14 age 60 non-null int64 15 sex 60 non-null object 16 marital_status 60 non-null object 17 num_dependents 60 non-null int64 18 residence_since 60 non-null int64 19 other_payment_plans 60 non-null object dtypes: float64(2), int64(10), object(8) memory usage: 9.5+ KB

Credit Scoring with Zettapark for Python

In this notebook, we will use the Zettapark Python API for a credit scoring demo.

In this scenario, Zettabank wants to leverage its existing credit files to analyze the current credit standing, i.e., whether loans are being repaid smoothly or if there are any delays/defaults.

Based on the current credit standing, Zettabank wants to build a machine learning credit scoring algorithm on the dataset to automatically assess whether a loan should be approved or denied.

Prerequisites

Please run the credit scoring demo setup notebook before running this demo.

1. Data Exploration

In this section, we will explore the dataset of existing credits.

1.1 Open a Lakehouse Session

import json import pandas as pd from clickzetta.zettapark import * from clickzetta.zettapark.functions import *

# Read connection parameters from config file with open('config/config.json', 'r') as config_file: config = json.load(config_file) schema = config['schema'] vcluster = config['vcluster'] print("Connecting to Lakehouse.....\n") # Create session session = Session.builder.configs(config).create() session.sql(f"CREATE SCHEMA IF NOT EXISTS {schema}").collect() session.sql(f"CREATE VCLUSTER IF NOT EXISTS {vcluster} VCLUSTER_SIZE=1 VCLUSTER_TYPE = GENERAL").collect() print(session.sql("SELECT current_instance_id(), current_workspace(),current_workspace_id(), current_schema(), current_user(),current_user_id(), current_vcluster()").collect()) print("\nConnected!...\n")

1.2 Explore Data in Lakehouse Tables

credit_df = session.table("CREDIT_FILES")

credit_df.describe().toPandas()

credit_df.toPandas().info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 2940 entries, 0 to 2939 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 credit_request_id 2940 non-null int64 1 credit_amount 2940 non-null int64 2 credit_duration 2940 non-null int64 3 purpose 2940 non-null object 4 installment_commitment 2940 non-null int64 5 other_parties 271 non-null object 6 credit_standing 2940 non-null object 7 credit_score 2940 non-null int64 8 checking_balance 2940 non-null float64 9 savings_balance 2940 non-null float64 10 existing_credits 2940 non-null int64 11 assets 2489 non-null object 12 housing 2940 non-null object 13 qualification 2940 non-null object 14 job_history 2940 non-null int64 15 age 2940 non-null int64 16 sex 2940 non-null object 17 marital_status 2940 non-null object 18 num_dependents 2940 non-null int64 19 residence_since 2940 non-null int64 20 other_payment_plans 2940 non-null object dtypes: float64(2), int64(10), object(9) memory usage: 482.5+ KB

credit_df.toPandas()

1.3 Visualize Numerical Features

From this visualization, we can observe some interesting characteristics:

  • Most credit requests are for smaller amounts (<50k)
  • Most credit durations are 20 months or shorter
  • Most applicants have high credit scores
  • Most applicants do not have large balances in their checking or savings accounts at Zettabank
  • Most applicants are under 40 years old

credit_df.toPandas().hist(figsize=(15,15))

array([[<Axes: title={'center': 'credit_request_id'}>, <Axes: title={'center': 'credit_amount'}>, <Axes: title={'center': 'credit_duration'}>], [<Axes: title={'center': 'installment_commitment'}>, <Axes: title={'center': 'credit_score'}>, <Axes: title={'center': 'checking_balance'}>], [<Axes: title={'center': 'savings_balance'}>, <Axes: title={'center': 'existing_credits'}>, <Axes: title={'center': 'job_history'}>], [<Axes: title={'center': 'age'}>, <Axes: title={'center': 'num_dependents'}>, <Axes: title={'center': 'residence_since'}>]], dtype=object)

1.4 Visualize Categorical Features

From this visualization, we can observe some interesting characteristics:

  • The most popular credit requests are related to vehicle purchases or consumer goods
  • The vast majority of loans have neither a guarantor nor a co-applicant
  • Most credit standings in the files are good
  • Most applicants are male, foreign workers, skilled workers, and own their homes/apartments
  • Higher-amount loans (thresholds vary by loan category) have a higher likelihood of default

import matplotlib.pyplot as plt import seaborn as sns sns.set(style="darkgrid") fig, axs = plt.subplots(5, 2, figsize=(15, 30)) df = credit_df.toPandas() sns.countplot(data=df, y="purpose", ax=axs[0,0]) sns.countplot(data=df, x="other_parties", ax=axs[0,1]) sns.countplot(data=df, x="credit_standing", ax=axs[1,0]) sns.countplot(data=df, x="assets", ax=axs[1,1]) sns.countplot(data=df, x="housing", ax=axs[2,0]) sns.countplot(data=df, x="qualification", ax=axs[2,1]) sns.countplot(data=df, x="sex", ax=axs[3,0]) sns.countplot(data=df, x="marital_status", ax=axs[3,1]) sns.countplot(data=df, x="other_payment_plans", ax=axs[4,0]) sns.stripplot(y="purpose", x="credit_amount", data=df, hue='credit_standing', jitter=True, ax=axs[4,1]) plt.show()

1.5 Run Queries via Zettapark API

We can use Zettapark API to run queries to obtain various insights. For example, let's try to determine the loan ranges across different categories. We can check the Lakehouse query history to understand how Zettapark APIs are being pushed down as SQL.

df_loan_status = credit_df.select(col("PURPOSE"),col("CREDIT_AMOUNT"))\ .groupBy(col("PURPOSE"))\ .agg([min(col("CREDIT_AMOUNT")).as_("MIN_CREDIT_AMOUNT"), max(col("CREDIT_AMOUNT")).as_("MAX_CREDIT_AMOUNT"), median(col("CREDIT_AMOUNT")).as_("MED_CREDIT_AMOUNT"),avg(col("CREDIT_AMOUNT")).as_("AVG_CREDIT_AMOUNT")])\ .sort(col("PURPOSE")) df_loan_status.toPandas()

2. Data Transformation and Encoding

For this use case, to prepare the data into a format required for machine learning, we need to encode categorical values into numerical ones.

To achieve this, we can leverage Zettapark Python API for encoding.

2.1 Prepare Feature Matrix for Machine Learning

In this section, we will use the Zettapark Python API to prepare a feature matrix for a Random Forest classifier model.

from clickzetta.zettapark.functions import when feature_matrix = credit_df.select( when(col("purpose") == "Consumer Goods", 1) .when(col("purpose") == "Vehicle", 2) .when(col("purpose") == "Tuition", 3) .when(col("purpose") == "Business", 4) .when(col("purpose") == "Repairs", 5) .otherwise(0).alias("purpose_code"), when(col("qualification") == "unskilled", 1) .when(col("qualification") == "skilled", 2) .when(col("qualification") == "highly skilled", 3) .otherwise(0).alias("qualification_code"), when(col("other_parties") == "Guarantor", 1) .when(col("other_parties") == "Co-Applicant", 2) .otherwise(0).alias("other_parties_code"), when(col("other_payment_plans") == "bank", 1) .when(col("other_payment_plans") == "stores", 2) .otherwise(0).alias("other_payment_plans_code"), when(col("housing") == "rent", 1) .when(col("housing") == "own", 2) .otherwise(0).alias("housing_code"), when(col("assets") == "Vehicle", 1) .when(col("assets") == "Investments", 2) .when(col("assets") == "Home", 3) .otherwise(0).alias("assets_code"), when(col("sex") == "M", 1) .otherwise(0).alias("sex_code"), when(col("marital_status") == "Married", 1) .when(col("marital_status") == "Single", 2) .otherwise(0).alias("marital_status_code"), when(col("credit_standing") == "good", 1) .otherwise(0).alias("credit_standing_code"), col("checking_balance"), col("savings_balance"), col("age"), col("job_history"), col("credit_score"), col("credit_duration"), col("credit_amount"), col("residence_since"), col("installment_commitment"), col("num_dependents"), col("existing_credits") ) feature_matrix_pandas = feature_matrix.toPandas() print(feature_matrix_pandas)

purpose_code qualification_code other_parties_code \ 0 2 2 0 1 2 2 0 2 3 2 0 3 3 2 0 4 2 2 0 ... ... ... ... 2935 0 0 0 2936 2 0 0 2937 2 0 0 2938 2 2 0 2939 2 2 0 other_payment_plans_code housing_code assets_code sex_code \ 0 0 2 0 0 1 1 1 0 1 2 0 1 2 0 3 1 1 2 0 4 0 2 2 0 ... ... ... ... ... 2935 1 0 0 1 2936 1 0 0 1 2937 0 0 1 1 2938 0 2 1 1 2939 0 2 1 1 marital_status_code credit_standing_code checking_balance \ 0 1 1 -728.12 1 2 1 0.00 2 1 1 4696.00 3 1 1 -25.35 4 1 1 0.00 ... ... ... ... 2935 2 1 1505.00 2936 2 1 4486.00 2937 2 1 720.00 2938 2 1 752.00 2939 2 1 1564.00 savings_balance age job_history credit_score credit_duration \ 0 17.00 39 15 466 6 1 2443.00 35 1 202 6 2 143.00 23 1 736 15 3 0.00 23 3 732 12 4 510.00 30 1 507 18 ... ... ... ... ... ... 2935 0.00 40 0 726 48 2936 7361.86 66 0 343 12 2937 460.00 68 0 396 16 2938 1444.00 27 0 523 45 2939 1998.00 27 0 552 45 credit_amount residence_since installment_commitment num_dependents \ 0 8600 4 1 1 1 12040 1 4 1 2 3920 4 4 1 3 12000 4 4 1 4 10550 1 4 1 ... ... ... ... ... 2935 53810 4 3 1 2936 14800 4 2 1 2937 11750 3 2 1 2938 45760 4 3 1 2939 45760 4 3 1 existing_credits 0 2 1 1 2 1 3 1 4 2 ... ... 2935 1 2936 3 2937 3 2938 1 2939 1 [2940 rows x 20 columns]

Now that the feature matrix is defined, we convert it to a Pandas DataFrame.

df = feature_matrix.toPandas().astype(int)

df.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 2940 entries, 0 to 2939 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 purpose_code 2940 non-null int64 1 qualification_code 2940 non-null int64 2 other_parties_code 2940 non-null int64 3 other_payment_plans_code 2940 non-null int64 4 housing_code 2940 non-null int64 5 assets_code 2940 non-null int64 6 sex_code 2940 non-null int64 7 marital_status_code 2940 non-null int64 8 credit_standing_code 2940 non-null int64 9 checking_balance 2940 non-null int64 10 savings_balance 2940 non-null int64 11 age 2940 non-null int64 12 job_history 2940 non-null int64 13 credit_score 2940 non-null int64 14 credit_duration 2940 non-null int64 15 credit_amount 2940 non-null int64 16 residence_since 2940 non-null int64 17 installment_commitment 2940 non-null int64 18 num_dependents 2940 non-null int64 19 existing_credits 2940 non-null int64 dtypes: int64(20) memory usage: 459.5 KB

This is what the data looks like:

df.head()

3. Random Forest Model Training

We will use the Random Forest classifier model from the popular scikit-learn machine learning library in Python.

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(df.drop('credit_standing_code', axis=1), df['credit_standing_code'], test_size=0.30)

from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(n_estimators=100) rfc.fit(X_train, y_train)

4. Test the Model

rfc_pred = rfc.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix print(classification_report(y_test,rfc_pred))

precision recall f1-score support 0 0.99 0.87 0.92 275 1 0.94 1.00 0.97 607 accuracy 0.96 882 macro avg 0.97 0.93 0.95 882 weighted avg 0.96 0.96 0.95 882

print(confusion_matrix(y_test,rfc_pred))

[[238 37] [ 2 605]]

5. Inference in Lakehouse

In the following example, we want to process 60 pending credit requests and evaluate whether the loans should be approved. The data is shown below:

df_cred_req = session.table("CREDIT_REQUESTS")

df_cred_req.toPandas()

df.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 2940 entries, 0 to 2939 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 purpose_code 2940 non-null int64 1 qualification_code 2940 non-null int64 2 other_parties_code 2940 non-null int64 3 other_payment_plans_code 2940 non-null int64 4 housing_code 2940 non-null int64 5 assets_code 2940 non-null int64 6 sex_code 2940 non-null int64 7 marital_status_code 2940 non-null int64 8 credit_standing_code 2940 non-null int64 9 checking_balance 2940 non-null int64 10 savings_balance 2940 non-null int64 11 age 2940 non-null int64 12 job_history 2940 non-null int64 13 credit_score 2940 non-null int64 14 credit_duration 2940 non-null int64 15 credit_amount 2940 non-null int64 16 residence_since 2940 non-null int64 17 installment_commitment 2940 non-null int64 18 num_dependents 2940 non-null int64 19 existing_credits 2940 non-null int64 dtypes: int64(20) memory usage: 459.5 KB

6. Develop Scoring Function

When Zettabank receives credit requests, we want to write a function that can be called via tasks to score incoming micro-batch requests.

The Python function will first use Zettapark API to build the model's input features for scoring.

from clickzetta.zettapark.functions import col, when def process_credit_requests_fn (session, credit_requests: str, credit_assessment: str) -> int: # Build model input features using Zettapark API direct encoding. df_cred_req = session.table(credit_requests).select( col("CREDIT_REQUEST_ID"), col("PURPOSE"), when(col("PURPOSE") == "Consumer Goods", 1) .when(col("PURPOSE") == "Vehicle", 2) .when(col("PURPOSE") == "Tuition", 3) .when(col("PURPOSE") == "Business", 4) .when(col("PURPOSE") == "Repairs", 5) .otherwise(0).alias("PURPOSE_CODE"), when(col("QUALIFICATION") == "unskilled", 1) .when(col("QUALIFICATION") == "skilled", 2) .when(col("QUALIFICATION") == "highly skilled", 3) .otherwise(0).alias("QUALIFICATION_CODE"), when(col("OTHER_PARTIES") == "Guarantor", 1) .when(col("OTHER_PARTIES") == "Co-Applicant", 2) .otherwise(0).alias("OTHER_PARTIES_CODE"), when(col("OTHER_PAYMENT_PLANS") == "bank", 1) .when(col("OTHER_PAYMENT_PLANS") == "stores", 2) .otherwise(0).alias("OTHER_PAYMENT_PLANS_CODE"), when(col("HOUSING") == "rent", 1) .when(col("HOUSING") == "own", 2) .otherwise(0).alias("HOUSING_CODE"), when(col("ASSETS") == "Vehicle", 1) .when(col("ASSETS") == "Investments", 2) .when(col("ASSETS") == "Home", 3) .otherwise(0).alias("ASSETS_CODE"), when(col("SEX") == "M", 1) .otherwise(0).alias("SEX_CODE"), when(col("MARITAL_STATUS") == "Married", 1) .when(col("MARITAL_STATUS") == "Single", 2) .otherwise(0).alias("MARITAL_STATUS_CODE"), col("CHECKING_BALANCE"), col("SAVINGS_BALANCE"), col("AGE"), col("JOB_HISTORY"), col("CREDIT_SCORE"), col("CREDIT_DURATION"), col("CREDIT_AMOUNT"), col("RESIDENCE_SINCE"), col("INSTALLMENT_COMMITMENT"), col("NUM_DEPENDENTS"), col("EXISTING_CREDITS") ) # Call UDF to score previously read existing credit requests input_features = [ 'PURPOSE_CODE', 'QUALIFICATION_CODE', 'OTHER_PARTIES_CODE', 'OTHER_PAYMENT_PLANS_CODE', 'HOUSING_CODE', 'ASSETS_CODE', 'SEX_CODE', 'MARITAL_STATUS_CODE', 'CHECKING_BALANCE', 'SAVINGS_BALANCE', 'AGE', 'JOB_HISTORY', 'CREDIT_SCORE', 'CREDIT_DURATION', 'CREDIT_AMOUNT', 'RESIDENCE_SINCE', 'INSTALLMENT_COMMITMENT', 'NUM_DEPENDENTS', 'EXISTING_CREDITS'] df_assessment = df_cred_req.select( col("CREDIT_REQUEST_ID"), col("PURPOSE"), col("CREDIT_AMOUNT"), col("CREDIT_DURATION"), when(col("CREDIT_SCORE") > 600, "Approved").otherwise("Denied").alias("CREDIT_STATUS")) df_assessment.write.mode("overwrite").saveAsTable(credit_assessment) # The function returns the number of credit requests evaluated. return df_assessment.count()

7. Call the Scoring Function

process_credit_requests_fn (session, "credit_requests", "credit_assessments")

60

session.table("credit_assessments").toPandas()

Appendix: