External AI Function: Create Embedding Function

Objective: Use the Alibaba Cloud Bailian platform's Embedding function to vectorize text and image file data, enabling image-to-image search scenarios. The effect is shown below:

Step 1: Prepare the Development Environment

  1. Install Docker: Ensure Docker is installed locally: https://www.docker.com/

  2. Pull the Docker image. Run the following in a local command-line terminal (e.g., macOS Terminal):

    [Local]# docker pull quay.io/pypa/manylinux2014_x86_64:2022-10-25-fbea779

  3. Start the Docker container: This container is based on the manylinux2014_x86_64 image and configured to use Python 3.10.

    [Local]# docker run -it --name cz_func --env PATH="/opt/python/cp310-cp310/bin:$PATH" quay.io/pypa/manylinux2014_x86_64:2022-10-25-fbea779 bash

  1. Create folder embeddings under the /root directory:

[root@docker root]# cd /root ; mkdir embeddings [root@docker embeddings]# cd embeddings [root@docker embeddings]# touch gen_embeddings.py

  1. The program code in gen_embeddings.py is as follows:

import os from cz.udf import annotate from openai import OpenAI import json @annotate("*->string") class get_embeddings(object): def evaluate(self, model_type, input_string, api_key, model_name, dim=None): if model_type == "text": # Initialize OpenAI client with the API key provided by the user client = OpenAI( api_key=api_key, base_url="https://dashscope.aliyuncs.com/compatible-mode/v1" ) input_data = input_string completion = client.embeddings.create( model=model_name, # Use the text model name provided by the user input=input_data, dimensions=int(dim), # Specify vector dimensions encoding_format="float" ) result_json = json.loads(completion.model_dump_json()) embedding_vector = result_json['data'][0]['embedding'] elif model_type == "multimodal": import dashscope image = input_string dashscope.api_key = api_key # Use the API key provided by the user input = [{'image': image}] resp = dashscope.MultiModalEmbedding.call( model=model_name, # Use the multimodal model name provided by the user input=input ) result_json = json.loads(json.dumps(resp.output, ensure_ascii=False, indent=4)) embedding_vector = result_json['embeddings'][0]['embedding'] else: return "Not Valid Model Type" if len(embedding_vector) >= 1: return str(embedding_vector) else: return "Not Valid"

Add command-line entry point:

if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description="Get Embeddings using OpenAI or DashScope") parser.add_argument('--model_type', required=True, help='Model type: text or multimodal') parser.add_argument('--input_string', required=True, help='The input string or image path') parser.add_argument('--api_key', required=True, help='Your API key') parser.add_argument('--model_name', required=True, help='Model name') parser.add_argument('--dim', default=1536, help='Vector dimensions (only for text models)') args = parser.parse_args() embedder = get_embeddings() result = embedder.evaluate( model_type=args.model_type, input_string=args.input_string, api_key=args.api_key, model_name=args.model_name, dim=args.dim ) print(result)

Step 2: Download Third-Party Libraries

The program depends on the third-party package openai, which needs to be downloaded (the rest os and json are Python built-in libraries and do not need to be downloaded. cz.udf will be added by default when creating the function).

Run the following in the development environment terminal:

[root@docker embeddings]# pwd /root/embeddings [root@docker embeddings]# pip install openai -t .

At this point, the directory structure should look like:

Step 3: Local Debugging

Make the following modifications to 3 lines of code, since the cz.udf library is not yet loaded in the current environment:

... 2 #from cz.udf import annotate # Comment out ... 6 #@annotate("*->string") # Comment out ...

The API_KEY is the Alibaba Cloud Bailian platform API-KEY. You need to register an Alibaba Cloud account and obtain it here after logging in: Alibaba Cloud Bailian

After commenting out the two lines above, save and exit the editor. Replace the image_url and api_key below with real values and run:

[root@docker embeddings]# export PYTHONPATH="${_PWD}:${_PWD}/lib" [root@docker embeddings]# python gen_embeddings.py \ --model_type multimodal \ --input_string ${image_url} \ --api_key ${api_key} \ --model_name multimodal-embedding-v1

Step 4: Package and Upload

Before packaging, uncomment the two lines of code that were commented out above.

... 2 from cz.udf import annotate # Remove comment ... 8 @annotate("*->string") # Remove comment

Before running the packaging command, ensure the current directory is the program directory (in this example, /root/embeddings).

[root@docker embeddings]# pwd /root/embeddings [root@docker embeddings]# zip -rq ../embeddings.zip ./ [root@docker embeddings]# ls ../

You will find a embeddings.zip file under the /root directory. Copy this file to the Lakehouse USER VOLUME:

Run on the Docker host:

[Local]# docker cp cz_func:/root/embeddings.zip ~/Downloads

Now the embeddings.zip is in the host's user Downloads directory.

Use the Lakehouse JDBC client (see Lakehouse JDBC Client) to put (upload) the file into the Lakehouse USER VOLUME:

PUT '/Users/derekmeng/Downloads/embeddings.zip' to USER VOLUME;

Step 5: Create and Use the Function

This step depends on having an API Connection created in advance. See: API Connection

CREATE EXTERNAL FUNCTION public.fc_embeddings AS 'gen_embeddings.get_embeddings' USING ARCHIVE 'volume:user://~/embeddings.zip' connection sg_fc_api_conn WITH PROPERTIES ( 'remote.udf.api' = 'python3.mc.v0' ) COMMENT 'Examples: For text: text <input_string> <api_key> <model_name> <dim> For multimodal: multimodal <input_string> <api_key> <model_name>';

Verify:

select public.fc_embeddings('multimodal', 'http://viapi-test.oss-cn-shanghai.aliyuncs.com/viapi-3.0domepic/imagerecog/RecognizeFood/RecognizeFood5.jpg', '${api_key}', 'multimodal-embedding-v1');

Execution result:

The next steps are the core steps for implementing the image-to-image search feature. This query takes an image URL, vectorizes it, and then compares it against all image vectors in the data table (food_images_data_vec). The contents of the table food_images_data_vec are as follows:

Result of vector-based image search: