What is this?

openaivec is a Python library designed for efficient text processing using the OpenAI API, with seamless integration for both Pandas DataFrames and Apache Spark. It allows you to leverage the power of OpenAI models for tasks like generating embeddings or text responses directly within your data processing workflows.

Let's dive into Generative Mutation for tabular data!

Full API reference is available at API Reference.

This is a simple dummy data with pd.Series.

animals: pd.Series = pd.Series(["panda", "koala", "python", "dog", "cat"])

You can mutate the column with natural language instructions.

# Translate animal names to Chinese
animals.ai.responses("Translate the animal names to Chinese.")

and its results are ['熊猫', '考拉', '蟒蛇', '狗', '猫'] (Not sure that's right, I can't read Chinese).

These are extremely fluent interface for data processing with pandas.

df = pd.DataFrame({"animal": ["panda", "koala", "python", "dog", "cat"]})
df.assign(
    zh=lambda df: df.animal.ai.responses("Translate the animal names to Chinese."),
    color=lambda df: df.animal.ai.responses("Translate the animal names to color."),
    is_technical_word=lambda df: df.animal.ai.responses("Is this related to python language? answer yes or no.").eq("yes"),
)

animal	zh	color	is_technical_word
panda	熊猫	black and white	False
koala	考拉	grey	False
python	蟒蛇	green	True
dog	狗	brown	False
cat	猫	orange	False

( Personally, I expect first and second row of is_technical_word to be True...)

Do you wanna use another llm model? I don't think so. OpenAI is all you need in this scenario.

Overview

This package provides a vectorized interface for the OpenAI API, enabling you to process multiple inputs with a single API call instead of sending requests one by one. This approach helps reduce latency and simplifies your code.

Additionally, it integrates effortlessly with Pandas DataFrames and Apache Spark UDFs, making it easy to incorporate into your data processing pipelines.

Features

Vectorized API requests for processing multiple inputs at once.
Seamless integration with Pandas DataFrames.
A UDF builder for Apache Spark.
Compatibility with multiple OpenAI clients, including Azure OpenAI.

Requirements

Python 3.10 or higher

Installation

Install the package with:

pip install openaivec

If you want to uninstall the package, you can do so with:

pip uninstall openaivec

Basic Usage

Synchronous:

import os
from openai import OpenAI
from openaivec import BatchResponses


# Initialize the batch client with your system message and parameters
client = BatchResponses(
    client=OpenAI(),
    temperature=0.0,
    top_p=1.0,
    model_name="<your-model-name>",
    system_message="Please answer only with 'xx family' and do not output anything else."
)

result = client.parse(["panda", "rabbit", "koala"])
print(result)  # Expected output: ['bear family', 'rabbit family', 'koala family']

See examples/basic_usage.ipynb for a complete example.

Using with Pandas DataFrame

openaivec.pandas_ext extends pandas.Series with accessors ai.responses and ai.embeddings.

import pandas as pd
from openai import OpenAI
from openaivec import pandas_ext

# Set OpenAI Client (optional: this is default client if environment "OPENAI_API_KEY" is set)
pandas_ext.use(OpenAI())

# Set models for responses and embeddings(optional: these are default models)
pandas_ext.responses_model("gpt-4o-mini")
pandas_ext.embeddings_model("text-embedding-3-small")

df = pd.DataFrame({"name": ["panda", "rabbit", "koala"]})

df.assign(
    kind=lambda df: df.name.ai.responses("Answer only with 'xx family' and do not output anything else.")
)

Example output:

name	kind
panda	bear family
rabbit	rabbit family
koala	koala family

Using with Apache Spark UDFs

openaivec.spark provides builders (ResponsesUDFBuilder, EmbeddingsUDFBuilder) to create asynchronous Spark UDFs for interacting with OpenAI APIs. These UDFs leverage openaivec.aio.pandas_ext for efficient asynchronous processing within Spark.

First, obtain a Spark session:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Next, instantiate UDF builders using either OpenAI or Azure OpenAI credentials and register the UDFs.

import os
from openaivec.spark import ResponsesUDFBuilder, EmbeddingsUDFBuilder, count_tokens_udf
from pydantic import BaseModel

# --- Option 1: Using OpenAI ---
resp_builder_openai = ResponsesUDFBuilder.of_openai(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="gpt-4o-mini", # Model for responses
)
emb_builder_openai = EmbeddingsUDFBuilder.of_openai(
    api_key=os.getenv("OPENAI_API_KEY"),
    model_name="text-embedding-3-small", # Model for embeddings
)

# --- Option 2: Using Azure OpenAI ---
# resp_builder_azure = ResponsesUDFBuilder.of_azure_openai(
#     api_key=os.getenv("AZURE_OPENAI_KEY"),
#     endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
#     api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
#     model_name="<your-resp-deployment-name>", # Deployment for responses
# )
# emb_builder_azure = EmbeddingsUDFBuilder.of_azure_openai(
#     api_key=os.getenv("AZURE_OPENAI_KEY"),
#     endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
#     api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
#     model_name="<your-emb-deployment-name>", # Deployment for embeddings
# )

# --- Register Responses UDF (String Output) ---
# Use the builder corresponding to your setup (OpenAI or Azure)
spark.udf.register(
    "parse_flavor",
    resp_builder_openai.build( # or resp_builder_azure.build(...)
        instructions="Extract flavor-related information. Return only the concise flavor name.",
        response_format=str, # Specify string output
    )
)

# --- Register Responses UDF (Structured Output with Pydantic) ---
class Translation(BaseModel):
    en: str
    fr: str
    ja: str

spark.udf.register(
    "translate_struct",
    resp_builder_openai.build( # or resp_builder_azure.build(...)
        instructions="Translate the text to English, French, and Japanese.",
        response_format=Translation, # Specify Pydantic model for structured output
    )
)

# --- Register Embeddings UDF ---
spark.udf.register(
    "embed_text",
    emb_builder_openai.build() # or emb_builder_azure.build()
)

# --- Register Token Counting UDF ---
spark.udf.register("count_tokens", count_tokens_udf("gpt-4o"))

You can now use these UDFs in Spark SQL:

-- Create a sample table (replace with your actual table)
CREATE OR REPLACE TEMP VIEW product_names AS SELECT * FROM VALUES
  ('4414732714624', 'Cafe Mocha Smoothie (Trial Size)'),
  ('4200162318339', 'Dark Chocolate Tea (New Product)'),
  ('4920122084098', 'Uji Matcha Tea (New Product)')
AS product_names(id, product_name);

-- Use the registered UDFs
SELECT
    id,
    product_name,
    parse_flavor(product_name) AS flavor,
    translate_struct(product_name) AS translation,
    embed_text(product_name) AS embedding,
    count_tokens(product_name) AS token_count
FROM product_names;

Example Output (structure might vary slightly):

id	product_name	flavor	translation	embedding	token_count
4414732714624	Cafe Mocha Smoothie (Trial Size)	Mocha	{en: ..., fr: ..., ja: ...}	[0.1, -0.2, ..., 0.5]	8
4200162318339	Dark Chocolate Tea (New Product)	Chocolate	{en: ..., fr: ..., ja: ...}	[-0.3, 0.1, ..., -0.1]	7
4920122084098	Uji Matcha Tea (New Product)	Matcha	{en: ..., fr: ..., ja: ...}	[0.0, 0.4, ..., 0.2]	8

Building Prompts

Building prompt is a crucial step in using LLMs. In particular, providing a few examples in a prompt can significantly improve an LLM’s performance, a technique known as "few-shot learning." Typically, a few-shot prompt consists of a purpose, cautions, and examples.

FewShotPromptBuilder is a class that helps you build a few-shot learning prompt with simple interface.

Basic Usage

FewShotPromptBuilder requires simply a purpose, cautions, and examples, and build method will return rendered prompt with XML format.

Here is an example:

from openaivec.prompt import FewShotPromptBuilder

prompt: str = (
    FewShotPromptBuilder()
    .purpose("Return the smallest category that includes the given word")
    .caution("Never use proper nouns as categories")
    .example("Apple", "Fruit")
    .example("Car", "Vehicle")
    .example("Tokyo", "City")
    .example("Keiichi Sogabe", "Musician")
    .example("America", "Country")
    .build()
)
print(prompt)

The output will be:

<Prompt>
    <Purpose>Return the smallest category that includes the given word</Purpose>
    <Cautions>
        <Caution>Never use proper nouns as categories</Caution>
    </Cautions>
    <Examples>
        <Example>
            <Input>Apple</Input>
            <Output>Fruit</Output>
        </Example>
        <Example>
            <Input>Car</Input>
            <Output>Vehicle</Output>
        </Example>
        <Example>
            <Input>Tokyo</Input>
            <Output>City</Output>
        </Example>
        <Example>
            <Input>Keiichi Sogabe</Input>
            <Output>Musician</Output>
        </Example>
        <Example>
            <Input>America</Input>
            <Output>Country</Output>
        </Example>
    </Examples>
</Prompt>

Improve with OpenAI

For most users, it can be challenging to write a prompt entirely free of contradictions, ambiguities, or redundancies. FewShotPromptBuilder provides an improve method to refine your prompt using OpenAI's API.

improve method will try to eliminate contradictions, ambiguities, and redundancies in the prompt with OpenAI's API, and iterate the process up to max_iter times.

Here is an example:

from openai import OpenAI
from openaivec.prompt import FewShotPromptBuilder

client = OpenAI(...)
model_name = "<your-model-name>"
improved_prompt: str = (
    FewShotPromptBuilder()
    .purpose("Return the smallest category that includes the given word")
    .caution("Never use proper nouns as categories")
    # Examples which has contradictions, ambiguities, or redundancies
    .example("Apple", "Fruit")
    .example("Apple", "Technology")
    .example("Apple", "Company")
    .example("Apple", "Color")
    .example("Apple", "Animal")
    # improve the prompt with OpenAI's API, max_iter is number of iterations to improve the prompt.
    .improve(client, model_name, max_iter=5)
    .build()
)
print(improved_prompt)

Then we will get the improved prompt with extra examples, improved purpose, and cautions:

<Prompt>
    <Purpose>Classify a given word into its most relevant category by considering its context and potential meanings.
        The input is a word accompanied by context, and the output is the appropriate category based on that context.
        This is useful for disambiguating words with multiple meanings, ensuring accurate understanding and
        categorization.
    </Purpose>
    <Cautions>
        <Caution>Ensure the context of the word is clear to avoid incorrect categorization.</Caution>
        <Caution>Be aware of words with multiple meanings and provide the most relevant category.</Caution>
        <Caution>Consider the possibility of new or uncommon contexts that may not fit traditional categories.</Caution>
    </Cautions>
    <Examples>
        <Example>
            <Input>Apple (as a fruit)</Input>
            <Output>Fruit</Output>
        </Example>
        <Example>
            <Input>Apple (as a tech company)</Input>
            <Output>Technology</Output>
        </Example>
        <Example>
            <Input>Java (as a programming language)</Input>
            <Output>Technology</Output>
        </Example>
        <Example>
            <Input>Java (as an island)</Input>
            <Output>Geography</Output>
        </Example>
        <Example>
            <Input>Mercury (as a planet)</Input>
            <Output>Astronomy</Output>
        </Example>
        <Example>
            <Input>Mercury (as an element)</Input>
            <Output>Chemistry</Output>
        </Example>
        <Example>
            <Input>Bark (as a sound made by a dog)</Input>
            <Output>Animal Behavior</Output>
        </Example>
        <Example>
            <Input>Bark (as the outer covering of a tree)</Input>
            <Output>Botany</Output>
        </Example>
        <Example>
            <Input>Bass (as a type of fish)</Input>
            <Output>Aquatic Life</Output>
        </Example>
        <Example>
            <Input>Bass (as a low-frequency sound)</Input>
            <Output>Music</Output>
        </Example>
    </Examples>
</Prompt>

Using with Microsoft Fabric

Microsoft Fabric is a unified, cloud-based analytics platform that seamlessly integrates data engineering, warehousing, and business intelligence to simplify the journey from raw data to actionable insights.

This section provides instructions on how to integrate and use vectorize-openai within Microsoft Fabric. Follow these steps:

Create an Environment in Microsoft Fabric:
- In Microsoft Fabric, click on New item in your workspace.
- Select Environment to create a new environment for Apache Spark.
- Determine the environment name, eg. openai-environment.
- Figure: Creating a new Environment in Microsoft Fabric.
Add openaivec to the Environment from Public Library
- Once your environment is set up, go to the Custom Library section within that environment.
- Click on Add from PyPI and search for latest version of openaivec.
- Save and publish to reflect the changes.
- Figure: Add openaivec from PyPI to Public Library
Use the Environment from a Notebook:
- Open a notebook within Microsoft Fabric.
- Select the environment you created in the previous steps.
- Figure: Using custom environment from a notebook.
- In the notebook, import and use openaivec.spark.UDFBuilder as you normally would. For example:
```
from openaivec.spark import ResponsesUDFBuilder

udf = ResponsesUDFBuilder(
    api_key="<your-api-key>",
    api_version="2024-10-21",
    endpoint="https://<your-resource-name>.openai.azure.com",
    model_name="<your-deployment-name>"
)
```

Following these steps allows you to successfully integrate and use openaivec within Microsoft Fabric.

Contributing

We welcome contributions to this project! If you would like to contribute, please follow these guidelines:

Fork the repository and create your branch from main.
If you've added code that should be tested, add tests.
Ensure the test suite passes.
Make sure your code lints.

Installing Dependencies

To install the necessary dependencies for development, run:

uv sync --all-extras --dev

Code Formatting

To reformat the code, use the following command:

uv run ruff check . --fix

Community

Join our Discord community for developers: https://discord.gg/vbb83Pgn

Name		Name	Last commit message	Last commit date
Latest commit History 645 Commits
.github		.github
docs		docs
src/openaivec		src/openaivec
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

What is this?

Overview

Features

Requirements

Installation

Basic Usage

Using with Pandas DataFrame

Using with Apache Spark UDFs

Building Prompts

Basic Usage

Improve with OpenAI

Using with Microsoft Fabric

Contributing

Installing Dependencies

Code Formatting

Community

About

Uh oh!

Releases 54

Uh oh!

Contributors 3

Uh oh!

Languages

License

anaregdesign/openaivec

Folders and files

Latest commit

History

Repository files navigation

What is this?

Overview

Features

Requirements

Installation

Basic Usage

Using with Pandas DataFrame

Using with Apache Spark UDFs

Building Prompts

Basic Usage

Improve with OpenAI

Using with Microsoft Fabric

Contributing

Installing Dependencies

Code Formatting

Community

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 54

Uh oh!

Contributors 3

Uh oh!

Languages