diff --git a/docs/docs/tutorials/audio/index.ipynb b/docs/docs/tutorials/audio/index.ipynb new file mode 100644 index 0000000000..204c7c1fd1 --- /dev/null +++ b/docs/docs/tutorials/audio/index.ipynb @@ -0,0 +1,722 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Tutorial: Using Audio in DSPy Programs\n", + "\n", + "This tutorial walks through building pipelines for audio-based applications using DSPy." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Install Dependencies\n", + "\n", + "Ensure you're using the latest DSPy version:\n", + "\n", + "```shell\n", + "pip install -U dspy\n", + "```\n", + "\n", + "To handle audio data, install the following dependencies:\n", + "\n", + "```shell\n", + "pip install soundfile torch==2.0.1+cu118 torchaudio==2.0.2+cu118\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load the Spoken-SQuAD Dataset\n", + "\n", + "We'll use the Spoken-SQuAD dataset ([Official](https://github.com/Chia-Hsuan-Lee/Spoken-SQuAD) & [HuggingFace version](https://huggingface.co/datasets/AudioLLMs/spoken_squad_test) for tutorial demonstration), which contains spoken audio passages used for question-answering:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "import dspy\n", + "from dspy.datasets import DataLoader\n", + "\n", + "kwargs = dict(fields=(\"context\", \"instruction\", \"answer\"), input_keys=(\"context\", \"instruction\"))\n", + "spoken_squad = DataLoader().from_huggingface(dataset_name=\"AudioLLMs/spoken_squad_test\", split=\"train\", trust_remote_code=True, **kwargs)\n", + "\n", + "random.Random(42).shuffle(spoken_squad)\n", + "spoken_squad = spoken_squad[:100]\n", + "\n", + "split_idx = len(spoken_squad) // 2\n", + "trainset_raw, testset_raw = spoken_squad[:split_idx], spoken_squad[split_idx:]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Preprocess Audio Data\n", + "\n", + "The audio clips in the dataset require some preprocessing into byte arrays with their corresponding sampling rates." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def preprocess(x):\n", + " audio = dspy.Audio.from_array(x.context[\"array\"], x.context[\"sampling_rate\"])\n", + " return dspy.Example(\n", + " passage_audio=audio,\n", + " question=x.instruction,\n", + " answer=x.answer\n", + " ).with_inputs(\"passage_audio\", \"question\")\n", + "\n", + "trainset = [preprocess(x) for x in trainset_raw]\n", + "testset = [preprocess(x) for x in testset_raw]\n", + "\n", + "len(trainset), len(testset)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## DSPy program for spoken question answering\n", + "\n", + "Let's define a simple DSPy program that uses audio inputs to answer questions directly. This is very similar to the [BasicQA](https://dspy.ai/cheatsheet/?h=basicqa#dspysignature) task, with the only difference being that the passage context is provided as an audio file for the model to listen to and answer the question:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class SpokenQASignature(dspy.Signature):\n", + " \"\"\"Answer the question based on the audio clip.\"\"\"\n", + " passage_audio: dspy.Audio = dspy.InputField()\n", + " question: str = dspy.InputField()\n", + " answer: str = dspy.OutputField(desc = 'factoid answer between 1 and 5 words')\n", + "\n", + "spoken_qa = dspy.ChainOfThought(SpokenQASignature)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's configure our LLM which can process input audio. \n", + "\n", + "```python\n", + "dspy.settings.configure(lm=dspy.LM(model='gpt-4o-mini-audio-preview-2024-12-17'))\n", + "```\n", + "\n", + "Note: Using `dspy.Audio` in signatures allows passing in audio directly to the model. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define Evaluation Metric\n", + "\n", + "We'll use the Exact Match metric (`dspy.evaluate.answer_exact_match`) to measure answer accuracy compared to the provided reference answers:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "evaluate_program = dspy.Evaluate(devset=testset, metric=dspy.evaluate.answer_exact_match,display_progress=True, num_threads = 10, display_table=True)\n", + "\n", + "evaluate_program(spoken_qa)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Optimize with DSPy\n", + "\n", + "You can optimize this audio-based program as you would for any DSPy program using any DSPy optimizer.\n", + "\n", + "Note: Audio tokens can be costly so it is recommended to configure optimizers like `dspy.BootstrapFewShotWithRandomSearch` or `dspy.MIPROv2` conservatively with 0-2 few shot examples and less candidates / trials than the optimizer default parameters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "optimizer = dspy.BootstrapFewShotWithRandomSearch(metric = dspy.evaluate.answer_exact_match, max_bootstrapped_demos=2, max_labeled_demos=2, num_candidate_programs=5)\n", + "\n", + "optimized_program = optimizer.compile(spoken_qa, trainset = trainset)\n", + "\n", + "evaluate_program(optimized_program)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompt_lm = dspy.LM(model='gpt-4o-mini') #NOTE - this is the LLM guiding the MIPROv2 instruction candidate proposal\n", + "optimizer = dspy.MIPROv2(metric=dspy.evaluate.answer_exact_match, auto=\"light\", prompt_model = prompt_lm)\n", + "\n", + "#NOTE - MIPROv2's dataset summarizer cannot process the audio files in the dataset, so we turn off the data_aware_proposer \n", + "optimized_program = optimizer.compile(spoken_qa, trainset=trainset, max_bootstrapped_demos=2, max_labeled_demos=2, data_aware_proposer=False)\n", + "\n", + "evaluate_program(optimized_program)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With this small subset, MIPROv2 led to a ~10% improvement over baseline performance." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we’ve seen how to use an audio-input-capable LLM in DSPy, let’s flip the setup.\n", + "\n", + "In this next task, we'll use a standard text-based LLM to generate prompts for a text-to-speech model and then evaluate the quality of the produced speech for some downstream task. This approach is generally more cost-effective than asking an LLM like `gpt-4o-mini-audio-preview-2024-12-17` to generate audio directly, while still enabling a pipeline that can be optimized for higher-quality speech output." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load the CREMA-D Dataset\n", + "\n", + "We'll use the CREMA-D dataset ([Official](https://github.com/CheyneyComputerScience/CREMA-D) & [HuggingFace version](https://huggingface.co/datasets/myleslinder/crema-d) for tutorial demonstration), which includes audio clips of chosen participants speaking the same line with one of six target emotions: neutral, happy, sad, anger, fear, and disgust." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from collections import defaultdict\n", + "\n", + "label_map = ['neutral', 'happy', 'sad', 'anger', 'fear', 'disgust']\n", + "\n", + "kwargs = dict(fields=(\"sentence\", \"label\", \"audio\"), input_keys=(\"sentence\", \"label\"))\n", + "crema_d = DataLoader().from_huggingface(dataset_name=\"myleslinder/crema-d\", split=\"train\", trust_remote_code=True, **kwargs)\n", + "\n", + "def preprocess(x):\n", + " return dspy.Example(\n", + " raw_line=x.sentence,\n", + " target_style=label_map[x.label],\n", + " reference_audio=dspy.Audio.from_array(x.audio[\"array\"], x.audio[\"sampling_rate\"])\n", + " ).with_inputs(\"raw_line\", \"target_style\")\n", + "\n", + "random.Random(42).shuffle(crema_d)\n", + "crema_d = crema_d[:100]\n", + "\n", + "random.seed(42)\n", + "label_to_indices = defaultdict(list)\n", + "for idx, x in enumerate(crema_d):\n", + " label_to_indices[x.label].append(idx)\n", + "\n", + "per_label = 100 // len(label_map)\n", + "train_indices, test_indices = [], []\n", + "for indices in label_to_indices.values():\n", + " selected = random.sample(indices, min(per_label, len(indices)))\n", + " split = len(selected) // 2\n", + " train_indices.extend(selected[:split])\n", + " test_indices.extend(selected[split:])\n", + "\n", + "trainset = [preprocess(crema_d[idx]) for idx in train_indices]\n", + "testset = [preprocess(crema_d[idx]) for idx in test_indices]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## DSPy pipeline for generating TTS instructions for speaking with a target emotion\n", + "\n", + "We’ll now build a pipeline that generates emotionally expressive speech by prompting a TTS model with both a line of text and an instruction on how to say it. \n", + "The goal of this task will be to use DSPy to generate prompts that guide the TTS output to match the emotion and style of reference audio from the dataset.\n", + "\n", + "First let’s set up the TTS generator to produce generate spoken audio with a specified emotion or style. \n", + "We utilize `gpt-4o-mini-tts` as it supports prompting the model with raw input and speaking and produces an audio response as a `.wav` file processed with `dspy.Audio`. \n", + "We also set up a cache for the TTS outputs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import base64\n", + "import hashlib\n", + "from openai import OpenAI\n", + "\n", + "CACHE_DIR = \".audio_cache\"\n", + "os.makedirs(CACHE_DIR, exist_ok=True)\n", + "\n", + "def hash_key(raw_line: str, prompt: str) -> str:\n", + " return hashlib.sha256(f\"{raw_line}|||{prompt}\".encode(\"utf-8\")).hexdigest()\n", + "\n", + "def generate_dspy_audio(raw_line: str, prompt: str) -> dspy.Audio:\n", + " client = OpenAI(api_key=os.environ[\"OPENAI_API_KEY\"])\n", + " key = hash_key(raw_line, prompt)\n", + " wav_path = os.path.join(CACHE_DIR, f\"{key}.wav\")\n", + " if not os.path.exists(wav_path):\n", + " response = client.audio.speech.create(\n", + " model=\"gpt-4o-mini-tts\",\n", + " voice=\"coral\", #NOTE - this can be configured to any of the 11 offered OpenAI TTS voices - https://platform.openai.com/docs/guides/text-to-speech#voice-options. \n", + " input=raw_line,\n", + " instructions=prompt,\n", + " response_format=\"wav\"\n", + " )\n", + " with open(wav_path, \"wb\") as f:\n", + " f.write(response.content)\n", + " with open(wav_path, \"rb\") as f:\n", + " encoded = base64.b64encode(f.read()).decode(\"utf-8\")\n", + " return dspy.Audio(data=encoded, format=\"wav\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's define the DSPy program for generating TTS instructions. For this program, we can use standard text-based LLMs again since we're just generating instructions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class EmotionStylePromptSignature(dspy.Signature):\n", + " \"\"\"Generate an OpenAI TTS instruction that makes the TTS model speak the given line with the target emotion or style.\"\"\"\n", + " raw_line: str = dspy.InputField()\n", + " target_style: str = dspy.InputField()\n", + " openai_instruction: str = dspy.OutputField()\n", + "\n", + "class EmotionStylePrompter(dspy.Module):\n", + " def __init__(self):\n", + " self.prompter = dspy.ChainOfThought(EmotionStylePromptSignature)\n", + "\n", + " def forward(self, raw_line, target_style):\n", + " out = self.prompter(raw_line=raw_line, target_style=target_style)\n", + " audio = generate_dspy_audio(raw_line, out.openai_instruction)\n", + " return dspy.Prediction(audio=audio)\n", + " \n", + "dspy.settings.configure(lm=dspy.LM(model='gpt-4o-mini'))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define Evaluation Metric\n", + "\n", + "Audio reference comparisons is generally a non-trivial task due to subjective variations of evaluating speech, especially with emotional expression. For the purposes of this tutorial, we use an embedding-based similarity metric for objective evaluation, leveraging Wav2Vec 2.0 to convert audio into embeddings and computing cosine similarity between the reference and generated audio. To evaluate audio quality more accurately, human feedback or perceptual metrics would be more suitable. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import torch\n", + "import torchaudio\n", + "import soundfile as sf\n", + "import io\n", + "\n", + "bundle = torchaudio.pipelines.WAV2VEC2_BASE\n", + "model = bundle.get_model().eval()\n", + "\n", + "def decode_dspy_audio(dspy_audio):\n", + " audio_bytes = base64.b64decode(dspy_audio.data)\n", + " array, _ = sf.read(io.BytesIO(audio_bytes), dtype=\"float32\")\n", + " return torch.tensor(array).unsqueeze(0)\n", + "\n", + "def extract_embedding(audio_tensor):\n", + " with torch.inference_mode():\n", + " return model(audio_tensor)[0].mean(dim=1)\n", + "\n", + "def cosine_similarity(a, b):\n", + " return torch.nn.functional.cosine_similarity(a, b).item()\n", + "\n", + "def audio_similarity_metric(example, pred, trace=None):\n", + " ref_audio = decode_dspy_audio(example.reference_audio)\n", + " gen_audio = decode_dspy_audio(pred.audio)\n", + "\n", + " ref_embed = extract_embedding(ref_audio)\n", + " gen_embed = extract_embedding(gen_audio)\n", + "\n", + " score = cosine_similarity(ref_embed, gen_embed)\n", + "\n", + " if trace is not None:\n", + " return score > 0.8 \n", + " return score\n", + "\n", + "evaluate_program = dspy.Evaluate(devset=testset, metric=audio_similarity_metric, display_progress=True, num_threads = 10, display_table=True)\n", + "\n", + "evaluate_program(EmotionStylePrompter())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can look at an example to see what instructions the DSPy program generated and the corresponding score:" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\n", + "\n", + "\n", + "\u001b[34m[2025-05-15T22:01:22.667596]\u001b[0m\n", + "\n", + "\u001b[31mSystem message:\u001b[0m\n", + "\n", + "Your input fields are:\n", + "1. `raw_line` (str)\n", + "2. `target_style` (str)\n", + "Your output fields are:\n", + "1. `reasoning` (str)\n", + "2. `openai_instruction` (str)\n", + "All interactions will be structured in the following way, with the appropriate values filled in.\n", + "\n", + "[[ ## raw_line ## ]]\n", + "{raw_line}\n", + "\n", + "[[ ## target_style ## ]]\n", + "{target_style}\n", + "\n", + "[[ ## reasoning ## ]]\n", + "{reasoning}\n", + "\n", + "[[ ## openai_instruction ## ]]\n", + "{openai_instruction}\n", + "\n", + "[[ ## completed ## ]]\n", + "In adhering to this structure, your objective is: \n", + " Generate an OpenAI TTS instruction that makes the TTS model speak the given line with the target emotion or style.\n", + "\n", + "\n", + "\u001b[31mUser message:\u001b[0m\n", + "\n", + "[[ ## raw_line ## ]]\n", + "It's eleven o'clock\n", + "\n", + "[[ ## target_style ## ]]\n", + "disgust\n", + "\n", + "Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## openai_instruction ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.\n", + "\n", + "\n", + "\u001b[31mResponse:\u001b[0m\n", + "\n", + "\u001b[32m[[ ## reasoning ## ]]\n", + "To generate the OpenAI TTS instruction, we need to specify the target emotion or style, which in this case is 'disgust'. We will use the OpenAI TTS instruction format, which includes the text to be spoken and the desired emotion or style.\n", + "\n", + "[[ ## openai_instruction ## ]]\n", + "\"Speak the following line with a tone of disgust: It's eleven o'clock\"\n", + "\n", + "[[ ## completed ## ]]\u001b[0m\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + } + ], + "source": [ + "program = EmotionStylePrompter()\n", + "\n", + "pred = program(raw_line=testset[1].raw_line, target_style=testset[1].target_style)\n", + "\n", + "print(audio_similarity_metric(testset[1], pred)) #0.5725605487823486\n", + "\n", + "dspy.inspect_history(n=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "TTS Instruction: \n", + "```text\n", + "Speak the following line with a tone of disgust: It's eleven o'clock\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.display import Audio\n", + "\n", + "audio_bytes = base64.b64decode(pred.audio.data)\n", + "array, rate = sf.read(io.BytesIO(audio_bytes), dtype=\"float32\")\n", + "Audio(array, rate=rate)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The instruction specifies the target emotion, but is not too informative beyond that. We can also see that the audio score for this sample is not too high. Let's see if we can do better by optimizing this pipeline." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Optimize with DSPy\n", + "\n", + "We can leverage `dspy.MIPROv2` to refine the downstream task objective and produce higher quality TTS instructions, leading to more accurate and expressive audio generations:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompt_lm = dspy.LM(model='gpt-4o-mini')\n", + "\n", + "teleprompter = dspy.MIPROv2(metric=audio_similarity_metric, auto=\"light\", prompt_model = prompt_lm)\n", + "\n", + "optimized_program = teleprompter.compile(EmotionStylePrompter(),trainset=trainset)\n", + "\n", + "evaluate_program(optimized_program)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's take a look at how the optimized program performs:" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\n", + "\n", + "\n", + "\u001b[34m[2025-05-15T22:09:40.088592]\u001b[0m\n", + "\n", + "\u001b[31mSystem message:\u001b[0m\n", + "\n", + "Your input fields are:\n", + "1. `raw_line` (str)\n", + "2. `target_style` (str)\n", + "Your output fields are:\n", + "1. `reasoning` (str)\n", + "2. `openai_instruction` (str)\n", + "All interactions will be structured in the following way, with the appropriate values filled in.\n", + "\n", + "[[ ## raw_line ## ]]\n", + "{raw_line}\n", + "\n", + "[[ ## target_style ## ]]\n", + "{target_style}\n", + "\n", + "[[ ## reasoning ## ]]\n", + "{reasoning}\n", + "\n", + "[[ ## openai_instruction ## ]]\n", + "{openai_instruction}\n", + "\n", + "[[ ## completed ## ]]\n", + "In adhering to this structure, your objective is: \n", + " Generate an OpenAI TTS instruction that makes the TTS model speak the given line with the target emotion or style, as if the speaker is a [insert persona relevant to the task, e.g. \"irate customer\", \"angry boss\", etc.]. The instruction should specify the tone, pitch, and other characteristics of the speaker's voice to convey the target emotion.\n", + "\n", + "\n", + "\u001b[31mUser message:\u001b[0m\n", + "\n", + "[[ ## raw_line ## ]]\n", + "It's eleven o'clock\n", + "\n", + "[[ ## target_style ## ]]\n", + "disgust\n", + "\n", + "Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## openai_instruction ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.\n", + "\n", + "\n", + "\u001b[31mResponse:\u001b[0m\n", + "\n", + "\u001b[32m[[ ## reasoning ## ]]\n", + "To convey disgust, the speaker's voice should be characterized by a high-pitched tone, a slightly nasal quality, and a sense of revulsion. The speaker's words should be delivered with a sense of distaste and aversion, as if the speaker is trying to convey their strong negative emotions.\n", + "\n", + "[[ ## openai_instruction ## ]]\n", + "Generate a text-to-speech synthesis of the input text \"It's eleven o'clock\" with the following characteristics: \n", + "- Tone: Disgusted\n", + "- Pitch: High-pitched, slightly nasal\n", + "- Emphasis: Emphasize the words to convey a sense of distaste and aversion\n", + "- Volume: Moderate to loud, with a sense of rising inflection at the end to convey the speaker's strong negative emotions\n", + "- Speaker: A person who is visibly and audibly disgusted, such as a character who has just been served a spoiled meal.\n", + "\n", + "[[ ## completed ## ]]\u001b[0m\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + } + ], + "source": [ + "pred = optimized_program(raw_line=testset[1].raw_line, target_style=testset[1].target_style)\n", + "\n", + "print(audio_similarity_metric(testset[1], pred)) #0.6691027879714966\n", + "\n", + "dspy.inspect_history(n=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "MIPROv2 Optimized Program Instruction: \n", + "```text \n", + "Generate an OpenAI TTS instruction that makes the TTS model speak the given line with the target emotion or style, as if the speaker is a [insert persona relevant to the task, e.g. \"irate customer\", \"angry boss\", etc.]. The instruction should specify the tone, pitch, and other characteristics of the speaker's voice to convey the target emotion.\n", + "```\n", + "\n", + "TTS Instruction: \n", + "```text\n", + "Generate a text-to-speech synthesis of the input text \"It's eleven o'clock\" with the following characteristics: \n", + "- Tone: Disgusted\n", + "- Pitch: High-pitched, slightly nasal\n", + "- Emphasis: Emphasize the words to convey a sense of distaste and aversion\n", + "- Volume: Moderate to loud, with a sense of rising inflection at the end to convey the speaker's strong negative emotions\n", + "- Speaker: A person who is visibly and audibly disgusted, such as a character who has just been served a spoiled meal.\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.display import Audio\n", + "\n", + "audio_bytes = base64.b64decode(pred.audio.data)\n", + "array, rate = sf.read(io.BytesIO(audio_bytes), dtype=\"float32\")\n", + "Audio(array, rate=rate)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "MIPROv2's instruction tuning added more flavor to the overall task objective, giving more criteria to how the TTS instruction should be defined, and in turn, the generated instruction is much more specific to the various factors of speech prosody and produces a higher similarity score." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "jun2024_py310", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.16" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/docs/tutorials/index.md b/docs/docs/tutorials/index.md index 6504d547e1..95f90d5045 100644 --- a/docs/docs/tutorials/index.md +++ b/docs/docs/tutorials/index.md @@ -20,6 +20,7 @@ Welcome to DSPy tutorials! We've organized our tutorials into three main categor - [Multi-Hop RAG](/tutorials/multihop_search/) - [Privacy-Conscious Delegation](/tutorials/papillon/) - [Image Generation Prompt iteration](/tutorials/image_generation_prompting/) + - [Audio](/tutorials/audio/) - Optimize AI Programs with DSPy diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 893221c79d..25d1c12cc3 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -34,6 +34,7 @@ nav: - Multi-Hop RAG: tutorials/multihop_search/index.ipynb - Privacy-Conscious Delegation: tutorials/papillon/index.md - Image Generation Prompt iteration: tutorials/image_generation_prompting/index.ipynb + - Image Generation Prompt iteration: tutorials/audio/index.ipynb - Optimize AI Programs with DSPy: - Overview: tutorials/optimize_ai_program/index.md - Math Reasoning: tutorials/math/index.ipynb