Skip to content

Added support for OpenAI Text to Audio (Speech API ) #317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

hemeda3
Copy link
Contributor

@hemeda3 hemeda3 commented Feb 12, 2024

how to use it :

	@Autowired
	public OpenAiAudioSpeechClient openAiAudioSpeechClient;

       byte[] responseAsBytes = openAiAudioSpeechClient.call("Hello, world!");
      

config:

          # OpenAI API configuration
          spring.ai.openai.api-key=your_api_key
          spring.ai.openai.base-url=https://api.openai.com
          
          # Speech synthesis options
          spring.ai.openai.audio.speech.options.model=tts-1
          spring.ai.openai.audio.speech.options.voice=alloy
          spring.ai.openai.audio.speech.options.response-format=mp3
          spring.ai.openai.audio.speech.options.speed=0.75

Manual options with metadata/ratelimit info and prompt style

	
      OpenAiAudioSpeechOptions.builder().withSpeed(0.25f).withModel(OpenAiAudioApi.TtsModel.TTS_1.value).build();
      SpeechPrompt speechPrompt = new SpeechPrompt("Hello, world!", options);
      SpeechResponse responseWithMetaData = openAiAudioSpeechClient.call(speechPrompt);
     OpenAiAudioSpeechResponseMetadata metadata = responseWithMetaData.getMetadata(); // rate limit info 

      byte[] responseAsBytes = responseWithMetaData.getResult().getOutput();

Streaming speech Audio directly from OpenAI API

OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
			.withVoice(OpenAiAudioApi.SpeechRequest.Voice.ALLOY)
			.withSpeed(SPEED)
			.withResponseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3)
			.withModel(OpenAiAudioApi.TtsModel.TTS_1.value)
			.build();
		SpeechPrompt speechPrompt = new SpeechPrompt("Today is a wonderful day to build something people love!",
				speechOptions);
		Flux<SpeechResponse> response = openAiAudioSpeechClient.stream(speechPrompt);

@hemeda3 hemeda3 force-pushed the speech branch 3 times, most recently from d234402 to 43d2390 Compare February 12, 2024 08:33
@hemeda3 hemeda3 force-pushed the speech branch 2 times, most recently from 8a79494 to 2ce21fb Compare February 12, 2024 08:38
@markpollack markpollack added this to the 0.9.0 milestone Feb 12, 2024
@hemeda3 hemeda3 force-pushed the speech branch 2 times, most recently from 8039a5f to 4e4989c Compare February 15, 2024 00:42
@markpollack markpollack modified the milestones: 0.9.0, 0.8.1 Feb 29, 2024
@markpollack markpollack self-assigned this Feb 29, 2024
@tzolov tzolov added enhancement New feature or request model client labels Mar 1, 2024
@hemeda3 hemeda3 force-pushed the speech branch 3 times, most recently from 9198015 to eb766b1 Compare March 5, 2024 21:57
@hemeda3
Copy link
Contributor Author

hemeda3 commented Mar 5, 2024

@tzolov seems you worked on the speech API code and merged to main, should I close this PR ?

@tzolov
Copy link
Contributor

tzolov commented Mar 6, 2024

Hi @hemeda3 , thanks for reaching out.
I've worked on and merged the #300 PR.
In the process I realised that the low-level API is scattered between the #300 and your #317 PRs. Additionally it was not completely covering the underlying OpenAI Audio API spec.
So I decided to adopt an old implementation I did for my Assistant AI explorations.

Next I realised that that until we have at least two text-to-speech and speech-to-text client implementations from different AI vendors, it is premature to create a common model abstractions under the spring-ai-core/model. Later are meant to facilitate portability between vendors, but with a single implementation there is not enough data to decide what the common abstractions should look like.
Therefore i've moved the Audio Transcription prompt / response and alike inside the spring-ai-openai project (under the audio/transcription package). Later when we have more audio clients we can decide how to abstract those back to the core.
Finally as you can see the #300 implements client for the transcription endpoint, while your PR is adding speech generation client.

Having said this, would you be interested to re-work your PR after the refactoring i did? You will have to base your client on the OpenAiAudiApi low-level client and move the code form spring-ai-core to the spring-ai-openai .../audio/speech (e.g. next to .../audio/transcription) package?
I would really appreciate your help. Do not hesitate to ask or suggest improvements.

@hemeda3
Copy link
Contributor Author

hemeda3 commented Mar 6, 2024

Having said this, would you be interested to re-work your PR after the refactoring i did? You will have to base your client on the OpenAiAudiApi low-level client and move the code form spring-ai-core to the spring-ai-openai .../audio/speech (e.g. next to .../audio/transcription) package? I would really appreciate your help. Do not hesitate to ask or suggest improvements.

@tzolov, thanks for the explanation 🙏. actually I was a bit confused since both APIs ( speech + transcription) share the same OpenAI audio API at a low level, but your changes have clarified things for me. I'm happy to re-work my PR based on your updates. If I have any questions, I'll reach out. Thanks for the opportunity to contribute and learn.

@hemeda3 hemeda3 force-pushed the speech branch 8 times, most recently from 4d6a718 to 684c81a Compare March 7, 2024 04:46
@hemeda3
Copy link
Contributor Author

hemeda3 commented Mar 7, 2024

  • Rebased the client on the OpenAiAudiApi low-level
  • Moved the code form spring-ai-core to the spring-ai-openai .../audio/speech
  • added stream support (open ai speech response can be received as stream)
  • added metadata/ratelimit using WebClient
  • added tests for speech properties + stream/call methods
  • updated issue description with thew new usage

@hemeda3 hemeda3 force-pushed the speech branch 3 times, most recently from b971a8e to 3a2edc0 Compare March 7, 2024 22:06
@hemeda3
Copy link
Contributor Author

hemeda3 commented Mar 11, 2024

Hi @tzolov should I add the documentation to the same PR or new PR?

@tzolov
Copy link
Contributor

tzolov commented Mar 12, 2024

Hi @hemeda3 , thanks for asking.
Sure it would be nice to add the docs too.

@markpollack markpollack modified the milestones: 0.8.1, 1.0.0-M1 Mar 14, 2024
@hemeda3 hemeda3 force-pushed the speech branch 3 times, most recently from 14849b6 to 5de7b4a Compare March 17, 2024 00:11
@hemeda3
Copy link
Contributor Author

hemeda3 commented Mar 17, 2024

Added speech API adoc:

  • updated nav.adoc
  • added speech.adoc


private OpenAiAudioSpeechOptions speechOptions;

private final List<SpeechMessage> messages;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying API only accepts a single string input, not an collection. So I think this should be ModelRequest<SpeechMessage> and not ModelRequest<List>.

@markpollack
Copy link
Member

It took a while to get to, but it is now merged. Thanks, this was a great contribution!

Merged as 766b420

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model client
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants