Skip to content

Commit 352926b

Browse files
authored
Merge pull request #47 from mutablelogic/v1
Added streaming responses
2 parents d8b9a2f + 6c99e1d commit 352926b

File tree

13 files changed

+198
-95
lines changed

13 files changed

+198
-95
lines changed

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ available at `http://localhost:8080/v1` and it generally conforms to the
3838
In order to download a model, you can use the following command (for example):
3939

4040
```bash
41-
curl -X POST -H "Content-Type: application/json" -d '{"Path" : "ggml-tiny.en-q8_0.bin" }' localhost:8080/v1/models
41+
curl -X POST -H "Content-Type: application/json" -d '{"Path" : "ggml-medium-q5_0.bin" }' localhost:8080/v1/models
4242
```
4343

4444
To list the models available, you can use the following command:
@@ -50,19 +50,19 @@ curl -X GET localhost:8080/v1/models
5050
To delete a model, you can use the following command:
5151

5252
```bash
53-
curl -X DELETE localhost:8080/v1/models/ggml-tiny.en-q8_0
53+
curl -X DELETE localhost:8080/v1/models/ggml-medium-q5_0
5454
```
5555

5656
To transcribe a media file into it's original language, you can use the following command:
5757

5858
```bash
59-
curl -F "model=ggml-tiny.en-q8_0" -F "file=@samples/jfk.wav" localhost:8080/v1/audio/transcriptions
59+
curl -F model=ggml-medium-q5_0 -F file=@samples/jfk.wav localhost:8080/v1/audio/transcriptions
6060
```
6161

6262
To translate a media file into a different language, you can use the following command:
6363

6464
```bash
65-
curl -F "model=ggml-tiny.en-q8_0" -F "file=@samples/de-podcast.wav" -F "language=en" localhost:8080/v1/audio/transcriptions
65+
curl -F model=ggml-medium-q5_0 -F file=@samples/de-podcast.wav -F language=en localhost:8080/v1/audio/transcriptions\?stream=true
6666
```
6767

6868
There's more information on the API [here](doc/API.md).

doc/API.md

Lines changed: 90 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,54 +1,112 @@
1-
# Draft API
1+
# Whisper server API
22

3-
(From OpenAPI docs)
3+
Based on OpenAPI docs
44

5-
## Create transcription - upload
5+
## Ping
66

77
```html
8-
POST /v1/audio/transcriptions
8+
GET /v1/ping
99
```
1010

11-
Transcribes audio into the input language.
11+
Returns a OK status to indicate the API is up and running.
12+
13+
## Models
14+
15+
### List Models
16+
17+
```html
18+
GET /v1/models
19+
```
1220

13-
**body - Required**
14-
The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
21+
Returns a list of available models. Example response:
1522

16-
**model - string - Required**
17-
ID of the model to use.
23+
```json
24+
{
25+
"object": "list",
26+
"models": [
27+
{
28+
"id": "ggml-large-v3",
29+
"object": "model",
30+
"path": "ggml-large-v3.bin",
31+
"created": 1722090121
32+
},
33+
{
34+
"id": "ggml-medium-q5_0",
35+
"object": "model",
36+
"path": "ggml-medium-q5_0.bin",
37+
"created": 1722081999
38+
}
39+
]
40+
}
41+
```
1842

19-
**language - string - Optional**
20-
The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency.
43+
### Download Model
2144

22-
**prompt - string - Optional**
23-
An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
45+
```html
46+
POST /v1/models
47+
POST /v1/models?stream={bool}
48+
```
2449

25-
**response_format - string - Optional**
26-
Defaults to json
27-
The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.
50+
The request should be a application/json, multipart/form-data or application/x-www-form-urlencoded request with the following fields:
2851

29-
**temperature - number - Optional**
30-
Defaults to 0
31-
The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
52+
```json
53+
{
54+
"path": "ggml-large-v3.bin"
55+
}
56+
```
3257

33-
**timestamp_granularities[] - array - Optional**
34-
Defaults to segment
35-
The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.
58+
Downloads a model from remote huggingface repository. If the optional `stream` argument is true,
59+
the progress is streamed back to the client as a series of [text/event-stream](https://html.spec.whatwg.org/multipage/server-sent-events.html) events.
3660

37-
**Returns**
38-
The transcription object or a verbose transcription object.
61+
If the model is already downloaded, a 200 OK status is returned. If the model was downloaded, a 201 Created status is returned.
3962

40-
### Example request
63+
### Delete Model
4164

42-
```bash
43-
curl https://localhost/v1/audio/transcriptions \
44-
-H "Authorization: Bearer $OPENAI_API_KEY" \
45-
-H "Content-Type: multipart/form-data" \
46-
-F file="@/path/to/file/audio.mp3" \
47-
-F model="whisper-1"
65+
```html
66+
DELETE /v1/models/{model-id}
4867
```
4968

69+
Deletes a model by it's ID. If the model is deleted, a 200 OK status is returned.
70+
71+
## Transcription and translation with file upload
72+
73+
### Transcription
74+
75+
This endpoint's purpose is to transcribe media files into text, in the language of the media file.
76+
77+
```html
78+
POST /v1/audio/transcriptions
79+
POST /v1/audio/transcriptions?stream={bool}
80+
```
81+
82+
The request should be a multipart/form-data request with the following fields:
83+
5084
```json
5185
{
52-
"text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
86+
"model": "<model-id>",
87+
"file": "<binary data>",
88+
"language": "<language-code>",
89+
"response_format": "<response-format>",
5390
}
5491
```
92+
93+
Transcribes audio into the input language.
94+
95+
`file` (required) The audio file object (not file name) to transcribe. This can be audio or video, and the format is auto-detected. The "best" audio stream is selected from the file, and the audio is converted to 16 kHz mono PCM format during transcription.
96+
97+
`model-id` (required) ID of the model to use. This should have previously been downloaded.
98+
99+
`language` (optional) The language of the input audio in ISO-639-1 format. If not set, then the language is auto-detected.
100+
101+
`response_format` (optional, defaults to `json`). The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.
102+
103+
If the optional `stream` argument is true, the segments of the transcription are returned as a series of [text/event-stream](https://html.spec.whatwg.org/multipage/server-sent-events.html) events. Otherwise, the full transcription is returned in the response body.
104+
105+
### Translation
106+
107+
This is the same as transcription (above) except that the `language` parameter is not optional, and should be the language to translate the audio into.
108+
109+
```html
110+
POST /v1/audio/translations
111+
POST /v1/audio/translations?stream={bool}
112+
```

go.mod

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ require (
99
github.com/go-audio/wav v1.1.0
1010
github.com/mutablelogic/go-client v1.0.9
1111
github.com/mutablelogic/go-media v1.6.9
12-
github.com/mutablelogic/go-server v1.4.13
12+
github.com/mutablelogic/go-server v1.4.14
1313
github.com/stretchr/testify v1.9.0
1414
)
1515

go.sum

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,10 @@ github.com/mattn/go-runewidth v0.0.16 h1:E5ScNMtiwvlvB5paMFdw9p4kSQzbXFikJ5SQO6T
2626
github.com/mattn/go-runewidth v0.0.16/go.mod h1:Jdepj2loyihRzMpdS35Xk/zdY8IAYHsh153qUoGf23w=
2727
github.com/mutablelogic/go-client v1.0.9 h1:Eh4sjQOFDldP/L3IizqkcOD3WigZR+u1VaHTUM4ujYw=
2828
github.com/mutablelogic/go-client v1.0.9/go.mod h1:VLyB8j8IBJSK/FXvvqhmq93PRWDKkyLu8R7V2Vudb6A=
29-
github.com/mutablelogic/go-media v1.6.8 h1:3v4povSQlOnvg9mHx6Bp9NVdCCjrNdDCjMHBGFHnVE8=
30-
github.com/mutablelogic/go-media v1.6.8/go.mod h1:HulNT0yyH63a3FRlbuzNDakhOypYrmtFVkHEXZjDgAY=
31-
github.com/mutablelogic/go-server v1.4.13 h1:k5LJJ/pCvyiw34UX341vRhliBOS6i7V65U/UICcOJOA=
32-
github.com/mutablelogic/go-server v1.4.13/go.mod h1:9nenPAohKu8bFoRgwHJh+3s8h0kLFjUAb8KZvT1TQNU=
29+
github.com/mutablelogic/go-media v1.6.9 h1:jkmqrMo7yKXaYXkALBeyVGpV6tNNEf36tmxdOX06VXI=
30+
github.com/mutablelogic/go-media v1.6.9/go.mod h1:HulNT0yyH63a3FRlbuzNDakhOypYrmtFVkHEXZjDgAY=
31+
github.com/mutablelogic/go-server v1.4.14 h1:MsYyS9MjBoYtWfJo/iw6DnZ8slnhakWhPPqVCuzuaV8=
32+
github.com/mutablelogic/go-server v1.4.14/go.mod h1:9nenPAohKu8bFoRgwHJh+3s8h0kLFjUAb8KZvT1TQNU=
3333
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
3434
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
3535
github.com/rivo/uniseg v0.2.0/go.mod h1:J6wj4VEh+S6ZtnVlnTBMWIodfgj8LQOQFoIToxlJtxc=

pkg/whisper/api/models.go

Lines changed: 14 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@ package api
22

33
import (
44
"context"
5-
"encoding/json"
65
"errors"
76
"fmt"
87
"net/http"
@@ -49,15 +48,13 @@ func ListModels(ctx context.Context, w http.ResponseWriter, service *whisper.Whi
4948
}
5049

5150
func DownloadModel(ctx context.Context, w http.ResponseWriter, r *http.Request, service *whisper.Whisper) {
52-
// Get query
51+
// Get query and body
5352
var query queryDownloadModel
53+
var req reqDownloadModel
5454
if err := httprequest.Query(&query, r.URL.Query()); err != nil {
5555
httpresponse.Error(w, http.StatusBadRequest, err.Error())
5656
return
5757
}
58-
59-
// Get request body
60-
var req reqDownloadModel
6158
if err := httprequest.Body(&req, r); err != nil {
6259
httpresponse.Error(w, http.StatusBadRequest, err.Error())
6360
return
@@ -69,34 +66,31 @@ func DownloadModel(ctx context.Context, w http.ResponseWriter, r *http.Request,
6966
return
7067
}
7168

72-
// If we're streaming, then set response to streaming
69+
// Create a text stream
70+
var stream *httpresponse.TextStream
7371
if query.Stream {
74-
httpresponse.JSON(w, respDownloadModelStatus{
75-
Status: fmt.Sprint("downloading ", req.Name()),
76-
}, http.StatusProcessing, 0)
72+
if stream = httpresponse.NewTextStream(w); stream == nil {
73+
httpresponse.Error(w, http.StatusInternalServerError, "Cannot create text stream")
74+
return
75+
}
76+
defer stream.Close()
7777
}
7878

7979
// Download the model
8080
t := time.Now()
8181
model, err := service.DownloadModel(ctx, req.Name(), func(curBytes, totalBytes uint64) {
82-
if time.Since(t) > time.Second && query.Stream {
82+
if time.Since(t) > time.Second && stream != nil {
8383
t = time.Now()
84-
json.NewEncoder(w).Encode(respDownloadModelStatus{
84+
stream.Write("progress", respDownloadModelStatus{
8585
Status: fmt.Sprint("downloading ", req.Name()),
8686
Total: totalBytes,
8787
Completed: curBytes,
8888
})
89-
// Flush the response
90-
if f, ok := w.(http.Flusher); ok {
91-
f.Flush()
92-
}
9389
}
9490
})
9591
if err != nil {
96-
if query.Stream {
97-
json.NewEncoder(w).Encode(respDownloadModelStatus{
98-
Status: fmt.Sprint("error ", err.Error()),
99-
})
92+
if stream != nil {
93+
stream.Write("error", err.Error())
10094
} else {
10195
httpresponse.Error(w, http.StatusBadGateway, err.Error())
10296
}
@@ -105,7 +99,7 @@ func DownloadModel(ctx context.Context, w http.ResponseWriter, r *http.Request,
10599

106100
// Return the model information
107101
if query.Stream {
108-
json.NewEncoder(w).Encode(model)
102+
stream.Write("ok", model)
109103
} else {
110104
httpresponse.JSON(w, model, http.StatusCreated, 2)
111105
}

pkg/whisper/api/transcribe.go

Lines changed: 49 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ type reqTranscribe struct {
3131
ResponseFmt *string `json:"response_format"`
3232
}
3333

34+
type queryTranscribe struct {
35+
Stream bool `json:"stream"`
36+
}
37+
3438
const (
3539
minSegmentSize = 5 * time.Second
3640
maxSegmentSize = 10 * time.Minute
@@ -42,6 +46,11 @@ const (
4246

4347
func TranscribeFile(ctx context.Context, service *whisper.Whisper, w http.ResponseWriter, r *http.Request, translate bool) {
4448
var req reqTranscribe
49+
var query queryTranscribe
50+
if err := httprequest.Query(&query, r.URL.Query()); err != nil {
51+
httpresponse.Error(w, http.StatusBadRequest, err.Error())
52+
return
53+
}
4554
if err := httprequest.Body(&req, r); err != nil {
4655
httpresponse.Error(w, http.StatusBadRequest, err.Error())
4756
return
@@ -75,6 +84,16 @@ func TranscribeFile(ctx context.Context, service *whisper.Whisper, w http.Respon
7584
return
7685
}
7786

87+
// Create a text stream
88+
var stream *httpresponse.TextStream
89+
if query.Stream {
90+
if stream = httpresponse.NewTextStream(w); stream == nil {
91+
httpresponse.Error(w, http.StatusInternalServerError, "Cannot create text stream")
92+
return
93+
}
94+
defer stream.Close()
95+
}
96+
7897
// Get context for the model, perform transcription
7998
var result *schema.Transcription
8099
if err := service.WithModel(model, func(task *task.Context) error {
@@ -97,35 +116,52 @@ func TranscribeFile(ctx context.Context, service *whisper.Whisper, w http.Respon
97116

98117
// TODO: Set temperature, etc
99118

119+
// Create response
120+
result = task.Result()
121+
result.Task = "transcribe"
122+
if translate {
123+
result.Task = "translate"
124+
}
125+
result.Duration = schema.Timestamp(segmenter.Duration())
126+
result.Language = task.Language()
127+
128+
// Output the header
129+
if stream != nil {
130+
stream.Write("task", result)
131+
}
132+
100133
// Read samples and transcribe them
101134
if err := segmenter.Decode(ctx, func(ts time.Duration, buf []float32) error {
102135
// Perform the transcription, return any errors
103-
return task.Transcribe(ctx, ts, buf, req.OutputSegments(), func(segment *schema.Segment) {
104-
fmt.Println("TODO: ", segment)
136+
return task.Transcribe(ctx, ts, buf, req.OutputSegments() || stream != nil, func(segment *schema.Segment) {
137+
if stream != nil {
138+
stream.Write("segment", segment)
139+
}
105140
})
106141
}); err != nil {
107142
return err
108143
}
109144

110-
// End of transcription, get result
111-
result = task.Result()
145+
// Set the language
146+
result.Language = task.Language()
112147

113148
// Return success
114149
return nil
115150
}); err != nil {
116-
httpresponse.Error(w, http.StatusInternalServerError, err.Error())
151+
if stream != nil {
152+
stream.Write("error", err.Error())
153+
} else {
154+
httpresponse.Error(w, http.StatusInternalServerError, err.Error())
155+
}
117156
return
118157
}
119158

120-
// Set task, duration
121-
result.Task = "transcribe"
122-
if translate {
123-
result.Task = "translate"
159+
// Return transcription if not streaming
160+
if stream == nil {
161+
httpresponse.JSON(w, result, http.StatusOK, 2)
162+
} else {
163+
stream.Write("ok")
124164
}
125-
result.Duration = segmenter.Duration()
126-
127-
// Return transcription
128-
httpresponse.JSON(w, result, http.StatusOK, 2)
129165
}
130166

131167
///////////////////////////////////////////////////////////////////////////////

0 commit comments

Comments
 (0)