mutablelogic
diff --git a/‎README.md
Lines changed: 4 additions & 4 deletions b/‎README.md
Lines changed: 4 additions & 4 deletions
diff --git a/‎doc/API.md
Lines changed: 90 additions & 32 deletions b/‎doc/API.md
Lines changed: 90 additions & 32 deletions
diff --git a/‎go.mod
Lines changed: 1 addition & 1 deletion b/‎go.mod
Lines changed: 1 addition & 1 deletion
diff --git a/‎go.sum
Lines changed: 4 additions & 4 deletions b/‎go.sum
Lines changed: 4 additions & 4 deletions
diff --git a/‎pkg/whisper/api/models.go
Lines changed: 14 additions & 20 deletions b/‎pkg/whisper/api/models.go
Lines changed: 14 additions & 20 deletions
diff --git a/‎pkg/whisper/api/transcribe.go
Lines changed: 49 additions & 13 deletions b/‎pkg/whisper/api/transcribe.go
Lines changed: 49 additions & 13 deletions
@@ -38,7 +38,7 @@ available at `http://localhost:8080/v1` and it generally conforms to the
 In order to download a model, you can use the following command (for example):
 
 ```bash
-curl -X POST -H "Content-Type: application/json" -d '{"Path" : "ggml-tiny.en-q8_0.bin" }' localhost:8080/v1/models  
+curl -X POST -H "Content-Type: application/json" -d '{"Path" : "ggml-medium-q5_0.bin" }' localhost:8080/v1/models  
 ```
 
 To list the models available, you can use the following command:
@@ -50,19 +50,19 @@ curl -X GET localhost:8080/v1/models
 To delete a model, you can use the following command:
 
 ```bash
-curl -X DELETE localhost:8080/v1/models/ggml-tiny.en-q8_0
+curl -X DELETE localhost:8080/v1/models/ggml-medium-q5_0
 ```
 
 To transcribe a media file into it's original language, you can use the following command:
 
 ```bash
-curl -F "model=ggml-tiny.en-q8_0" -F "file=@samples/jfk.wav" localhost:8080/v1/audio/transcriptions
+curl -F model=ggml-medium-q5_0 -F file=@samples/jfk.wav localhost:8080/v1/audio/transcriptions
 ```
 
 To translate a media file into a different language, you can use the following command:
 
 ```bash
-curl -F "model=ggml-tiny.en-q8_0" -F "file=@samples/de-podcast.wav" -F "language=en" localhost:8080/v1/audio/transcriptions
+curl -F model=ggml-medium-q5_0 -F file=@samples/de-podcast.wav -F language=en localhost:8080/v1/audio/transcriptions\?stream=true
 ```
 
 There's more information on the API [here](doc/API.md).
 
@@ -1,54 +1,112 @@
-# Draft API
+# Whisper server API
 
-(From OpenAPI docs)
+Based on OpenAPI docs
 
-## Create transcription - upload
+## Ping
 
 ```html
-POST /v1/audio/transcriptions
+GET /v1/ping
 ```
 
-Transcribes audio into the input language.
+Returns a OK status to indicate the API is up and running.
+
+## Models
+
+### List Models
+
+```html
+GET /v1/models
+```
 
-**body - Required**
-The audio file object (not file name) to transcribe, in one of these formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.
+Returns a list of available models. Example response:
 
-**model - string - Required**
-ID of the model to use.
+```json
+{
+  "object": "list",
+  "models": [
+    {
+      "id": "ggml-large-v3",
+      "object": "model",
+      "path": "ggml-large-v3.bin",
+      "created": 1722090121
+    },
+    {
+      "id": "ggml-medium-q5_0",
+      "object": "model",
+      "path": "ggml-medium-q5_0.bin",
+      "created": 1722081999
+    }
+  ]
+}
+```
 
-**language - string - Optional**
-The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency.
+### Download Model
 
-**prompt - string - Optional**
-An optional text to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
+```html
+POST /v1/models
+POST /v1/models?stream={bool}
+```
 
-**response_format - string - Optional**
-Defaults to json
-The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.
+The request should be a application/json, multipart/form-data or application/x-www-form-urlencoded request with the following fields:
 
-**temperature - number - Optional**
-Defaults to 0
-The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.
+```json
+{
+  "path": "ggml-large-v3.bin"
+}
+```
 
-**timestamp_granularities[] - array - Optional**
-Defaults to segment
-The timestamp granularities to populate for this transcription. response_format must be set verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.
+Downloads a model from remote huggingface repository. If the optional `stream` argument is true,
+the progress is streamed back to the client as a series of [text/event-stream](https://html.spec.whatwg.org/multipage/server-sent-events.html) events.
 
-**Returns**
-The transcription object or a verbose transcription object.
+If the model is already downloaded, a 200 OK status is returned. If the model was downloaded, a 201 Created status is returned.
 
-### Example request
+### Delete Model
 
-```bash
-curl https://localhost/v1/audio/transcriptions \
-  -H "Authorization: Bearer $OPENAI_API_KEY" \
-  -H "Content-Type: multipart/form-data" \
-  -F file="@/path/to/file/audio.mp3" \
-  -F model="whisper-1"
+```html
+DELETE /v1/models/{model-id}
 ```
 
+Deletes a model by it's ID. If the model is deleted, a 200 OK status is returned.
+
+## Transcription and translation with file upload
+
+### Transcription
+
+This endpoint's purpose is to transcribe media files into text, in the language of the media file.
+
+```html
+POST /v1/audio/transcriptions
+POST /v1/audio/transcriptions?stream={bool}
+```
+
+The request should be a multipart/form-data request with the following fields:
+
 ```json
 {
-  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
+  "model": "<model-id>",
+  "file": "<binary data>",
+  "language": "<language-code>",
+  "response_format": "<response-format>",
 }
 ```
+
+Transcribes audio into the input language.
+
+`file` (required) The audio file object (not file name) to transcribe. This can be audio or video, and the format is auto-detected. The "best" audio stream is selected from the file, and the audio is converted to 16 kHz mono PCM format during transcription.
+
+`model-id` (required) ID of the model to use. This should have previously been downloaded.
+
+`language` (optional) The language of the input audio in ISO-639-1 format. If not set, then the language is auto-detected.
+
+`response_format` (optional, defaults to `json`). The format of the transcript output, in one of these options: json, text, srt, verbose_json, or vtt.
+
+If the optional `stream` argument is true, the segments of the transcription are returned as a series of [text/event-stream](https://html.spec.whatwg.org/multipage/server-sent-events.html) events. Otherwise, the full transcription is returned in the response body.
+
+### Translation
+
+This is the same as transcription (above) except that the `language` parameter is not optional, and should be the language to translate the audio into.
+
+```html
+POST /v1/audio/translations
+POST /v1/audio/translations?stream={bool}
+```
@@ -9,7 +9,7 @@ require (
 	github.com/go-audio/wav v1.1.0
 	github.com/mutablelogic/go-client v1.0.9
 	github.com/mutablelogic/go-media v1.6.9
-	github.com/mutablelogic/go-server v1.4.13
+	github.com/mutablelogic/go-server v1.4.14
 	github.com/stretchr/testify v1.9.0
 )
 
 
@@ -26,10 +26,10 @@ github.com/mattn/go-runewidth v0.0.16 h1:E5ScNMtiwvlvB5paMFdw9p4kSQzbXFikJ5SQO6T
 github.com/mattn/go-runewidth v0.0.16/go.mod h1:Jdepj2loyihRzMpdS35Xk/zdY8IAYHsh153qUoGf23w=
 github.com/mutablelogic/go-client v1.0.9 h1:Eh4sjQOFDldP/L3IizqkcOD3WigZR+u1VaHTUM4ujYw=
 github.com/mutablelogic/go-client v1.0.9/go.mod h1:VLyB8j8IBJSK/FXvvqhmq93PRWDKkyLu8R7V2Vudb6A=
-github.com/mutablelogic/go-media v1.6.8 h1:3v4povSQlOnvg9mHx6Bp9NVdCCjrNdDCjMHBGFHnVE8=
-github.com/mutablelogic/go-media v1.6.8/go.mod h1:HulNT0yyH63a3FRlbuzNDakhOypYrmtFVkHEXZjDgAY=
-github.com/mutablelogic/go-server v1.4.13 h1:k5LJJ/pCvyiw34UX341vRhliBOS6i7V65U/UICcOJOA=
-github.com/mutablelogic/go-server v1.4.13/go.mod h1:9nenPAohKu8bFoRgwHJh+3s8h0kLFjUAb8KZvT1TQNU=
+github.com/mutablelogic/go-media v1.6.9 h1:jkmqrMo7yKXaYXkALBeyVGpV6tNNEf36tmxdOX06VXI=
+github.com/mutablelogic/go-media v1.6.9/go.mod h1:HulNT0yyH63a3FRlbuzNDakhOypYrmtFVkHEXZjDgAY=
+github.com/mutablelogic/go-server v1.4.14 h1:MsYyS9MjBoYtWfJo/iw6DnZ8slnhakWhPPqVCuzuaV8=
+github.com/mutablelogic/go-server v1.4.14/go.mod h1:9nenPAohKu8bFoRgwHJh+3s8h0kLFjUAb8KZvT1TQNU=
 github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
 github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
 github.com/rivo/uniseg v0.2.0/go.mod h1:J6wj4VEh+S6ZtnVlnTBMWIodfgj8LQOQFoIToxlJtxc=
 
@@ -2,7 +2,6 @@ package api
 
 import (
 	"context"
-	"encoding/json"
 	"errors"
 	"fmt"
 	"net/http"
@@ -49,15 +48,13 @@ func ListModels(ctx context.Context, w http.ResponseWriter, service *whisper.Whi
 }
 
 func DownloadModel(ctx context.Context, w http.ResponseWriter, r *http.Request, service *whisper.Whisper) {
-	// Get query
+	// Get query and body
 	var query queryDownloadModel
+	var req reqDownloadModel
 	if err := httprequest.Query(&query, r.URL.Query()); err != nil {
 		httpresponse.Error(w, http.StatusBadRequest, err.Error())
 		return
 	}
-
-	// Get request body
-	var req reqDownloadModel
 	if err := httprequest.Body(&req, r); err != nil {
 		httpresponse.Error(w, http.StatusBadRequest, err.Error())
 		return
@@ -69,34 +66,31 @@ func DownloadModel(ctx context.Context, w http.ResponseWriter, r *http.Request,
 		return
 	}
 
-	// If we're streaming, then set response to streaming
+	// Create a text stream
+	var stream *httpresponse.TextStream
 	if query.Stream {
-		httpresponse.JSON(w, respDownloadModelStatus{
-			Status: fmt.Sprint("downloading ", req.Name()),
-		}, http.StatusProcessing, 0)
+		if stream = httpresponse.NewTextStream(w); stream == nil {
+			httpresponse.Error(w, http.StatusInternalServerError, "Cannot create text stream")
+			return
+		}
+		defer stream.Close()
 	}
 
 	// Download the model
 	t := time.Now()
 	model, err := service.DownloadModel(ctx, req.Name(), func(curBytes, totalBytes uint64) {
-		if time.Since(t) > time.Second && query.Stream {
+		if time.Since(t) > time.Second && stream != nil {
 			t = time.Now()
-			json.NewEncoder(w).Encode(respDownloadModelStatus{
+			stream.Write("progress", respDownloadModelStatus{
 				Status:    fmt.Sprint("downloading ", req.Name()),
 				Total:     totalBytes,
 				Completed: curBytes,
 			})
-			// Flush the response
-			if f, ok := w.(http.Flusher); ok {
-				f.Flush()
-			}
 		}
 	})
 	if err != nil {
-		if query.Stream {
-			json.NewEncoder(w).Encode(respDownloadModelStatus{
-				Status: fmt.Sprint("error ", err.Error()),
-			})
+		if stream != nil {
+			stream.Write("error", err.Error())
 		} else {
 			httpresponse.Error(w, http.StatusBadGateway, err.Error())
 		}
@@ -105,7 +99,7 @@ func DownloadModel(ctx context.Context, w http.ResponseWriter, r *http.Request,
 
 	// Return the model information
 	if query.Stream {
-		json.NewEncoder(w).Encode(model)
+		stream.Write("ok", model)
 	} else {
 		httpresponse.JSON(w, model, http.StatusCreated, 2)
 	}
 
@@ -31,6 +31,10 @@ type reqTranscribe struct {
 	ResponseFmt *string               `json:"response_format"`
 }
 
+type queryTranscribe struct {
+	Stream bool `json:"stream"`
+}
+
 const (
 	minSegmentSize     = 5 * time.Second
 	maxSegmentSize     = 10 * time.Minute
@@ -42,6 +46,11 @@ const (
 
 func TranscribeFile(ctx context.Context, service *whisper.Whisper, w http.ResponseWriter, r *http.Request, translate bool) {
 	var req reqTranscribe
+	var query queryTranscribe
+	if err := httprequest.Query(&query, r.URL.Query()); err != nil {
+		httpresponse.Error(w, http.StatusBadRequest, err.Error())
+		return
+	}
 	if err := httprequest.Body(&req, r); err != nil {
 		httpresponse.Error(w, http.StatusBadRequest, err.Error())
 		return
@@ -75,6 +84,16 @@ func TranscribeFile(ctx context.Context, service *whisper.Whisper, w http.Respon
 		return
 	}
 
+	// Create a text stream
+	var stream *httpresponse.TextStream
+	if query.Stream {
+		if stream = httpresponse.NewTextStream(w); stream == nil {
+			httpresponse.Error(w, http.StatusInternalServerError, "Cannot create text stream")
+			return
+		}
+		defer stream.Close()
+	}
+
 	// Get context for the model, perform transcription
 	var result *schema.Transcription
 	if err := service.WithModel(model, func(task *task.Context) error {
@@ -97,35 +116,52 @@ func TranscribeFile(ctx context.Context, service *whisper.Whisper, w http.Respon
 
 		// TODO: Set temperature, etc
 
+		// Create response
+		result = task.Result()
+		result.Task = "transcribe"
+		if translate {
+			result.Task = "translate"
+		}
+		result.Duration = schema.Timestamp(segmenter.Duration())
+		result.Language = task.Language()
+
+		// Output the header
+		if stream != nil {
+			stream.Write("task", result)
+		}
+
 		// Read samples and transcribe them
 		if err := segmenter.Decode(ctx, func(ts time.Duration, buf []float32) error {
 			// Perform the transcription, return any errors
-			return task.Transcribe(ctx, ts, buf, req.OutputSegments(), func(segment *schema.Segment) {
-				fmt.Println("TODO: ", segment)
+			return task.Transcribe(ctx, ts, buf, req.OutputSegments() || stream != nil, func(segment *schema.Segment) {
+				if stream != nil {
+					stream.Write("segment", segment)
+				}
 			})
 		}); err != nil {
 			return err
 		}
 
-		// End of transcription, get result
-		result = task.Result()
+		// Set the language
+		result.Language = task.Language()
 
 		// Return success
 		return nil
 	}); err != nil {
-		httpresponse.Error(w, http.StatusInternalServerError, err.Error())
+		if stream != nil {
+			stream.Write("error", err.Error())
+		} else {
+			httpresponse.Error(w, http.StatusInternalServerError, err.Error())
+		}
 		return
 	}
 
-	// Set task, duration
-	result.Task = "transcribe"
-	if translate {
-		result.Task = "translate"
+	// Return transcription if not streaming
+	if stream == nil {
+		httpresponse.JSON(w, result, http.StatusOK, 2)
+	} else {
+		stream.Write("ok")
 	}
-	result.Duration = segmenter.Duration()
-
-	// Return transcription
-	httpresponse.JSON(w, result, http.StatusOK, 2)
 }
 
 ///////////////////////////////////////////////////////////////////////////////
Original file line number	Diff line number	Diff line change
`@@ -9,7 +9,7 @@ require (`
`9`	`9`	`github.com/go-audio/wav v1.1.0`
`10`	`10`	`github.com/mutablelogic/go-client v1.0.9`
`11`	`11`	`github.com/mutablelogic/go-media v1.6.9`
`12`		`- github.com/mutablelogic/go-server v1.4.13`
	`12`	`+ github.com/mutablelogic/go-server v1.4.14`
`13`	`13`	`github.com/stretchr/testify v1.9.0`
`14`	`14`	`)`
`15`	`15`