Merge pull request #40 from mutablelogic/v1

djthorpe · web-flow · commit 0803dcb02f48 · 2024-07-30T10:24:54.000+02:00
Updated documentation
diff --git a/README.md b/README.md
@@ -3,20 +3,23 @@
 Speech-to-Text in golang. This is an early development version.
 
 * `cmd` contains an OpenAI-API compatible server
-* `pkg` contains the `whisper` service and http gateway
+* `pkg` contains the `whisper` service and client
 * `sys` contains the `whisper` bindings to the `whisper.cpp` library
 * `third_party` is a submodule for the whisper.cpp source
 
 ## Running
 
+(Note: Docker images are not created yet - this is some forward planning!)
+
 There are docker images for arm64 and amd64 (Intel). The arm64 image is built for
 Jetson GPU support specifically, but it will also run on Raspberry Pi's.
 
 In order to utilize a NVIDIA GPU, you'll need to install the
 [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) first.
 
 A docker volume should be created called "whisper" can be used for storing the Whisper language
-models. You can see which models are available to download locally [here](https://huggingface.co/ggerganov/whisper.cpp). The following command will run the server on port 8080:
+models. You can see which models are available to download locally [here](https://huggingface.co/ggerganov/whisper.cpp).
+The following command will run the server on port 8080:
 
 ```bash
 docker run \
@@ -47,33 +50,49 @@ curl -X GET localhost:8080/v1/models
 To delete a model, you can use the following command:
 
 ```bash
-curl -X DELETE localhost:8080/v1/models/ggml-tiny.en-q8_0.bin
+curl -X DELETE localhost:8080/v1/models/ggml-tiny.en-q8_0
 ```
 
-And to transcribe an audio file, you can use the following command:
+To transcribe a media file into it's original language, you can use the following command:
 
 ```bash
-curl -F "model=ggml-tiny.en-q8_0.bin" -F "file=@samples/jfk.wav" -F "language=en" localhost:8080/v1/audio/transcriptions
+curl -F "model=ggml-tiny.en-q8_0" -F "file=@samples/jfk.wav" localhost:8080/v1/audio/transcriptions
+```
+
+To translate a media file into a different language, you can use the following command:
+
+```bash
+curl -F "model=ggml-tiny.en-q8_0" -F "file=@samples/de-podcast.wav" -F "language=en" localhost:8080/v1/audio/transcriptions
 ```
 
-Right now there's a limitation on the files: they must be mono WAV files at 16K sample rate.
 There's more information on the API [here](doc/API.md).
 
 ## Building
 
+The build dependencies are:
+
+* Go 1.22
+* C++ compiler
+* FFmpeg 6.1 libraries (see [here](doc/build.md) for more information)
+* For CUDA, you'll need the CUDA toolkit including the `nvcc` compiler
+
 If you want to build the server yourself for your specific combination of hardware,
-you can use the `Makefile` in the root directory. You'll need go 1.22, `make` and 
+you can use the `Makefile` in the root directory. You'll need go 1.22, `make` and
 a C++ compiler to build this project. The following `Makefile` targets can be used:
 
-* `make server` - creates the server binary, and places it in the `build` directory
-* `DOCKER_REGISTRY=docker.io/user make docker` - builds a docker container with the server binary
+* `make server` - creates the server binary, and places it in the `build` directory. Should
+  link to Metal on macOS
+* `GGML_CUDA=1 make server` - creates the server binary linked to CUDA, and places it
+  in the `build` directory. Should work for amd64 and arm64 (Jetson) platforms
+* `DOCKER_REGISTRY=docker.io/user make docker` - builds a docker container with the 
+  server binary, tagged to a specific registry
 
 See all the other targets in the `Makefile` for more information.
 
 ## Status
 
-Still in development. It only accepts mono WAV files at 16K sample rate, for example. It also
-occasionally crashes, and the API is not fully implemented.
+Still in development. See this [issue](https://github.com/mutablelogic/go-whisper/issues/1) for
+remaining tasks to be completed.
 
 ## Contributing & Distribution
 
@@ -84,7 +103,7 @@ The license is Apache 2 so feel free to redistribute. Redistributions in either
 code or binary form must reproduce the copyright notice, and please link back to this
 repository for more information:
 
-> __go-media__\
+> __go-whisper__\
 > [https://github.com/mutablelogic/go-whisper/](https://github.com/mutablelogic/go-whisper/)\
 > Copyright (c) 2023-2024 David Thorpe, All rights reserved.
 >
diff --git a/doc/build.md b/doc/build.md
@@ -1,5 +1,6 @@
+# Notes on building
 
-# Package Config
+## Package Config
 
 libwhisper.pc
 
@@ -13,16 +14,6 @@ Cflags: -I${prefix}/third_party/whisper.cpp/include -I${prefix}/third_party/whis
 Libs: -L${prefix}/third_party/whisper.cpp -lwhisper -lggml -lm -lstdc++
 ```
 
-libwhisper-linux.pc
-
-```pkg-config
-prefix=/Users/djt/Projects/go-whisper/
-
-Name: libwhisper-linux
-Description: Whisper is a C/C++ library for speech transcription, translation and diarization.
-Version: 0.0.0
-```
-
 libwhisper-darwin.pc
 
 ```pkg-config
@@ -36,3 +27,10 @@ Libs: -framework Accelerate -framework Metal -framework Foundation -framework Co
 
 I don't know what the windows one should be as I don't have a windows machine.
 
+## Ubuntu 22.04
+
+```bash
+sudo add-apt-repository -y ppa:ubuntuhandbook1/ffmpeg6
+sudo apt-get update
+sudo apt-get install -y libavcodec-dev libavdevice-dev libavfilter-dev libavutil-dev libswscale-dev libswresample-dev
+```