Skip to content

Commit a954416

Browse files
committed
Initial post on private LLMs with OpenFaaS Edge
Signed-off-by: Alex Ellis (OpenFaaS Ltd) <alexellis2@gmail.com>
1 parent 521caed commit a954416

File tree

3 files changed

+388
-0
lines changed

3 files changed

+388
-0
lines changed
Lines changed: 388 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,388 @@
1+
---
2+
title: "How to Protect Your Data with Self-Hosted LLMs and OpenFaaS Edge"
3+
description: "The rise of hosted LLMs has been meteoric, however many NDAs would prevent you from using them. Learn how to run LLMs locally with OpenFaaS Edge"
4+
date: 2025-04-16
5+
author_staff_member: alex
6+
categories:
7+
- functions
8+
- llm
9+
- privacy
10+
- enterprise
11+
dark_background: true
12+
image: images/2025-04-private-llm/background.png
13+
hide_header_image: true
14+
---
15+
16+
The rise of hosted LLMs has been meteoric, but many Non-Disclosure Agreements (NDAs) would prevent you from using them. We explore how a self-hosted solution protects your data.
17+
18+
## Why Self-Hosted LLMs?
19+
20+
Self-hosted models are great for experimentation and exploring what is possible, without having to worry about how much your API calls are costing you ($$$). Practically speaking, they are the only option if you are dealing with Confidential Information covered by an NDA.
21+
22+
The definition of Confidential Information varies by NDA, but it usually includes any information that is not publicly available, and that you would not want to be made public. This could include customer data, employee names, organisational charts, source-code, designs and schematics, trade secrets, or any other sensitive information.
23+
24+
Even if the data you want to process via an LLM is not protected under an NDA, if you work for a regulated company or a Fortune 500 enterprise, it's likely that you will be required to use models hosted on-premises or in a private cloud.
25+
26+
![Computer eating private data](/images/2025-04-private-llm/computer-eating.jpg)
27+
> Pictured: A computer eating private data in a datacenter, generated by Grok.
28+
29+
There are pros and cons to both self-hosted, and hosted LLMs for inference.
30+
31+
Pros for hosted models:
32+
33+
* Require no capital expenditure (CapEx) on GPUs, or dedicated hardware
34+
* Can be invoked via API and paid for based upon tokens in/out
35+
* You can use the largest models available, which would cost tends of thousands of dollars to run locally
36+
* You get to access the best in class proprietary models, such as GPT-4, Claude, and Gemini
37+
38+
Downsides for hosted models:
39+
40+
* Costs can be unpredictable, and can spiral out of control
41+
* You have no control over the model, and it can be changed or removed at any time
42+
* You have no control over the data, and it can be used to train the model - opting out may require an enterprise agreement
43+
* When used with customer data, it will almost certainly breach any NDA you have with your enterprise customers
44+
45+
Pros for self-hosted models:
46+
47+
* Tools such as [Ollama](https://ollama.com), [llama.cpp](https://github.com/ggml-org/llama.cpp), [LLM Studio](https://lmstudio.ai) and [vLLM](https://github.com/vllm-project/vllm) make it trivial to run LLMs locally
48+
* A modest investment in 1 or 2 NVIDIA GPUs such as 3060 or 3090 can give you access to a wide range of models
49+
* Running on your own hardware means there are no API costs - all you can eat
50+
* You have full control over the model, and can choose to use open source models, or your own fine-tuned models
51+
* You have full control over the data, and can choose to keep it on-premises or in a private cloud
52+
53+
Cons for self-hosted models:
54+
55+
* The GPUs will need a dedicated machine or server to be set up and managed
56+
* The GPUs may become obsolete as the pace of innovation in LLMs accelerates requiring many more GB of VRAM to run the latest models
57+
* The results of self-hosted models are nowhere as good as the hosted models - which may also make tool calls to search the Internet and improve their results
58+
* Tool calling is usually not available on smaller models, or works poorly
59+
60+
## Build of materials for a PC
61+
62+
For our sister company [actuated.com](https://actuated.com), we built a custom PC to show [how to leverage GPUs and LLMs during CI/CD with GitHub Actions and GitLab CI](https://actuated.com/blog/ollama-in-github-actions).
63+
64+
The build uses an AMD Ryzen 9 5950X 16-Core CPU with 2x 3060 GPUs, 128GB of RAM, 1TB of NVMe storage, and a 1000W power supply.
65+
66+
![PC with 2x 3060 GPUs](https://actuated.com/images/2024-03-gpus/3060.jpg)
67+
68+
It made practical sense for us to build a PC with consumer components, however you could just as easily build an affordable server [using components from Supermicro](https://www.supermicro.com/en/support/resources/gpu), or even run a used PowerEdge server acquired from a reseller. Ampere's range of Arm servers and workstations [report good performance](https://amperecomputing.com/developers/power-your-ai) whilst running inference workloads purely on CPU.
69+
70+
Around 9 months later, we swapped the 2x 3060 GPUs for 2x 3090s taking the VRAM from 24GB total to 48GB total when both GPUs are allocated.
71+
72+
For this post, we allocated one of the two 3090 cards to a microVM, then we installed OpenFaaS Edge.
73+
74+
At the time of writing, a brand-new NVIDIA 3060 card with 12GB of VRAM is currently available for around [250 GBP as a one-off cost from Amazon.co.uk](https://amzn.to/42tE1Xp). If you use it heavily, will pay for itself in a short period of time compared to the cost of API credits.
75+
76+
## How to get started with OpenFaaS Edge
77+
78+
OpenFaaS Edge is a commercial distribution of [faasd](https://github.com/openfaas/faasd), which runs on a VM or bare-metal devices. It's easy to setup and operate because it doesn't include clustering or high-availability. Instead it's designed for automation tasks, ETL, and edge workloads, which are often run on a single device.
79+
80+
Whilst there are various options for running a model locally, we chose Ollama because it comes with its own container image, and exposes a REST API which is easy to call from an OpenFaaS Function.
81+
82+
In our last post [Eradicate Cold Emails From Gmail for Good With OpenAI and OpenFaaS](https://www.openfaas.com/blog/filter-emails-with-openai/), we showed a workflow for Gmail / Google Workspace users to filter out unwanted emails using OpenAI's GPT-3.5 model. The content in the article could be used with OpenFaaS on Kubernetes, or OpenFaaS Edge.
83+
84+
We'll focus on the same use-case, and I'll show you a simplified function which receives an input and makes a call to the local model. It'll then be up to the reader to retrofit it into the existing solution, if that's what they wish to do. If you use another email provider, if they have an API, then you can adapt the code for i.e. Hotmail etc.
85+
86+
### Install OpenFaaS Edge
87+
88+
Use the [official instructions to install OpenFaaS Edge](https://docs.openfaas.com/deployment/edge/) on your VM or bare-metal device, you can use any Linux distribution, but we recommend Ubuntu Server LTS.
89+
90+
Activate your license using your license key or GitHub Sponsorship.
91+
92+
### Install the NVIDIA Container Toolkit
93+
94+
Follow the instructions for your platform to install the NVIDIA Container Toolkit. This will allow you to run GPU workloads in Docker containers.
95+
96+
[Installing the NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
97+
98+
You should be able to run `nvidia-smi` and see your GPUs detected.
99+
100+
### Add Ollama to OpenFaaS Edge
101+
102+
Ollama is a stateful service which requires a large amount of storage for models. We can add it to the `docker-compose.yaml` file so that OpenFaaS Edge will start it up and manage it.
103+
104+
Edit: `/var/lib/faasd/docker-compose.yaml`:
105+
106+
```yaml
107+
services:
108+
ollama:
109+
image: docker.io/ollama/ollama:latest
110+
command:
111+
- "ollama"
112+
- "serve"
113+
volumes:
114+
- type: bind
115+
source: ./ollama
116+
target: /root/.ollama
117+
ports:
118+
- "127.0.0.1:11434:11434"
119+
gpus: all
120+
deploy:
121+
restart: always
122+
```
123+
124+
Restart faasd:
125+
126+
```bash
127+
sudo systemctl daemon-reload
128+
sudo systemctl restart faasd
129+
```
130+
131+
You can perform a pre-pull of a model using the following, run directly on the host:
132+
133+
```bash
134+
curl -X POST http://127.0.0.1:11434/api/pull -d '{"name": "gemma3:4b"}'
135+
```
136+
137+
The [gemma3 model](https://ollama.com/library/gemma3) is known to work well on a single GPU. We've used the 4b version, but you can go smaller or larger if you like. Some experimentation may be required to find a model and parameter size that matches your specific needs.
138+
139+
If you wish to make this manual step a bit more automated, you can use an "init container" which runs after ollama has started.
140+
141+
```yaml
142+
ollama-init:
143+
image: docker.io/alpine/curl:latest
144+
command:
145+
- "curl"
146+
- "-X"
147+
- "POST"
148+
- "http://ollama:11434/api/pull"
149+
- "-d"
150+
- '{"name":"gemma3:4b"}'
151+
depends_on:
152+
- ollama
153+
```
154+
155+
### Create a function to call the model
156+
157+
You can [use the documentation](https://docs.openfaas.com/languages/python/) to learn how to create a new function using a template such as Python.
158+
159+
We used Python in the previous post, but you can use any language that you like - with an existing template, or with one that you write for yourself.
160+
161+
Update requirements.txt with the `requests` HTTP client:
162+
163+
```
164+
+requests
165+
```
166+
167+
Then create a function handler which will call the model. The model is called `llama2` in this example, but you can use any model that you have installed.
168+
169+
Here's an example handler:
170+
171+
```python
172+
import requests
173+
import json
174+
175+
def handle(event, context):
176+
url = "http://ollama:11434/api/generate"
177+
payload = {
178+
"model": "gemma3:4b",
179+
"stream": False,
180+
"prompt": str(event.body)
181+
}
182+
headers = {
183+
"Content-Type": "application/json"
184+
}
185+
186+
response = requests.post(url, data=json.dumps(payload), headers=headers)
187+
188+
# Parse the JSON response
189+
response_json = response.json()
190+
191+
return {
192+
"statusCode": 200,
193+
"body": response_json["response"]
194+
}
195+
```
196+
197+
As you can see, we can access Ollama via service discovery using its name as defined in `docker-compose.yaml`.
198+
199+
The example is simple, it just takes the body, forms a request payload and returns the result from the Ollama REST API.
200+
201+
However you could take this in any direction you wish:
202+
203+
* Include the API call in a workflow, or a chain of functions to decide the next action
204+
* Trigger an API call to your email provider to mark the message as spam, or important
205+
* Save the result to a database or into S3, to filter out future messages from the same sender
206+
* Send a message with a confirmation button to Slack message or Discord for final human approval
207+
208+
Email is just one use-case, now that we have a working private function and private self-hosted LLM, we can send it any kind of data.
209+
210+
An emerging use-case is to take podcast episodes, transcribe them, and then to provide deep searching and chat capabilities across episodes and topics.
211+
212+
### Deploy and invoke the function
213+
214+
You can now deploy and invoke the function:
215+
216+
```bash
217+
faas-cli up
218+
```
219+
220+
Here is a genuine cold outreach email I got:
221+
222+
```text
223+
You are a founder of a startup, and a target for cold outreach email, spam, and nuisance messages. Use the best of your abilities to analyze this email, be skeptical, and ruthless. Respond in JSON with a categorization between -1 and 1, and a reason for your categorization in one sentence only. The categorization should be one of the following:
224+
225+
-1: Spam
226+
0: Neutral
227+
1: Legitimate
228+
229+
{
230+
"categorization": 0,
231+
"reason": "The email is a generic outreach message that does not provide any specific value or relevance to the recipient. It lacks personalization and seems to be part of a mass email campaign."
232+
}
233+
234+
235+
Subject: Alex, Quick idea for your LinkedIn
236+
237+
Body:
238+
Hi Alex
239+
Quick message to say hello 👋 and tell you about a new service for founders.
240+
241+
We can transform a 30-minute interview with you into a month of revenue-generating LinkedIn content through strategic repurposing. Here's what a month could like for you...
242+
243+
4 short video clips - 30-60 second highlights with captions
244+
12 professionally written LinkedIn posts with images and graphics
245+
1 long-form LinkedIn article - In-depth piece highlighting key insights
246+
247+
If you want to drive revenue with LinkedIn, that's what we do at Expert Takes
248+
249+
Reply if you'd like to learn more:)
250+
251+
Have a great day!
252+
Bryan Collins
253+
Director Expert Takes
254+
No longer interested in these messages? Unsubscribe
255+
```
256+
257+
Save the above as "email.txt", then invoke the function with the email as input:
258+
259+
```bash
260+
cat ./email.txt | faas-cli invoke filter-email
261+
```
262+
263+
Here's the result I received, which took `0m1.391s`:
264+
265+
```
266+
{% raw %}
267+
{
268+
"categorization": -1,
269+
"reason": "The email employs generic language, lacks specific details about the recipient's business, and utilizes a high-pressure, 'transformative' sales pitch, strongly indicating it's a spam or low-quality marketing message."
270+
}
271+
{% endraw %}
272+
```
273+
274+
If you'd like to invoke the function via curl, run `faas-cil describe filter-email` to get the URL.
275+
276+
Let's try another email, this time, you'll need to repeat the prompt, edit `email.txt`:
277+
278+
```
279+
Subject: Refurbished Herman Miller Aeron Office Chairs for Openfaas Ltd
280+
281+
Body:
282+
283+
Dear Alex,
284+
285+
I am writing this email to introduce our wonderful deals on Refurbished Herman Miller Aeron Office Chairs, which we are discounted by up to 70% on the price of new ones!
286+
287+
Would you like to slash the cost of your office refurbishment by purchasing high quality chairs that will last for years?
288+
289+
The Aeron Office Chair is one of the best on the market and these are literally a fraction of the new price.
290+
291+
We have sizes A ,B & C in stock too with prices starting from just £450 each!
292+
293+
See our current stock here
294+
295+
All our chairs come with 12 months warranty on all parts, have a 14 day money-back guarantee and we provide a nationwide delivery service.
296+
297+
Discover more here
298+
299+
Kind Regards,
300+
301+
Michael Watkins
302+
303+
MW Office Furniture
304+
```
305+
306+
Result:
307+
308+
```
309+
{% raw %}
310+
{
311+
"categorization": -1,
312+
"reason": "This email employs a highly generic sales pitch for refurbished furniture, lacks any specific connection to Openfaas Ltd, and uses common sales tactics likely associated with spam."
313+
}
314+
{% endraw %}
315+
```
316+
317+
Now whenever you're doing any kind of testing, it's just as important to do a negative test as it is a positive one.
318+
319+
So if you were planning on using this code, make sure that you get a categorization of 1 for a legitimate email from one of your customers.
320+
321+
### Invoke the function asynchronously for durability and scale
322+
323+
Many of us have grown used to API calls taking milliseconds to execute, particularly in response to events such as webhooks. However LLMs can take seconds to minutes to respond to requests, especially if they involve a reasoning stage like DeepSeek R1.
324+
325+
One way to get around this, is to invoke the function asynchronously, which will queue the request and return an immediate HTTP response to the caller, along with an X-Call-Id header.
326+
327+
You can register a one-time HTTP callback/webhook by passing in an additional `X-Callback-Url` header to the request. The X-Call-Id will be returned along with the status and body of the invocation.
328+
329+
Here's an example:
330+
331+
```bash
332+
curl -i http://127.0.0.1:8080/async-function/filter-email \
333+
--data-binary @./email.txt \
334+
-H "X-Callback-Url: http://gateway.openfaas:8080/function/email-result
335+
```
336+
337+
Now, we could queue up hundreds or thousands of asynchronous invocations, and each will be processed as quickly as the function can handle them. The "email-result" function will receive the responses, and can correlate the X-Call-Id with the original request.
338+
339+
If you'd like to try out an asynchronous invocation and don't have a receiver function, just remove the extra header:
340+
341+
342+
```bash
343+
curl -i http://127.0.0.1:8080/async-function/filter-email \
344+
--data-binary @./email.txt
345+
```
346+
347+
Now look at the logs of the filter-email function to see the processing:
348+
349+
```bash
350+
faas-cli logs filter-email
351+
```
352+
353+
### Further work for the function
354+
355+
Our specific function was kept simple so that you can adapt it for your own needs, but perhaps if you were going to deploy this to production, you could improve the solution:
356+
357+
* Index or save tokenized emails in a vector database for future reference and training
358+
* Let the LLM perform RAG to check for similar emails in the past, increasing confidence
359+
* Allow for a human-in-the-loop to approve or reject the categorization via a Slack or Discord message with a clickable button
360+
* Run two small models at the same time, and get a consensus on the categorization by invoking both in serial and combining the results
361+
362+
Whilst Ollama does not yet support multi-modal models, which can process and produce images, audio and video, it is possible to run OpenAI's Whisper model to transcribe audio files, and then to use the text output as input to a model.
363+
364+
You can deploy the [function we wrote previously on the blog](https://www.openfaas.com/blog/transcribe-audio-with-openai-whisper/) that uses Whisper to OpenFaaS Edge as a core service, then send it HTTP requests like we did to the Ollama service.
365+
366+
### Conclusion
367+
368+
The latest release of [OpenFaaS Edge](https://docs.openfaas.com/deployment/edge/) adds support for NVIDIA GPUs for core services defined in the `docker-compose.yaml` file. This makes it easy to run local LLMs using a tool like Ollama, then to call them for a wide range of tasks and workflows, whilst retaining data privacy and complete confidentiality.
369+
370+
The functions can be written in any language, both synchronously and asynchronously for durability and scaling out.
371+
372+
Your function could be responding to a webhook, an event such as an incoming email, or get triggered from a cron schedule, to process data from a Google Sheet, S3 bucket, or database table.
373+
374+
If you'd like to discuss ideas and get a demo of anything we've talked about, feel free to [attend our weekly call](https://docs.openfaas.com/community/) - or reach out via our [pricing page](https://openfaas.com/pricing).
375+
376+
We've covered various AI/LLM related topics across our blog in the past:
377+
378+
* [Eradicate Cold Emails From Gmail for Good With OpenAI and OpenFaaS](https://www.openfaas.com/blog/filter-emails-with-openai/)
379+
* [Scale to zero GPUs with OpenFaaS, Karpenter and AWS EKS](https://www.openfaas.com/blog/scale-to-zero-gpus/)
380+
* [How to check for price drops with Functions, Cron & LLMs](https://www.openfaas.com/blog/checking-stock-price-drops/)
381+
* [How to transcribe audio with OpenAI Whisper and OpenFaaS](https://www.openfaas.com/blog/transcribe-audio-with-openai-whisper/)
382+
383+
From our sister companies:
384+
385+
* Inlets - [Access local Ollama models from a cloud Kubernetes Cluster](https://inlets.dev/blog/2024/08/09/local-ollama-tunnel-k3s.html)
386+
* Actuated - [Run AI models with ollama in CI with GitHub Actions](https://actuated.com/blog/ollama-in-github-actions)
387+
* Actuated - [Accelerate GitHub Actions with dedicated GPUs](https://actuated.com/blog/gpus-for-github-actions)
388+
317 KB
Loading
215 KB
Loading

0 commit comments

Comments
 (0)