This project serves as proof-of-concept implementation of a Compute Engine job for application builders in Vana's Data Access architecture.
The worker leverages the test sqlite DB mounted to /mnt/input/query_results.db
(dir overridable via INPUT_PATH
env variable) following an unwrapped schema used for demoing the query engine in the hope of an eventual, consistent E2E workflow.
It processes the input data and outputs a stats.json
under /mnt/output/stats.json
(dir overridable via OUTPUT_PATH
env variable).
- Edit the
dummy_data.sql
script with the DLP data refiner schema, seed some dummy data, and add your query at the bottom to simulate theresults
table creation. - Run
sqlite3 ./input/query_results.db < dummy_data.sql
to transform the seed data into an SQLite database that can be processed by the job. - Update the
worker.py
to add any processing logic for artifact generation. - Have the worker output any artifacts your application needs in the output dir
os.getenv("OUTPUT_PATH", "/mnt/output")
. - Run the
image-build.sh
andimage-run.sh
scripts to test your worker implementation. Make sure to setDEV_MODE=1
to use the local database file without requiring a real query engine. - Run the
image-export.sh
script to generate themy-compute-job.tar
archive. Gzip this manually or push your changes to main to build a release (with SHA256 checksum).
The worker supports two modes of operation:
-
Development Mode: Set
DEV_MODE=1
to use a local database file without connecting to the query engine. This is useful for testing and development.# Example: Running in development mode docker run -e DEV_MODE=1 -v /local/path/to/input:/mnt/input -v /local/path/to/output:/mnt/output my-compute-job
-
Production Mode: The default mode connects to the query engine using the
QUERY
andQUERY_SIGNATURE
environment variables to execute the query first, then processes the results.# Example: Running in production mode docker run -e QUERY="SELECT user_id, locale FROM users" -e QUERY_SIGNATURE="xyz123" -e QUERY_ENGINE_URL="https://query.vana.org" -v /local/path/to/output:/mnt/output my-compute-job
Important: Docker images must be compatible with AMD architecture to run properly in the Compute Engine's Trusted Execution Environments (TEEs). When building your Docker image:
- Ensure all dependencies and binaries are AMD64-compatible
- Build the Docker image on an AMD64 platform or use the
--platform=linux/amd64
flag with Docker buildx - Test the image in an AMD64 environment before submission
- Avoid using architecture-specific binaries or libraries when possible
These are sugar scripts for docker commands to build, export, and run the worker image consistently for simpler dev cycles / iteration.
The image-export.sh
script builds an exportable .tar
for uploading in remote services for registering with the compute engine / image registry contracts.
The script dummy_data.sql
can be modified with the relevant schema and dummy data insertion. The query at the bottom of the script file simulates the Query Engine results
table creation when processing queries.
To transform this dummy data into the input query_results.db
SQLite DB simply run sqlite3 ./input/query_results.db < dummy_data.sql
.
Note: Only the results
table will be available in production compute engine jobs. The other tables serve to seed dummy data.
- Compute Jobs are run as Docker containers inside of the Compute Engine TEE.
- Docker container images ("Compute Instructions") must be approved for a given Data Refiner id by DLP owners through the Compute Instruction Registry smart contract before being submitted for processing via the Compute Engine API.
- The Data Refiner id determines the schema that can be queried against, the granted permissions by the DLP owner, and the cost to access each queried data component (schema, table, column) of the query when running compute jobs.
- Individual queries to the Query Engine are run outside of the Compute Job by the Compute Engine directly before invoking the Compute Job.
- Input data is provided from the compute engine to the compute job container through a mounted
/mnt/input
directory.- This directory contains a single
query_results.db
SQLite file downloaded from the Query Engine after a query has been successfully processed. - A queryable
results
table is the only table in the mountedquery_results.db
. This table contains all of the queried data points of the query submitted to the Query Engine through the Compute Engine API. - Example:
- This directory contains a single
-- Refiner Schema:
CREATE TABLE users (id AUTOINCREMENT, name TEXT, locale TEXT, zip_code TEXT, city TEXT);
-- Application Builder Query:
SELECT id, name, locale FROM users;
-- Query Engine outputs `query_results.db` with schema
CREATE TABLE results (id INTEGER, name TEXT, locale text);
-- Compute Job processing:
SELECT id, name FROM results;
SELECT locale FROM results;
…
- Output data Artifacts are provided to the Compute Engine from the Compute Job container through a mounted
/mnt/output
directory. - Any Artifact files generated in this directory by the Compute Job will later be available for consumption and download by the job owner (=application builder) through the Compute Engine API.
- Query data from
results
table of/mnt/input/query_results.db
with SQLite. - Run custom logic to process (transform / aggregate / …) query results.
Write generated Artifacts to the
/mnt/output
directory for later download by the application builder / job owner wallet through the Compute Engine API.
- Build and export the Compute Job Docker image to a
.tar
, andgzip
to a.tar.gz
. - Upload it to a publicly accessible URL for later retrieval by the Compute Engine.
- Calculate the SHA256 checksum of the image archive file and document for use in on-chain registration. (Example:
sha256sum my-compute-job.tar.gz | cut -d' ' -f1
) - Write the Compute Instruction on-chain to the ComputeInstructionRegistry smart contract via the
addComputeInstruction
function with both the publicly available image URL and the SHA256 image checksum. - Notify the relevant DLP owner for Compute Instruction image audits and eventual approval with the DLP owner wallet through the
updateComputeInstruction
ComputeInstructionRegistry smart contract function. - Approval can be checked and verified on-chain with the
isApproved
ComputeInstructionRegistry smart contract function.