Skip to content

Commit 51d7dbc

Browse files
Merge pull request #27 from semiotic-ai/install
install
2 parents 9978130 + a124c88 commit 51d7dbc

28 files changed

+169
-380
lines changed

.env.example

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
OPENAI_API_KEY=<your-openai-api-key>
2-
HF_DATASET_KEY=<your-huggingface-dataset-key>
2+
HF_DATASET_KEY=<can-be-empty-if-you-do-not-want-to-use-huggingface>
33
MLFLOW_TRACKING_URI=<your-mlflow-tracking-uri>
44
MLFLOW_TRACKING_USERNAME=admin
55
MLFLOW_TRACKING_PASSWORD=password

README.md

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,13 @@ We utilize [poetry](https://python-poetry.org/) for dependency management. Pleas
1111

1212
We utilize [commitizen](https://commitizen-tools.github.io/commitizen/) for commit messages and semantic versioning. Please run `cz commit` to commit your changes. Commitizen can be installed with `pip install commitizen` or `brew install commitizen`.
1313

14+
We utilize [docker](https://www.docker.com/) for managing the tracking of our service and associated expirements through [mlflow](https://mlflow.org/). In our docker image, we spin up a [mlflow](https://mlflow.org/), [postgres](https://www.postgresql.org/), and [minio](https://min.io/) instance. This is very similar to our production setup, and allows for a pretty smooth development flow between local and prod. Please ensure you have downloaded and are running docker in the background of your machine.
15+
1416
Here are some quick commands for getting started:
1517

1618
```bash
17-
brew add poetry
18-
brew add commitizen
19+
brew install poetry
20+
brew install commitizen
1921
```
2022

2123
```bash
@@ -26,11 +28,35 @@ cd ../mlflow-manager
2628
poetry install
2729
```
2830

31+
### .env
32+
33+
There are two `.env` files that we expect the user to set up. They are divided between `mlflow-manager` and `graphdoc`. First, let's setup the `mlflow-manager` `.env` file. You can leave these values as they are, or modify them as you see fit:
34+
35+
```bash
36+
# navigate to the docker root
37+
cd mlflow-manager
38+
cd docker
39+
40+
# copy the .env.example for setup
41+
cp .env.example .env # set values directly in your newly created .env file
42+
```
43+
44+
Next, let's set up the `.env` file to be used by our `graphdoc` program.
45+
46+
```bash
47+
# navigate to the graphdoc root
48+
cd ../..
49+
50+
# copy the .env.example for setup
51+
cp .env.example .env # set values directly in your newly created .env file
52+
```
53+
2954
### run.sh
3055

3156
The `run.sh` script is a convenience script for development. It provides a few shortcuts for running useful commands.
3257

3358
```bash
59+
# make sure you are in the root of the repository
3460
# ensure that the script is executable
3561
chmod +x run.sh
3662

@@ -41,6 +67,8 @@ chmod +x run.sh
4167
To setup the mlflow-manager services, run the following command:
4268

4369
```bash
70+
# default username: admin
71+
# default password: password
4472
./run.sh mlflow-setup
4573
```
4674

graphdoc/assets/configs/single_prompt_doc_generator_module.yaml

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
graphdoc:
22
log_level: INFO # The log level to use (DEBUG, INFO, WARNING, ERROR, CRITICAL)
3-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
4-
mlflow_tracking_username: !env MLFLOW_TRACKING_USERNAME # The username for the mlflow tracking server
5-
mlflow_tracking_password: !env MLFLOW_TRACKING_PASSWORD # The password for the mlflow tracking server
63

74
mlflow:
85
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
@@ -26,6 +23,7 @@ data:
2623
evalset_ratio: 0.1 # The proportionate size of the evalset
2724
data_helper_type: generation # Type of data helper to use (quality, generation)
2825
seed: 42 # The seed for the random number generator
26+
2927
prompt:
3028
prompt: base_doc_gen # Which prompt signature to use
3129
class: DocGeneratorPrompt # Must be a child of SinglePrompt (we will use an enum to map this)
@@ -50,18 +48,15 @@ prompt_metric:
5048
trainer:
5149
class: DocGeneratorTrainer # The type of trainer to use (DocQualityTrainer)
5250
optimizer_type: miprov2 # The type of optimizer to use (miprov2, BootstrapFewShotWithRandomSearch)
53-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
5451
mlflow_model_name: doc_generator_model # The name of the model in MLflow
5552
mlflow_experiment_name: doc_generator_experiment # The name of the experiment in MLflow
5653

5754
optimizer:
5855
optimizer_type: miprov2 # BootstrapFewShotWithRandomSearch, miprov2
5956
auto: light # miprov2 setting
60-
# student: this is the prompt.infer object
61-
# trainset: this is the dataset we are working with
62-
max_labeled_demos: 2
63-
max_bootstrapped_demos: 4
64-
num_trials: 2
57+
max_labeled_demos: 2 # max number of labeled demonstrations
58+
max_bootstrapped_demos: 4 # max number of bootstrapped demonstrations
59+
num_trials: 2 # number of trials
6560
minibatch: true # default true
6661

6762
module:

graphdoc/assets/configs/single_prompt_doc_generator_module_eval.yaml

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
graphdoc:
22
log_level: INFO # The log level to use (DEBUG, INFO, WARNING, ERROR, CRITICAL)
3-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
4-
mlflow_tracking_username: !env MLFLOW_TRACKING_USERNAME # The username for the mlflow tracking server
5-
mlflow_tracking_password: !env MLFLOW_TRACKING_PASSWORD # The password for the mlflow tracking server
63

74
mlflow:
85
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
@@ -51,18 +48,15 @@ prompt_metric:
5148
trainer:
5249
class: DocGeneratorTrainer # The type of trainer to use (DocQualityTrainer)
5350
optimizer_type: miprov2 # The type of optimizer to use (miprov2, BootstrapFewShotWithRandomSearch)
54-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
5551
mlflow_model_name: doc_generator_model # The name of the model in MLflow
5652
mlflow_experiment_name: doc_generator_experiment # The name of the experiment in MLflow
5753

5854
optimizer:
5955
optimizer_type: miprov2 # BootstrapFewShotWithRandomSearch, miprov2
6056
auto: light # miprov2 setting
61-
# student: this is the prompt.infer object
62-
# trainset: this is the dataset we are working with
63-
max_labeled_demos: 2
64-
max_bootstrapped_demos: 4
65-
num_trials: 2
57+
max_labeled_demos: 2 # max number of labeled demonstrations
58+
max_bootstrapped_demos: 4 # max number of bootstrapped demonstrations
59+
num_trials: 2 # number of trials
6660
minibatch: true # default true
6761

6862
module:
@@ -72,7 +66,6 @@ module:
7266
fill_empty_descriptions: true # Whether to fill the empty descriptions in the schema
7367

7468
eval:
75-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
7669
mlflow_experiment_name: doc_generator_eval # The name of the experiment in MLflow
7770
generator_prediction_field: documented_schema # The field in the generator prediction to use
7871
evaluator_prediction_field: rating # The field in the evaluator prediction to use

graphdoc/assets/configs/single_prompt_doc_generator_trainer.yaml

Lines changed: 8 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,14 @@
11
graphdoc:
2-
log_level: INFO # The log level to use (DEBUG, INFO, WARNING, ERROR, CRITICAL)
3-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
4-
mlflow_tracking_username: !env MLFLOW_TRACKING_USERNAME # The username for the mlflow tracking server
5-
mlflow_tracking_password: !env MLFLOW_TRACKING_PASSWORD # The password for the mlflow tracking server
2+
log_level: INFO # The log level to use (DEBUG, INFO, WARNING, ERROR, CRITICAL)
63

74
mlflow:
8-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
5+
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
96
mlflow_tracking_username: !env MLFLOW_TRACKING_USERNAME # The username for the mlflow tracking server
107
mlflow_tracking_password: !env MLFLOW_TRACKING_PASSWORD # The password for the mlflow tracking server
118

129
language_model:
13-
model: openai/gpt-4o # Must be a valid dspy language model
14-
api_key: !env OPENAI_API_KEY # Must be a valid dspy language model API key
10+
model: openai/gpt-4o # Must be a valid dspy language model
11+
api_key: !env OPENAI_API_KEY # Must be a valid dspy language model API key
1512
cache: true # Whether to cache the calls to the language model
1613

1714
data:
@@ -26,6 +23,7 @@ data:
2623
evalset_ratio: 0.1 # The proportionate size of the evalset
2724
data_helper_type: generation # Type of data helper to use (quality, generation)
2825
seed: 42 # The seed for the random number generator
26+
2927
prompt:
3028
prompt: base_doc_gen # Which prompt signature to use
3129
class: DocGeneratorPrompt # Must be a child of SinglePrompt (we will use an enum to map this)
@@ -50,16 +48,13 @@ prompt_metric:
5048
trainer:
5149
class: DocGeneratorTrainer # The type of trainer to use (DocQualityTrainer)
5250
optimizer_type: miprov2 # The type of optimizer to use (miprov2, BootstrapFewShotWithRandomSearch)
53-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
5451
mlflow_model_name: doc_generator_model # The name of the model in MLflow
5552
mlflow_experiment_name: doc_generator_experiment # The name of the experiment in MLflow
5653

5754
optimizer:
5855
optimizer_type: miprov2 # BootstrapFewShotWithRandomSearch, miprov2
5956
auto: light # miprov2 setting
60-
# student: this is the prompt.infer object
61-
# trainset: this is the dataset we are working with
62-
max_labeled_demos: 2
63-
max_bootstrapped_demos: 4
64-
num_trials: 2
57+
max_labeled_demos: 2 # max number of labeled demonstrations
58+
max_bootstrapped_demos: 4 # max number of bootstrapped demonstrations
59+
num_trials: 2 # number of trials
6560
minibatch: true # default true

graphdoc/assets/configs/single_prompt_doc_quality_trainer.yaml

Lines changed: 4 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
graphdoc:
22
log_level: INFO # The log level to use (DEBUG, INFO, WARNING, ERROR, CRITICAL)
3-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
4-
mlflow_tracking_username: !env MLFLOW_TRACKING_USERNAME # The username for the mlflow tracking server
5-
mlflow_tracking_password: !env MLFLOW_TRACKING_PASSWORD # The password for the mlflow tracking server
63

74
mlflow:
85
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
@@ -26,6 +23,7 @@ data:
2623
evalset_ratio: 0.1 # The proportionate size of the evalset
2724
data_helper_type: quality # Type of data helper to use (quality, generation)
2825
seed: 42 # The seed for the random number generator
26+
2927
prompt:
3028
prompt: doc_quality # Which prompt signature to use
3129
class: DocQualityPrompt # Must be a child of SinglePrompt (we will use an enum to map this)
@@ -50,16 +48,13 @@ prompt_metric:
5048
trainer:
5149
class: DocQualityTrainer # The type of trainer to use (DocQualityTrainer)
5250
optimizer_type: miprov2 # The type of optimizer to use (miprov2, BootstrapFewShotWithRandomSearch)
53-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
5451
mlflow_model_name: doc_quality_model # The name of the model in MLflow
5552
mlflow_experiment_name: doc_quality_experiment # The name of the experiment in MLflow
5653

5754
optimizer:
5855
optimizer_type: miprov2 # BootstrapFewShotWithRandomSearch, miprov2
5956
auto: light # miprov2 setting
60-
# student: this is the prompt.infer object
61-
# trainset: this is the dataset we are working with
62-
max_labeled_demos: 2
63-
max_bootstrapped_demos: 4
64-
num_trials: 2
57+
max_labeled_demos: 2 # max number of labeled demonstrations
58+
max_bootstrapped_demos: 4 # max number of bootstrapped demonstrations
59+
num_trials: 2 # number of trials
6560
minibatch: true # default true
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
graphdoc.modules.token\_tracker module
2+
======================================
3+
4+
.. automodule:: graphdoc.modules.token_tracker
5+
:members:
6+
:undoc-members:
7+
:show-inheritance:
8+
:noindex:

graphdoc/docs/index.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,5 @@ Indices and tables
2626
==================
2727

2828
* :ref:`genindex`
29-
* :ref:`modindex`
3029
* :ref:`search`
3130

graphdoc/graphdoc/config.py

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ def mlflow_data_helper_from_yaml(yaml_path: Union[str, Path]) -> MlflowDataHelpe
116116
.. code-block:: yaml
117117
118118
mlflow:
119-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
119+
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
120120
mlflow_tracking_username: !env MLFLOW_TRACKING_USERNAME # The username for the mlflow tracking server
121121
mlflow_tracking_password: !env MLFLOW_TRACKING_PASSWORD # The password for the mlflow tracking server
122122
@@ -429,11 +429,15 @@ def single_trainer_from_dict(
429429
.. code-block:: python
430430
431431
{
432+
"mlflow": {
433+
"mlflow_tracking_uri": "http://localhost:5000",
434+
"mlflow_tracking_username": "admin",
435+
"mlflow_tracking_password": "password",
436+
},
432437
"trainer": {
433438
"class": "DocQualityTrainer",
434439
"mlflow_model_name": "doc_quality_model",
435440
"mlflow_experiment_name": "doc_quality_experiment",
436-
"mlflow_tracking_uri": "http://localhost:5000"
437441
},
438442
"optimizer": {
439443
"optimizer_type": "miprov2",
@@ -465,7 +469,7 @@ def single_trainer_from_dict(
465469
optimizer_kwargs=trainer_dict["optimizer"],
466470
mlflow_model_name=trainer_dict["trainer"]["mlflow_model_name"],
467471
mlflow_experiment_name=trainer_dict["trainer"]["mlflow_experiment_name"],
468-
mlflow_tracking_uri=trainer_dict["trainer"]["mlflow_tracking_uri"],
472+
mlflow_tracking_uri=trainer_dict["mlflow"]["mlflow_tracking_uri"],
469473
trainset=trainset,
470474
evalset=evalset,
471475
)
@@ -631,7 +635,7 @@ def doc_generator_eval_from_yaml(yaml_path: Union[str, Path]) -> DocGeneratorEva
631635
.. code-block:: yaml
632636
633637
mlflow:
634-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
638+
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
635639
mlflow_tracking_username: !env MLFLOW_TRACKING_USERNAME # The username for the mlflow tracking server
636640
mlflow_tracking_password: !env MLFLOW_TRACKING_PASSWORD # The password for the mlflow tracking server
637641
@@ -663,7 +667,6 @@ def doc_generator_eval_from_yaml(yaml_path: Union[str, Path]) -> DocGeneratorEva
663667
fill_empty_descriptions: true # Whether to fill the empty descriptions in the schema
664668
665669
eval:
666-
mlflow_tracking_uri: !env MLFLOW_TRACKING_URI # The tracking URI for MLflow
667670
mlflow_experiment_name: doc_generator_eval # The name of the experiment in MLflow
668671
generator_prediction_field: documented_schema # The field in the generator prediction to use
669672
evaluator_prediction_field: rating # The field in the evaluator prediction to use

graphdoc/graphdoc/prompts/schema_doc_quality.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,9 @@ class DocQualitySignature(dspy.Signature):
3030
""" # noqa: B950
3131

3232
database_schema: str = dspy.InputField()
33-
category: Literal[
34-
"perfect", "almost perfect", "poor but correct", "incorrect"
35-
] = dspy.OutputField()
33+
category: Literal["perfect", "almost perfect", "poor but correct", "incorrect"] = (
34+
dspy.OutputField()
35+
)
3636
rating: Literal[4, 3, 2, 1] = dspy.OutputField()
3737

3838

@@ -69,9 +69,9 @@ class DocQualityDemonstrationSignature(dspy.Signature):
6969
""" # noqa: B950
7070

7171
database_schema: str = dspy.InputField()
72-
category: Literal[
73-
"perfect", "almost perfect", "poor but correct", "incorrect"
74-
] = dspy.OutputField()
72+
category: Literal["perfect", "almost perfect", "poor but correct", "incorrect"] = (
73+
dspy.OutputField()
74+
)
7575
rating: Literal[4, 3, 2, 1] = dspy.OutputField()
7676

7777

0 commit comments

Comments
 (0)