A generic "generate" goal for version-controlled generated files #18235

cognifloyd · 2023-02-13T19:29:18Z

cognifloyd
Feb 13, 2023
Collaborator

For a variety of reasons (many of which could probably be summarized as "legacy"), some generated files need to be committed to version control.

Since these files need to be in version control, it would be great if pants could make it easier to manage them. Here are the goals I have for whatever solution we use:

Generated files should be up-to-date.
If the files are out-of-date (eg one of the inputs used to generate it has changed), then the user needs an error or warning with an explanation of how to re-generate the files.

A survey of current pants goals

goal: `export-codegen`

At first blush, codegen seems ideal for making generated files. But codegen in pants is not designed for version controlled files.

When pants generates code, it is considered an internal by-product as described in the PR that introduced the export-codegen goal:

It's important that we write to dist/, rather than saving to the build root, as these are generated files that should be kept out of version control. Also, we'd have major issues if Pants starting thinking those generated files were actual production files, e.g. a [target]'s default sources field claiming ownership of those files.

The export-codegen goal makes these generated files available in dist/ for those who want to:

use the results of codegen outside of Pants, including:

Inspect the files for debugging purposes.

Save the files for IDEs to use.

So, we need a different form of codegen in pants for files that should be version controlled.

goals: lint, fmt, fix

Version controlled files are the bread and butter of lint, fmt, and fix which is opposite of export-codegen which is for files that are not version controlled.

Fixers and formatters also run during the lint goal so that lint fails if they need to fix or format a file.

Code generation of files that are version controlled is remarkably similar to fmt and fix. Files that need to be changed are advertised via the lint goal. Logically, fmt and fix are very different than codegen - but the process, the UX, and the underlying rules seem quite similar to me.

In the StackStorm project, I've shoved my (version controlled) file generation under the fmt umbrella even though it's not really a formatter. Logically this generation is also not a fixer. But, one of the benefits of doing this is that: lint will fail if the generator needs to regenerate the file(s) and it will provide an error message that explains how to run ./pants fmt ... to resolve the lint error.

I've heard of others that create a "test" that fails if the generated file(s) were not properly regenerated. I really like using lint for this to take advantage of an the fine grained caching and reusing the cache for both lint and fmt/fix.

goal: generate-lockfiles

Another goal that is very similar to the file generation feature I'm looking for is the generate-lockfiles goal. The warnings/errors saying to regenerate can be triggered in many places where rules need to use the lockfiles. So, there is no need to integrate that warning in a lint backend/rule. The goal also provides a fairly ergonomic interface for regenerating those.

generate-lockfiles looks up the resolves, translates them into lockfile paths, and then generates the lockfile at that path.

Proposal for a new "generate" goal

If there were a generic "generate" goal, it could go in the opposite direction of generate-lockfiles: given the path of a lockfile, look up the resolve associated with that lockfile, and then generate the lockfile at that path.

Similarly, that "generate" goal could take a path to another generated file, look up owning target for it, and if that target is explicitly for generated files, run the plugin/subsystem/tool that encompasses the codegen logic for that target.

This goal could be only for non-lockfile generated files. If we did include lockfile generation in this goal, then the owning target would probably be a synthetic target synthesized from the config in pants.toml -- so it is explicitly defined (even though not defined in a BUILD file).

This goal should be modeled on fix/fmt (not modeled on export-codegen). This would extend the lint request/result classes (like fix/fmt do) so that file generation happens in a sandbox during lint, failing lint if something should be regenerated. When lint fails here, the error message would prompt the user to run ./pants generate path/to/generated/file/or/dir.

For a plugin to provide this generate rule, it would either have to add a custom target, or add a field to an existing target to allow flagging it as owning generated files. This would be a codified assumption built into the generate goal: all generated files must be owned by exactly 1 file generating target: If unowned, they will not be materialized in the workspace; If owned by more than one file generating targets, raise an error. Generating a file that matches a file glob for that target counts for this.

Then, if any other linters can work with a generated files the standard skip_tool fields provide an escape hatch if the generated code does not need to meet the same standards as other files. Or, for StackStorm, I can add an extra linter specific to the generated file that checks more than just whether or not it needs to be regenerated.

Goal: run

A lot of work has gone into runnable targets like adhoc_tool and shell_command. If combined with the generate goal, those could allow repos to use light-weight run targets and file generator targets instead of writing a medium- or heavy-weight pants plugins.

So, a runnable target could be tied to the file generating target such that the adhoc_tool or pex_binary is used to actually do the generation when a user does ./pants generate path/to/generated/file.

For the run goal, the UX focuses on the generator script/target. For this "generate" goal, the UX focuses on the files to generate. So they are complementary.

goal: go-generate

#16909

This seems very similar and could perhaps be rolled into the new "generate" goal. go-generate references the go modules that define the generation. I'm looking at something that targets the generated files. But, often go:generate commands work within the same module where they are defined, so maybe it is effectively the same thing.

But, this is something that would not run via Lint. So, that would probably be a go backend option to skip running for lint is the goals were combined.

discussion

Actually naming the new "generate" goal is a different discussion. When we discuss that, we'll need to make sure it can be distinguished from the codegen that already exists in pants.

So, would a generic generate goal be helpful for anyone else? Does a generate goal that follows a process similar to fix/fmt make sense for other projects?

huonw · 2023-02-14T20:41:01Z

huonw
Feb 14, 2023
Collaborator

Native ability to define simple targets that write back to the non-sandboxed filesystem (with built-in up-to-date-ness checking) would be great!

Some use-cases I've encountered:

files that are read by external systems straight from the repo, without the ability to execute code. Examples: GitHub Actions workflows (e.g. Pants' build-support/bin/generate_github_workflows.py), dependabot configuration.
"Snapshot" testing of critical derived data. For instance, the API schema derived from an API server checked in means any changes in the schema will be directly visible during code review, increasing the chance of catching issues like backwards incompatible changes.
"Legacy"-ish of integrating with non-pants parts of a repo. For instance: exporting a schema from an (Python) API and doing code generations for (TypeScript) clients.
Relatedly, even if the clients were fully integrated into pants, I'd be pretty tempted to keep checking in the generated clients, because my feeling is it generally integrates better with editors/IDEs etc. to have the source files just in the right place with all the other files, rather than needing to manually keep exporting and configuring the editors (but maybe we just haven't worked out our editors well enough).

Some small/bikeshed-y thoughts:

Subsuming generate-lockfiles seems like it'd be tricky, as I suspect you're aware:
- The current behaviour of generate-lock files is to always update to the latest versions that satisfy constraints (e.g. Add ability to extend a PEX lockfile without modifying existing entries #15704), and it'd be very annoying if any run of pants generate :: did that too (I think that'd effectively mean pants generate :: would be unusable as is, and would always require filtering/exclusions/more specific targets)
- Expanding on that, there's two meanings of "up to date" for lockfiles: (1) consistent with the declared requirements (a "local"/static property), and (2) using the latest versions satisfying the requirements (a "global"/mutable property: new versions of dependencies published to PyPI will break this check).
- there's a few potentially related issues: New goal for managing 3rd party dependencies #12880, Rename generate-lockfiles to lock #15568, Add ability to extend a PEX lockfile without modifying existing entries #15704, Add ability to check if lockfiles are up-to-date with all declared requirements #15723 (maybe more...?)
It seems a bit weird to have so many goals that mutate files (fmt, fix, generate, and even run in some cases)?
Some generation processes might both be fairly heavyweight and have to trigger often. This means I'm not 100% sure running on lint would be a great user experience. It might be better to run on check or test (related to 2, there's a lot of goals that check files...).
- For example, generating a schema from an API server may require starting that API, and pants will, correctly, rerun for any code change (even editing a comment in some tiny utility file that's a deep transitive dependency), even though most of them won't affect the schema.
- allow users to assign tools to fmt and fix goals #18239 is related, since different generators might want to run at different "levels".
Allow using a shell command/adhoc tool as part of any goal (check, lint, fmt, fix, deploy, ...) #17729 is related, in that it's other thinking of how shell_command and/or adhoc_tool might mutate files directly.

3 replies

cognifloyd Feb 15, 2023
Collaborator Author

Subsuming generate-lockfiles seems like it'd be tricky, as I suspect you're aware:

Yes. I agree that subsuming generate-lockfiles would be very problematic. But, I found it helpful to think through the generation workflow to find similarities. There are enough differences that it should probably remain separate.

It seems a bit weird to have so many goals that mutate files (fmt, fix, generate, and even run in some cases)?

Yeah. That's frustrating.
- I started out just reusing fmt for this, but the changes to support fmt/fix batching removed the ability for fmt plugins to add new files. It's also conceptually different. I think the saving grace here is that running lint tells you which goal to run if something can be fixed automatically.
- When can run mutate files? adhoc_tools runs in a sandbox, and makes its output available via standard codegen (ephemeral non-version-controlled). I believe shell_command is similar.
- If run can have un-sandboxed side effects, then perhaps we could use that instead, but people will have to copy/paste all the args when regenerating something which feels error prone. run feels too ad-hoc for that.

Some generation processes might both be fairly heavyweight and have to trigger often. This means I'm not 100% sure running on lint would be a great user experience. It might be better to run on check or test (related to 2, there's a lot of goals that check files...).

Oh. I forgot about check. I have only used it in the pants repo, so I don't think of it much. It's sounding more and more like the generated target will need some kind of knob to say whether the output should be subject to lint or check or nothing (kind of like how go-generate is not currently tied into lint right now).

Allow using a shell command/adhoc tool as part of any goal (check, lint, fmt, fix, deploy, ...) #17729 is related, in that it's other thinking of how shell_command and/or adhoc_tool might mutate files directly.

Yes these are very interrelated, but the current goals don't logically match with something that generates files. So, we need something else as well.

huonw Feb 15, 2023
Collaborator

Yes. I agree that subsuming generate-lockfiles would be very problematic. But, I found it helpful to think through the generation workflow to find similarities. There are enough differences that it should probably remain separate.

👍

When can run mutate files? adhoc_tools runs in a sandbox, and makes its output available via standard codegen (ephemeral non-version-controlled). I believe shell_command is similar.

I believe something supporting run is executed with its working directory as the true repo root (although the sources/deps etc. are all in a sandbox) and thus has full access to mutate things. E.g. pants run whatever.py with the file below creates (or updates) last_run.txt.

# whatever.py
from datetime import datetime
import os
import sys

print(os.getcwd())
print(sys.path)

with open("last_run.txt", "w") as f:
    f.write(str(datetime.now()))

This seems to be how a generator like https://github.com/pantsbuild/pants/blob/99d56537fa8b9020c107ab347e9e6b3416fc52a8/build-support/bin/generate_github_workflows.py is recommended to be used.

(I think adhoc_tool and shell_command don't directly support the run goal.)

If run can have un-sandboxed side effects, then perhaps we could use that instead, but people will have to copy/paste all the args when regenerating something which feels error prone. run feels too ad-hoc for that.

Yeah, fully agreed that run isn't right. For instance, we have an alias like the following so one can just run pants quick-checks and get all the obvious stuff fixed/checked, and it'd be great if codegen tasks could happen as part of it too:

[cli.alias]
quick-checks = "tailor update-build-files fmt lint check ::"

A new generate goal as you've designed (or reusing the fix goal, or whatever) presumably can just be added, but run path/to/generator.py cannot (with pant's current behaviour: the target selection doesn't work out).

cognifloyd Feb 15, 2023
Collaborator Author

Hmm. I guess my mental model around run and adhoc_tools wasn't quite accurate. Thank you for also confirming (err agreeing with me) that it isn't a good fit for code generation.

huonw · 2023-08-07T07:30:28Z

huonw
Aug 7, 2023
Collaborator

Just recording the workaround we use for this for now:

have the codegen as a normal shell_command or adhoc_tool or whatever target that generates the sources
a separate run_shell_command target that copies the generated files from the {chroot} sandbox to the right place on disk, which can be run like pants run path/to:target
a files (or resources) target for the files on diskl
a experimental_test_shell_command that diffs the generated files and the ones on disk, to confirm that the codegen matches what's checked in

E.g. for a file foo.txt something vaguely like: (can work for a directory too, with some variations like using rsync instead of cp and diff -r)

# path/to/BUILD

shell_command(
    name="generate-foo",
    command="echo abc > foo-generated.txt",
    tools=["echo"],
    output_files=["foo-generated.txt"],
)

files(name="foo", sources=["foo.txt"])
run_shell_command(
    name="write-foo",
    # different file name so that they can be compared in the test
    command="cp {chroot}/path/to/foo-generated.txt foo.txt",
    execution_dependencies=[":generate-foo"]
)

experimental_test_shell_command(
    # if this fails, run `pants run path/to:write-foo`
    name="check-foo-is-up-to-date",
    command="diff foo.txt foo-generated.txt",
    execution_dependencies=[":generate-foo", ":foo"],
)

1 reply

huonw Jul 11, 2025
Collaborator

Now packaged up into a macro:

def checked_in_codegen(
    *,
    name,
    output_file = None,
    output_file_from_stdout=False,
    output_directory = None,
    help_text_on_failure,
    codegen_target,
    codegen_target_fields,
    extra_test_target_fields=None,
):
    """A target that packages up code generation and checking it into the repo.

    Exactly one of output_file or output_directory must be specified.

    This creates various sub-targets. The interesting ones are:

    - `{name}`: a file or files target referring to the checked-in file(s) on disk
    - `{name}-write`: a `pants run`-able target that runs the code generation and updates the file(s) on disk
    - `{name}-is-up-to-date`: a `pant test`-able target that checks the files on disk match the result of re-running codegen.

    FIXME: https://github.com/pantsbuild/pants/discussions/18235 this codegen should be maintained
    as part of a native goal, e.g. a `save="."` field to `adhoc_tool` or similar.

    Fields:
      name:
        The base name to use (also ends up as the name of the 'file' target for the result of the generation)

      output_file:
        The name of file that should be saved to the repository (becomes a sibling of the BUILD file
        calling this macro). Mutually exclusive with `output_directory`.

      output_file_from_stdout:
        If True, the `output_file` is created from what the script prints; if False (default), the
        `output_file` is created on disk by executing `codegen_target(...)` within its sandbox, and
        captured from that. Must not be set when using `output_directory`.

      output_directory:
        The name of the directory that should be saved to the repository (the _directory_ becomes a
        sibling of the BUILD file calling this macro, i.e. any files within the directory are niblings of the
        BUILD file, not siblings). Mutually exclusive with `output_file`.

      help_text_on_failure:
        Informative text to print if the "up to date?" test fails, usually "how to run the codegen generation"

      codegen_target:
        A Pants target value to run the codegen, usually either `shell_command` or `adhoc_tool`.

      codegen_target_fields:
        Any fields required for `codegen_target`; used like `codegen_target(..., **codegen_target_fields)`.

      extra_test_target_fields:
        Any additional fields to pass to the "up to date?" test `test_shell_command` target.

    """
    # There's three basic step to this:
    #
    # 1. doing the codegen
    # 2. writing it to disk (on demand)
    # 3. verifying that what's on disk is up to date (in CI etc.)

    # Step 1: codegen
    output_fields = {}

    # validate the output configuration makes sense,
    # either one output file (potentially from stdout)
    if output_file is not None:
        if output_directory is not None:
            raise ValueError(f"{build_file_dir()}:{name}: expected one of output_file or output_directory, found both {output_file=} and {output_directory=}")

        if output_file_from_stdout:
            output_fields["stdout"] = output_file
        else:
            output_fields["output_files"] = [output_file]

        output_pathname = output_file

    elif output_directory is not None:
        if output_file_from_stdout:
            raise ValueError(f"{build_file_dir()}:{name}: cannot set output_file_from_stdout when using output_directory")

        output_fields["output_directories"] = [output_directory]
        # NB. the trailing / is required for the rsync below, to get uniform handling between
        # output_file vs. output_directory: suppose we have output_directory="X", with contents
        # file.txt (i.e. X/file.txt exists):
        #
        # - `rsync X Y`, will result in Y/X/file.txt ❌
        # - `rsync X/ Y`, will result in Y/file.txt ✅
        output_pathname = f"{output_directory}/"

    else:
        raise ValueError(f"{build_file_dir()}:{name}: expected one of output_file or output_directory, found neither")


    raw_codegen_tgt = codegen_target(
        name=f"{name}-codegen-raw",
        **codegen_target_fields,
        **output_fields,
    )
    raw_codegen = f":{raw_codegen_tgt.name}"

    # NB. the codegened file name and what's on disk are the same, but we need to be able to tell
    # them apart, for the test, so we put the codegen into a dedicated directory, as an
    # implementation detail of this target
    codegen_dir = build_file_dir() / "_codegen"
    codegen_tgt = relocated_files(
        name=f"{name}-codegen",
        files_targets=[raw_codegen],
        src=str(build_file_dir()),
        dest=str(codegen_dir),
    )
    codegen = f":{codegen_tgt.name}"

    # Step 2: writing it to disk
    run_shell_command(
        name=f"{name}-write",
        command=f"""
        rsync --verbose --recursive --archive {{chroot}}/{codegen_dir}/{output_pathname} ./{output_pathname}
        """,
        execution_dependencies=[codegen],
    )

    if output_file is not None:
        assert output_directory is None, "checked above"
        checked_in_tgt = file(name=name, source=output_file)
    else:
        assert output_directory is not None, "checked above"
        checked_in_tgt = files(name=name, sources=[f"{output_directory}/**"])
    checked_in = f":{checked_in_tgt.name}"

    # Step 3: verifying up to date-ness

    if extra_test_target_fields is None:
        extra_test_target_fields = {}

    test_shell_command(
        name=f"{name}-is-up-to-date",
        tools=["diff"],
        command=f"""
        if ! diff --unified=1 {{chroot}}/{codegen_dir / output_pathname} {output_pathname}; then
            echo
            echo '{help_text_on_failure}'
            exit 1
        fi
        """,
        execution_dependencies=[codegen, checked_in],
        **extra_test_target_fields,
    )

We use it, for instance, for generating models using https://github.com/koxudaxi/datamodel-code-generator/, so we end up with say:

python_requirement(
    name="_datamodel-code-generator",
    requirements=["datamodel-code-generator==0.31.2"],
    # separate resolve to avoid infecting the runtime libraries with this built-time tool's
    # dependency constraints
    resolve="datamodel-codegen",
)
# FIXME (https://github.com/pantsbuild/pants/issues/20108): running the python_requirement directly
# generally works... except in CI where the "named caches" aren't preserved across runs, so we
# workaround this by reifying an explicit PEX. Once running a requirement directly is reliable, this
# can be removed.
pex_binary(
    name="datamodel-codegen",
    dependencies=[":_datamodel-code-generator"],
    script="datamodel-codegen",
    resolve="datamodel-codegen",
    layout="packed",
)

file(name="json-schema", source="schema.json")

checked_in_codegen(
    name="model-codegen",
    output_file="__init__.py",
    codegen_target=adhoc_tool,
    codegen_target_fields=dict(
        runnable=":datamodel-codegen",
        args=[
            "--input=schema.json",
            "--input-file-type=jsonschema",
            "--output=__init__.py",
...
        ],
        execution_dependencies=[":json-schema"],
    ),
    help_text_on_failure="❌ Models not matched, run `pants run path/to/here:model-codegen-write` to update",
)

This isn't as automatic as would be nice, but it does at least have the tests that run on pants test :: to check for up-to-date-ness and then print out the command to run.

For common codegen tasks like this, we create aliases, e.g. pants custom-model-codegen (and then have the help_text_on_failure reference that, rather than the full pants run ...):

# pants.toml
[cli.alias]
custom-model-codegen = "run //path/to/here:model-codegen-write"

salotz · 2025-06-13T19:11:08Z

salotz
Jun 13, 2025

I would also add that it would be useful to have outputs of tests be output as well. For example some of them run long running computations which I can't rerun all the time to get the outputs. Normally I write them out as one "test" and then have other acceptance tests that just assume the data was already generated locally. (Additionally I version control them with a data version control system (dvc)).
I get that I can have different targets for this but it would be duplicating a lot since the data generation is a test as well.

0 replies

Uh oh!

Uh oh!

A generic "generate" goal for version-controlled generated files #18235

Uh oh!

Uh oh!

cognifloyd Feb 13, 2023 Collaborator

A survey of current pants goals

goal: export-codegen

goals: lint, fmt, fix

goal: generate-lockfiles

Proposal for a new "generate" goal

Goal: run

goal: go-generate

discussion

Replies: 3 comments · 4 replies

Uh oh!

huonw Feb 14, 2023 Collaborator

Uh oh!

Uh oh!

cognifloyd Feb 15, 2023 Collaborator Author

Uh oh!

huonw Feb 15, 2023 Collaborator

Uh oh!

cognifloyd Feb 15, 2023 Collaborator Author

Uh oh!

Uh oh!

huonw Aug 7, 2023 Collaborator

Uh oh!

huonw Jul 11, 2025 Collaborator

Uh oh!

Uh oh!

salotz Jun 13, 2025

cognifloyd
Feb 13, 2023
Collaborator

goal: `export-codegen`

Replies: 3 comments 4 replies

huonw
Feb 14, 2023
Collaborator

cognifloyd Feb 15, 2023
Collaborator Author

huonw Feb 15, 2023
Collaborator

cognifloyd Feb 15, 2023
Collaborator Author

huonw
Aug 7, 2023
Collaborator

huonw Jul 11, 2025
Collaborator

salotz
Jun 13, 2025