Skip to content

[mlgo-utils] Hoist entrypoint scripts to mlgo-utils directory #146981

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

thevinster
Copy link
Contributor

@thevinster thevinster commented Jul 4, 2025

These scripts belong in the mlgo-utils directory when directly used with python3. But since they are also used to package with pip, hoist the entrypoint scripts to mlgo-utils directory. Adjust the bazel paths to account for this as well. This follows the same structure as lit.

$ python3 ./llvm/utils/mlgo-utils/extract_ir.py --help                                                                                                        1s
usage: extract_ir.py [-h] [--input INPUT] [--input_type [{json,params,directory,bazel_aquery}]] [--output_dir OUTPUT_DIR] [--num_workers [NUM_WORKERS]] [--llvm_objcopy_path [LLVM_OBJCOPY_PATH]]
                     [--obj_base_dir [OBJ_BASE_DIR]] [--cmd_filter [CMD_FILTER]] [--thinlto_build [{distributed,local}]] [--cmd_section_name [CMD_SECTION_NAME]]
                     [--bitcode_section_name [BITCODE_SECTION_NAME]] [--verbosity [{DEBUG,INFO,WARNING,ERROR}]]
...

$ python3 ./llvm/utils/mlgo-utils/combine_training_corpus.py --help                                                                                           0s
usage: combine_training_corpus.py [-h] [--root_dir ROOT_DIR] [--verbosity [{DEBUG,INFO,WARNING,ERROR}]]
...

$ python3 ./llvm/utils/mlgo-utils/make_corpus.py --help                                                                                                      0s
usage: make_corpus.py [-h] [--input_dir INPUT_DIR] [--output_dir OUTPUT_DIR] [--default_args [DEFAULT_ARGS]]
...

Verified that I was also able to build the package successfully and use the script.

@thevinster thevinster requested a review from boomanaiden154 July 4, 2025 01:28
@llvmbot llvmbot added mlgo bazel "Peripheral" support tier build system: utils/bazel labels Jul 4, 2025
@thevinster thevinster requested a review from smeenai July 4, 2025 01:28
@llvmbot
Copy link
Member

llvmbot commented Jul 4, 2025

@llvm/pr-subscribers-mlgo

Author: Vincent Lee (thevinster)

Changes

These scripts belong in the mlgo-utils directory when directly used with python3. Adjust the bazel paths to account for this as well. This follows the same structure as lit.

$ python3 ./llvm/utils/mlgo-utils/extract_ir.py --help                                                                                                        1s
usage: extract_ir.py [-h] [--input INPUT] [--input_type [{json,params,directory,bazel_aquery}]] [--output_dir OUTPUT_DIR] [--num_workers [NUM_WORKERS]] [--llvm_objcopy_path [LLVM_OBJCOPY_PATH]]
                     [--obj_base_dir [OBJ_BASE_DIR]] [--cmd_filter [CMD_FILTER]] [--thinlto_build [{distributed,local}]] [--cmd_section_name [CMD_SECTION_NAME]]
                     [--bitcode_section_name [BITCODE_SECTION_NAME]] [--verbosity [{DEBUG,INFO,WARNING,ERROR}]]
...

$ python3 ./llvm/utils/mlgo-utils/combine_training_corpus.py --help                                                                                           0s
usage: combine_training_corpus.py [-h] [--root_dir ROOT_DIR] [--verbosity [{DEBUG,INFO,WARNING,ERROR}]]
...

$ python3 ./llvm/utils/mlgo-utils/make_corpus.py --help                                                                                                      0s
usage: make_corpus.py [-h] [--input_dir INPUT_DIR] [--output_dir OUTPUT_DIR] [--default_args [DEFAULT_ARGS]]
...

Full diff: https://github.com/llvm/llvm-project/pull/146981.diff

4 Files Affected:

  • (renamed) llvm/utils/mlgo-utils/combine_training_corpus.py ()
  • (renamed) llvm/utils/mlgo-utils/extract_ir.py ()
  • (renamed) llvm/utils/mlgo-utils/make_corpus.py ()
  • (modified) utils/bazel/llvm-project-overlay/llvm/BUILD.bazel (+3-3)
diff --git a/llvm/utils/mlgo-utils/mlgo/corpus/combine_training_corpus.py b/llvm/utils/mlgo-utils/combine_training_corpus.py
similarity index 100%
rename from llvm/utils/mlgo-utils/mlgo/corpus/combine_training_corpus.py
rename to llvm/utils/mlgo-utils/combine_training_corpus.py
diff --git a/llvm/utils/mlgo-utils/mlgo/corpus/extract_ir.py b/llvm/utils/mlgo-utils/extract_ir.py
similarity index 100%
rename from llvm/utils/mlgo-utils/mlgo/corpus/extract_ir.py
rename to llvm/utils/mlgo-utils/extract_ir.py
diff --git a/llvm/utils/mlgo-utils/mlgo/corpus/make_corpus.py b/llvm/utils/mlgo-utils/make_corpus.py
similarity index 100%
rename from llvm/utils/mlgo-utils/mlgo/corpus/make_corpus.py
rename to llvm/utils/mlgo-utils/make_corpus.py
diff --git a/utils/bazel/llvm-project-overlay/llvm/BUILD.bazel b/utils/bazel/llvm-project-overlay/llvm/BUILD.bazel
index b618c74c19da1..db8a92fd25de6 100644
--- a/utils/bazel/llvm-project-overlay/llvm/BUILD.bazel
+++ b/utils/bazel/llvm-project-overlay/llvm/BUILD.bazel
@@ -5210,8 +5210,8 @@ py_binary(
 py_binary(
     name = "extract_ir",
     srcs = [
+        "utils/mlgo-utils/extract_ir.py",
         "utils/mlgo-utils/mlgo/__init__.py",
-        "utils/mlgo-utils/mlgo/corpus/extract_ir.py",
         "utils/mlgo-utils/mlgo/corpus/extract_ir_lib.py",
         "utils/mlgo-utils/mlgo/corpus/flags.py",
     ],
@@ -5221,8 +5221,8 @@ py_binary(
 py_binary(
     name = "combine_training_corpus",
     srcs = [
+        "utils/mlgo-utils/combine_training_corpus.py",
         "utils/mlgo-utils/mlgo/__init__.py",
-        "utils/mlgo-utils/mlgo/corpus/combine_training_corpus.py",
         "utils/mlgo-utils/mlgo/corpus/combine_training_corpus_lib.py",
         "utils/mlgo-utils/mlgo/corpus/flags.py",
     ],
@@ -5232,8 +5232,8 @@ py_binary(
 py_binary(
     name = "make_corpus",
     srcs = [
+        "utils/mlgo-utils/make_corpus.py",
         "utils/mlgo-utils/mlgo/__init__.py",
-        "utils/mlgo-utils/mlgo/corpus/make_corpus.py",
         "utils/mlgo-utils/mlgo/corpus/make_corpus_lib.py",
     ],
     imports = ["utils/mlgo-utils"],

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested that the binaries still work like this when installed through the pip package? That's the primary way I've been using them.

Why exactly does this need to be changed though? I don't know if the external definitions have ever been tested, and the internal (to Google) rules for these binaries are a bit different because you can't import a python project with a hyphen in it (like llvm-project), but it works just fine after symlinking all the files to a different directory in the same structure.

@thevinster
Copy link
Contributor Author

Have you tested that the binaries still work like this when installed through the pip package? That's the primary way I've been using them.

I didn't realize these could also be installed through pip. From the looks of it, there's a top level wrapper that I can see that essentially just imports the main file and then executes it. This is the model that lit has well, with the exception that the source tree doesn't contain that wrapper in llvm that pip has created.

Why exactly does this need to be changed though? I don't know if the external definitions have ever been tested, and the internal (to Google) rules for these binaries are a bit different because you can't import a python project with a hyphen in it (like llvm-project), but it works just fine after symlinking all the files to a different directory in the same structure.

The purpose of moving is not related to the script itself but more related to the structure of those files. The context here is that there's another build system (buck) which has very similar internal mechanics to bazel (both operate on Starlark). We're trying to experiment by converting bazel rules to buck rules so we can also be able to build llvm and friends with buck. It would be great if we can rely on the bazel definitions as is to build the same target. Unfortunately, with buck, we'd have to execute the binary as if it was invoked by the interpreter directly (which you can see in the summary).

This PR would allow us to converge on one implementation that bazel and buck could use. If this is breaking some other mechanism, we can always do other internal hacks to get it working. I figured I'd try to see if we can have harmony first between the build systems first before resorting to other solutions.

@boomanaiden154
Copy link
Contributor

I didn't realize these could also be installed through pip. From the looks of it, there's a top level wrapper that I can see that essentially just imports the main file and then executes it. This is the model that lit has well, with the exception that the source tree doesn't contain that wrapper in llvm that pip has created.

Can you build the package locally, install it, and test that the binaries work as expected?

This PR would allow us to converge on one implementation that bazel and buck could use. If this is breaking some other mechanism, we can always do other internal hacks to get it working. I figured I'd try to see if we can have harmony first between the build systems first before resorting to other solutions.

I don't think this should break anything on the bazel/blaze side. I'm fine with this patch going in if it doesn't break any functionality in the pip package.

Copy link

github-actions bot commented Jul 4, 2025

✅ With the latest revision this PR passed the Python code formatter.

@thevinster
Copy link
Contributor Author

@boomanaiden154 To avoid breaking any of the pip packaging functionality, I added the same top level wrapper referencing those same files that pip install also does. With this, the bazel changes will reference these top level wrappers instead which fixes our side without needing to worry if it breaks pip packaging.

@thevinster thevinster changed the title [mlgo-utils] Hoist entry script out to the correct directory [mlgo-utils] Create top level entrypoint scripts for mlgo-utils Jul 7, 2025
@MatzeB
Copy link
Contributor

MatzeB commented Jul 7, 2025

Isn't the "pythonic" way to run this: python3 -m llvm.utils.mlgo-utils.extract_ir --help etc?

ignore this, missing context here for a quick answer :) (it's not a problem of sub-packages given the directory name has a slash inside which cannot be a python package name...)

@boomanaiden154
Copy link
Contributor

To avoid breaking any of the pip packaging functionality, I added the same top level wrapper referencing those same files that pip install also does. With this, the bazel changes will reference these top level wrappers instead which fixes our side without needing to worry if it breaks pip packaging.

I would prefer to not have the wrappers in tree if possible.

Unfortunately, with buck, we'd have to execute the binary as if it was invoked by the interpreter directly (which you can see in the summary).

I'm not really understanding how this constraint leads to these changes. Are there constraints on the CWD of the interpreter or something?

@thevinster
Copy link
Contributor Author

I would prefer to not have the wrappers in tree if possible.

An alternative is to have a symlink of the script to the mlgo-utils directory. This avoids the need to have to change two places if the script were to change. Otherwise, I'm open to other ideas on how we can fix it.

I'm not really understanding how this constraint leads to these changes. Are there constraints on the CWD of the interpreter or something?

It's not the cwd of the interpreter. It's just placed incorrectly if invoked directly from python3. The file imports from mlgo and the script lives inside it when it should live beside it. The test commands I have are repros of the issue, if that helps.

@boomanaiden154
Copy link
Contributor

n alternative is to have a symlink of the script to the mlgo-utils directory

I think a symlink would be fine for now.

It's not the cwd of the interpreter. It's just placed incorrectly if invoked directly from python3. The file imports from mlgo and the script lives inside it when it should live beside it.

Buck doesn't do anything to $PYTHONPATH? The following works just fine for invoking one of the scripts:

cd llvm/utils/mlgo-utils
PYTHONPATH=$PYTHONPATH:. python3 ./mlgo/corpus/extract_ir.py

@thevinster
Copy link
Contributor Author

I think a symlink would be fine for now.

With the symlink solution, I needed to move the scripts to mlgo-utils directory. I created respective symlinks within the mlgo/corpus. Verified by building the package and was still able to execute successfully. Let me know if there's some other path I need to test as well.

Buck doesn't do anything to $PYTHONPATH?

Due to constraints out of my reach, I don't have the ability to modify the environment. Even if I could, it's generally discouraged as build systems are meant to handle paths and dependencies internally.

@thevinster thevinster changed the title [mlgo-utils] Create top level entrypoint scripts for mlgo-utils [mlgo-utils] Hoist entrypoint scripts to mlgo-utils directory Jul 9, 2025
Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave the files in their original locations and create the symlinks in the root directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bazel "Peripheral" support tier build system: utils/bazel mlgo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants