Skip to content

Conversation

@astefanutti
Copy link
Contributor

What this PR does / why we need it:

This PR writes the train function script into a tmp directory so it can be read when TrainJob containers run as non-root.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #2372.

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Jan 7, 2025

Pull Request Test Coverage Report for Build 12695578194

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 12692161092: 0.0%
Covered Lines: 85
Relevant Lines: 85

💛 - Coveralls

Comment on lines 169 to 170
printf "%s" \"$SCRIPT\" > \"$program_path\"/\"{func_file}\"
{entrypoint} \"$program_path\"/\"{func_file}\""""
Copy link
Member

@andreyvelich andreyvelich Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we create this file under HOME directory, will it still fail for non-root containers ?

I intentionally write the script into the same file, so the file system between users' workspace and Kubernetes Pod is the same. Which means users can use standard Python imports while developing ML training code. Similar to how we showcase at KubeCon demo: https://youtu.be/Lgy4ir1AhYw?t=446

Related: https://github.com/kubeflow/training-operator/issues/2347.

cc @shravan-achar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the stricter security policy, containers are run with random UID with no fixed HOME directory.

I need to research a bit more what could be the other possible options and get back to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given this another cycle and it doesn't seem possible to change the working directory permissions easily.

On the other hand, mounting an emptyDir volume as the /workspace directory in the training runtime works as expected with the stricter security policy.

My understanding is this approach aligns well with the initializer and worker pods coordinating via PVCs. The default emptyDir fits with the initializers being currently run as init containers, but this can be adapted to mount a PVC when they'll run as separate Jobs.

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
@github-actions
Copy link

github-actions bot commented Apr 9, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member

/remove-lifecycle stale

@astefanutti
Copy link
Contributor Author

Let's close this for now. We'll revisit it with the approach based on PVCs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Permission denied when reading TrainJob function script when run as non-root user

3 participants