-
Notifications
You must be signed in to change notification settings - Fork 834
Add e2e test for train API #2199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add e2e test for train API #2199
Conversation
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Pull Request Test Coverage Report for Build 12449362407Details
💛 - Coveralls |
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
@andreyvelich I've separated the e2e test for train API and now it works. Please review when you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this overall lgtm, just small comment.
/assign @deepanker13 @kubeflow/wg-training-leads @Electronic-Waste
|
/lgtm |
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
|
I've updated the Kubernetes version to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM. I left some comments for you @helenxie-bit
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| kubernetes-version: ["v1.31.4"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we change the Kubernetes version to be aligned with other ci tests? Like:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to save compute resources, I think for now we can just run this test on a single k8s version, since we run the rests E2E tests on the all versions.
WDYT @Electronic-Waste ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. Maybe we can select one k8s version from this list:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me change the version to v1.30.6. And we can update it if needed in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we support 1.28-1.31, I would suggest that we run our integration tests on 1.29, 1.30, 1.31, we can update it in the following PR.
For the train API tests, I think running it on 1.31 should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. So I think we will still keep the v1.31.4 version.
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
Signed-off-by: helenxie-bit <helenxiehz@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this effort @helenxie-bit!
/lgtm
/approve
/hold cancel
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* add e2e test for train API Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix peft import error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update settings of the job Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix error detection Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * resolve conflict Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * resolve conflict Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * resolve conflict Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix NoneType error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * test bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * find bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * find bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * find bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * reduce pvc size Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use gpu Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use gpu Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use gpu Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix 'set_device' error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add timeout error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix typo Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update e2e test for train api Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add num_labels Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check disk space Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * change sequence of e2e tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add clean-up after each e2e test of pytorchjob Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function-add check disk Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker volumes Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker directory Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * change the value of 'clean_pod_policy' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * change the value of 'update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker volumes Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker volumes Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * stop the controller and restart it again to clean up Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate e2e test for train api Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix parameter of namespace Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * reduce resources Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate e2e test for train API Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * remove go setup Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust the version of k8s Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * move test file to new place Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix typos Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update install packages Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * build and verify images of storage-intializer and trainer Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix image build error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix image build error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check disk space Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * make 'setup-storage-initializer-and-trainer' executable Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate step of loading images Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check disk space after loading image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * clean up and check disk space Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * prune docker build cache Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * prune docker build cache Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust sequence of building and loading images Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * move working directory Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * delete moving working directory Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use 'docker system prune' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * make the format of the commands to be consistent Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update base image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update base image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update base image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * delete unnecessary space clear and check code Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * merge e2e test for train api into integration tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check for timeout error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix name of trainer image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix env of building storage initializer image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * clean format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * skip e2e test for train API when use scheduling Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * Update name of fileholder Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Hezhi (Helen) Xie <hezxie@ucdavis.edu> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate e2e test for train API Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * move test script Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update path to test script Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update path to test script Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update kubernetes version Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update kubernetes version Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust kubernetes version to 1.30.6 Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust kubernetes version to 1.31.4 Signed-off-by: helenxie-bit <helenxiehz@gmail.com> --------- Signed-off-by: helenxie-bit <helenxiehz@gmail.com> Signed-off-by: Hezhi (Helen) Xie <hezxie@ucdavis.edu> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* add e2e test for train API Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix peft import error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update settings of the job Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix error detection Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * resolve conflict Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * resolve conflict Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * resolve conflict Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix NoneType error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * test bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * find bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * find bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * find bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * reduce pvc size Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use gpu Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use gpu Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use gpu Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix 'set_device' error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add timeout error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix typo Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update e2e test for train api Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add num_labels Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check disk space Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * change sequence of e2e tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add clean-up after each e2e test of pytorchjob Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function-add check disk Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker volumes Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker directory Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * change the value of 'clean_pod_policy' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * change the value of 'update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker volumes Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker volumes Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * stop the controller and restart it again to clean up Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate e2e test for train api Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix parameter of namespace Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * reduce resources Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate e2e test for train API Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * remove go setup Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust the version of k8s Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * move test file to new place Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix typos Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update install packages Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * build and verify images of storage-intializer and trainer Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix image build error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix image build error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check disk space Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * make 'setup-storage-initializer-and-trainer' executable Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate step of loading images Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check disk space after loading image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * clean up and check disk space Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * prune docker build cache Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * prune docker build cache Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust sequence of building and loading images Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * move working directory Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * delete moving working directory Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use 'docker system prune' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * make the format of the commands to be consistent Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update base image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update base image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update base image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * delete unnecessary space clear and check code Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * merge e2e test for train api into integration tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check for timeout error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix name of trainer image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix env of building storage initializer image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * clean format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * skip e2e test for train API when use scheduling Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * Update name of fileholder Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Hezhi (Helen) Xie <hezxie@ucdavis.edu> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate e2e test for train API Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * move test script Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update path to test script Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update path to test script Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update kubernetes version Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update kubernetes version Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust kubernetes version to 1.30.6 Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust kubernetes version to 1.31.4 Signed-off-by: helenxie-bit <helenxiehz@gmail.com> --------- Signed-off-by: helenxie-bit <helenxiehz@gmail.com> Signed-off-by: Hezhi (Helen) Xie <hezxie@ucdavis.edu> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* add e2e test for train API Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix peft import error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update settings of the job Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix error detection Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * resolve conflict Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * resolve conflict Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * resolve conflict Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix NoneType error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * test bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * find bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * find bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * find bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * reduce pvc size Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use gpu Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use gpu Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use gpu Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix 'set_device' error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add timeout error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix typo Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update e2e test for train api Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add num_labels Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check disk space Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * change sequence of e2e tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add clean-up after each e2e test of pytorchjob Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function-add check disk Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker volumes Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker directory Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * change the value of 'clean_pod_policy' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * change the value of 'update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker volumes Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker volumes Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * stop the controller and restart it again to clean up Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate e2e test for train api Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix parameter of namespace Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * reduce resources Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate e2e test for train API Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * remove go setup Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust the version of k8s Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * move test file to new place Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix typos Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update install packages Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * build and verify images of storage-intializer and trainer Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix image build error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix image build error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check disk space Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * make 'setup-storage-initializer-and-trainer' executable Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate step of loading images Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check disk space after loading image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * clean up and check disk space Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * prune docker build cache Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * prune docker build cache Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust sequence of building and loading images Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * move working directory Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * delete moving working directory Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use 'docker system prune' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * make the format of the commands to be consistent Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update base image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update base image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update base image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * delete unnecessary space clear and check code Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * merge e2e test for train api into integration tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check for timeout error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix name of trainer image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix env of building storage initializer image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * clean format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * skip e2e test for train API when use scheduling Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * Update name of fileholder Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Hezhi (Helen) Xie <hezxie@ucdavis.edu> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate e2e test for train API Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * move test script Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update path to test script Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update path to test script Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update kubernetes version Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update kubernetes version Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust kubernetes version to 1.30.6 Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust kubernetes version to 1.31.4 Signed-off-by: helenxie-bit <helenxiehz@gmail.com> --------- Signed-off-by: helenxie-bit <helenxiehz@gmail.com> Signed-off-by: Hezhi (Helen) Xie <hezxie@ucdavis.edu> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* add e2e test for train API Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix peft import error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update settings of the job Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix error detection Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * resolve conflict Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * resolve conflict Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * resolve conflict Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix NoneType error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * test bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * find bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * find bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * find bug Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * reduce pvc size Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * set storage_config Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use gpu Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use gpu Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use gpu Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix 'set_device' error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add timeout error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix typo Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update e2e test for train api Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add num_labels Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check disk space Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * change sequence of e2e tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * add clean-up after each e2e test of pytorchjob Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function-add check disk Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker volumes Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker directory Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install and 'num_workers' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update pip install Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * change the value of 'clean_pod_policy' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * change the value of 'update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker volumes Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check docker volumes Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * stop the controller and restart it again to clean up Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update cleanup function Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate e2e test for train api Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix parameter of namespace Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * reduce resources Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate e2e test for train API Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * remove go setup Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust the version of k8s Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * move test file to new place Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix typos Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update install packages Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * build and verify images of storage-intializer and trainer Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix image build error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix image build error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check disk space Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * make 'setup-storage-initializer-and-trainer' executable Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate step of loading images Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check disk space after loading image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * clean up and check disk space Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * prune docker build cache Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * prune docker build cache Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust sequence of building and loading images Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * move working directory Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * delete moving working directory Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * use 'docker system prune' Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * make the format of the commands to be consistent Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update base image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update base image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update base image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * delete unnecessary space clear and check code Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * merge e2e test for train api into integration tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * check for timeout error Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix name of trainer image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix env of building storage initializer image Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * clean format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * skip e2e test for train API when use scheduling Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * Update name of fileholder Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Hezhi (Helen) Xie <hezxie@ucdavis.edu> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * separate e2e test for train API Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * fix format Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * move test script Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update path to test script Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update path to test script Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update kubernetes version Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * update kubernetes version Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * rerun tests Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust kubernetes version to 1.30.6 Signed-off-by: helenxie-bit <helenxiehz@gmail.com> * adjust kubernetes version to 1.31.4 Signed-off-by: helenxie-bit <helenxiehz@gmail.com> --------- Signed-off-by: helenxie-bit <helenxiehz@gmail.com> Signed-off-by: Hezhi (Helen) Xie <hezxie@ucdavis.edu> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
What this PR does / why we need it:
Add an e2e test in the
test_e2e_train_api.pyfor the train API.Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #
Checklist: