-
Notifications
You must be signed in to change notification settings - Fork 28
Added indexed option for k8s jobs #597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added indexed option for k8s jobs #597
Conversation
Can you provide a Also thanks for the contribution! |
First, create a persistent volume with ReadWriteMany. For example: kind: PersistentVolume
apiVersion: v1
metadata:
name: pv-storage
labels:
type: local
spec:
capacity:
storage: 4Gi
accessModes:
- ReadWriteMany
hostPath:
path: /tpi-data Then use this main.tf terraform {
required_providers {
iterative = {
source = "github.com/iterative/iterative"
}
}
}
provider "iterative" {}
resource "iterative_task" "example" {
cloud = "k8s"
machine = "s"
disk_size = 1
parallelism = 3
indexed = false
storage {
workdir = "."
output = "results"
}
script = <<-END
#!/bin/bash
mkdir -p results
printenv > results/out-$JOB_COMPLETION_INDEX
END
} |
related #585 /CC @0x2b3bfa0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome feature, @sjawhar! 😍 It would be nice to enable IndexedCompletion
automatically when parallelism
is greater than 1
instead of exposing a separate option that has no equivalent on other providers.
task/k8s/resources/resource_job.go
Outdated
@@ -119,6 +121,13 @@ func (j *Job) Create(ctx context.Context) error { | |||
jobCompletions := int32(j.Attributes.Task.Parallelism) | |||
jobParallelism := int32(j.Attributes.Task.Parallelism) | |||
|
|||
var jobCompletionMode kubernetes_batch.CompletionMode | |||
if j.Attributes.Indexed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if j.Attributes.Indexed { | |
if j.Attributes.Parallelism > 1 { |
Thanks for the quick review! I also didn't like having an option that only applies to k8s, but I don't quite agree with your suggestion. Let me try to explain, though bear with me because I'm coming from a very cloud-heavy background and am relatively new to k8s. K8s has two ways of running parallel jobs: indexed and nonindexed. Nonindexed means the jobs are identical, start up N, they do their thing. Indexed means that k8s expects a success for at least one job of each index 1-N. In other words, if job 3 fails, it'll keep retrying 3 specifically. They're useful for different things: you might use nonindexed to spawn N workers to accomplish M jobs by reading from a task queue, whereas you might use indexed to accomplish the same thing without the queue (e.g. by giving the full task list to each job, and then it assigns itself work based on its index). tl;dr: both indexed and nonindexed are useful ways of running parallel jobs in k8s, and I don't think the user should be forced to use one or the other. Now here's an extra twist: to accomplish the indexed example I described above (N workers for M jobs, where M > N), we need to be able to set Thoughts on all this? UPDATE: What if we removed |
if we could make index less k8s specific like somehow populate an env like |
I added the |
For the use cases we1 have in mind, the following should be enough:
The following scenarios are still not supported:
Out of curiosity, can you share a bit more of information on your use case? I guess that you may want to restrict the parallelism so running a massively parallel task doesn't exhaust your cluster's resources. 🤔 Footnotes
|
I think you still might not be grokking the purpose of non/indexed. They're both valid modes of parallel operation, it's not about parallelism = 1 or not.
You got it. If I need to run 100 things, I don't want to spawn 100 pods. This seems to me like a necessary consideration for supporting on-prem/k8s. As you can see from the diff, this is a very minor change that opens up a lot of functionality in k8s. I'd be grateful if you'd reconsider the pros/cons here. 🙏 |
What you propose is similar to As per your suggestion, the current
By default, 2 should be equal to 1 if not specified by users. |
Sorry, I explained myself badly. 🙈 Traditionally, Kubernetes Jobs didn't have an This project tries to simplify the user's workflow as much as possible: the My insistence of using |
@iterative/cml
Error: "completions not supported for non-k8s run-time"? (I am allergic to referring to k8s as a cloud provider 😓 )
|
introducing a resource "iterative_task" "example" {
cloud = "k8s"
parallelism = ... # warn iff defined on `cloud = "k8s"`?
# k8s-specific block
k8s {
parallelism = 3 # in the k8s sense, not in the TPI sense
completions = 5
}
} |
I'd avoid introducing backend-specific details, especially when the outer |
@sjawhar, would it be possible to keep the original scope of this pull request (enable |
I really don't think that's a good idea. Users should not be forced to use indexed mode. I went that route initially just to make it work for my use case as quickly as possible, but this would be both a change from the current behavior and unnecessarily limiting. Are you sure you want to do that? |
I thought that enabling the indexed mode doesn't have any significant effect beyond exposing a couple of additional environment variables. Which kind of unnecessary limitations do you foresee? |
It's not just about environment variables. From the docs:
These are two very different behaviors. Your suggestion work perfectly well for my use case (Indexed is what I need), but it seems unnecessarily limiting. The spec already has provider-specific flags (e.g. region does not apply to k8s), so we're not breaking new ground by adding another one. Nonetheless, if you're sure then I'll make the change. |
Indexed mode is precisely the kind of behavior this tool is meant to have; see #585. Unless there is any practical use case that requires the
Ouch! 😅 Indeed, although there were plans to “rename |
I'm with @sjawhar on this one - it makes no sense taking away this choice for the perceived "simplicity" of merely saving a field here (1:1 with k8s field, explainability & docs built in). If we will get here the power of
I'd advise against 😨 (commented on #412 ) |
Citation needed. 😅 If we provide “powerful” options it should be for a reason, methinks. |
We may end up exposing backend-specific details when shoehorning them under a common specification is proven to be difficult or impractical. 👍🏼 Although I wonder if this is the case. |
When terminology is [susceptible of being] generalizable and different backends overload the same names with different meanings, I believe that it makes sense to run the extra mile and “generalize” on our side a bit more. I'd argue that this particular functionality should be generalized, although in some other cases (e.g. placement) we end up resorting to specific options. |
Taking away some choices isn't implicitly negative; quite the opposite. Unless we find any supported use case that benefits particularly from |
We discussed this offline a bit, and specifically - I find it hard to defend the choice for |
6f5ab2a
to
174f56b
Compare
Ok, I'll make the changes sometime in the next few days |
I came across this PR while looking for this feature with AWS EC2. I think the ability to operate parallel instances with regular cloud providers and have some sort of indexing, or any mechanism, to dispatch work to the different instances can greatly help small teams and individual developers who don't have resources to manage k8s. |
@redabuspatrol, can you please comment on #585? In the meanwhile, you may also want to try having a separate |
ae37043
to
f6d569b
Compare
Sorry for the long delay. I've reverted most of the changes, and implemented @0x2b3bfa0 's preferred solution. |
Thank you very much, @sjawhar! 🙏🏼 |
Thanks so much @sjawhar! Sorry for the delay and (perhaps excessive) hesitancy from our end too. The reverted changes still look worth pursuing to us, just in separate PRs. We'd probably also be much quicker at understanding & merging said PRs if you could share an example of their use too... I assume related to things like master...sjawhar:terraform-provider-iterative:feature/nfs-volume and iterative/dvc@main...sjawhar:dvc:feature/parallel-repro? |
I can open separate issues for each of the following features
From the links above, you'll see I've already implemented all of these in the |
Purpose
Parallel K8s jobs have two modes: Indexed and NonIndexed (default). Indexed jobs have some nice abilities, for example each job gets a
JOB_COMPLETION_INDEX
environment variable so it could assign itself work without needing a separate task queue. There's currently no way to specify the k8s completion mode.Approach
Added an optional boolean
indexed
attributeExample when set to

false
or ommitted:Example when set to

true
TODO