Skip to content

Conversation

schongloo
Copy link
Contributor

@schongloo schongloo commented Apr 8, 2025

What is the purpose of the change

This pull request adds basic support for Blue/Green deployments as outlined by FLIP-503 (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337677648).

Brief change log

  • A new FlinkBlueGreenDeploymentController controller, with the capability of managing the lifecycle of these deployments.
  • A new corresponding CRD is introduced
  • The new FlinkBlueGreenDeploymentController will manage this CRD and hide from the user the details of the actual Blue/Green (Active/StandBy) jobs.
  • Delegate the lifecycle of the actual Jobs to the existing FlinkDeployment controller.

Verifying this change

  • Test to verify a single basic deployment
  • Test to verify a proper transition Blue -> Green
  • Test to verify correct behavior when encountering an error before a transition
  • Test to verify correct behavior when encountering an error during a transition

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? docs / JavaDocs. UML diagrams for all scenarios added under docs/content/docs

Overview Diagram

image

@davidradl
Copy link

I am concerned about the number of TODOs. Things like dealing with errors and timing conditions should be in place in the initial drop of code - unless we can mark this capability as beta or the like. WDYT @gyfora ?

@schongloo schongloo requested a review from davidradl April 11, 2025 17:41
@schongloo
Copy link
Contributor Author

@davidradl thanks for your review/comments. I've cleaned up all the outdated/invalid TODOs which pointed me 1 missing unit test, added. At this point the few remaining TODOs do not interfere with the main functionality and are not critical/open for discussion.

@schongloo schongloo marked this pull request as draft April 15, 2025 18:06
@hansh0801
Copy link

👍

@hansh0801
Copy link

when this feature gonna be released?

@schongloo schongloo marked this pull request as ready for review April 30, 2025 13:22
@schongloo
Copy link
Contributor Author

Hi @hansh0801, trying to release it ASAP but I also want to make sure the logic and contracts are as robust as possible to minimize changes later. Looks pretty stable in my current testing and I've reopened it for review. Thanks!

@hansh0801
Copy link

cool ! and I'm also waiting for blue green phase 2 :)
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337677650

@schongloo
Copy link
Contributor Author

@hansh0801 indeed... Phase 2 will take a little bit longer due to its complexity but already working on it.

@schongloo schongloo changed the base branch from release-1.11 to main April 30, 2025 21:14
@schongloo schongloo changed the base branch from main to release-1.11 April 30, 2025 22:33
@maheepm-lyft
Copy link

maheepm-lyft commented Jun 16, 2025

Hi @schongloo @davidradl @gyfora , I hope you're doing well. I've been following this PR as blue-green deployments would solve a critical use case for our production needs. I noticed there have been no updates to this PR for a few weeks. Was wondering if there is anything blocking the release of this and if there is something myself or the OSS community could help with (testing, fixing any changes, etc). Thanks for all the work so far! 🙏

@gyfora
Copy link
Contributor

gyfora commented Jun 19, 2025

Hi @schongloo @davidradl @gyfora , I hope you're doing well. I've been following this PR as blue-green deployments would solve a critical use case for our production needs. I noticed there have been no updates to this PR for a few weeks. Was wondering if there is anything blocking the release of this and if there is something myself or the OSS community could help with (testing, fixing any changes, etc). Thanks for all the work so far! 🙏

@maheepm-lyft , @schongloo is currently on holiday but once he is back we can try to merge this as soon as reasonably possible :)

In the meantime you can help us by trying out the current version and by providing a code review for the PR, that can definitely speed things up.

@gyfora
Copy link
Contributor

gyfora commented Jun 23, 2025

@schongloo can you please re-open this PR against the main branch?

@maheepm-lyft
Copy link

Hi @schongloo @davidradl @gyfora , I hope you're doing well. I've been following this PR as blue-green deployments would solve a critical use case for our production needs. I noticed there have been no updates to this PR for a few weeks. Was wondering if there is anything blocking the release of this and if there is something myself or the OSS community could help with (testing, fixing any changes, etc). Thanks for all the work so far! 🙏

@maheepm-lyft , @schongloo is currently on holiday but once he is back we can try to merge this as soon as reasonably possible :)

In the meantime you can help us by trying out the current version and by providing a code review for the PR, that can definitely speed things up.

Great thanks! Taking us a bit of time to test this out in our stack, but we're working on it and I'll post any comments here.

@schongloo schongloo changed the base branch from release-1.11 to main July 1, 2025 18:36
@schongloo schongloo requested review from davidradl and gyfora July 1, 2025 18:37
.durationType()
.defaultValue(Duration.ofSeconds(15))
.withDescription(
"Configurable delay in milliseconds to use when the operator reschedules a reconciliation.");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest including some advice as to what to set this to in different circumstances. Same for deployment-deletion.delay

*
* <p>Deployment States
*
* <p>1. INITIALIZING_BLUE - First-time deployment setup 2. ACTIVE_BLUE - Blue environment serving

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit maybe use a html list for this list

@schongloo
Copy link
Contributor Author

@davidradl thank you for all your feedback. The past few weeks I was focused on mostly the high level refactoring, I'll address your comments ASAP.

@schongloo schongloo force-pushed the release-1.11-bluegreen-flip503 branch from bb9d576 to 7ed0e6f Compare September 1, 2025 18:18
@schongloo schongloo force-pushed the release-1.11-bluegreen-flip503 branch 2 times, most recently from 38cee0e to 0398adc Compare September 3, 2025 06:38
String wrapperKey,
Class<T> valueType) {
String serializedSpec = SpecUtils.writeSpecAsJSON(spec, wrapperKey);
String replacedSerializedSpec = serializedSpec.replace(deploymentName, childDeploymentName);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will do replacement of the image tag as well. We would probably need to exclude that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what vikas said. This will change the name in places that are problematic. In my testing, I am facing problems so far in expected configmap name, savepoint path, and metrics reporters.

@schongloo schongloo force-pushed the release-1.11-bluegreen-flip503 branch from 88394be to 68e5ecc Compare September 17, 2025 06:37
@schongloo schongloo closed this Sep 17, 2025
@schongloo schongloo force-pushed the release-1.11-bluegreen-flip503 branch from 68e5ecc to a57579c Compare September 17, 2025 17:06
@schongloo schongloo reopened this Sep 17, 2025
@gyfora gyfora merged commit 331ac47 into apache:main Sep 18, 2025
122 of 133 checks passed
private UpdateControl<FlinkBlueGreenDeployment> patchFlinkDeployment(
BlueGreenContext context, BlueGreenDeploymentType blueGreenDeploymentTypeToPatch) {

String childDeploymentName =
Copy link

@maheepm-lyft maheepm-lyft Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @schongloo , sorry to leave a comment after the PR is merged. I have been doing some testing with BlueGreen deployments, and I think this is replacing the name in several places where it should not be. For example, if I am using an initialSavepointPath with savepoint path in S3, it is replacing the app name in the path with app name + "-blue" or "-green", and the app is unable to find the corresponding savepoint. E.g. app_name becomes app_name-blue. If the savepoint path already has -blue in it (e.g. app_name-blue, then it is updating it to app_name-blue-blue. I am seeing similar issues related to a few other things like configmap names, although I haven't followed that code path all the way.

EDIT: added a comment elsewhere after looking into the bug a little more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants