Options to deploy Nebari to existing AWS VPC/subnets? #2559

mwengren · 2024-07-11T02:21:56Z

mwengren
Jul 11, 2024

We have some particular networking/security requirements that compel us to use a pre-existing VPC and associated public/private subnets in our AWS account. Similarly, we can't create new InternetGateway (IGW) in our public subnet, we must use an existing networking setup for outgoing traffic out of AWS to our network or for any public traffic.

Is there a way to configure such a deployment by editing the nebari-config.yaml file appropriately? Essentially to pass IDs for existing AWS resources (VPC, subnet, etc) for Nebari to deploy components to? I can't find documentation for this type of deployment scenario in the Nebari docs, so I assume it would require some manual editing of Nebari internals beyond just the options provided in nebari-config?

TIA!

viniciusdc · 2024-07-12T15:28:47Z

viniciusdc
Jul 12, 2024
Maintainer

Hi @mwengren, Nebari supports existing AWS subnets, by passing the associated security group ID and existing subnet IDs in the nebari-config.yaml. However, I am not sure if Nebari even creates an IGW at all as part of its deployment process, so I guess it would default for any configuration already present in your infrastructure.

Here's an example of how that can be set in the config:

amazon_web_services:
  terraform_overrides:
    existing_subnet_ids: ["subnet-05b2b1f41f0b1d8a6", "subnet-017efc3309fbca2da"]
    existing_security_group_id: "sg-0e2c865bdd8824b1a"
  region: ...
  <other_attributes>: ...

I am c.c. @aktech @Adam-D-Lewis as both have more experience with AWS deployments, and Adam worked on adding support for this

2 replies

mwengren Jul 12, 2024
Author

Thanks @viniciusdc Do you know if it's possible to deploy the Load Balancer in one subnet (our public subnet) and the k8s cluster and other resources like EFS in another (private) subnet?

We have a small public subnet (/27) and a larger private subnet (/23). Networking is provided between the two subnets via pre-configured Route Table.

viniciusdc Jun 28, 2025
Maintainer

Unfortunately this is not supported, sorry I missed your ping

mwengren · 2024-07-13T14:43:09Z

mwengren
Jul 13, 2024
Author

@viniciusdc When I add the terraform_overrides YAML I get a validation error running nebari deploy:

ValidationError: 1 validation error for ConfigSchema
amazon_web_services.terraform_overrides
  Extra inputs are not permitted [type=extra_forbidden, input_value={'existing_subnet_ids': [... 'sg-xxxxxxxxxxxxxx'}, input_type=CommentedMap]
    For further information visit https://errors.pydantic.dev/2.4/v/extra_forbidden

Do I need a custom validation configuration as well?

0 replies

aktech · 2024-07-13T14:59:15Z

aktech
Jul 13, 2024
Collaborator

The syntax is as following:

amazon_web_services:
  existing_subnet_ids:
    - subnet-xxxxxxxxxxxx
    - subnet-yyyyyyyyyyyy
  existing_security_group_id: sg-kkkkkkkkkkkk

1 reply

mwengren Jul 13, 2024
Author

Thanks @aktech, I modified the formatting of my config file, but I am still getting the same validation error.

Config section:

amazon_web_services:
  terraform_overrides:
       existing_subnet_ids:
           - subnet-xxxxxxxxxxxxxxxx
           - subnet-yyyyyyyyyyyyyyy
        existing_security_group_id: sg-zzzzzzzzzzzzzzz

Output:

ValidationError: 1 validation error for ConfigSchema
amazon_web_services.terraform_overrides
  Extra inputs are not permitted [type=extra_forbidden, input_value={'existing_subnet_ids': [...net-xxxxxxxxxxxxxx']}, input_type=CommentedMap]
    For further information visit https://errors.pydantic.dev/2.4/v/extra_forbidden

Is it required to include two subnets? What if I would like to deploy Nebari to only a single subnet?

More accurately, I have a public/private subnet configuration (see details above), but I don't know if this is an architecture that's supported by Nebari 'out of the box' or not? We need the end-user access point (Load Balancer?) to be in our public subnet, and the k8s cluster in the private subnet, as we only have 16 IP addresses available in our public subnet.

I have several other questions if you can assist:

What does passing an existing security group do? I have a security group that manages permissions to our public subnet, so I planned to try passing that in hoping Nebari would adopt the same permissions rather than create a new permission set that is more open than I want.

Are these terraform overrides documented anywhere or can you point me to the right section of the code to poke around to see what other override options are available, if any?

mwengren · 2024-07-13T21:43:33Z

mwengren
Jul 13, 2024
Author

I figured out the answer to the nebari-config.yaml question above, however I'm hitting an issue in the deployment process. This is the YAML config I used to override and use my existing subnets (same as above but without the terraform_overrides node:

existing_subnet_ids:
           - subnet-xxxxxxxxxxxxxxxx
           - subnet-yyyyyyyyyyyyyyy
existing_security_group_id: sg-zzzzzzzzzzzzzzz

Adding these resulted in the Nebari deploy starting.

Is the purpose of the Security Group documented anywhere? I found this part of the TF code that both the existing_subnet_ids and existing_security_group_id parameters are needed in order for networking override to occur. But how do I know what my existing_security_group_id needs to include for Nebari to deploy successfully?

Asking because it's not clear to me why my general k8s node group failed to initialize. The error I got was:

[terraform]: │ Error: waiting for EKS Node Group (nebari-dev-dev:general) create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: i-08575544df69eef5c: NodeCreationFailure: Instances failed to join the kubernetes cluster

Looking in the AWS console it appears the desired size of the general node group is 1, so presumably it failed when the instance failed to join, which must be security group or other permission-related issue?

Thanks for any advice!

0 replies

mwengren · 2024-07-14T16:49:16Z

mwengren
Jul 14, 2024
Author

I've tried redeploying a few times but each time the node instance (which is created successfully as m52xlarge type per nebari-config.yml) isn't able to join the general k8s cluster. I also get the following message at the top of the nebari-dev-dev cluster page in the console:

Your current IAM principal doesn’t have access to Kubernetes objects on this cluster.
This may be due to the current user or role not having Kubernetes RBAC permissions to describe cluster resources or not having an entry in the cluster’s auth config map

It's odd because I have AdministratorAccess role for my account which appears to have full access as far as I can tell for IAM permissions, so I'm not sure where to go to troubleshoot further. Are there other common reasons why the instance might not be able to connect to the node group?

Thanks in advance!

6 replies

mwengren Apr 28, 2025
Author

@viniciusdc, @aktech

Thanks for both of your comments.

I had to pause on pursuing this until now, but I'd like to pick it back up again if you can still provide some assistance.

I attempted to re-deploy the most recent Nebari version 2025.4.1 on the same AWS account and had more or less the same results as last July (k8s cluster is created but the EC2 instance is unable to join the 'general' node group:

Error: waiting for EKS Node Group (<eks_cluster_name>:general) create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: i-06dfbc2cfffbe1a3d: NodeCreationFailure: Instances failed to join the kubernetes cluster

I still need to investigate if there's anything useful in CloudTrail as far as diagnosing the deployment errors, but I've also been looking into the Nebari TF code as well to understand it better and I can also give some details about our environment, in hopes you can advise me (apart from any IAM permission errors I'm having) on how I should attempt to deploy Nebari successfully:

VPC Overview:
'Public' subnet: xyz.xyz.xyz,xyz/27 -> VPGW -> public internet
2 Private subnets: xyz.xyz.xyz,xyz/23 -> NAT Gateway (within 'public' subnet) -> VPGW -> public internet

Our 'public' subnet is also firewalled to only be reachable from our internal network.

So, I guess my main question is: what can I do in my nebari-config.yaml settings, apart from specifying the existing_subnet_ids and existing_security_group_id to enable the override_network option, to make it successfully deploy in the above VPC?

Also, can/how can I configure k8s to span both of the private subnets but be available via an IP address in our 'public' subnet space?

With the default networking, Nebari deploys an IGW and presumably the k8s cluster leverages that in its networking to be reachable (don't fully understand k8s networking).

There aren't enough IP addresses available in the 'public' subnet to be appropriate for Nebari, but we need some sort of resource deployed to an IP address in our 'public' subnet in order for users to connect to it.

Thanks for any assistance!

dcmcand Apr 29, 2025
Maintainer

Hi @mwengren,
I have been working on allowing deploying nodes into private subnets while allowing the load balancer to be in a public subnet. You can see the work on the branch https://github.com/nebari-dev/nebari/tree/aws-no-public-ip-by-default. It won't work well yet on upgrading an existing deployment because of other issues (which is why we haven't merged it yet) but it should work fine for a new install. If you would like to give it a try, it may help your issues.

mwengren Apr 29, 2025
Author

@dcmcand That is some of the best news I've heard in awhile, thanks! I will take a look at the branch & documentation and give it a try.

If you know of any info or documentation about required permissions to successfully deploy Nebari on AWS (without account-wide admin rights), that would be really helpful for my situation. I've already found #1366 which addresses that somewhat, but since it's > 2 years old I thought the situation might have changed and I couldn't find specific permissions or a policy available anywhere in the Nebari docs.

I still think I'm running into limits of my permission set when I try to deploy in my existing VPC/account related to the k8s 'general' cluster and the instance join error.

marcelovilla Apr 30, 2025
Maintainer

Hey @mwengren, in addition to what @dcmcand already suggested, you could also try to use https://github.com/iann0036/iamlive when deploying Nebari to an account where you have full permissions and see what the resulting policy looks like.

mwengren Apr 30, 2025
Author

@dcmcand Can you help me out as far as how to use Nebari from a git branch like the one you suggested? Is it just a process like:

clone https://github.com/nebari-dev/nebari locally to my install server
git checkout aws-no-public-ip-by-default
pip install -e .

Or is there a better/recommended way I'm missing? Thanks!

dcmcand · 2025-04-29T17:22:46Z

dcmcand
Apr 29, 2025
Maintainer

Unfortunately we don't yet have a complete list of the permissions required yet. It is on our todo list but no one has picked it up yet. If you have access to cloudtrail that will tell you what permissions are missing as you will see permission denied errors there. All that being said, please feel free to create an issue about the minimal permissions. People creating issues is one of the things that we use when prioritizing issues.

…

On Tue, Apr 29, 2025, 7:17 PM Micah Wengren ***@***.***> wrote: @dcmcand <https://github.com/dcmcand> That is some of the best news I've heard in awhile, thanks! I will take a look at the branch & documentation and give it a try. If you know of any info or documentation about required permissions to successfully deploy Nebari on AWS (without account-wide admin rights), that would be really helpful for my situation. I've already found #1366 <#1366> which addresses that somewhat, but since it's > 2 years old I thought the situation might have changed and I couldn't find specific permissions or a policy available anywhere in the Nebari docs. I still think I'm running into limits of my permission set when I try to deploy in my existing VPC/account related to the k8s 'general' cluster and the instance join error. — Reply to this email directly, view it on GitHub <#2559 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPVTR3X6Z43FHU6IG4OSPL236X4NAVCNFSM6AAAAAB4BENLFOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTEOJYGM3TAMI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

dcmcand · 2025-04-30T19:33:11Z

dcmcand
Apr 30, 2025
Maintainer

Yep that's it. Checkout the branch and `pip install .`

…

On Wed, Apr 30, 2025, 8:47 PM Micah Wengren ***@***.***> wrote: @dcmcand <https://github.com/dcmcand> Can you help me out as far as how to use Nebari from a git branch like the one you suggested? Is it just a process like: - clone https://github.com/nebari-dev/nebari locally to my install server - git checkout aws-no-public-ip-by-default - pip install -e . Or is there a better/recommended way I'm missing? Thanks! — Reply to this email directly, view it on GitHub <#2559 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPVTR6XHJ67VQC7KHFAGFD24ELELAVCNFSM6AAAAAB4BENLFOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTEOJZG43DKNQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

mwengren · 2025-05-02T18:23:43Z

mwengren
May 2, 2025
Author

@dcmcand I've looked over the changes in https://github.com/nebari-dev/nebari/tree/aws-no-public-ip-by-default and generally it looks fairly similar to our existing network layout (as I described above) with a few differences:

We route our outgoing from AWS traffic through a VPGW rather than an IGW as Nebari deploys as part of the 'network' module.
We have a single public subnet rather than the one per AZ that your branch creates.

I also see in your branch that there are additional endpoints created in the network module (presumably for the private subnets to access external AWS services that are needed?):
main...aws-no-public-ip-by-default#diff-84a3bb327637d640fbb30ff76e399655337fbf8ef6170be4cae7452d335a7acfR139.

My main question is how to make Nebari work for my situation, which requires using existing VPC and networking resources in my account?

I can't deploy Nebari's network module as is, essentially. So, as a result, the testing I've done involves passing both existing_subnet_ids and existing_security_group_id in my nebari-config.yaml file to override the network module deployment and to use our existing VPC and public/private subnets. So, the additional VPC endpoint resources in your branch won't be created, of course.

I see two options:

Create the necessary endpoints manually in my account following your code and test deploying Nebari again, skipping 'network' module.
Override the Nebari TF main and network module code myself so that rather than skipping the network module, deploy it but pass existing VPC, subnet IDs etc as TF variables for it to use. I've taken this approach before for another project, but I'm not sure if this is a good practice or not or an anti-pattern I should avoid (relatively new to TF).

I think 2 would be possible but I'm still not sure if my overall architecture is likely to work for Nebari or not with the VPGW and other factors like the 1:2 public:private subnet architecture. Either way, I'm sure it'll take more testing. Appreciate any guidance, thx!

6 replies

mwengren May 6, 2025
Author

@dcmcand

I'm trying to pursue Option 2 to update Nebari TF code, but I'm a bit stumped on how to add new parameters for nebari-config.yml to accept.

I'd like to pass each of: vpc_id, public_subnet_ids, and private_subnet_ids to the nebari render and nebari deploy commands, but I get a Pydantic validation error:

amazon_web_services.vpc_id
  Extra inputs are not permitted [type=extra_forbidden, input_value='vpc-...', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/extra_forbidden
amazon_web_services.public_subnet_ids
  Extra inputs are not permitted [type=extra_forbidden, input_value=['subnet-...'], input_type=CommentedSeq]
    For further information visit https://errors.pydantic.dev/2.9/v/extra_forbidden
amazon_web_services.private_subnet_ids
  Extra inputs are not permitted [type=extra_forbidden, input_value=['subnet-...bnet-...'], input_type=CommentedSeq]

I found what I think is the right place to modify the validation rules, and added each to the AWSInputVars class in:

https://github.com/mwengren/nebari/blob/mw-aws-no-public-ip/src/_nebari/stages/infrastructure/__init__.py#L181-L183

but still receive the extra inputs error above when running the render command. Appreciate any advice on what I' missing...

mwengren May 6, 2025
Author

Nevermind, with a little more sleuthing, I found a few more places where I'd missed the necessary Pydantic model updates for Nebari to accept my added params.

mwengren@44dd4eb

Nebari deploying presently with new parameters passed for vpc_id, public_subnet_ids, and private_subnet_ids . I'll post the results when I have them.

mwengren May 7, 2025
Author

@dcmcand With the changes that I added to yours in this branch: main...mwengren:nebari:mw-aws-no-public-ip, I was able to deploy the cluster to both of my private subnets, however I still encounter some permission issues that I think more have to do with our AWS Organization policies that I'll have to run down.

This is progress, though, so I'm in a better position to troubleshoot further and to try to run down the additional permissions I need to make the cluster fully deploy.

I've never gotten past the point of getting the 'general' node group to be properly created and the one EC2 instance created to be able to join (either before or after these new code changes). Here's some error info copied from nebari deploy FWIW:

[tofu]: module.kubernetes.aws_eks_node_group.main[0]: Still creating... [33m0s elapsed]
[tofu]: module.kubernetes.aws_eks_node_group.main[0]: Still creating... [33m10s elapsed]
[tofu]: module.kubernetes.aws_eks_node_group.main[0]: Still creating... [33m20s elapsed]
[tofu]: ╷
[tofu]: │ Error: waiting for EKS Node Group (nebari-coastalsb-dev:general) create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: i-07fa341d6c9dc2d4c: NodeCreationFailure: Instances failed to join the kubernetes cluster
[tofu]: │
[tofu]: │   with module.kubernetes.aws_eks_node_group.main[0],
[tofu]: │   on modules/kubernetes/main.tf line 86, in resource "aws_eks_node_group" "main":
[tofu]: │   86: resource "aws_eks_node_group" "main" {

In addition, with the VPC endpoint code now running as part of the deploy, I get additional errors that are clearly permission-related and have to do with my Org policies that I'll have to sort out with my IT department.

Any advice appreciated on where I should look to troubleshoot the 'general' cluster and instance join failure messages would be fantastic. I need to look around the console to see if I can find anything obvious that might be causing this issue still.

Either way, a few steps closer, thanks!

mwengren May 8, 2025
Author

A quick update: I discovered I'd used the wrong existing security group ID for my most recent test deployment (with the changes in my branch above to pass vpc_id, public_subnet_ids, and private_subnet_ids params).

Passing a proper SG resulted in the node groups deploying and the EKS cluster being created successfully, so now I just need to resolve my AWS Org permission limit issues to deploy the new VPC endpoints for the private subnets and I think I might be good to go. Pretty pleased with this progress so far, thanks!

marcelovilla May 26, 2025
Maintainer

Hey @mwengren, we're working on updating our docs with examples of narrower sets of permissions needed to deploy (and destroy) Nebari on different cloud providers. Here's the PR in our docs repo and the preview of the docs with these additions.

Hope this helps you resolving your permission limit issues.

mwengren · 2025-05-02T20:05:45Z

mwengren
May 2, 2025
Author

This is a rough-cut edit of some of the relevant TF files in the main and network modules to illustrate what I mean:

https://github.com/mwengren/nebari/tree/mw-aws-no-public-ip

I haven't run this to test and I'm sure I missed some things, but hopefully this gets the idea across of what I had in mind for Option 2 above.

0 replies

mwengren · 2025-06-06T21:45:06Z

mwengren
Jun 6, 2025
Author

@dcmcand I've reached a point where my EKS cluster is deployed successfully, but I'm running into a problem where the EKS OIDC provider doesn't seem to be properly created. Here's the error message from nebari deploy:

[tofu]: │ Error: Failed to identify fetch peer certificates
[tofu]: │
[tofu]: │   with module.kubernetes.data.tls_certificate.this,
[tofu]: │   on modules/kubernetes/main.tf line 185, in data "tls_certificate" "this":
[tofu]: │  185: data "tls_certificate" "this" {
[tofu]: │
[tofu]: │ failed to fetch certificates from URL 'https': Get
[tofu]: │ "https://oidc.eks.us-east-2.amazonaws.com:443/id/7BBFFEA6C2CC58FFE69229AB66CDCD9C":
[tofu]: │ dial tcp: lookup oidc.eks.us-east-2.amazonaws.com on xxx.xxx.xxx.xxx:53: no such
[tofu]: │ host

I checked in the console and a previous deploy I'd ran successfully created the aws_iam_openid_connect_provider per https://github.com/nebari-dev/nebari/blob/main/src/_nebari/stages/infrastructure/template/aws/modules/kubernetes/main.tf#L198-L207 (note, I am using my own branch as mentioned above here, so it's possible something I introduced in my changes could be the problem).

Googling led me two a few good posts about the same/similar problem, but I was hoping this had been encountered by someone else previously who could point me in the right direction. These posts mostly describe my problem:

https://repost.aws/knowledge-center/eks-troubleshoot-oidc-and-irsa (in my case the IAM OIDC provider is missing in Step 2)
dial tcp: lookup oidc.eks.eu-central-1.amazonaws.com on 10.144.197.2:53: no such host terraform-aws-modules/terraform-aws-eks#2992 (comment) which links to:
https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html#_create_oidc_provider_eksctl

I think it's failing before the aws_iam_openid_connect_provider is created, because it can't obtain the tls_cert from https://oidc.eks.us-east-2.amazonaws.com:443/id/7BBFFEA6C2CC58FFE69229AB66CDCD9C to pass to the OIDC provider, due to a networking issue or something else. I'm just not sure where the networking issue originated from.

From post 2 above, the EKS VPC endpoint might be the culprit, not sure.

Looking in AWS Console, my EKS cluster includes the OpenID Connect Provider with the same URL as above, so it seems it's created properly from EKS' standpoint.

I'm also wondering if a failed manual cleanup from a previous nebari deploy might be the issue, since I've had to do that a number of times.

Is there an inventory or resources to clean up when nebari deploy fails that's published somewhere? I've started my own but there's a good chance I've missed a few things.

3 replies

dcmcand Jun 9, 2025
Maintainer

@mwengren I haven't run into that one before. I think the possibility of a previous failed deploy messing things up is a real possibility. You could try changing your deploy name, or just go into the tag editor and search for all resources in the region that you deployed and make sure everything is properly removed (this is what I have to do when I am developing an AWS feature).

mwengren Jun 13, 2025
Author

I believe the issue here was the EKS VPC endpoint's private_dns_enabled=True setting causing a DNS failure when creating the OIDC identity provider, as mentioned in terraform-aws-modules/terraform-aws-eks#2992 (comment).

In any case, I tested by setting to false as in:

nebari/src/_nebari/stages/infrastructure/template/aws/modules/network/main.tf

Line 258 in 0eee754

private_dns_enabled = false

and was able to get past the OIDC identity provider step to creating the module.kubernetes-ingress.kubernetes_service.main, which timed out. So more troubleshooting from that point on, but made some progress at least.

mwengren Jun 16, 2025
Author

@dcmcand I tried a few different options here, including trying to create an OIDC-specific VPC endpoint, which you can see in my commented-out code here: https://github.com/mwengren/nebari/blob/b2efe86f2d113f2850dbba154e5b00422cb5fb5a/src/_nebari/stages/infrastructure/template/aws/modules/network/main.tf#L264-L275 (in combination with setting the above EKS VPC private_dns_enabled to true), but the only way I could get past the certificate error mentioned above was to set private_dns_enabled to false for the existing EKS VPC endpoint as in above.

Good news is that with that change it progresses to the kubernetes-ingress stage now, where I'm having some domain or DNS-related errors. I'm going to see if there's anything pertinent already in GitHub and if not post further about it below.

mwengren · 2025-06-16T20:58:47Z

mwengren
Jun 16, 2025
Author

I've reached the kubernetes-ingress stage of the deployment, but I'm now blocked at module.kubernetes-ingress.kubernetes_service.main:

I'm getting a timeout here:

[tofu]: ╷
[tofu]: │ Error: client rate limiter Wait returned an error: context deadline exceeded
[tofu]: │
[tofu]: │   with module.kubernetes-ingress.kubernetes_service.main,
[tofu]: │   on modules/kubernetes/ingress/main.tf line 114, in resource "kubernetes_service" "main":
[tofu]: │  114: resource "kubernetes_service" "main" {
[tofu]: │
[tofu]: ╵

My best guess is that it's hung up on wait_for_load_balancer. For my setup (details above), I have a small CIDR range of non-AWS-managed IP addresses I can deploy services to in my public subnet. We are not using an AWS Internet Gateway due to TIC restrictions, and instead route our public traffic through our own network. Not sure if that could the the source of the current issue or not, but sharing those details again anyway.

Digging around the docs, I found the ingress config section. Do I need to specify an IP address for load-balancer-ip within my public subnet range, or does it choose one from the subnet range automatically if empty?

I also had a hard time identifying what parameters are passed to the kubernetes-ingress stage ( e.g. for load-balancer-ip, which seems highly relevant). That's just being a novice with the Nebari code, presumably there's a correspondence between the two, but I wasn't able to find where in the code the parameters are passed to the kubernetes-ingress module.

Also, for testing purposes, I created a Route 53 private hosted zone that I'll use in the domain config option in nebari-config.yml per https://github.com/orgs/nebari-dev/discussions/2942. This is mostly because I don't have ready access to DNS to assign a valid public A record to the load balancer IP, so I figured I could use a local /etc/hosts alias for testing in the meantime, as long as Nebari was happy with the domain setting I used.

Presently, I have no value for domain in nebari-config.yml because when running nebari init in interactive mode it indicated an IP address would be used as a default within Nebari, which sounded ok, but I'm not sure how that maps to the URLs listed at the end of the Nebari AWS deploy docs, for example, or if that could be contributing to why my kubernetes ingress is failing, along with missing an explicit load-balancer-ip.

For now, I'm assuming a blank value doesn't work for domain despite what nebari init indicates, so I'll try making both changes above and redeploy.

The docs are helpful, but the answers I need seem to be in several places and it's a little hard to diagnose, Any advice appreciated!

1 reply

viniciusdc Jun 28, 2025
Maintainer

Hey @mwengren, I've read all your comments. Sorry for leaving you hanging last time. I didn't see the pings last time and considered this a separate discussion when I finally returned to it. A few things: usually, when you have a setup like yours, I would recommend having the cluster fully deployed into two private subnets of the same security group to avoid network issues when the services start communicating with each other, which you ended up experiencing -- and then work out on expsing the internal IP to a external gateway later. Though I am glad you were able to persevere against that.

Now, regarding the issue with the OIDC provider, I've a suspicion that it was a policy in your security group that could have been restricting communication with the AWS OIDC tenant. Checking the CloudTrail for any errors could lead to a solution.

Considering that we will handle it disabled for now, I think the cert issue you are now seeing is also due to restrict outbound access, so check you permissions there -- I would enable open tofu trail logs (export TF_LOG=trace) to inspect the AWS terraform provider internal logs so that you can inspect what is the request being sent and what is the resposne (usualy the error shows up in there too) but keep in mind that the amount of logs is colloosall.

Regarding your question of how the input vars are consumed, the idea is:

Each stage has a Nebari Stage child class, which internally handles transforming all contents of the input_vars methods into an input_vars temp file, which is passed down to the tofu deploy command line.

nebari/src/_nebari/stages/kubernetes_ingress/__init__.py

Lines 169 to 183 in d680ca8

    
           class KubernetesIngressStage(NebariTerraformStage): 
        
               name = "04-kubernetes-ingress" 
        
               priority = 40 
        
               input_schema = InputSchema 
        
               output_schema = OutputSchema 
        
               def tf_objects(self) -> List[Dict]: 
        
                   return [ 
        
                       NebariTerraformState(self.name, self.config), 
        
                       NebariKubernetesProvider(self.config), 
        
                       NebariHelmProvider(self.config), 
        
                   ] 
        
               def input_vars(self, stage_outputs: Dict[str, Dict[str, Any]]):

So basically the corresponding would be Python Nebari stage class - input_vars, and then look for the associated variable names inside the actual terraform module e.,g
https://github.com/nebari-dev/nebari/blob/main/src/_nebari/stages/kubernetes_ingress/template/variables.tf
Once in there, they are captured by the respective modules like this

nebari/src/_nebari/stages/kubernetes_ingress/template/main.tf

Lines 1 to 19 in d680ca8

    
           module "kubernetes-ingress" { 
        
             source = "./modules/kubernetes/ingress" 
        
             namespace = var.environment 
        
             node-group = var.node_groups.general 
        
             traefik-image = var.traefik-image 
        
             certificate-service       = var.certificate-service 
        
             acme-email                = var.acme-email 
        
             acme-server               = var.acme-server 
        
             acme-challenge-type       = var.acme-challenge-type 
        
             cloudflare-dns-api-token  = var.cloudflare-dns-api-token 
        
             certificate-secret-name   = var.certificate-secret-name 
        
             load-balancer-annotations = var.load-balancer-annotations 
        
             load-balancer-ip          = var.load-balancer-ip 
        
             additional-arguments      = var.additional-arguments 
        
           }

mwengren · 2025-07-01T22:05:47Z

mwengren
Jul 1, 2025
Author

@viniciusdc Thanks. The security group suggestion made me revisit the settings I was using in AWS, and I'm hoping that will lead somewhere.

Early on when testing this, I manually created a security group in the AWS console to test with, primarily because in the infrastructure module, as written currently, the only way to trigger the override_network local variable is to pass both an existing_security_group and existing_subnet_ids parameter, per: https://github.com/nebari-dev/nebari/blob/main/src/_nebari/stages/infrastructure/template/aws/main.tf#L12.

I noticed when checking my SG rules, which I tried to match against those in https://github.com/nebari-dev/nebari/blob/main/src/_nebari/stages/infrastructure/template/aws/modules/network/main.tf#L51-L75, I inadvertently selected 'All TCP' rather than 'All traffic'.

So, I'm going to try to redeploy and see if this helps things. Will report back.

However, I like the way I've modified the TF code in my development branch - at least for my situation, it is more flexible and allows a user to pass existing resources they don't want created in TF (existing public/private subnets and security group id), and Nebari TF adapts to use them accordingly, and removes the override_network local variable as a flag to completely skip deploying the network submodule, or not. My version always deploys the network module just varies based on what's passed in.

Obviously, it doesn't work yet for me, and I can only test it for my existing set up, but it gets me a lot closer to being able to use Nebari in my AWS environment than the code currently in main.

0 replies

nebari-dev

Options to deploy Nebari to existing AWS VPC/subnets? #2559

Uh oh!

mwengren Jul 11, 2024

Replies: 12 comments · 19 replies

Uh oh!

Uh oh!

viniciusdc Jul 12, 2024 Maintainer

Uh oh!

mwengren Jul 12, 2024 Author

Uh oh!

Uh oh!

viniciusdc Jun 28, 2025 Maintainer

Uh oh!

Uh oh!

mwengren Jul 13, 2024 Author

Uh oh!

aktech Jul 13, 2024 Collaborator

Uh oh!

mwengren Jul 13, 2024 Author

Uh oh!

mwengren Jul 13, 2024 Author

Uh oh!

mwengren Jul 14, 2024 Author

Uh oh!

mwengren Apr 28, 2025 Author

Uh oh!

dcmcand Apr 29, 2025 Maintainer

Uh oh!

mwengren Apr 29, 2025 Author

Uh oh!

marcelovilla Apr 30, 2025 Maintainer

Uh oh!

mwengren Apr 30, 2025 Author

Uh oh!

dcmcand Apr 29, 2025 Maintainer

Uh oh!

dcmcand Apr 30, 2025 Maintainer

Uh oh!

mwengren May 2, 2025 Author

Uh oh!

mwengren May 6, 2025 Author

Uh oh!

mwengren May 6, 2025 Author

Uh oh!

mwengren May 7, 2025 Author

Uh oh!

mwengren May 8, 2025 Author

Uh oh!

marcelovilla May 26, 2025 Maintainer

Uh oh!

mwengren May 2, 2025 Author

Uh oh!

Uh oh!

mwengren Jun 6, 2025 Author

Uh oh!

dcmcand Jun 9, 2025 Maintainer

Uh oh!

mwengren Jun 13, 2025 Author

Uh oh!

mwengren Jun 16, 2025 Author

Uh oh!

mwengren Jun 16, 2025 Author

Uh oh!

Uh oh!

viniciusdc Jun 28, 2025 Maintainer

Uh oh!

Uh oh!

mwengren
Jul 11, 2024

Replies: 12 comments 19 replies

viniciusdc
Jul 12, 2024
Maintainer

mwengren Jul 12, 2024
Author

viniciusdc Jun 28, 2025
Maintainer

mwengren
Jul 13, 2024
Author

aktech
Jul 13, 2024
Collaborator

mwengren Jul 13, 2024
Author

mwengren
Jul 13, 2024
Author

mwengren
Jul 14, 2024
Author

mwengren Apr 28, 2025
Author

dcmcand Apr 29, 2025
Maintainer

mwengren Apr 29, 2025
Author

marcelovilla Apr 30, 2025
Maintainer

mwengren Apr 30, 2025
Author

dcmcand
Apr 29, 2025
Maintainer

dcmcand
Apr 30, 2025
Maintainer

mwengren
May 2, 2025
Author

mwengren May 6, 2025
Author

mwengren May 6, 2025
Author

mwengren May 7, 2025
Author

mwengren May 8, 2025
Author

marcelovilla May 26, 2025
Maintainer

mwengren
May 2, 2025
Author

mwengren
Jun 6, 2025
Author

dcmcand Jun 9, 2025
Maintainer

mwengren Jun 13, 2025
Author

mwengren Jun 16, 2025
Author

mwengren
Jun 16, 2025
Author

viniciusdc Jun 28, 2025
Maintainer