-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle helm upgrade of AWS VPC CNI more gracefully #57
Comments
For sure but how? To my knowledge, it's not possible blindly overwrite existing resources without errors. But the behaviour has also changed in the last couple releases of Helm. |
For a freshliy set-up EKS cluster the a above quoted statement:
is just not true:
The chart simply cannot be installed into an existing EKS cluster. |
Thanks for the heads up @3oris, this is embarrassing indeed. I will work on getting an upgrade job placed into the VPC CNI system once we move the chart sources in there. |
I removed the daemon set object & service account before applying the change and managing CNI with helm. |
@mohsen0 for me it doesn't. Helm complains about the pre-existing CRD next. I surely know that I could delete all related resources and then apply the chart. For me the question more or less is: Should an EKS cluster come with a pre-installed CNI plugin if after all the CNI plugin wont be part of the "managed service" promise. Now the thing comes with an unmanaged component out of the box that leaves me manage it myself, but adequate means of automation (in terms of CD) cannot be applied, and I have to work around that. |
The CNI chart is not compatible with Helm v3, we should move the CRD to the |
@3oris I understand there is room for more improvement and clarify, But I suggest a more constructive, positive approach. then your contributions will be much more appreciated. |
CRD issue aside, you can just install the chart, it fails, then purge it, then reinstall and all is done. Obviously deletes all the resources and then recreates them with the chart. I've done this method many times now and it works fine without any interruption. |
This was true for older versions of Helm. For 2.16 it doesn't work. Helm 3 also probably not. |
Here is how we work around this in terraform for now: We remove all aws-node related pre-deployed resources as soon as the control plane becomes available (and before the worker nodes become ready) and then apply the helm chart.
with
This can also be applied to an existing cluster. As @max-rocket-internet stated there should be no noticeable interruption. Now you could annotate the aws-node service account via the |
Note that as of Helm 3.2.0, there is theoretically a way to adopt existing resources into helm if you add appropriate annotations and labels to it. helm/helm#7649 Just add the following annotations and labels to existing objects:
In theory, you can add the annotations and labels via kubectl annotate/kubectl label to the objects that AWS pre-creates, then run Helm. I haven't tested this yet, and this is still ugly if you're using something like Terraform, but in theory you don't have delete the Kubernetes objects now. |
The above suggestion does not work as matchlabels on the daemonset is an immutable field. I am going to attempt to allow us to override that in the chart and make a pr, which may be feasible if it does not try to modify any other immutable fields. |
Now that #192 has been merged with documentation on how to adopt the EKS-deployed resources into the Helm chart, is there anything more to do here? |
What happens when you just delete the existing resources and install the CNI helm chart a few seconds later? Does everything recover or do you break your cluster and existing nodes? I did that yesterday and everything seems to continue working just fine until I've replaced alle nodes in my cluster. The AWS NLB stopped working and I only got a "Error empty reply from Server" until I deleted the LB (delete the ingress-nginx helm chart) and created a new one. Update: Did the exact same thing on another cluster and everything went well. I did the exact same thing, only difference was that I was a bit faster doing everything. Weird. Maybe it was just a AWS EKS bug since I did both changes with a EKS Upgrade. |
Would it be an option to use |
I don't think so, I don't think Helm will run the pre-install hooks if it's not going to do the install because there are conflicting resources it cannot adopt. |
VPC-CNI is fixing crds issue for helm v3 in aws/amazon-vpc-cni-k8s#2144 |
If someone runs into problems with @3oris's great workaround (see above), you can try replacing the shell script with this one. Some things have changed since he posted his solution, so I applied some fixes. This works for me now, thanks a lot @3oris for your solution! Fixes:
#!/bin/bash
#CLUSTER_ENDPOINT
#CLUSTER_CA
#CLUSTER_TOKEN
#REGION
until curl -k -s $CLUSTER_ENDPOINT/healthz >/dev/null; do sleep 4; done
MANIFEST_FILE=`mktemp -t manifest_`
CONFIG_FILE=`mktemp -t config_`
CA_FILE=`mktemp -t ca_`
trap "{ rm -f $MANIFEST_FILE $CONFIG_FILE $CA_FILE; }" EXIT
echo $CLUSTER_CA | base64 -d > $CA_FILE
VERSION=$(kubectl get ds aws-node -n kube-system -o yaml \
--kubeconfig $CONFIG_FILE \
--server $CLUSTER_ENDPOINT \
--certificate-authority $CA_FILE \
--token "$CLUSTER_TOKEN" \
| grep -i "image: \d" | grep amazon-k8s-cni: | cut -d "/" -f 2 | cut -d ":" -f 2 | sed -E 's#^v([[:digit:]]+)\.([[:digit:]]+)\..+$#\1.\2#g')
curl -sqL \
https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-$VERSION/config/master/aws-k8s-cni.yaml \
| sed -e "s#us-west-2#$REGION#" > $MANIFEST_FILE
kubectl delete \
--kubeconfig $CONFIG_FILE \
--server $CLUSTER_ENDPOINT \
--certificate-authority $CA_FILE \
--token "$CLUSTER_TOKEN" \
--filename $MANIFEST_FILE \
--wait seems like TF doesn't like resource "null_resource" "remove_aws_vpc_cni_plugin" {
provisioner "local-exec" {
environment = {
CLUSTER_ENDPOINT = data.aws_eks_cluster.cluster.endpoint
CLUSTER_CA = data.aws_eks_cluster.cluster.certificate_authority.0.data
CLUSTER_TOKEN = data.aws_eks_cluster_auth.cluster.token
}
command = format("%s/remove-aws-vpc-cni-plugin.sh", path.module)
}
}
resource "helm_release" "vcni" {
depends_on = [null_resource.remove_aws_vpc_cni_plugin]
repository = "https://aws.github.io/eks-charts"
name = "aws-vpc-cni"
chart = "aws-vpc-cni"
namespace = "kube-system"
set {
name = "env.ENABLE_PREFIX_DELEGATION"
value = true
}
} |
Closing as the |
According the documentation in aws-VPC-CNI helm chart,
“ If you receive an error similar to Error: release aws-vpc-cni failed: "aws-node" already exists, simply rerun the above command.”
This causes issues with automation. This should be handled automatically in the helm chart and not require a rerun of the command by user.
The text was updated successfully, but these errors were encountered: