Skip to main content

Fixing Ansible AWX ingress when using the AVI load balancer

·7 mins
AVI VMware vExpert LoadBalancer SSL TLS Kubernetes AKO NSX AWX
Photo by Markus Spiske on Unsplash
The Grand Plan for my homelab involves getting as many services as I can moved into a cloud-native format, specifically Kubernetes. This week it was the turn of AWX, which is built on top of open-source Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform. So of course I want it in my lab! The big problem however was the ingress. Something wasn’t working as it should meaning that I just couldn’t login to AWX.

I’m not going to cover the installation of AWX in any detail. Briefly, the process involves deploying the “awx-operator” and then using a Custom Resource in Kubernetes to tell the operator how to deploy and configure AWX. The resource is pretty simple, you’re effectively passing it some secrets to consume and it does the rest. Here’s the resource that I started with:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
  namespace: awx
spec:
  admin_user: admin
  admin_email: awx@lab.mpoore.io
  admin_password_secret: awx-admin-password
  bundle_cacert_secret: awx-custom-certs
  ingress_type: ingress
  hostname: awx.lab.mpoore.io
  ingress_tls_secret: awx.lab.mpoore.io
  postgres_configuration_secret: awx-postgres-configuration
  projects_persistence: true
  projects_existing_claim: awx-data
  replicas: 3
  secret_key_secret: awx-secret-key
  service_type: NodePort

After giving the awx-operator time to do its thing, you get a few more pods:

Screenshot showing the AWX pods created by the awx-operator.
Figure 1: The expected AWX pods are deployed.

And some services:

Screenshot showing the AWX services created by the awx-operator.
Figure 2: An AWX service is created.

The Avi Kubernetes Operator (AKO), being our ingress provider of choice picks up the service and creates some objects around it automatically. Without too much apparent difficulty AWX is available via an HTTPS URL and it has a trusted certificate on it. However…

Screenshot showing a generic error message when logging into AWX.
Figure 3: AWX login fails with an unhelpful error.

I tried to login using the default admin credentials that I defined and got the error above. Initially I thought that I must have made a mistake with the credentials so I checked it all again… and again. Still no dice!

It turns out that my friend Mark Brookfield had the same issue and had been unable to solve it either. There were some hints online about CSRF (Cross-Site Request Forgery) cookies and the Django web framework that AWX uses being part of the problem and we tried some of the suggested remedies to no avail. Using the Developer Tools in my browser, I could see that a CSRF token was being sent to the browser, but I couldn’t tell if it was being sent back.

Screenshot showing the browser Developer Tools open with the CSRF cookie and token highlighted.
Figure 4: A CSRF cookie and token were being sent to my browser.

The next stop for troubleshooting was to fiddle with the ingress settings passed to the awx-operator. Lots of googling, lots of experimentation. The only two outcomes were not being able to reach AWX at all, or getting that error during login.

The next step was to check the logs of the “awx-web--” pods to see if there was an error there or not. The easiest way to accomplish this was to reduce the number of replicas down to 1 by modifying and re-applying the Custom Resource from above.

Screenshot showing the number of replica pods reduced down to 1.
Figure 5: Working with a single replica is easier because you know which pod will be used.

With the number of replicas down to 1, it was easy to tail the logs using the following command:

kubectl logs -f -n awx awx-web-6f8bcbd854-pmzct

Well, will you look at that! A warning about the CSRF token being missing and a 403 returned.

Screenshot showing a snippet of log output from the pod.
Figure 6: Looks like the CSRF token isn’t getting through to the backend.

For the benefit of the internet’s search engines, here is the error:

[pid: 22|app: 0|req: 1/4] 192.168.15.1 () {60 vars in 1052 bytes} [Mon Sep 16 10:54:42 2024] GET /api/v2/auth/ => generated 2 bytes in 208 msecs (HTTP/1.1 200) 11 headers in 379 bytes (1 switches on core 0)
192.168.15.1 - - [16/Sep/2024:10:54:57 +0000] "GET /api/login/ HTTP/1.1" 200 5754 "https://awx.lab.mpoore.io/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" "172.16.100.2"
[pid: 24|app: 0|req: 2/5] 192.168.15.1 () {60 vars in 1048 bytes} [Mon Sep 16 10:54:57 2024] GET /api/login/ => generated 5754 bytes in 106 msecs (HTTP/1.1 200) 10 headers in 463 bytes (1 switches on core 0)
2024-09-16 10:54:57,805 WARNING  [cdd247bc4ab540b3b195978565d5496f] django.security.csrf Forbidden (CSRF token missing.): /api/login/
[pid: 20|app: 0|req: 3/6] 192.168.15.1 () {66 vars in 1203 bytes} [Mon Sep 16 10:54:57 2024] POST /api/login/ => generated 1019 bytes in 10 msecs (HTTP/1.1 403) 7 headers in 271 bytes (1 switches on core 0)
192.168.15.1 - - [16/Sep/2024:10:54:57 +0000] "POST /api/login/ HTTP/1.1" 403 1019 "https://awx.lab.mpoore.io/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" "172.16.100.2"

So, the problem is identified, but we don’t know the root cause yet. For a simple test, the ‘service_type’ was changed to “ClusterIP” and a simple port forward rule was used to bypass the load balancer. The CSRF token made it through, meaning that the cause was probably in Avi somewhere.

After some reading around, I found a page in the AKO documentation covering HostRules. The fact that you could use these Custom Resources to manipulate the configuration in Avi seemed like a winner. I decided to make some changes to the Application Profile as the default one (‘System-Secure-HTTP’) that was being used has quite a lot of security settings in it.

The first step was to copy (manually, as there’s no UI option to duplicate) the ‘System-Secure-HTTP’ Application Profile to a new one called ‘Custom-Secure-HTTP’. I then disabled many of the security settings in ‘Custom-Secure-HTTP’.

Next, I needed to create a HostRule as described in the documentation above. For this I only wanted to change the Application Profile being used, so I ended up with the following Custom Resource to give to AKO:

1
2
3
4
5
6
7
8
9
apiVersion: ako.vmware.com/v1beta1
kind: HostRule
metadata:
   name: host-to-awx
   namespace: awx
spec:
  virtualhost:
    fqdn: awx.lab.mpoore.io
    applicationProfile: Custom-Secure-HTTP

On line 8 above, we specify the FQDN of the Virtual Service that this HostRule will be applied to. On line 9, we specifiy the name of the Application Profile that we want to use.

With that applied, I reverted all of the other changes made in the deployment of AWX and waited. Once again with the logs being tailed, look what happened!

[pid: 24|app: 0|req: 5/10] 192.168.15.1 () {60 vars in 1073 bytes} [Mon Sep 16 11:41:04 2024] GET /api/v2/auth/ => generated 2 bytes in 161 msecs (HTTP/1.1 200) 11 headers in 379 bytes (1 switches on core 0)
192.168.15.1 - - [16/Sep/2024:11:41:20 +0000] "GET /api/login/ HTTP/1.1" 200 5754 "https://awx.lab.mpoore.io/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" "172.16.100.2"
[pid: 23|app: 0|req: 1/11] 192.168.15.1 () {60 vars in 1069 bytes} [Mon Sep 16 11:41:20 2024] GET /api/login/ => generated 5754 bytes in 413 msecs (HTTP/1.1 200) 10 headers in 463 bytes (1 switches on core 0)
2024-09-16 11:41:20,824 WARNING  [bbdccb79dd614138abd91c61307fa442] django.security.csrf Forbidden (Origin checking failed - https://awx.lab.mpoore.io does not match any trusted origins.): /api/login/
[pid: 24|app: 0|req: 6/12] 192.168.15.1 () {66 vars in 1224 bytes} [Mon Sep 16 11:41:20 2024] POST /api/login/ => generated 1019 bytes in 22 msecs (HTTP/1.1 403) 7 headers in 271 bytes (1 switches on core 0)
192.168.15.1 - - [16/Sep/2024:11:41:20 +0000] "POST /api/login/ HTTP/1.1" 403 1019 "https://awx.lab.mpoore.io/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" "172.16.100.2"

Yes, it still failed, but the cookie actually got through! And the error told me pretty much everything that I needed to fix it based on my prior research into the issue. All was needed was a modification to the resource given to the awx-operator to deploy AWX. A list of trusted origins can be specified during the deployment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
  namespace: awx
spec:
  admin_user: admin
  admin_email: svc_awx@lab.mpoore.io
  admin_password_secret: awx-admin-password
  bundle_cacert_secret: awx-custom-certs
  ingress_type: ingress
  hostname: awx.lab.mpoore.io
  ingress_tls_secret: awx.lab.mpoore.io
  postgres_configuration_secret: awx-postgres-configuration
  projects_persistence: true
  projects_existing_claim: awx-data
  replicas: 1
  secret_key_secret: awx-secret-key
  service_type: NodePort
  extra_settings:
  - setting: CSRF_TRUSTED_ORIGINS
    value:
      - https://awx.lab.mpoore.io

The ’extra_settings’ added by lines 20 - 23 do exactly that. Once the awx-operator had time to reconfigure AWX, I was able to login without issue!

Now, which Application Profile setting was it that caused the problem? Well, that I didn’t know for sure so I finally had to make incremental changes to the profile to see which one ‘broke’ AWX login. In the end, it was “HTTP-only Cookies” that made the difference.

Screenshot showing the Avi Application Profile security settings that work.
Figure 7: Leaving ‘HTTP-only Cookies’ switched off allowed AWX login to work.

Leaving that turned off sorted the login issue and I was able to scale out the replicas again.

Related

vSphere Kubernetes shared Virtual Services show 80% health in the AVI dashboard
·5 mins
AVI VMware vExpert LoadBalancer SSL TLS Kubernetes AKO NSX
In this article I explain why shared Virtual Services show only 80% health in the AVI dashboard and how to fix it.
Startup plan for vSphere clusters hosting a Kubernetes Supervisor
·5 mins
vSphere VMware vExpert Homelab Kubernetes
A runbook of steps to complete to startup a vSphere cluster that hosts a Kubernetes Supervisor.
Shutdown plan for vSphere clusters hosting a Kubernetes Supervisor
·6 mins
vSphere VMware vExpert Homelab Kubernetes
A runbook of steps to complete to shutdown a vSphere cluster that hosts a Kubernetes Supervisor.