I’m not going to cover the installation of AWX in any detail. Briefly, the process involves deploying the “awx-operator” and then using a Custom Resource in Kubernetes to tell the operator how to deploy and configure AWX. The resource is pretty simple, you’re effectively passing it some secrets to consume and it does the rest. Here’s the resource that I started with:
|
|
After giving the awx-operator time to do its thing, you get a few more pods:
And some services:
The Avi Kubernetes Operator (AKO), being our ingress provider of choice picks up the service and creates some objects around it automatically. Without too much apparent difficulty AWX is available via an HTTPS URL and it has a trusted certificate on it. However…
I tried to login using the default admin credentials that I defined and got the error above. Initially I thought that I must have made a mistake with the credentials so I checked it all again… and again. Still no dice!
It turns out that my friend Mark Brookfield had the same issue and had been unable to solve it either. There were some hints online about CSRF (Cross-Site Request Forgery) cookies and the Django web framework that AWX uses being part of the problem and we tried some of the suggested remedies to no avail. Using the Developer Tools in my browser, I could see that a CSRF token was being sent to the browser, but I couldn’t tell if it was being sent back.
The next stop for troubleshooting was to fiddle with the ingress settings passed to the awx-operator. Lots of googling, lots of experimentation. The only two outcomes were not being able to reach AWX at all, or getting that error during login.
The next step was to check the logs of the “awx-web--” pods to see if there was an error there or not. The easiest way to accomplish this was to reduce the number of replicas down to 1 by modifying and re-applying the Custom Resource from above.
With the number of replicas down to 1, it was easy to tail the logs using the following command:
kubectl logs -f -n awx awx-web-6f8bcbd854-pmzct
Well, will you look at that! A warning about the CSRF token being missing and a 403 returned.
For the benefit of the internet’s search engines, here is the error:
[pid: 22|app: 0|req: 1/4] 192.168.15.1 () {60 vars in 1052 bytes} [Mon Sep 16 10:54:42 2024] GET /api/v2/auth/ => generated 2 bytes in 208 msecs (HTTP/1.1 200) 11 headers in 379 bytes (1 switches on core 0)
192.168.15.1 - - [16/Sep/2024:10:54:57 +0000] "GET /api/login/ HTTP/1.1" 200 5754 "https://awx.lab.mpoore.io/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" "172.16.100.2"
[pid: 24|app: 0|req: 2/5] 192.168.15.1 () {60 vars in 1048 bytes} [Mon Sep 16 10:54:57 2024] GET /api/login/ => generated 5754 bytes in 106 msecs (HTTP/1.1 200) 10 headers in 463 bytes (1 switches on core 0)
2024-09-16 10:54:57,805 WARNING [cdd247bc4ab540b3b195978565d5496f] django.security.csrf Forbidden (CSRF token missing.): /api/login/
[pid: 20|app: 0|req: 3/6] 192.168.15.1 () {66 vars in 1203 bytes} [Mon Sep 16 10:54:57 2024] POST /api/login/ => generated 1019 bytes in 10 msecs (HTTP/1.1 403) 7 headers in 271 bytes (1 switches on core 0)
192.168.15.1 - - [16/Sep/2024:10:54:57 +0000] "POST /api/login/ HTTP/1.1" 403 1019 "https://awx.lab.mpoore.io/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" "172.16.100.2"
So, the problem is identified, but we don’t know the root cause yet. For a simple test, the ‘service_type’ was changed to “ClusterIP” and a simple port forward rule was used to bypass the load balancer. The CSRF token made it through, meaning that the cause was probably in Avi somewhere.
After some reading around, I found a page in the AKO documentation covering HostRules. The fact that you could use these Custom Resources to manipulate the configuration in Avi seemed like a winner. I decided to make some changes to the Application Profile as the default one (‘System-Secure-HTTP’) that was being used has quite a lot of security settings in it.
The first step was to copy (manually, as there’s no UI option to duplicate) the ‘System-Secure-HTTP’ Application Profile to a new one called ‘Custom-Secure-HTTP’. I then disabled many of the security settings in ‘Custom-Secure-HTTP’.
Next, I needed to create a HostRule as described in the documentation above. For this I only wanted to change the Application Profile being used, so I ended up with the following Custom Resource to give to AKO:
|
|
On line 8 above, we specify the FQDN of the Virtual Service that this HostRule will be applied to. On line 9, we specifiy the name of the Application Profile that we want to use.
With that applied, I reverted all of the other changes made in the deployment of AWX and waited. Once again with the logs being tailed, look what happened!
[pid: 24|app: 0|req: 5/10] 192.168.15.1 () {60 vars in 1073 bytes} [Mon Sep 16 11:41:04 2024] GET /api/v2/auth/ => generated 2 bytes in 161 msecs (HTTP/1.1 200) 11 headers in 379 bytes (1 switches on core 0)
192.168.15.1 - - [16/Sep/2024:11:41:20 +0000] "GET /api/login/ HTTP/1.1" 200 5754 "https://awx.lab.mpoore.io/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" "172.16.100.2"
[pid: 23|app: 0|req: 1/11] 192.168.15.1 () {60 vars in 1069 bytes} [Mon Sep 16 11:41:20 2024] GET /api/login/ => generated 5754 bytes in 413 msecs (HTTP/1.1 200) 10 headers in 463 bytes (1 switches on core 0)
2024-09-16 11:41:20,824 WARNING [bbdccb79dd614138abd91c61307fa442] django.security.csrf Forbidden (Origin checking failed - https://awx.lab.mpoore.io does not match any trusted origins.): /api/login/
[pid: 24|app: 0|req: 6/12] 192.168.15.1 () {66 vars in 1224 bytes} [Mon Sep 16 11:41:20 2024] POST /api/login/ => generated 1019 bytes in 22 msecs (HTTP/1.1 403) 7 headers in 271 bytes (1 switches on core 0)
192.168.15.1 - - [16/Sep/2024:11:41:20 +0000] "POST /api/login/ HTTP/1.1" 403 1019 "https://awx.lab.mpoore.io/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36" "172.16.100.2"
Yes, it still failed, but the cookie actually got through! And the error told me pretty much everything that I needed to fix it based on my prior research into the issue. All was needed was a modification to the resource given to the awx-operator to deploy AWX. A list of trusted origins can be specified during the deployment.
|
|
The ’extra_settings’ added by lines 20 - 23 do exactly that. Once the awx-operator had time to reconfigure AWX, I was able to login without issue!
Now, which Application Profile setting was it that caused the problem? Well, that I didn’t know for sure so I finally had to make incremental changes to the profile to see which one ‘broke’ AWX login. In the end, it was “HTTP-only Cookies” that made the difference.
Leaving that turned off sorted the login issue and I was able to scale out the replicas again.