This is the tale of how bugs get diagnosed in the world of FOSS software. Join us on a journey from “Man, why does this not work?" through “This makes no sense at all!" to final “All is good now."
It all started on a sunny afternoon when our administrators replaced the old firewall with a brand new one. And because the new firewall requires a different client to establish the VPN connection, we also started updating the documentation on how to get the VPN working again. So far so good.
After we installed the required plugin for NetworkManager and created new the VPN configuration, we tried to establish a connection and were greeted by an error message, informing us, that something went wrong. But what?!?
Dragging the bug out into the open
Since the pop-up error message was not informative enough to be of much help, we opened our good-ol terminal emulator and ran
$ journalctl -u NetworkManager -e
Thankfully, developers of NetworkManager, NetworkManager-fortisslvpn and openfortivpn added enough diagnostic messages to the log that we were able to determine the main cause of the connection issue: the VPN client does not like our server certificate. The exact error message was:
Gateway certificate validation failed, and the certificate digest in not in the local whitelist. If you trust it, rerun with: --trusted-cert 3c8678a17367d9a5c... or add this line to your config file: trusted-cert = 3c8678a17367d9a5c...
Openfortivpn developers rock! They even included a fix/workaround for our problem in the error message.
At this point, we added the workaround into the documentation and published the docs, since getting people a working copy of an ugly procedure now allows them to resume working (perfection is a productivity killer in such situations).
Inspecting the bug
Temporary workaround that we placed in the documentation will work for a couple of months, but the certificate renewal will break the instructions, so we decided to find a more permanent solution to this problem. And so we started to dig deeper.
What could be wrong with the server certificate? Browsers all display the happy green lock on that URL, but just to be safe, we decided to check with OpenSSL’s client.
$ openssl s_client -connect <REDACTED>.xlab.si:443 < /dev/null depth=0 C = SI, L = Ljubljana, O = XLAB d.o.o., CN = *.xlab.si verify error:num=20:unable to get local issuer certificate verify return:1 depth=0 C = SI, L = Ljubljana, O = XLAB d.o.o., CN = *.xlab.si verify error:num=21:unable to verify the first certificate verify return:1 DONE CONNECTED(00000003) --- Certificate chain 0 s:/C=SI/L=Ljubljana/O=XLAB d.o.o./CN=*.xlab.si i:/C=US/O=DigiCert Inc/CN=DigiCert SHA2 Secure Server CA --- ...
Well, it turns out that browsers sometimes do not tell the whole truth, since they happily tolerated the missing intermediary certificate. While we were updating the certificate that is served to the client, we started asking ourself: “Could this be it? Did we just fixed the issue for good?”
The bug behind the bug
To test the fix, we removed the trusted digest from the VPN connection settings and tried to establish the connection again. And it failed. With the same error message!?!
And this is how dreams get crushed. Some people would get discouraged by this, but not us. We were determined to get to the bottom of this issue. And since openfortivpn is FOSS, we checked-out the sources from GitHub and started taking things apart.
First thing that we did was to simplify our test by removing the NetworkManager from the testing procedure. What we ended up with was the next command that reproduced the output that we previously found int the log:
# openfortivpn -c k.cfg --no-dns --no-routes <REDACTED>.xlab.si:443
Next, we built the latest development release of the openfortivpn client (from the master branch):
$ ./autogen.sh && ./configure && make
Then we checked if the error still persisted … and it did. Being completely new to the codebase, we decided to run
$ git grep "Gateway certificate validation fail" src/tunnel.c:577: log_debug("Gateway certificate validation ... src/tunnel.c:603: log_error("Gateway certificate validation ...
Bingo! Two hits relatively close by indicated that
contained the functionality that was broken. And after we opened the file,
those two locations were in a function called
ssl_verify_cert. To get to
the exact point of the failure, we ran the connect command under
# gdb --args \ ./openfortivpn -c k.cfg --no-dns --no-routes <REDACTED>.xlab.si:443 ... (gdb) b ssl_verify_cert Breakpoint 1 at 0x40abe1: file src/tunnel.c, line 545. (gdb) r ... Breakpoint 1, ssl_connect (tunnel=0x7fffffffbe80) at src/tunnel.c:754 754 if (ssl_verify_cert(tunnel)) (gdb) s (gdb) n ## Repeat this 10 times 565 && strncasecmp(common_name, tunnel->config->gateway_host, (gdb) n 577 log_debug("Gateway certificate validation failed.\n");
This way we found out that hostname of the VPN endpoint and common name in
certificate do not match, which was no surprise, since we are using wildcard
Now we needed to find a fix for this. Inspecting a
once again, we noticed that line 555 contained a preprocessor
conditional that tested for macro
HAVE_X509_CHECK_HOST and that we were
using the else branch in our
The other branch of the conditional statement used
perform the validation and a few web searches later we were certain that using
this function would resolve our issue. But the build system was convinced that
we do not have this function even though our OpenSSL library was recent
HAVE_SOME_FUNC macros are usually defined by the
autoconf as a result
AC_CHECK_FUNCS([some_func]) macro call in
configure.ac. And indeed,
configure.ac file for openfortivpn contained that macro on the line 42.
But why did this test fail to find the
X509_check_host function? To
determine the reason, we opened the
config.log file that has been produced
./configure call. Searching for the
X509_check_host string brought
us to the next section of the log file:
configure:5102: checking for X509_check_host configure:5102: gcc -o conftest -g -O2 conftest.c -lutil -lpthread >&5 /tmp/cc46x0eH.o: In function `main': conftest.c:184: undefined reference to `X509_check_host'
Test failed because there are some linker flags missing (
to be precise). When using
autoconf, these flags are usually added to the
compile commads by the
AC_CHECK_LIB macro. But the OpenSSL linker flags are
obtained from the
pkg-config command (
PKG_CHECK_MODULES macro in
configure.ac file), which does not add them to the compile command.
The extermination of the bug
And this is how we got to the bottom of the problem. Interested to see how we fixed this? Join us in the second part of the journey (that is almost as long as the first part, despite the Tolkien’s stories that are concluded fairly quickly;).