May 14, 2018

The Bug Hunt in FOSS: There and Back Again

WORDS BY   Tadej Borovšak

POSTED IN   autotools | community | foss


This is the tale of how bugs get diagnosed in the world of FOSS software. Join us on a journey from "Man, why does this not work?" through "This makes no sense at all!" to final "All is good now."

Being bugged

It all started on a sunny afternoon when our administrators replaced the old firewall with a brand new one. And because the new firewall requires a different client to establish the VPN connection, we also started updating the documentation on how to get the VPN working again. So far so good.

After we installed the required plugin for NetworkManager and created new the VPN configuration, we tried to establish a connection and were greeted by an error message, informing us, that something went wrong. But what?!?

Dragging the bug out into the open

Since the pop-up error message was not informative enough to be of much help, we opened our good-ol terminal emulator and ran

$ journalctl -u NetworkManager -e

Thankfully, developers of NetworkManager, NetworkManager-fortisslvpn and openfortivpn added enough diagnostic messages to the log that we were able to determine the main cause of the connection issue: the VPN client does not like our server certificate. The exact error message was:

Gateway certificate validation failed, and the certificate digest in not
  in the local whitelist. If you trust it, rerun with:
    --trusted-cert 3c8678a17367d9a5c...
or add this line to your config file:
    trusted-cert = 3c8678a17367d9a5c...

Openfortivpn developers rock! They even included a fix/workaround for our problem in the error message.

At this point, we added the workaround into the documentation and published the docs, since getting people a working copy of an ugly procedure now allows them to resume working (perfection is a productivity killer in such situations).

Inspecting the bug

Temporary workaround that we placed in the documentation will work for a couple of months, but the certificate renewal will break the instructions, so we decided to find a more permanent solution to this problem. And so we started to dig deeper.

What could be wrong with the server certificate? Browsers all display the happy green lock on that URL, but just to be safe, we decided to check with OpenSSL's client.

$ openssl s_client -connect <REDACTED>.xlab.si:443 < /dev/null
depth=0 C = SI, L = Ljubljana, O = XLAB d.o.o., CN = *.xlab.si
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 C = SI, L = Ljubljana, O = XLAB d.o.o., CN = *.xlab.si
verify error:num=21:unable to verify the first certificate
verify return:1
DONE
CONNECTED(00000003)
---
Certificate chain
 0 s:/C=SI/L=Ljubljana/O=XLAB d.o.o./CN=*.xlab.si
   i:/C=US/O=DigiCert Inc/CN=DigiCert SHA2 Secure Server CA
---
...

Well, it turns out that browsers sometimes do not tell the whole truth, since they happily tolerated the missing intermediary certificate. While we were updating the certificate that is served to the client, we started asking ourself: "Could this be it? Did we just fixed the issue for good?"

The bug behind the bug

To test the fix, we removed the trusted digest from the VPN connection settings and tried to establish the connection again. And it failed. With the same error message!?!

And this is how dreams get crushed. Some people would get discouraged by this, but not us. We were determined to get to the bottom of this issue. And since openfortivpn is FOSS, we checked-out the sources from GitHub and started taking things apart.

First thing that we did was to simplify our test by removing the NetworkManager from the testing procedure. What we ended up with was the next command that reproduced the output that we previously found int the log:

# openfortivpn -c k.cfg --no-dns --no-routes <REDACTED>.xlab.si:443

Next, we built the latest development release of the openfortivpn client (from the master branch):

$ ./autogen.sh && ./configure && make

Then we checked if the error still persisted ... and it did. Being completely new to the codebase, we decided to run

$ git grep "Gateway certificate validation fail"
src/tunnel.c:577:       log_debug("Gateway certificate validation ...
src/tunnel.c:603:       log_error("Gateway certificate validation ...

Bingo! Two hits relatively close by indicated that src/tunnel.c file contained the functionality that was broken. And after we opened the file, those two locations were in a function called ssl_verify_cert. To get to the exact point of the failure, we ran the connect command under gdb:

# gdb --args \
    ./openfortivpn -c k.cfg --no-dns --no-routes <REDACTED>.xlab.si:443
...
(gdb) b ssl_verify_cert
Breakpoint 1 at 0x40abe1: file src/tunnel.c, line 545.
(gdb) r
...
Breakpoint 1, ssl_connect (tunnel=0x7fffffffbe80) at src/tunnel.c:754
754                     if (ssl_verify_cert(tunnel))
(gdb) s
(gdb) n  ## Repeat this 10 times
565             && strncasecmp(common_name, tunnel->config->gateway_host,
(gdb) n
577         log_debug("Gateway certificate validation failed.\n");

This way we found out that hostname of the VPN endpoint and common name in certificate do not match, which was no surprise, since we are using wildcard certificate for *.xlab.si.

Now we needed to find a fix for this. Inspecting a ssl_verify_cert function once again, we noticed that line 555 contained a preprocessor conditional that tested for macro HAVE_X509_CHECK_HOST and that we were using the else branch in our gdb session.

The other branch of the conditional statement used X509_check_host to perform the validation and a few web searches later we were certain that using this function would resolve our issue. But the build system was convinced that we do not have this function even though our OpenSSL library was recent enough.

Now, HAVE_SOME_FUNC macros are usually defined by the autoconf as a result of a AC_CHECK_FUNCS([some_func]) macro call in configure.ac. And indeed, configure.ac file for openfortivpn contained that macro on the line 42.

But why did this test fail to find the X509_check_host function? To determine the reason, we opened the config.log file that has been produced by the ./configure call. Searching for the X509_check_host string brought us to the next section of the log file:

configure:5102: checking for X509_check_host
configure:5102: gcc -o conftest -g -O2   conftest.c -lutil -lpthread  >&5
/tmp/cc46x0eH.o: In function `main':
conftest.c:184: undefined reference to `X509_check_host'

Test failed because there are some linker flags missing (-crypt and -lssl to be precise). When using autoconf, these flags are usually added to the compile commads by the AC_CHECK_LIB macro. But the OpenSSL linker flags are obtained from the pkg-config command (PKG_CHECK_MODULES macro in configure.ac file), which does not add them to the compile command.

The extermination of the bug

And this is how we got to the bottom of the problem. Interested to see how we fixed this? Join us in the second part of the journey (that is almost as long as the first part, despite the Tolkien's stories that are concluded fairly quickly;).