How deep can a bug be?

"setup_range_conditions": [], "analyzing_range_alternatives": { "range_scan_alternatives": [ { "index": "PRIMARY", "ranges": ["(id) <= (3)"], "rowid_ordered": false, "using_mrr": false, "index_only": false, "rows": 2, "cost": 0.004554858, "chosen": false, "cause": "cost" } ], ... "considered_access_paths": [ { "access_type": "scan", "rows": 7, "rows_after_filter": 2, "rows_out": 2, "cost": 0.113371474, "index_only": false, "chosen": true } ],

Engineering makes open source happen, smoothly and often unnoticed effort. Effort paid for via donations and sponsorship, or whatever funding model a project has, is the only way it survives.

diff --git a/sql-common/client.c b/sql-common/client.c index e6ad0e97c18..7cb8c070f1c 100644 --- a/sql-common/client.c +++ b/sql-common/client.c @@ -2089,7 +2089,7 @@ static int send_client_reply_packet(MCPVIO_EXT *mpvio, if (mpvio->db) mysql->client_flag|= CLIENT_CONNECT_WITH_DB;

- if (vio_type == VIO_TYPE_NAMEDPIPE) + if (vio_type == VIO_TYPE_NAMEDPIPE || vio_type == VIO_TYPE_SOCKET) { mysql->server_capabilities&= ~CLIENT_SSL; mysql->options.use_ssl= 0;

The SQL query below was an incorrect query plan. There’s no ORDER BY in the query, so the rows returned were actually a valid answer, just not executed in the way the test planned.

It’s IBM’s generosity in being a Foundation Sponsor, a major sponsor at that, that enables me to dive into these rabbit holes. Sometimes that can create very tangible benefits for the entire ecosystem, and in this case to many of their customers. The stability of year on year sponsorship facilitates stable employment where engineers can grow skills, fix bugs, and occasionally blog about them.

+ "range_scan_alternatives": [ + { + "index": "PRIMARY", + "ranges": ["(id) <= (3)"], + "rowid_ordered": false, + "using_mrr": false, + "index_only": false, + "rows": 2, + "cost": 0.004554858, + "chosen": true + }

Any POWER/ppc64le system with an application calling into OpenSSL that’s using a specific set of VSX or floating point register might be exhibiting unexpected behavior because of this. Yes, yikes, code that has been core in OpenSSL for almost 4 years and I’m the first to find it?

if (read_time > found_read_time ....) ... trace_idx.add("chosen", true);

So down the rabbit hole we start looking at what’s going on. As this test case was exclusively on PPC64LE with on a particular OS, getting that environment was key. With IBM as a platinum sponsor, that assists in paying the entire foundation staff in achieving general openness, adoption and continuity which translated into me working at my own volition solving this bug, they also provide a Buildbot sponsor of hardware, allowing us to see this bug, and have access to a remote accessible test environment to debug this.

I feel I need to start this story justifying why it was left so long. I’ll contain this to a paragraph. In summary the tests failure, wasn’t continuous integration worker dependent (so not a hardware fault), was exclusively on RHEL 9 and Centos Stream 9, and only on POWER (ppc64le). It hit two tests, and was in a MariaDB Connector, so my assumption was a quirky one off low impact compiler bug. It also started in MariaDB-11.4 for what will remain a bit of a mystery until later.

Now, after validating the fix by Danny Tsen, I’m just waiting for the fix to be reviewed, merged into OpenSSL. After that it gets backported, picked up by RHEL (+ FIPS re-validation?) and other distributions, and finally back to MariaDB Foundation for us to update our Buildbot containers. After that we’ll have at least two tests green again. In the mean time, I’m enjoying a coffee, about to start the next rabbit hole adventure.

Last year I filed a bug report MDEV-33603 on what a looked like a benign problem with an optimizer taking a different code path in a particular trivial looking test. Its benign looking nature lead to me not looking at it until last week. The “benign” bug as it turned out is a bug in an OpenSSL optimization on IBM POWER, which maybe not the lowest level of “How deep”, but its certainly a long way from the high level (above storage engines) optimizer decisions in MariaDB.

On ppc64le systems, applications may exhibit undefined behavior because the ppc_aes_gcm_encrypt and ppc_aes_gcm_decrypt functions overwrite certain floating-point and VSX registers. As a result, values stored in these registers by calling functions may be replaced with intermediate AES-GCM computation data upon return, leading to data corruption or unpredictable results.
ref: https://github.com/openssl/openssl/pull/28990#issuecomment-3458481314

So a little benign bug from last year is now described on the PR (by me) as:

Our buildbot environment is container based, so a pull down of quay.io/mariadb-foundation/bb-worker:centosstream9 and I’ve got the environment covered. A quick build with clone of the server code, run the test, connect.mysql_index, and luckily discover it to be an always repeatable test. The test case got the wrong result not on the first query, so the test case was simplified to contain the minimal SQL setup and the query that generated the wrong result.

IBM people inadvertently created this bug in software for their hardware, they are only human, and submitted a fix rapidly when advised, and that’s the nature of open source. Quick bug fixes, without contracts. Just communication and rapid evolution. It’s IBM provisioning of hardware to open source projects like MariaDB that enabled this bug to be visible, able to be debugged, and many will benefit.

SELECT * FROM t2 WHERE id <= 3

Danny Tsen responds and comes back with a fix in PR #28990 two days later. Thank you Danny ❤️.