Heya party people! This is a tricky bit of news to share, but it needs to be done.
If you're not in the UK you probably didn't know but, two weeks ago, we had some down-time. 😬 This was our first significant period own down-time in the history of Finger-Ink. The entire platform was unusable for a number of hours.
Even worse, a couple of days later there were further issues for the same group of customers. Web forms were still available, but a large part of Finger-Ink's functionality was offline for a number of hours.
I've heard from many of you as to how frustrating this was. To everyone who experienced this outage — I am truly sorry.
As of writing this, all major issues having a hand in this have been addressed, and Finger-Ink is in the best shape since our introduction of the portal.
If you're interested in understanding what went wrong, and what's been done to address this, please read on.
Occasionally, the servers that run Finger-Ink need patching. Security and stability updates are released, and these need to be applied in order to ensure our servers are not vulnerable.
Our hosting provider ensures that our servers are up-to-date. They do this by applying these patches for us, as required. At the time, however, it also meant that our servers would be restarted automatically after these patches were applied.
Prior to this event, all patching had been done during Finger-Ink's office hours (or close enough to them). This was important because, when our application came online, there were some manual steps that needed to be performed in order to make sure that our encryption module comes online. This encryption module is used throughout Finger-Ink — nothing can operate without it.
Unfortunately, on this day, the patching and restarting was done well outside of our regular hours. When the application came back online, and sent a notification that the manual steps were required, no one was around to receive it. This meant the encryption module remained offline, until I checked in on things at 5:30am the following morning, and brought the encryption module online.
Following the first outage, on that same day, we deployed a change to ensure that the encryption module could come online all by itself — without requiring any manual steps.
This ran well until another patch was applied a day or two later. The server was restarted again, after hours. The encryption module came back online, as intended, but the key server (our new fancy tech that ensures API keys aren't removed immediately after being told they're invalid by the Cliniko API) did not, due to it running into database connection limits.
While the portal remained usable, web forms couldn't match to patients, and workflow couldn't run.
Since then, our top priority has been making sure that this does not happen again. In addition to ensuring that the encryption module always comes back online, we've also taken further steps to improve reliability.
Regarding unplanned patches & server restarts:
Regarding database connectivity & capacity:
Other things we've done:
All is going well thus far. The new connection limits have increased the overall snappiness of the portal interface, and we haven't had any further issues.
There are some additional changes coming the platform, which will reduce memory usage even further.
Thank you all for your continued support. 🙏
Cover image by Ian Taylor.