Increased our capacity & improved reliability

Shared by Brendan • October 27, 2022

Heya party people! This is a tricky bit of news to share, but it needs to be done.

We had some down-time

If you're not in the UK you probably didn't know but, two weeks ago, we had some down-time. 😬 This was our first significant period own down-time in the history of Finger-Ink. The entire platform was unusable for a number of hours.

Even worse, a couple of days later there were further issues for the same group of customers. Web forms were still available, but a large part of Finger-Ink's functionality was offline for a number of hours.

I've heard from many of you as to how frustrating this was. To everyone who experienced this outage — I am truly sorry.

It's been sorted

As of writing this, all major issues having a hand in this have been addressed, and Finger-Ink is in the best shape since our introduction of the portal.

If you're interested in understanding what went wrong, and what's been done to address this, please read on.

The first outage

Occasionally, the servers that run Finger-Ink need patching. Security and stability updates are released, and these need to be applied in order to ensure our servers are not vulnerable.

Our hosting provider ensures that our servers are up-to-date. They do this by applying these patches for us, as required. At the time, however, it also meant that our servers would be restarted automatically after these patches were applied.

Prior to this event, all patching had been done during Finger-Ink's office hours (or close enough to them). This was important because, when our application came online, there were some manual steps that needed to be performed in order to make sure that our encryption module comes online. This encryption module is used throughout Finger-Ink — nothing can operate without it.

Unfortunately, on this day, the patching and restarting was done well outside of our regular hours. When the application came back online, and sent a notification that the manual steps were required, no one was around to receive it. This meant the encryption module remained offline, until I checked in on things at 5:30am the following morning, and brought the encryption module online.

The second outage

Following the first outage, on that same day, we deployed a change to ensure that the encryption module could come online all by itself — without requiring any manual steps.

This ran well until another patch was applied a day or two later. The server was restarted again, after hours. The encryption module came back online, as intended, but the key server (our new fancy tech that ensures API keys aren't removed immediately after being told they're invalid by the Cliniko API) did not, due to it running into database connection limits.

While the portal remained usable, web forms couldn't match to patients, and workflow couldn't run.

Getting it sorted

Since then, our top priority has been making sure that this does not happen again. In addition to ensuring that the encryption module always comes back online, we've also taken further steps to improve reliability.

Regarding unplanned patches & server restarts:

Patches are now only applied once per week, rather than daily.
Server restarts after patching need to be initiated by Finger-Ink, they will no longer happen automatically. This ensures we will be there to make sure everything comes back online, just in case.

Regarding database connectivity & capacity:

Our database connection limits have been increased 3-fold.
We've optimised our database connection configuration to ensure that waiting for the database a little longer than usual doesn't cause errors.

Other things we've done:

Optimised our start-up procedures within the app to reduce initial load on the server during boot.
Reduced unnecessary load on the server
Fixed an issue which, due to server load, could have caused appointment syncing (from the iPad) to pause until the next launch of the app.
Implemented a more thorough monitoring system for our platform.

Going forward

All is going well thus far. The new connection limits have increased the overall snappiness of the portal interface, and we haven't had any further issues.

There are some additional changes coming the platform, which will reduce memory usage even further.

Thank you all for your continued support. 🙏

Cover image by Ian Taylor.