At Sixfold we provide real-time visibility for supply chains, you can read more on sixfold.com.
This is the first post in a series, where we will dig deeper into how we ensure scalability of our service and the technical challenges involved. Follow us on Medium to get notified when a new post is published.
Before we dive into the details, a bit of background. As part of the Sixfold product, we ingest quite a lot of data from different sources. Two biggest categories of the data are what we call transport data and telemetry data.
The transport data gives us an idea as to what the planned route is: where and when to pick up or deliver the goods.
The telemetry data is what allows us to give real-time estimate of the pick up and delivery times. This data comes from GPS units mounted on vehicles executing the plan.
Simply put, merging these two data streams allows us to figure out what is going on with the transport, in real-time.
But this also means that as our customer-base grows, our product scales along two axes: the number of transports and the number of vehicles we are connected to.
We are anticipating a significantly increased amount of business coming from Transporeon starting September 1st
These were roughly the marching orders that the engineering team received at the start of August 2020 from the executive team.
Behind the scenes, work had started on an initiative, that would eventually lead to Sixfold and Transporeon joining forces in shaking up the visibility market. About 30 days in advance of the release, our goal was to make sure the engineering side is up to the task.
The difference of scale we are talking here is going from hundreds of thousands of transports per month, to millions of transports per month — a magnitude of difference.
Planning for scale
Although we had built our product to be resilient and scalable from day one, we still knew some trouble was bound to pop-up.
As this goal involved all aspects of our business, we assembled a working group for this specific goal. This included people from various engineering teams, as well as people from business operations and customer success.
We also realised as the deadline was quite close, that the team should likely work and act as any regular team would — have frequent enough standups to share the current status, as well as a dedicated communications channel, in our case, a channel in Slack.
After a brief discussion, we realised that we shouldn’t guess which part of the systems could cause trouble, we should test for it.
One of the perks of hosting our entire service on the cloud & Kubernetes, is the speed at which we can set up a dedicated environment that looks very similar to the production environment (apart from the data). As such, we used a dedicated environment for this load test, where we could simulate the increase in load.
Quite early on in Sixfold we had built a set of in-house tools to generate realistic looking, but fake, data. Originally, this was intended as a sales tool, so that we could demo our product. However, over time the tooling had become much more useful — allowing us to test new features, as well as build further tools, such as an end-to-end blackbox test of our entire product.
This tooling now became incredibly handy, as it allowed us to simulate the increase in load by just generating the data ourselves.
With the system under load, we could proceed to figuring out which parts needed a helping hand.
Identifying and eliminating the bottlenecks
Our architecture is built around Apache Kafka, meaning various microservices communicate with each other via message-passing.
One of the main metrics we use to gauge the stability of our system is how well the services are processing these messages — also known as lag. If a service is unable to process messages at appropriate speeds, it will start falling behind, which in turn surfaces as lag — a clear identifier that a service is not able to operate at the required scale/speed.
One of the first, perhaps unsurprising, findings was that services in general were doing fine. As the load increased, so did the number of processed messages.
We did notice, however, that if certain services were to undergo new deployments, or other situations in which not all instances were available at the same time (for example, a partial outage), it would fall behind on the messages and never quite catch up — they were operating at their limit.
In case you are new to Kubernetes, a service/microservice, can have multiple replicas/instances of it running, these are called pods. During a rollout of a new version, there is an overlap in time, when older pods are terminated and newer ones are spawned. With Kafka, this leads to the consumers having to rebalance, and this can cause additional load on individual pods.
At this point, our leading assumption was that processing of messages was not fast enough. We were hopeful that we could increase throughput by optimising code before resorting to more hardware. The leading indicator pointing towards this was that we were seeing an elevated CPU usage on the service pods, but not on the database behind them, hinting that a lot of work was happening in the service code.
Profiling data-heavy services
How do you figure out where these CPU cycles are going? You profile the running process! Our services are all running on Node.js, so there are various tools out there. We ended up using the V8 Inspector Protocol, which is conveniently built into Chrome DevTools. So what did we find?
Regexes can be quite expensive! We identified TypeORM, the OSS package we use for DB abstraction, had some performance issues. Luckily we could work around these (and they have since been fixed, so kudos to the TypeORM team!).
Similarly, we found that parsing dates in node-postgres could be improved, once again by improving the regexes and their logic.
All of this resulted in quite a lot of performance wins on many services that were doing a lot of DB querying. Combining this with some DB detective-work, improving query plans via indices & batching queries together via dataloaders, we managed to increase the throughput quite significantly.
Improving GraphQL APIs
While all of this profiling of internal services was ongoing, we also conducted some load tests for the APIs that serve our frontend applications. These APIs are built on top of a GraphQL stack, that uses apollo-server at its core.
Once again, thanks to some profiling, we identified that apollo itself can be improved by changing its configuration. For example, by disabling plugin-support, which we didn’t really need anyway.
We also noticed something about our GraphQL schema — it is quite deeply nested.
Due to the way apollo works, this means many processes are serial in nature — you first fetch the user, then you fetch the company, then you fetch the transport.
By analysing the abstract syntax tree of an incoming GraphQL query, we could predict the kinds of data we would need to fetch, and prime the underlying dataloaders by already initiating the asynchronous DB queries.
This resulted in quite a few of our serial DB queries getting turned into parallel queries — shaving off many valuable milliseconds from the API requests.
Reducing the size of periodic jobs
As the month progressed, we also started noticing a common pattern in our service monitoring — we were seeing larger spikes of load around specific time intervals.
Historically, some processes in our stack have been done periodically (think cronjob), these are mostly non-business critical, or as fallbacks in case the real-time data has turned stale. We noticed that these periodic jobs were now struggling, seeing more and more things to work on, but having the same limited amount of time to do so.
Our solution to this was to spread out the periodic nature of these jobs by introducing a different way of thinking about them.
In order to gain the improvement we needed, the scheduling was refactored from “this process needs to happen every X” to “this particular transport needs to have something happen to it at X”.
Each of these individual processing jobs takes much less time, and more importantly, can be scheduled independently of others. This allowed us to introduce jitter into the scheduling, spreading out the load from the spikes across a much longer period of time.
In essence we converted a constant number of periodic jobs into a dynamic one based on need, where even if the new total is higher, the individual jobs are faster — resulting in a globally more uniform spread of load across computing resources.
In parallel, the non-technical side of the working group focused on the processes that needed to be improved.
As expected, with increased number of transports being tracked, we would also have an increased number of users asking questions about them, as well as users being onboarded onto the product, and in general, needing support.
This lead us to improve some more critical parts of our product — streamlining onboarding and trying to make it as self-service as possible. As well as putting together a comprehensive set of help articles, to guide both new and existing users in case they run into unknowns.
We also re-evaluated how we offer support, and how to avoid the support requests being escalated to engineering. This meant figuring out additional documentation and tooling, all of which we are continuing to build out to further support our users in a more self-service manner.
Launch and learnings
When September 1st came around, we were eagerly waiting for the flood-gates to open. We had spent time identifying and fixing problems, improving various processes — we were ready.
For large portion of the company, the day was in essence just a really nice launch party. The system worked as expected, we started getting more and more data and everything just worked.
However, not everything was perfect. See, there is a third kind of data we work with, data about the companies that deal with all of these transports. This also contains all of the relationships between these companies. Throughout the entire month of August, we had missed the part where we now also got a lot more of this company data from Transporeon.
This resulted in a small-scale incident — a period of time when the quality of service is degraded — as the service that was handling the relationships was not ready for the increased traffic.
For us it served as a nice reminder as to why it was a good idea to test our ability to handle the load, as it looked very similar to what we had seen with other services. In fact, we in essence applied the same improvements as elsewhere — introducing batch processing, improving queries and indices. We had established a playbook for common issues, as well as how to solve them.
So what did we learn from all of this? Attacking problems related to scaling is always best when you have concrete proof that something is not working, or could be improved.
Similarly, making sure that the team has the ability to address the entire scope of the project is key. However, likely the biggest thing we learned was that no matter how perfect a plan is, there will be surprises — you just have to keep calm and carry on.
What next? Well, we are already in the process of testing increasing the second type of data ingress mentioned above — telemetry data. This time the scale we are aiming for is even bigger — there are more than 6 million trucks on the roads in EU alone.