Why Serverless

There’s no question mark — I’m telling you why.

Published: 17 Oct 2024

First up, yes yes, “serverless” is a stupid name

It’s also a very ambiguous name. One person’s “serverless” is another person’s “cloud” is another person’s “normal”.

But, assuming we are talking here about the broad collection of resources more correctly referred to as “fully-managed, hyper-ephemeral compute” then it’s easy to see why a more snappy name than that made it into the collective consciousness, and accuracy be damned!

It’s interesting also to note how often the “hur hur ‘serverless’ is just somebody else’s server hur hur” quip seems to come up as not only clearly the most hilarious thing ever said, but also — in the mind of the keen-witted utterer of this rapier-sharp phrase — how much this seems to serve (no pun) as a cast iron reason as to why Serverless (with a big S) just isn’t worth the servers it’s sprinted on. It’s just a fad. A gimmick.

And here are seven reasons why those sadly uninitiated folks are just plain wrong.

Reason 1: It redefines expectations of delivery times

You may have already heard the quote by Yan Cui of Real World Serverless: “What used to take teams weeks or months to deliver can now be done within days or sometimes even on the same day”.

Quotes are nice but let me give you a real-life example from my own recent past which validates this bold claim.

In 2022 I was working with a client who operated in the FinTech space. For one reason and another the company had gone into an unexpected and very sudden wind-down. The COO came to me with one simple requirement: “Can you write something to perform a nightly job which extracts payment data from GoCardless and summarises it for our Accounts Team?”

There was an existing system in place but this was deemed both too expensive to keep alive and also (sadly) too error-prone to entrust to a maintenance team of zero people, following the dissolution of their core tech operations.

My replacement system was duly delivered (using AWS Step Functions, Lambda and EventBridge) and immediately accepted.

It was replacing an existing system (as mentioned), a Python Django behemoth, the corresponding component of which had taken two developers approx two months to develop.

My serverless standalone replacement? One developer (me), FOUR DAYS.

“Hur hur” indeed.

Reason 2: It requires smaller, more lean squads

Heck, no, it demands smaller, more lean squads. Put an end to the world of lofty architects pontificating in insular whiteboard sessions and filling the world with entity diagrams and (often) unattainable ideals, leading developers into a granular and confusing mosaic of engineering tasks.

Many people going nowhere, slowly, and at great expense (to paraphrase the late, great Mike Peyton).

When you are implementing in Serverless you absolutely need to be deep in the detail as part of the larger solution design — big picture and small picture at the same time — which naturally means a smaller number of people working on each solution and (crucially) each of them owning much more of the whole solution from top to bottom.

I’d typically say that you want one, max two backend folks, one frontend or mobile developer and perhaps the part-time assistance of an expert practitioner to ensure the idea stays on the rails.

This squad would be highly performant, producing significant feature shifts every week.

Compare this with (no word of a lie) a team I was once in with two frontenders, two backend developers and one ops guy, who took three days to implement one Facebook “Like” button (in the days when we cared about Facebook “Like” buttons. Ask your parents.)

It follows that if you have smaller squads then you can have MORE squads than you currently do while keeping the same FTE headcount, and THIS means that you can accelerate both growth and innovation with very little cost change (save a spot of learning time).

Reason 3: Be Scalable from the Get-Go

In projects based on Traditional (always-on) Compute, you often start out by standing up a little virtual machine server for a few dollars a month. So far, so good.

Your little VM starts to sweat it a bit when you start to add in the usual extra baggage, such as:

An enlarged development team
Weighty regular background jobs (data pulls, API syncs, etc)
An increased browser code footprint as your Javascript front-end starts to run into the megabytes (the delivery of which slows down your single-node service every time a user refreshes the page)

“Easy,” says your dev team, “we will implement caching”.

But this only fends off the performance problems for so long, and so the decision is made to multiply the service into several instances (so-called “horizontal scaling”), and also to beef each of those up a tiny bit (vertical scaling).

So these multiple instances need something directing traffic to them in turn, so you have some form of load balancer.

Perhaps you even at this time decide to move the “megabytes of Javascript” problem into its own “static” service. Now you have two load balancers.

This works really well — it was the basis of internet service delivery for DECADES — but it has two main problems:

Firstly, it is EXPENSIVE because each of these extra instances needs to run on something, each of the load balancers is also a “something” to keep running and in doing all of this shoring up, you’ve probably had to amp up your database instance size / power too and those things are not cheap (esp not if you have multi-region redundancy, which you really should).

The second (less immediately clear) option is that you can only scale in this manner for SO long until the structure itself starts to be unsuitable. So now, the tech team tells you, we need to do some hefty re-architecting. For a few months.

And if you hit THAT problem, then you’ve hit it because you have hundreds of thousands of users (well done!) but the LAST things you want when you have hundreds of thousands of users are: 1) an obviously flaky system, and 2) an extremely high-risk service restructure project.

So. How do you avoid this?

You guessed it. Serverless. When you make the decision to adopt Serverless early on you’re forced to break down your system’s structure into the separate purposes and lifecycles of each piece of data. From the start. The natural structural rules that developers place on code (grouping together by business function, solving General Cases when really a Specific Case would do just fine), these things don’t come naturally in a Serverless system (assuming you’re doing it right).

Responsibilities are naturally separated. Components will have only one singular purpose each. The mesh of behaviours in your system will be both resilient and elastic from the outset. Sure, it’s a different mindset that’s required to achieve this technically but the mental shift is not hard.

Reason 4: Goodbye, security patches!

When your “classic compute” system starts out, it’s likely just one repo, one language, one framework, one package manager, and this seems quite manageable.

Then you start to think in system architecture and realise maybe it should be two or three services.

Then you decide you need a Staging, Development, UAT and Production environments.

Suddenly you have 12 different resources all susceptible to emergency patching, regular dependency updates (oh HI Node projects with your 4,396 dependencies for a single simple service) and the DREADED horror of a language major version upgrade with the typical unavoidable breaking changes. Well that’s the next four sprints spoken for already. At best after those eight weeks you will have exactly the same feature set that you have right now.

Let’s compare that with the amount of patching required in fully managed environments: Zero.

OK, computer.

Reason 5: Serverless forces you to IAC

If you value your sanity, you will want the declarative certainty of Terraform on your side while you are spinning up your Serverless resources. Infrastructure as Code is one of the greatest developments (another non-pun) of the modern tech world and in Serverless you should never ever set out without it. Put down that mouse. Step away from the console.

I say this again and again to teams: ALWAYS BE TERRAFORMING. Even for Alphas / experiments. It’s just such a productivity multiplier and makes it WAY easier to reason about your infrastructure as a whole.

Basically if an approach allows me to answer all my unknowns using egrep then it’s a winner in my book.

An immediate positive side-effect of this is that your solution is portable / repeatable / testable.

Reason 6: The end of Dev vs Ops Silos

For a long time, development culture has catered to comfortable habits. These commonly include:

Engineers (over) engineering in an unrealistic laboratory environment
Desktop dev environments being massively customised
Test databases living for a long time, because they’re hard to reproduce
Hack-arounds and cheats (dotenv, anyone?) holding the configuration and interoperation of the various service components together on the engineer’s laptop
Arcane and inscrutable (often unnoticed) translation and transpilation processes producing the actual executed code
Hardcoded service URLs
Different code / config paths for “development” mode during bootstrapping of the app

The list goes on, and on.

NATURALLY, when the time came to get this beautiful system off the engineer’s laptop and onto some form of internet-based service, many great adn unexpected adventures would often ensue.

Various approaches and methodologies have been devised over the years to address this but the point remains that Devs Will Be Devs and this is just the way many of them work, especially in larger teams and particularly especially during long-established projects.

Needless to say this culture is not helpful when it comes to smooth continuous delivery and — crucially — predictability of behaviour for the released system. For example, if one of your hardcoded configs is only used to access an API very infrequently — say at 2am on the 1st of every month — guess when you’re going to find out about that misconfiguration?

Uh huh.

But as stated above, because the team for Serverless is necessarily smaller, and because the delivery of Serverless is as much itself about configuration as it is about coding and — in particular — because the solution should be expressed in Infrastructure As Code anyway (Terraform) — then the elevation of the solution onto the internet is a part of the development process already.

Reason 7: Less code, like WAY less code

Arguably, in a clean Serverless implementation the ONLY actual code you ought to be writing is Business Logic which is specific to your product. Everything else should be abstracted away into two types of element:

Interconnection of the various resources you are employing in your solution (the architecture)
The configuration you apply to those interconnecting resources in order to get them working in harmony

So now, it’s all about configuration, not code.

This has some excellent side-effects in terms of system reliability and solution certainty:

Less code means fewer logical assumptions in code
Fewer logical assumptions in code means less scope for unforeseen logical loopholes (or bugs) to creep in
The configuration can be sense-checked against the official documentation for the resources — quite probably (one day) opening the door to much more rigorous automated solution checking by in-build cloud provider processes / AI. The nuances buried in bespoke code are often hard for even other humans to understand, whereas configuration is a much more mechanical idea
Much lower levels of test coverage are required, and particularly way fewer unit tests (thank goodness) because you don’t need to test the components (they’re fully managed); you just need to test the overall behaviour of the system and that’s much more like end-to-end or full-stack testing. Doing away with the meaningless reams of unit test code will be a very welcome change in the world of software.

Caveat: Bad Things about Serverless

Yep, ok, if you’ve made it this far you’re excused for thinking this is just a shameless ad for Serverless. I mean, it’s not (it’s a shameless ad for my Serverless consulting services — https://www.mcbh.co.uk ) but whatever it is, it is admittedly one-sided and positive so far.

So let’s just take a look at a few areas in which Serverless isn’t your friend.

Long-running processes

Sigh. If only “hyper-ephemeral” had stuck, this wouldn’t even need saying. But we are where we are, technically and linguistically, and so it does need saying: Long-running processes should not be done in Serverless. Whether that’s a single process that repeatedly calls heavy-latency downstream APIs in a single run, or whether it’s a process that needs to stay alive all the time in order to be hyper-responsive and highly available.

The current collection of Serverless tools do not perform well in either of these scenarios, and neither ought they. There’s still room in this world for (say) Docker tasks staying up and awaiting requests, or even full-fat Compute Instances. For some component of your architecture, either of these may be the most suitable.

But really these should be the exceptions, the really special cases. Overall you want to try to start with Serverless and diversify only when needed.

Challenges of Testing

If you’re used to the world of code-based testing (Jest, Pytest, etc) then the leap to testing Serverless will be… hard.

Firstly, what do you test? You can’t just import the system and do all dependency injection / mocking / patching on it (and rightly so — some awful testing antipatterns have become established practice in this space).

So what you’re going to need to do is to actually spin up a version or a part of a version of the System Under Test. (YET another reason by “ALWAYS BE TERRAFORMING!” is something I say a lot). Your whole approach to fixtures management / reference data management is going to have to shift, and you will likely end up writing quite a lot of supporting test bootstrapping / setup / teardown code yourself.

This is one area in which the relative recency of Serverless is really felt. Because so much of Serverless is proprietary (it’s “someone else’s server, hur hur”, remember?) then you’re quite limited in how you can manipulate the resources to behave unnaturally.

I don’t know what the future of Serverless Testing looks like but I am confident it will get easier.

Plus — and it’s a big plus — don’t lose sight of the fact that the move to Serverless does away with so many types of testing anyway and so the testing you will care about will be more along the “does the whole thing do what users expect?” testing. So you’ll probably find that one chunk of setup tooling will service many many different kinds of tests.

Resolution Challenges

This is a toughie. “It’s not working!” Um… OK. Somewhere within “the GraphQL API Mutation call puts a new record into the Dynamo table which in turn creates a Dynamo Stream event that’s picked up by a Lamba function which publishes a message to an SNS Topic which is picked up by three SQS queues, one of which feeds into a Lambda which starts a Step Function which orders me a pizza”, something failed because I HAVE NO PIZZA.

Debugging and investigation is hard. You’re relying on logging, correlation IDs, and of course REALLY knowing how the system works.

The good news is that with an appropriately hard separation of concerns between your components, it is a lot easier to reason about what might have gone wrong than it is with, say, a five-level deep, multiple inheritance object model which calls default methods on its third superclass using computed properties on its final class which are chosen by a badly-named function in the base class which is basically just an extension of the language’s own Array (or List) class.

That kind of stuff will drive you crazy and the mental models you have to construct to understand it are weighty and tiring if you’re not the person who wrote it.

The separation of responsibility you get in Serverless means that the focus is very much more on the messaging between components and really that’s a lot more simple to unpick than the complete psyche of Vlad who left the project three years ago.

Conclusion

So no, it’s not all perfect. There will be tears. There will be howls of anguish, But. The up-sides are incredible. This is really Why Serverless.

Author: Mark Henwood | Mark’s LinkedIn