A Cloud Platform can deliver cost-savings, efficiency improvements and higher quality service by consolidating an organisation’s front-line technology services. The result of this consolidation is what we call a ‘Cloud Platform’. The process of getting there is called ‘Platform Engineering’.
Table of Contents
Building platforms on cloud services can be pretty challenging due to the number of potential ways your organisation can perform the migration. However, a deep understanding of the cloud vendor helps and practical experience with the various services.
Avoid decision fatigue with the following recommendations.
- Use managed services as much as possible; for example, select Managed Database services instead of managing your databases. With a managed service, your technology teams can focus on more strategic aspects of the Platform like software development, monitoring and the underlying front-line infrastructure and customer experience.
- Avoid duplicating tooling; we often see teams with tools like Grafana and Prometheus but still rely on Cloudwatch—making operations more complex. With multiple tools that perform the same task, you often have situations where no one knows where to look and when they do, it takes longer to derive any valuable conclusion from it.
- Integrate testing into the Platform; if a service doesn’t perform based on expectations, be prepared to find alternatives. As an organisation, you take responsibility for how you use cloud services.
- The terms of service for all public cloud vendors put the burden of availability and reliability on the customer.
Cloud-based or DIY
Cloud Services like those offered by Amazon Web Services, Google Cloud Platform and Microsoft Azure are the best foundations for Platforms, but it is still possible to build bespoke platforms in your own data centres. We would recommend looking at solutions offered by VMWare and the various bare metal Virtualisation and Kubernetes distributions. At Servana, we have used Rancher extensively on bare-metal platforms, and the experience was good. Kubernetes does a great job of normalising the Runtime and enabling heterogeneous workloads to be run simultaneously on a single platform.
Runtimes are the unit of computing that you build your Platform. The Runtime can be a feature of a Platform, or it could be the whole purpose of the Platform. Today, there are various runtimes ranging from virtual servers, container orchestration, and the runtimes included in Platform-as-a-service solutions like serverless (i.e. AWS Lambda, Google CloudRun or ElasticBeanstalk and Google App Engine.
The Runtime is essential because it is what makes your software portable. Building applications for a specific runtime would limit the ability to port the application across different runtimes without making changes first. While these changes can be minimal, they are not minimal enough to be considered irrelevant.
Container Orchestration creates extensible Platforms because they normalise the Runtime to the degree that makes it irrelevant. The best example of this is Kubernetes, which is probably the best software for basing a Platform because of its rich ecosystem of tooling available and the ability to classify, partition and allocate heterogeneous workloads across a homogenous platform.
What should a Cloud Platform do?
Platforms support the whole information technology organisation and should be able to provide all stakeholders with integration points to support all strategic activities. A few key areas are software development, testing and automation, security, compliance and operations.
Platforms should enable your valuable technology resources to onboard quickly and be productive. Critical components in making this happen are to have optimised onboarding for software developers that enable them to start being productive fast. Likewise, it should be possible that when a software developer leaves an organisation, there is enough housekeeping to tidy up any unnecessary resources. Build bespoke Pipelines for the Platform to support the various stages in the software development lifecycle. Integrate testing, deployment and release management and patching into the pipelines to ensure that there are enough capabilities to manage the application in production.
Testing all components is critical to the reliability or automation and stability of the overall Platform. Testing integrated into the pipelines is essential for the reliability of the applications your customers use, but testing the various components that make up the Platform is just as essential. For example, the infrastructure as code requires testing and staging before use in production. Testing in a non-production version of the Platform allows testing the actual changes while also enabling you to test the change management process.
Automation is one feature that provides the scalability we need in platforms today. Automation is a broad classification that collectively optimises the different kinds of work on the Platform. Automation helps scale the Platform Engineering capability across an organisation while also enabling a bespoke experience for each team. No turnkey or cookie-cutter solutions are necessary with the right automation.
Types of Automation
Automation can be event-driven, meta-data driven, rules-based or procedural. For example, software development pipelines are a form of procedural automation, and they usually form the backbone for Platform automation. On the other hand, some infrastructure automation is procedural, and the practice of defining procedural automation can sometimes be called declarative. Kubernetes provides a rich capability to support meta-data driven automation through its labelling systems. For example, meta-data driven automation enables us to automate the creation of monitoring and alerting capabilities simply by pushing new services into production or non-production environments with the correct labels to signal to the Kubernetes Operators in charge of monitoring. Event-driven automation provides an even better integration point for platforms because they are more aware of the context of the Platform. For example, different events could signal different kinds of automation. Where meta-data driven automation usually only happen on create events, event-driven automation happens more frequently, reflecting the various lifecycle stages of resources in Platforms.
While we could probably point to the advantages and disadvantages of either approach, leveraging all three is the best way to build scalable platforms today.
Integrating security into the Platform is the only way to ensure that security is enforceable. Covering a large area, we look at security as something we want to layer into the Platform, covering web application access, user identity, user access control, secrets encryption, runtime hardening, network and software development.
Implementing these categories of security improves the overall compliance of the whole Platform. Each one is a standalone piece of work but works well to improve the Platform’s safety.
Change management, releases, triage are all essential activities of the development and operations teams. The Platform requires a considerable amount of operations planning, with the most crucial aspect being change management. How do you go about making significant platform changes once in production? We have witnessed how taxing change management can be if not approached correctly. At Servana, we believe a combination of semantic versioning and gitops provide enough capability to make change management more effortless. We have examples of Gitops going back to 2016.
To support triage, a Platform needs to centralise logs, log aggregation, and searching logs are critical components of a good log aggregation platform. So we use whatever is available and makes sense. Elastic Search, Cloudwatch Logs, Google Stackdriver are all good at providing this capability.
Monitoring is another critical capability to provide on any platform. For many organisations, monitoring is a challenge and adds to operational complexity and costs in an environment where an application can be online for a few minutes or longer. However, change can be cumbersome and loud; integrating monitoring into the Platform guarantees a repeatable process. When covering automation, we mentioned how it is possible to set up metadata-driven automation well with the Kubernetes platform. For example, it is possible to automate the setup of new monitoring endpoints with labels. With Kubernetes, you can define a service with a label to completely automate the provisioning of the monitoring and alerting for the service.
What is a Platform Engineering?
Platform Engineering is what we would call the development of capabilities and tools to build Cloud Platforms. The Platform Engineering team could be virtual, consisting of your different technology functions members, i.e. (someone from Security, Network, DevOps, SRE etc.). The mission of the platform engineering team is to use automation to combine their capabilities to deliver a single product. Platform teams can have Delivery Managers and even Product Owners to ensure effective communication with the broader organisation around their Platform’s capabilities.
A Platform team has many customers, including the software development teams, operations teams and central information technology teams. Depending on how your organisation is structured, it might make sense to integrate Platform functions within the software development teams but consolidate the governance and compliance functions centrally.
For platforms to be scalable, there need to be certain practices in place and standardisation that enable the Platform team to scale their capabilities or else they can be a source of bottlenecks and frustration.