A private social-network platform was hampered by infrastructure performance, scalability, security, reliability and cost issues that were liable to destroy any prospect of it achieving the business’s substantial potential.
Called in to help, our team was able to quickly stabilise the situation and optimise the infrastructure and reconfigure engineering operations to put the business back on a path to growth.
Making a social-network platform fit for growth
A provider of private social-networks had developed a platform to allow organisations to create customised digital communities.
When a new CTO joined a private-social-network company, it quickly became clear to him that they would be unable to execute their strategy unless action was taken to address their AWS infrastructure. The platform, which had been developed to allow organisations to create customised digital communities, was struggling on multiple counts:
- Reliability — the platform was prone to frequent service outages;
- Scalability — platform was struggling to cope during peak workloads, to the point where, even during planned events, it was becoming unusable;
- Cost — platform costs were exceptionally high — unsustainable if they were to make a success of the planned mass-market business model;
- Analytics — only limited monitoring and business intelligence existed. This not only affected platform management, but, at an application level, there was no visibility of metrics that would enable it to deliver product-led growth;
- Security — steps being taken to protect user data were significantly below where they should be.
Recognising he needed specialist expertise, he called upon our resources to help.
Working with the CTO and the CEO, a three-stage plan of action to optimise the infrastructure and operations was agreed:
- Identify quick wins — i.e. immediate opportunities to to stabilize platform performance to protect user experience and secondarily, reduce platform costs;
- Deeper-dive analysis and recommendations — a comprehensive assessment of infrastructure fit against business need and priorities, root-cause analysis of performance deficiencies and design-optimisation recommendations;
- Implementation acceleration — work with the client’s development team to accelerate implementation of the planned infrastructure changes, establish operational best practices, including deployment of platform monitoring and customer-behaviour analytics.
Phase 1: Identify quick wins
The team quickly ascertained that a significant cause of poor service reliability was operations-related. By creating greater separation of duties across the technology team, and introducing more effective change-control processes much of the problems disappeared. In addition, event logging and monitoring using Splunk was set up, so that when issues did arise they could be diagnosed faster.
In parallel, the team analysed infrastructure cost effectiveness. Workloads, platform-service usage and billing were analysed. As a result, services were rationalised and ‘right-sized’, leading to a 75% cost saving within 12 weeks of project commencement — worth $400k per year.
Security was also strengthened through the implementation of a range of improved identity and access-management measures.
Phase 2: Deeper-dive analysis and recommendations
The company’s product line had evolved over time into three separate application stacks, which created significant inefficiencies and inflexibilities across engineering and commercial operations. A plan was therefore agreed to consolidate them into one multi-tenant platform based on a microservices architecture that could scale with the business’s planned growth.
An important part of the architectural review was selecting appropriate cloud applications and services to use. In such a rapidly evolving technology landscape, it was important to choose wisely, assessing not just the immediate capability of a given product, but longevity from a support and skill-pool perspective.
Areas the team considered in depth included container orchestration, IaC (infrastructure-as-code), platform monitoring and application intelligence. After evaluation of different options, it was decided to use microservices on AWS’s Kubernetes implementation (EKS), Terraform for IaC, and a combination of Splunk and Signal FX for instructure and security events management, and application observation and tracing.
Phase 3: Implementation acceleration
Work moved onto infrastructure rebuild, and setting up of new devsecops practices.
Pent-up demand for the service meant that speed of execution was vital. Infrastructure refactoring proceeded hand-in-hand with software engineering who were tasked with transforming their code to a more robust microservices-based stack.
Although the software team was relatively large and experienced, it was spread across one inhouse and two outsourced teams based in the UK, India, Belarus, Switzerland and USA. To maximise momentum across such a dispersed team, a cross-functional leadership ‘pod’ was a established. This ensured that the entire team had a clear understanding of the vision, and software and infrastructure development could be tightly coordinated — underpinned by a standardisation of working processes.
With so many moving parts — upwards of a thousand elements across microservices, security groups, IAM roles, AWS services — enforcing common terminology was a small but critical task to avoid confusion, establish unambiguous documentation and audit of processes.
Security played a key part in the team’s methodology — in cloud-component choices, platform design and in operations. For example, a devsecops approach ensured CICD (continuous-integration, continuous-delivery) pipelines had appropriate security measures and auditability inbuilt. Full rollout of Splunk, combined with Signal FX, not only supported infrastructure monitoring but also threat detection, prevention and incident response — Signal FX providing the traceability and observability that’s essential in a complex microservices environment.
Leveraging Splunk to enable product-led growth
The company’s business model put its product — private social networks — at the heart of its go-to-market strategy; i.e. product experience driving user acquisition and usage, which in turn drives monetization through premium-feature adoption and ad revenues. A cornerstone of this “product-led-growth” strategy was effective customer-behaviour analytics that could provide decision-support to prioritise feature development and optimise the customer journey and advertising.
As Splunk was rolled out to support platform management, it quickly became clear to the commercial team how it could support their goals. Dashboards and reports were set up to show who, how and when different user types were using a given network.
Technology security in line with ISO27001 and Cyber Essentials Plus
AWS EKS, EC2, RDS, S3, SQS, Cognito, Lambda
With platform reliability and performance stabilised, the infrastructure properly autoscaling, security-hardened and costing 75% less to run — and a clear line of sight into user behaviour — the business was primed for growth and fit for a bright future.