APIGEE X SITE RELIABILITY ENGINEER - VOIS
Who we are
As the largest shared services organisation in the global telco industry with 30,000 FTE, our portfolio of next-generation solutions and services are designed in partnership with customers across Vodafone Group, local markets, and partner markets to simplify and drive growth. With our strategic partner Accenture, we work alongside our Vodafone customers, other Telco and tech companies to drive transformation, meet the challenges of our industry and ensure we stay relevant and resilient. This partnership is a unique, industry-first model which brings together the best of in-house and 3rd party capability.
We work with customers across 28 countries from 10 VOIS locations: Albania, Egypt, Hungary, India, Romania, Spain, Turkey, UK, Germany, Ireland, and with a network of teams in Czech Republic, Italy, Greece, and Portugal.
#VOIS #BeUnrivalled #CreateTheFuture
About this Role
What you will do
- Define, implement, and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for Apigee-backed services, including availability, latency, error rates, and throughput
- Establish SLO targets, manage error budgets, and own reliability reporting cadence
- Design, implement, and continuously tune alerting strategies across the API platform to reduce noise and improve actionability
- Classify and route alerts by severity (P1/P2/P3) based on customer impact and SLO burn rates
- Implement alert correlation patterns, including authentication failures, quota spikes, and backend target failures
- Own and enhance operational dashboards covering Golden Signals and dependency health using Datadog, with adaptability to future observability tools
- Build and maintain dashboards for traffic, latency, error rates, backend dependencies, DNS health, certificate expiry, and authentication providers
- Create SLO burn-rate views and identify top impacted API proxies
- Proactively identify anomalies and performance degradation trends such as p95 latency drift, rising 429 responses, backend timeouts, and token failures
- Analyse seasonality patterns and establish intelligent baseline thresholds
- Produce weekly and monthly reliability reports covering SLO performance, major incidents, recurring root causes, change failure rate, and MTTR
- Implement and maintain synthetic monitoring and user journey checks for critical API flows, including authentication, API invocation, and backend dependencies
- Participate in 24x7 on-call rotations and lead incident response and problem management activities
Who you are
- An experienced reliability or production support professional with strong hands-on expertise in the Apigee platform, particularly Apigee X
- Proficient in custom reporting and advanced debugging within Apigee environments
- Experienced with APM and observability tools, including creating dashboards, alerts, and monitors (Datadog preferred)
- Comfortable operating in production environments and responding to incidents with a structured, customer-impact-focused approach
- Knowledgeable in modern cloud technologies and distributed systems
- Familiar with Agile ways of working and collaborative, cross-functional delivery
- Educated to bachelor’s degree level in Computer Science, Computer Engineering, or equivalent practical experience
Not a Perfect Fit?
What’s in it for you
- The opportunity to work on large-scale, business-critical API platforms supporting high-impact customer journeys
- Exposure to advanced reliability engineering practices within a global technology organisation
- Collaboration with diverse, cross-functional teams across markets and partners
- A role with clear ownership, influence, and measurable outcomes in platform reliability and resilience
What skills you will learn
- Advanced SRE practices including error budgets, burn-rate alerting, and reliability governance
- Deep operational insight into Apigee X runtime behaviour and API performance optimisation
- Enhanced observability and monitoring design skills across complex, distributed systems
- Incident leadership, problem management, and continuous improvement techniques at scale
VOIS Equal Opportunity Employer Commitment
Join Us
We challenge and innovate in order to connect people, businesses, and communities across the world. Delighting our customers and earning their loyalty drive us, and we experiment, learn fast and get it done, together.
With us, you can be truly be yourself and belong, share inspiration, embrace new opportunities, thrive, and make a real difference.
Alert
Follow us on social media and #StayConnected
- LinkedIn: https://www.linkedin.com/company/vois/
- Facebook: https://www.facebook.com/voisglobal
- Instagram: https://www.instagram.com/voisglobal