Site Reliability Engineer – 12 Month FTC (we have office locations in Cambridge, Leeds and London)

Site Reliability Engineer – 12 Month FTC (we have office locations in Cambridge, Leeds and London)

Site Reliability Engineer – 12 Month FTC (we have office locations in Cambridge, Leeds and London)

Genomics England

London, United Kingdom

2 hours ago

No application

About

  • Company Description
  • Genomics England partners with the NHS to provide whole genome sequencing diagnostics. We also equip researchers to find the causes of disease and develop new treatments – with patients and participants at the heart of it all.
  • Our mission is to continue refining, scaling, and evolving our ability to enable others to deliver genomic healthcare and conduct genomic research.
  • We are accelerating our impact and working with patients, doctors, scientists, government and industry to improve genomic testing, and help researchers access the health data and technology they need to make new medical discoveries and create more effective, targeted medicines for everybody.
  • Job Description
  • Are you driven by a deep curiosity about how complex distributed systems work and, more importantly, how they fail? Do you believe reliability is the most critical feature of any service?
  • At Genomics England, we’re pushing the boundaries of science and technology to transform patient outcomes, and our platform underpins it all.
  • We're looking for a Site Reliability Engineer to ensure our platform is not just running, but is sustainably reliable, scalable, and resilient. As a SRE advocate, you will actively collaborate with engineering squads to cultivate a culture of reliability. You will play a pivotal role in driving our technical evolution, influencing and shaping platform practices across the organisation.
  • Your responsibilities will include automating and optimising infrastructure to improve workload throughput. You will focus on implementing proactive measures to anticipate and address potential issues before they impact our users. You can’t fix what you don’t measure, so there will be a focus on developing monitoring and metrics that teams will rely on day to day. Through this approach, you will help create a platform that is not only scalable and resilient but also ready to meet the demands of our mission.

What You'll Be Doing Day-to-Day

Your work will be a balance of proactive engineering and thoughtful operational practice. You'll move between different modes, from deep project work and strategic initiatives to collaboration and incident response. Your primary mission will be to

  • Champion Reliability: Work with engineering teams to define and measure what matters to our users, establishing and monitoring SLIs, SLOs, and error budgets that drive data-informed decisions.
  • Learn from Failure: Be involved in blameless post-incident reviews that focus on identifying contributing factors, ensuring we turn every failure into a valuable opportunity for systemic improvement.
  • Eliminate Toil: Systematically identify and automate repetitive, manual, and tactical operational processes. You'll reduce operational load by building solutions with enduring value.
  • Build Resilient Systems: Design, build, and maintain robust infrastructure across AWS and on-prem environments using Infrastructure as Code and automation. You'll also drive performance tuning, capacity planning, and cost optimisation.
  • Enable Developer Velocity: Develop CI/CD pipelines, release automation, and platform tooling that help our engineering squads deploy changes safely and efficiently, without sacrificing reliability.
  • Share Your Knowledge: Create clear, usable documentation and act as a consultant and advocate for SRE and DevOps best practices, helping to improve resilience across the entire organisation.

What You’ll Bring

  • We're looking for someone who not only advocates for the SRE mindset but can also implement it with robust code, thoughtful automation, and scalable architecture.

Mindset & Approach

  • Deep-Seated Curiosity: You're driven to understand how systems truly behave in production, not just how they are supposed to work.
  • A Systems Thinker: You can zoom out to see the big picture and zoom in to troubleshoot the details, understanding that reliability is an emergent property of the entire system.
  • Relentlessly Collaborative: You see reliability as a shared responsibility, actively seeking out different perspectives and treating SRE as a dialogue. You're open to new ideas, welcome diverse viewpoints, and thrive on teaching, learning, and driving initiatives with colleagues across various teams.
  • Incident Responder: You remain calm under pressure, applying a structured approach to troubleshooting when the pager rings. You know how to take charge of an incident, coordinate a response, and mitigate issues efficiently.
  • Views Failure as an Opportunity: You champion blameless post-incident reviews as a core learning mechanism, focusing on process and technology, not people.
  • Customer-Focused: You understand that reliability must be measured from the customer's perspective to be meaningful.

Technical Experience

  • Experience applying Site Reliability Engineering principles in a production environment.
  • Strong hands-on experience with AWS services across compute, storage, networking, and security.
  • Deep understanding of distributed systems and their common failure modes, including issues related to latency, data consistency, and fault tolerance.
  • Experience with capacity planning, performance engineering, and designing systems that scale to meet traffic demands and remain fault-tolerant under pressure.
  • Excellent Infrastructure as Code skills (Terraform essential).
  • Solid scripting and software engineering fundamentals in languages like Python or Bash, with an ability to debug code, handle errors, and understand system architecture.
  • Experience with observability and alerting tools (e.g., DataDog, Cloudwatch, OpsGenie etc) and a passion for turning data into actionable insights.
  • Knowledge of CI/CD tools (e.g., GitLab CI, Jenkins) and release engineering best practices.
  • Familiarity with container orchestration (ECS, Kubernetes) and running production-grade infrastructure at scale.
  • A good understanding of networking fundamentals (DNS, TCP/IP, HTTP) and their practical application, including load balancing and traffic management.
  • Familiarity with Relational (e.g., PostgreSQL) and NoSQL Databases.

Nice to Haves

  • Exposure to new tech evaluation, lean experimentation, or platform tooling decisions.
  • Experience mentoring or sharing knowledge across teams.
  • Understanding of genomics, HPC, data-heavy workloads, or regulated environments.
  • Qualifications
  • Formal qualifications are not mandatory. We value practical experience, a curious mind, and a passion for reliability. Relevant certifications in AWS, Terraform, or other technologies are welcome and highly beneficial.
  • Additional Information
  • Closing Date: Monday 15th October at 23:00 (UK time)
  • Salary From: £71,300

Being an integral part of such a meaningful mission is extremely rewarding in itself, but in order to support our people, we’re continually improving our benefits package. We pride ourselves on investing in our people and supporting them to achieve their career goals, as well as offering a benefits package including

  • Generous Leave: 30 days’ holiday plus bank holidays, additional leave for long service, and the option to apply for up to 30 days of remote working abroad annually (approval required).
  • Family-Friendly: Blended working arrangements, flexible working, enhanced maternity, paternity and shared parental leave benefits.
  • Pension & Financial: Defined contribution pension (Genomics England double-matches up to 10%, however you can contribute more if you wish), Life Assurance (3x salary), and a Give As You Earn scheme.
  • Learning & Development: Individual learning budgets, support for training and certifications, and reimbursement for one annual professional subscription (approval required).
  • Recognition & Rewards: Employee recognition programme and referral scheme.
  • Health & Wellbeing: Subsidised gym membership, a free Headspace account, and access to an Employee Assistance Programme, eye tests, flu jabs.
  • Equal opportunities and our commitment to a diverse and inclusive workplace
  • Genomics England is actively committed to providing and supporting an inclusive environment that promotes equity, diversity and inclusion best practice both within our community and in any other area where we have influence. We are proud of our diverse community where everyone can come to work and feel welcomed and treated with respect regardless of any disability, ethnicity, gender, gender identity, religion, sexual orientation, or social background.
  • Genomics England’s policies of non-discrimination and equity and will be applied fairly to all people, regardless of age, disability, gender identity or reassignment, marital or civil partnership status, being pregnant or recently becoming a parent, race, religion or beliefs, sex or sexual orientation, length of service, whether full or part-time or employed under a permanent or a fixed-term contract or any other relevant factor.
  • Genomics England does not tolerate any form of discrimination, harassment, victimisation or bullying at work. Such behaviour is contrary to our virtues, undermines our mission and core values and diminishes the dignity, respect and integrity of all parties. Our People policies outline our commitment to inclusivity.
  • We aim to remove barriers in our recruitment processes and to be flexible with our interview processes. Should you require any adjustments that may help you to fully participate in the recruitment process, we encourage you to discuss this with us.
  • Blended working model
  • Genomics England operates a blended working model as we know our people appreciate the flexibility that hybrid working can bring. We expect most people to come into the office a minimum of 2 times each month. However, this will vary according to role and will be agreed with your team leader. There is no expectation that people will return to the office full time unless they want to, however, some of our roles require full time on site attendance e.g., lab teams, reception team.
  • Our teams and squads have, and will continue to reflect on what works best for them to work together successfully and have the freedom to design working patterns to suit, beyond the minimum. Our office locations are: Canary Wharf, Cambridge and Leeds.
  • Onboarding background checks
  • As part of our recruitment process, all successful candidates are subject to a Standard Disclosure and Barring Service (DBS) check. We therefore require applicants to disclose any previous offences at point of application, as some unspent convictions may mean we are unable to proceed with your application due to the nature of our work in healthcare.