Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Remotive

Remotive

Remote

8 hours ago

No application

About

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.

Role Description

Be part of a global team that ensures the performance, scalability, and reliability of critical cloud-based applications. As part of the Global Investor and Distribution Solutions (GIDS) Platform Services team, you’ll play a key role in keeping our systems running smoothly and efficiently—while helping shape the future of our platform.

  • Collaborate with global teams as part of a follow-the-sun support model.
  • Respond to, troubleshoot, and resolve Level 2 application incidents.
  • Ensure critical applications are effectively monitored using tools like Prometheus and Grafana.
  • Create and maintain dashboards and alerts to enhance visibility into application health.
  • Define, implement, and track key SRE metrics (SLOs, SLIs, error budgets).
  • Partner with development teams to improve application reliability and resilience.
  • Analyze incident trends and recommend improvements to reduce recurrence.
  • Automate repetitive support tasks to improve efficiency.
  • Participate in post-incident reviews and drive reliability initiatives.
  • Perform infrastructure and application patching as part of regular maintenance cycles.
  • Support security vulnerability remediation efforts across both infrastructure and application layers.

Qualifications

  • Bachelor’s degree in Computer Science, Computer Engineering, IT, or related field.
  • 5+ years of experience for senior roles; fresh graduates welcome for junior roles.
  • Proficiency in one or more programming languages, preferably Java, JavaScript or Python.
  • Proven ability to troubleshoot complex systems.
  • Skilled in debugging, code optimization, and automation.
  • Experience with relational databases and data analysis.
  • Experience working in Site Reliable Engineer (SRE) roles or incident response environments.
  • Hands-on experience with cloud infrastructure, preferably AWS.
  • Familiarity with observability tools such as Grafana, ELK Stack, or similar.
  • Experience deploying and managing applications on Kubernetes platforms.
  • Strong skills in analyzing and troubleshooting issues in large-scale, distributed systems.
  • Familiarity with PostgreSQL and its performance tuning, monitoring, and troubleshooting.

Benefits

  • Flexibility: Hybrid Work Model & a Business Casual Dress Code, including jeans.
  • Your Future: RRSP Matching Program, Professional Development Reimbursement.
  • Work/Life Balance: Flexible Personal/Vacation Time Off, Sick Leave, Paid Holidays.
  • Your Wellbeing: Medical, Dental, Vision, Employee Assistance Program, Parental Leave.
  • Diversity & Inclusion: Committed to Welcoming, Celebrating and Thriving on Diversity.
  • Training: Hands-On, Team-Customized, including SS&C Learning Institute.
  • Extra Perks: Discounts on fitness clubs, travel and more!
  • Wide-Ranging Perspectives: Committed to Celebrating the Variety of Backgrounds, Talents and Experiences of Our Employees.