The company is a global leader in partnering with businesses to transform and manage their operations by leveraging the power of technology. The group is driven daily by its purpose of unleashing human energy through technology for an inclusive and sustainable future. It is a responsible and diverse organization with 350,000 members in over 50 countries.
The role:
The involves enhancing and sustaining our cloud infrastructure and services’ reliability, scalability, and performance. Collaborating with our cloud and engineering teams, will be instrumental in architecting, constructing, and upkeeping systems within the AWS platform, guaranteeing that our applications achieve high availability and resilience.
Essential Job Functions:
● Oversee the maintenance and enhancement of system availability, latency, performance, and efficiency through robust monitoring/observability practices, emergency response protocols, capacity planning, and the setting and upkeep of Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets, including the creation of comprehensive dashboards.
● Ensure the stability, performance, and reliability of our platform, maintaining consistent uptime for all production environments.
● Take a lead role in evolving and refining our Observability strategy and roadmap to align with best practices and organizational goals.
● Drive the implementation and continuous improvement of SRE practices to develop a fully automated observability framework that boosts the availability, scalability, and monitoring capabilities of our systems, prioritizing operational excellence.
● Diagnose, troubleshoot, and address operational issues to meet established SLOs effectively.
● Collaborate with Cloud Operations and Engineering teams, delivering observability insights that address key business challenges and support system integrity.
● Facilitate operational efficiency by integrating and streamlining Observability and monitoring tools into a cohesive framework.
● Champion automation initiatives to minimize toil, thereby enhancing development productivity and system reliability.
● Propose architecture modifications for the platform based on data-driven analysis to improve reliability, performance, and availability.
● Develop and maintain essential technical documentation, including design specifications, user guides, runbooks, and best practices, to ensure knowledge sharing and system understanding.
● Proactively seek out and implement improvements to system availability and performance by leveraging insights gained from monitoring and observation.
● Govern infrastructure management through automation and Infrastructure as Code (IaC) practices.
● Engage actively in Agile processes, including sprint planning, daily stand-ups, and retrospectives, to foster continuous improvement and team collaboration.
Requirements:
● At least 3 years of experience in Site Reliability Engineering (SRE) and observability fields.
● Possession of a Bachelor’s degree in Computer Science, Data Science, or a related technology discipline.
● AWS Certified: Possession of an AWS certification (e.g., AWS Certified Solutions Architect, AWS Certified Developer, or AWS Certified SysOps Administrator) is required.
● Profound understanding and expertise in the culture and principles of site reliability, with a proven track record of applying SRE practices to applications or platforms.
● Comprehensive knowledge of SRE concepts such as Service Level Indicators (SLI), Service Level Objectives (SLO), Service Level Agreements (SLA), Error Budgets, Mean Time to Detect (MTTD), Mean Time to Repair (MTTR), Mean Time Between Failures (MTBF), toil, capacity planning, Observability, monitoring/alerting, release engineering, and incident management including blameless post-mortems.
● In-depth experience with various observability/monitoring tools, including but not limited to Datadog, New Relic, Sumo Logic, CloudWatch, CloudTrail, Dynatrace, Elastic, Prometheus, Splunk, and Grafana.
● Advanced skills in AWS cloud infrastructure, including proficiency in Infrastructure as Code (IaC) methodologies.
● Solid background in engineering and architectural practices within AWS environments.
● Active engagement in agile software development teams, contributing to the development process (shift -left)
● Ability to design, develop, and maintain infrastructure through popular Infrastructure as Code (IaC) tools like Terraform and/or AWS CDK.
● Strong expertise in cloud containerization technologies, particularly Kubernetes (EKS)
● Robust understanding of Linux, Windows, software development, system architecture, networking, and cloud computing principles.
● Hands-on experience in implementing and adhering to CI/CD best practices.
● Excellent communication skills, capable of mentoring and educating team members on SRE principles and practices, as well as influencing and driving vision across teams by engaging with peers organization wide.
● Advanced analytical and programming abilities, with proficiency in languages such as Python, PowerShell, Bash, etc.
Benefits:
● OSDE 210 family health plan.
● Birthday day off.
● Continuous training through content platforms.
And more!
Great, just keep talking to your recruiter.
Apply for this position
If you are already talking to a recruiter from CONEXIONHR, DON'T FILL THE FORM.