The Site Reliability Engineer (SRE) will be responsible for implementing and managing observability across the company’s hybrid application and data landscape. This role ensures system reliability, scalability, and proactive monitoring to reduce downtime and improve MTTR.
Key Responsibilities:
● Implement telemetry (logs, metrics, traces, events) for applications and data systems across on-prem and cloud.
● Design and configure dashboards, alerts, and monitoring tools aligned to SLAs and SLOs.
● Collaborate with Tier 1 and Tier 2 support teams to streamline case and incident management.
● Analyze observability data to identify anomalies, bottlenecks, and root causes.
● Provide proactive monitoring and support AIOps-based predictive insights.
● Document runbooks, workflows, and provide knowledge transfer to client teams.
Required Skills:
● Experience with observability tools (e.g., Prometheus, Grafana, ELK, AppDynamics, Dynatrace).
● Strong understanding of distributed systems, cloud environments (Azure, AWS, GCP), and containers/Kubernetes.
● Knowledge of SRE principles, SLIs, SLOs, and error budgets.
● Hands-on experience with CI/CD pipelines and integration with ITSM tools (ServiceNow, ADO, Jira).
● Strong troubleshooting and root cause analysis skills.
● Excellent communication and collaboration abilities.
Benefits:
● Family health plan.
● Birthday day off.
● Continuous training through content platforms.
And more!
Great, just keep talking to your recruiter.
Apply for this position
If you are already talking to a recruiter from CONEXIONHR, DON'T FILL THE FORM.