For Employers
Site Reliability Engineer


NOV
12 hours ago
Posted date
12 hours ago
N/A
Minimum level
N/A
Full-timeEmployment type
Full-time
JOB DESCRIPTION

Overview

We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with a specialization in Application Performance Monitoring (APM) to join our team. You will be a key player in ensuring the reliability, performance, and scalability of our mission-critical applications and systems. You will work closely with software engineering and operations teams to proactively identify, analyze, and resolve performance issues. The ideal candidate is a creative problem-solver with deep expertise in APM tools, particularly the Elastic Stack, and a passion for designing and implementing innovative solutions to complex technical challenges.

Responsibilities

  • APM Strategy: Design, implement, and manage our Application Performance Monitoring strategy using tools like Elastic APM, Datadog, Dynatrace, or similar platforms.
  • Deep Performance Analysis: Utilize APM tools to conduct in-depth performance analysis, tracing distributed requests, identifying bottlenecks, and optimizing application code and infrastructure.
  • Dashboarding and Alerting: Develop and maintain comprehensive dashboards, visualizations, and intelligent alerting systems in Grafana, Kibana, or other platforms to provide real-time insights into application health and performance.
  • Proactive Issue Resolution: Monitor systems to detect and respond swiftly to performance degradations, security threats, and system failures before they impact users.
  • Define and Track SLOs: Measure and optimize system performance by establishing and tracking key Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
  • Root Cause Analysis (RCA): Lead post-incident investigations to analyze the root cause of production issues, quantify business impact, and implement corrective actions to prevent recurrence.
  • Automation: Automate repetitive tasks, monitoring setups, and incident response processes to enhance efficiency and reduce manual intervention.
  • Collaboration: Partner with software engineering and operations teams to embed reliability and performance best practices into the entire development lifecycle.
  • Continuous Improvement: Continuously refine our systems, processes, and APM tooling to elevate reliability, performance, and observability.
  • Stakeholder Engagement: Engage with business stakeholders to understand key application pain points and solicit feedback to inform the platform roadmap.


Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
  • 5+ years of experience in a Site Reliability, DevOps, or Performance Engineering role.
  • Proven, hands-on experience with Application Performance Monitoring (APM) tools such as Elastic APM, Datadog, Dynatrace, New Relic, or AppDynamics.
  • Expertise in the Elastic Stack (Elasticsearch, Logstash, Kibana, Beats) for logging, monitoring, and APM.
  • Strong understanding of SRE principles, Production Support Operations, DevOps, and CI/CD methodologies.
  • Proficiency in scripting languages such as Python, Bash, or PowerShell for automation and data analysis.
  • Solid understanding of Linux/Unix systems, networking fundamentals, and distributed systems architecture.
  • Experience with containerization and orchestration technologies, specifically Docker and Kubernetes.
  • Excellent problem-solving skills with the ability to perform deep-dive analysis and think creatively.
  • Strong communication and interpersonal skills, with the ability to collaborate effectively in a global, cross-functional team environment.


Desired Skills

  • Experience with Infrastructure as Code (IaC) automation tools like Ansible, Terraform, or Chef.
  • Knowledge of cloud-native services and serverless architectures (e.g., AWS Lambda, Azure Functions).
  • Familiarity with modern CI/CD tools and environments (e.g., GitHub, Azure DevOps, Jenkins).
  • Experience with other observability pillars, including metrics (Prometheus) and logging.
  • Knowledge of agile development methodologies.

Related tags
-
JOB SUMMARY
Site Reliability Engineer
NOV
Monterrey
12 hours ago
N/A
Full-time

Site Reliability Engineer