Site Reliability Engineer

NOV
12 hours ago
Posted date12 hours ago
N/A
Minimum levelN/A
EngineeringJob category
EngineeringJOB DESCRIPTION
Overview
We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with a specialization in Application Performance Monitoring (APM) to join our team. You will be a key player in ensuring the reliability, performance, and scalability of our mission-critical applications and systems. You will work closely with software engineering and operations teams to proactively identify, analyze, and resolve performance issues. The ideal candidate is a creative problem-solver with deep expertise in APM tools, particularly the Elastic Stack, and a passion for designing and implementing innovative solutions to complex technical challenges.
Responsibilities
Requirements
Desired Skills
Overview
We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with a specialization in Application Performance Monitoring (APM) to join our team. You will be a key player in ensuring the reliability, performance, and scalability of our mission-critical applications and systems. You will work closely with software engineering and operations teams to proactively identify, analyze, and resolve performance issues. The ideal candidate is a creative problem-solver with deep expertise in APM tools, particularly the Elastic Stack, and a passion for designing and implementing innovative solutions to complex technical challenges.
Responsibilities
- APM Strategy: Design, implement, and manage our Application Performance Monitoring strategy using tools like Elastic APM, Datadog, Dynatrace, or similar platforms.
- Deep Performance Analysis: Utilize APM tools to conduct in-depth performance analysis, tracing distributed requests, identifying bottlenecks, and optimizing application code and infrastructure.
- Dashboarding and Alerting: Develop and maintain comprehensive dashboards, visualizations, and intelligent alerting systems in Grafana, Kibana, or other platforms to provide real-time insights into application health and performance.
- Proactive Issue Resolution: Monitor systems to detect and respond swiftly to performance degradations, security threats, and system failures before they impact users.
- Define and Track SLOs: Measure and optimize system performance by establishing and tracking key Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
- Root Cause Analysis (RCA): Lead post-incident investigations to analyze the root cause of production issues, quantify business impact, and implement corrective actions to prevent recurrence.
- Automation: Automate repetitive tasks, monitoring setups, and incident response processes to enhance efficiency and reduce manual intervention.
- Collaboration: Partner with software engineering and operations teams to embed reliability and performance best practices into the entire development lifecycle.
- Continuous Improvement: Continuously refine our systems, processes, and APM tooling to elevate reliability, performance, and observability.
- Stakeholder Engagement: Engage with business stakeholders to understand key application pain points and solicit feedback to inform the platform roadmap.
Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- 5+ years of experience in a Site Reliability, DevOps, or Performance Engineering role.
- Proven, hands-on experience with Application Performance Monitoring (APM) tools such as Elastic APM, Datadog, Dynatrace, New Relic, or AppDynamics.
- Expertise in the Elastic Stack (Elasticsearch, Logstash, Kibana, Beats) for logging, monitoring, and APM.
- Strong understanding of SRE principles, Production Support Operations, DevOps, and CI/CD methodologies.
- Proficiency in scripting languages such as Python, Bash, or PowerShell for automation and data analysis.
- Solid understanding of Linux/Unix systems, networking fundamentals, and distributed systems architecture.
- Experience with containerization and orchestration technologies, specifically Docker and Kubernetes.
- Excellent problem-solving skills with the ability to perform deep-dive analysis and think creatively.
- Strong communication and interpersonal skills, with the ability to collaborate effectively in a global, cross-functional team environment.
Desired Skills
- Experience with Infrastructure as Code (IaC) automation tools like Ansible, Terraform, or Chef.
- Knowledge of cloud-native services and serverless architectures (e.g., AWS Lambda, Azure Functions).
- Familiarity with modern CI/CD tools and environments (e.g., GitHub, Azure DevOps, Jenkins).
- Experience with other observability pillars, including metrics (Prometheus) and logging.
- Knowledge of agile development methodologies.
JOB SUMMARY
Site Reliability Engineer

NOV
Monterrey
12 hours ago
N/A
Full-time
Site Reliability Engineer