Site Reliability Engineer

NOV

MonterreyLocation

Monterrey

13 days ago

Posted date

13 days ago

N/A

Minimum level

N/A

Full-timeEmployment type

Full-time

EngineeringJob category

Engineering

JOB DESCRIPTION

Overview

We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with a specialization in Application Performance Monitoring (APM) to join our team. You will be a key player in ensuring the reliability, performance, and scalability of our mission-critical applications and systems. You will work closely with software engineering and operations teams to proactively identify, analyze, and resolve performance issues. The ideal candidate is a creative problem-solver with deep expertise in APM tools, particularly the Elastic Stack, and a passion for designing and implementing innovative solutions to complex technical challenges.

Responsibilities

APM Strategy: Design, implement, and manage our Application Performance Monitoring strategy using tools like Elastic APM, Datadog, Dynatrace, or similar platforms.
Deep Performance Analysis: Utilize APM tools to conduct in-depth performance analysis, tracing distributed requests, identifying bottlenecks, and optimizing application code and infrastructure.
Dashboarding and Alerting: Develop and maintain comprehensive dashboards, visualizations, and intelligent alerting systems in Grafana, Kibana, or other platforms to provide real-time insights into application health and performance.
Proactive Issue Resolution: Monitor systems to detect and respond swiftly to performance degradations, security threats, and system failures before they impact users.
Define and Track SLOs: Measure and optimize system performance by establishing and tracking key Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
Root Cause Analysis (RCA): Lead post-incident investigations to analyze the root cause of production issues, quantify business impact, and implement corrective actions to prevent recurrence.
Automation: Automate repetitive tasks, monitoring setups, and incident response processes to enhance efficiency and reduce manual intervention.
Collaboration: Partner with software engineering and operations teams to embed reliability and performance best practices into the entire development lifecycle.
Continuous Improvement: Continuously refine our systems, processes, and APM tooling to elevate reliability, performance, and observability.
Stakeholder Engagement: Engage with business stakeholders to understand key application pain points and solicit feedback to inform the platform roadmap.

Requirements

Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
5+ years of experience in a Site Reliability, DevOps, or Performance Engineering role.
Proven, hands-on experience with Application Performance Monitoring (APM) tools such as Elastic APM, Datadog, Dynatrace, New Relic, or AppDynamics.
Expertise in the Elastic Stack (Elasticsearch, Logstash, Kibana, Beats) for logging, monitoring, and APM.
Strong understanding of SRE principles, Production Support Operations, DevOps, and CI/CD methodologies.
Proficiency in scripting languages such as Python, Bash, or PowerShell for automation and data analysis.
Solid understanding of Linux/Unix systems, networking fundamentals, and distributed systems architecture.
Experience with containerization and orchestration technologies, specifically Docker and Kubernetes.
Excellent problem-solving skills with the ability to perform deep-dive analysis and think creatively.
Strong communication and interpersonal skills, with the ability to collaborate effectively in a global, cross-functional team environment.

Desired Skills

Experience with Infrastructure as Code (IaC) automation tools like Ansible, Terraform, or Chef.
Knowledge of cloud-native services and serverless architectures (e.g., AWS Lambda, Azure Functions).
Familiarity with modern CI/CD tools and environments (e.g., GitHub, Azure DevOps, Jenkins).
Experience with other observability pillars, including metrics (Prometheus) and logging.
Knowledge of agile development methodologies.

Related tags

JOB SUMMARY

Site Reliability Engineer

NOV

Monterrey

13 days ago

N/A

Full-time

Site Reliability Engineer