For Employers
Site Reliability Engineer


NOV
2 days ago
Posted date
2 days ago
N/A
Minimum level
N/A
Full-timeEmployment type
Full-time
JOB DESCRIPTION

Site Reliability Engineer (SRE) - Application Performance Monitoring (APM)

Location: Monterrey, Nuevo León, Mexico (Hybrid - candidates must reside in Monterrey or the metropolitan area)

Language requirement: Fluent English (spoken and written)

About the Role

We're looking for a Site Reliability Engineer (SRE) with a passion for Application Performance Monitoring (APM) and system optimization.

In this role, you'll be at the heart of ensuring the reliability, scalability, and performance of NOV's mission-critical applications. You'll work closely with software engineering and operations teams to design monitoring strategies, analyze performance, and proactively prevent issues before they affect users.

If you thrive in fast-paced environments, love solving complex technical challenges, and enjoy turning data into insight, this is the role for you.

What You'll Do
  • Design and manage APM strategies using tools like Elastic APM, Datadog, Dynatrace, or similar platforms.
  • Perform deep performance analysis, tracing distributed requests and identifying bottlenecks in both code and infrastructure.
  • Build real-time dashboards and alerting systems using Grafana, Kibana, or equivalent tools to visualize system health.
  • Proactively monitor systems to detect performance degradations, security threats, and system failures - before users are impacted.
  • Define and track Service Level Objectives (SLOs) and Service Level Agreements (SLAs) to continuously improve reliability.
  • Lead Root Cause Analysis (RCA) sessions after incidents and implement corrective actions to prevent recurrence.
  • Automate repetitive tasks and monitoring setups using Python, Bash, or PowerShell.
  • Collaborate with cross-functional teams to embed reliability, performance, and observability best practices into every stage of development.
  • Continuously refine tools, processes, and APM strategies to enhance efficiency, reliability, and visibility across platforms.
  • Engage with stakeholders to understand performance challenges and shape the platform roadmap.
What You Bring
  • Bachelor's or Master's degree in Computer Science, Engineering, or related field.
  • 5+ years of experience in Site Reliability, DevOps, or Performance Engineering roles.
  • Proven hands-on experience with APM tools such as Elastic APM, Datadog, Dynatrace, New Relic, or AppDynamics.
  • Expertise in the Elastic Stack (Elasticsearch, Logstash, Kibana, Beats) for logging, monitoring, and APM.
  • Deep understanding of SRE principles, DevOps methodologies, and Production Support operations.
  • Strong scripting ability in Python, Bash, or PowerShell for automation and analysis.
  • Solid grasp of Linux/Unix systems, networking fundamentals, and distributed system architecture.
  • Experience with containerization (Docker) and orchestration (Kubernetes).
  • Excellent analytical, problem-solving, and collaboration skills, with the ability to communicate effectively in a global team.
Preferred Skills
  • Fluent English (Mandatory)
  • Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or Chef.
  • Familiarity with cloud-native services (AWS, Azure, or GCP) and serverless architectures (AWS Lambda, Azure Functions).
  • Knowledge of CI/CD tools like GitHub Actions, Azure DevOps, or Jenkins.
  • Understanding of other observability pillars, including metrics (Prometheus) and logging.
  • Experience working in agile environments.
Why NOV

At NOV, we combine over 150 years of innovation with cutting-edge technology to power the global energy industry.

You'll join a global engineering team that values collaboration, curiosity, and continuous improvement - giving you the opportunity to make a real impact on systems that matter.
Related tags
-
JOB SUMMARY
Site Reliability Engineer
NOV
Monterrey
2 days ago
N/A
Full-time

Site Reliability Engineer