We are looking for an experienced Site Reliability Engineer to join the RingCentral Operations Observability team. In this role, you will be responsible for the availability and performance of our home-built Monitoring Platform and infrastructure.
Our team provides the mission-critical operational insights used across RingCentral, managing everything from high-scale data collection to our proprietary alert correlation and processing engine. You will play a crucial role in ensuring the reliability and uptime of these systems by identifying bottlenecks, automating recovery, and proactively scaling the environment. The ideal candidate is a Linux-focused SRE who enjoys working on custom-built internal products and has a strong background in distributed systems, containerization, and data-driven observability.
Responsibilities:
Maintain and Support Platform Availability: Act as the primary owner for the uptime and health of our internal monitoring and alerting infrastructure.
Incident Management: Represent the team in global incident resolution and participate in a sustainable on-call rotation.
Evolution of Custom Tooling: Make changes and improvements to the monitoring stack to meet evolving business needs.
Lifecycle Integration: Collaborate with Dev and Ops teams to integrate our custom observability solutions into the global software development lifecycle.
Capacity Management: Stay ahead of growth requirements in a high-concurrency, fast-growing SaaS environment.
Code-Level Contributions: Actively work with the team’s codebase (Go/Python) to extend system integrations and automate routine operational "toil."
Auditing & Standards: Conduct regular assessments of the monitoring systems to ensure they meet performance benchmarks and security standards.
Skills:
Experience: 4+ years as an SRE or Systems Engineer in a production environment.
Linux Expertise: Strong Linux administration and performance tuning skills.
Problem Solving: A methodical approach to troubleshooting complex, distributed system failures.
Programming: Experience with at least one language (Go or Python preferred) to interact with our custom-built codebase.
Observability Mindset: Deep understanding of the monitoring domain, SaaS telemetry, and alerting theory.
Cloud Platforms: Experience with cloud platforms (AWS/GCP or similar)
Scalability: Proven experience operating systems in large-scale, heterogeneous environments (a major plus).
Communication: Ability to work with globally distributed teams and communicate technical issues clearly.
Preferred technology stack:
OS: Linux (CentOS/RedHat/Oracle Linux).
Languages: Go, Python, JavaScript/TypeScript.
Cloud & Containers: AWS, Kubernetes, Docker.
Data Pipelines: Experience with message brokers, distributed logs, and TSDBs.
Observability: Custom Alert Processors, Zabbix, Prometheus, Grafana.
Databases: ClickHouse, VictoriaMetrics, MongoDB, PostgreSQL.
Automation: Ansible, Terraform, GitLab CI, ArgoCD.
Qualification:
B.S. in Computer Engineering, Computer Science, or a related field with 5+ years of relevant experience.
We offer:
Well-coordinated professional team
Cutting edge technologies, interesting and challenging tasks, dynamic project, great opportunities for self-realization, professional and career growth
Additional Health and Life Insurance Package
Employee Assistance Program
25 vacation days
