Responsibilities
1. Team Leadership and Process Management (Operations & People Management) (70% of working time)
Team Leadership: Manage a small team of DBRE engineers (3-5 people): setting goals, task delegation, monitoring execution, development, and mentorship.
Operational Excellence: Establish, implement, and maintain effective operational processes, including Incident Management, Root Cause Analysis (Post-Mortems/RCA), and problem management.
Incident Management: Lead and coordinate actions during critical incidents, ensuring rapid communication, thorough documentation, and Post-Mortem execution.
Engineering Collaboration: Work closely with Development teams to ensure the operational readiness of new products (Production Readiness) and implement Shift-Left methodologies.
SRE Processes: Implement and enforce key SRE practices, such as SLI/SLO definition, Error Budget management, and automation of manual tasks (Toil Reduction).
2. Technical Expertise and Observability (30% of working time)
Monitoring and Alerting (Key Focus): Lead the configuration and optimization of monitoring and alerting systems. Ensure Actionable Alerting—alerts must signal user impact (SLI/SLO) rather than resource saturation issues.
On-Call Management: Manage alerting processes and rotations (On-Call Rotation), mitigating Pager Fatigue by continuously improving alert quality.
Cloud Platforms: Proven experience with cloud providers, preferably AWS, at a level sufficient to advise the team on technical decisions.
Containerization: Understanding and hands-on experience with Kubernetes (k8s)—deployment, administration, monitoring, and troubleshooting.
Automation: Programming experience in Python or Go for automating operational tasks.
Infrastructure as Code (IaC): Experience using Terraform for infrastructure and configuration management.
Qualifications
Experience
Minimum of 2-3 years of proven experience in a Team Lead or similar leadership role related to operations in DBRE, DBA, SRE positions;
Experience managing operational processes and production incident resolution workflows.
Experience operating mission-critical, high-availability systems.
Technical Skills
MySQL DB: Solid understanding of building reliable clusters of these databases, covering sharding, replications, backups, performance, and queries tuning. Experience in building efficient monitoring, alerting, reporting dashboards and databases capacity management.
MongoDB: Administrator experience, with a solid understanding of reliable configuration with efficient performance tuning for this kind of databases. Understanding of Cloud Hosted solutions, such as DocumentDB, process of migration and its pros/cons
Kubernetes (k8s): Proficient command.
AWS: Experience with key cloud compute services and advanced experience with AWS managed database services, especially with MySQL. Understanding of costs management process of AWS services;
IaC: Experience with Terraform.
Programming: Development skills in Python/Go (Middle Engineer level).
Monitoring: Deep experience configuring systems (Prometheus, Grafana, Datadog) and implementing Observability practices (metrics, logs, traces).
Personal and Leadership Qualities
Management Skills: Ability to inspire, motivate, and develop the individual competencies of engineers. Provide feedback, set clear expectations for engineers; Understanding of project management basics;
Communication: Excellent written and verbal communication skills for interacting with technical teams, management, and leadership. Understanding of cultural differences and the ability to unite people distributed around the world;
Systemic Thinking: Ability to see the big picture, analyze complex operational issues, and propose long-term, scalable solutions.
We offer:
Well-coordinated professional team
Cutting edge technologies, interesting and challenging tasks, dynamic project, great opportunities for self-realization, professional and career growth
Additional Health and Life Insurance Package
Employee Assistance Program
25 vacation days
ReBenefit Platform Account.
This role requires on-site presence at our office 4 days a week to support effective collaboration and teamwork.
