EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
We are seeking a talented and motivated Senior Site Reliability Engineer (SRE) to join our organization.
The experienced SRE will play a crucial role in ensuring the reliability, scalability, capacity planning, and performance of our infrastructure and applications. The ideal candidate will have a strong background in software engineering, system administration, containerization, and cloud technologies.

Want more jobs like this?

Get jobs in Hyderabad, India delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.

#LI-DNI

Responsibilities

Ensure system stability and high availability by proactively monitoring performance and troubleshooting issues
Design, build and maintain efficient, reliable, and scalable cloud-based infrastructure and services
Automate repetitive tasks and workflows to improve efficiency and reduce error using scripting and programming languages
Implement and manage observability tools for comprehensive monitoring, alerting, and logging
Develop and execute automation strategies using tools like Jenkins, GitLab, and Ansible/Chef
Define and oversee SLI, SLO, SLA, and Error Budget to maintain service quality
Provide on-call support for incident management and participate actively in response activities

Requirements

Should have 5 to 8 years of experience
Well-versed with scripting/programming languages (Python/Bash/PowerShell, etc.) to automate manual work, particularly within cloud environments
Well-versed with Observability tools (Grafana, Splunk, Dynatrace) for monitoring, alerting, and logging solutions to identify and address potential issues, especially in cloud infrastructure
Working experience with automation tools (Jenkins, GitLab, Ansible/Chef for configuration management) and processes to streamline deployment, monitoring, and management of systems and applications in the cloud
Hands-on experience with containerization and orchestration technologies such as Docker, Kubernetes, or similar, particularly in cloud-native environments
Well aware of SLI, SLO, SLA, and Error Budget concepts and their implementations; provide on-call support and participate in incident management & response activities as needed

We offer

Opportunity to work on technical challenges that may impact across geographies
Vast opportunities for self-development: online university, knowledge sharing opportunities globally, learning opportunities through external certifications
Opportunity to share your ideas on international platforms
Sponsored Tech Talks & Hackathons
Unlimited access to LinkedIn learning solutions
Possibility to relocate to any EPAM office for short and long-term projects
Focused individual development
Benefit package:
- Health benefits
- Retirement benefits
- Paid time off
- Flexible benefits
Forums to explore beyond work passion (CSR, photography, painting, sports, etc.)

Senior Site Reliability Engineer

Want more jobs like this?

Search Additional Jobs