EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
We are seeking a talented and motivated Senior Site Reliability Engineer (SRE) to join our organization.
The experienced SRE will play a crucial role in ensuring the reliability, scalability, capacity planning, and performance of our infrastructure and applications. The ideal candidate will have a strong background in software engineering, system administration, containerization, and cloud technologies.
Want more jobs like this?
Get jobs in Hyderabad, India delivered to your inbox every week.
#LI-DNI
Responsibilities
- Ensure system stability and high availability by proactively monitoring performance and troubleshooting issues
- Design, build and maintain efficient, reliable, and scalable cloud-based infrastructure and services
- Automate repetitive tasks and workflows to improve efficiency and reduce error using scripting and programming languages
- Implement and manage observability tools for comprehensive monitoring, alerting, and logging
- Develop and execute automation strategies using tools like Jenkins, GitLab, and Ansible/Chef
- Define and oversee SLI, SLO, SLA, and Error Budget to maintain service quality
- Provide on-call support for incident management and participate actively in response activities
- Should have 5 to 8 years of experience
- Well-versed with scripting/programming languages (Python/Bash/PowerShell, etc.) to automate manual work, particularly within cloud environments
- Well-versed with Observability tools (Grafana, Splunk, Dynatrace) for monitoring, alerting, and logging solutions to identify and address potential issues, especially in cloud infrastructure
- Working experience with automation tools (Jenkins, GitLab, Ansible/Chef for configuration management) and processes to streamline deployment, monitoring, and management of systems and applications in the cloud
- Hands-on experience with containerization and orchestration technologies such as Docker, Kubernetes, or similar, particularly in cloud-native environments
- Well aware of SLI, SLO, SLA, and Error Budget concepts and their implementations; provide on-call support and participate in incident management & response activities as needed
- Opportunity to work on technical challenges that may impact across geographies
- Vast opportunities for self-development: online university, knowledge sharing opportunities globally, learning opportunities through external certifications
- Opportunity to share your ideas on international platforms
- Sponsored Tech Talks & Hackathons
- Unlimited access to LinkedIn learning solutions
- Possibility to relocate to any EPAM office for short and long-term projects
- Focused individual development
- Benefit package:
- Health benefits
- Retirement benefits
- Paid time off
- Flexible benefits
- Forums to explore beyond work passion (CSR, photography, painting, sports, etc.)