Skip to main contentA logo with &quat;the muse&quat; in dark blue text.

Senior Site Reliability Engineer

AT EPAM Systems
EPAM Systems

Senior Site Reliability Engineer

Chennai, India

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
We are seeking a talented and motivated Senior Site Reliability Engineer (SRE) to join our organization.
The experienced SRE will play a crucial role in ensuring the reliability, scalability, capacity planning, and performance of our infrastructure and applications. The ideal candidate will have a strong background in software engineering, system administration, containerization, and cloud technologies.

Want more jobs like this?

Get jobs in Chennai, India delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.


#LI-DNI

Responsibilities
  • Ensure system stability and high availability by proactively monitoring performance and troubleshooting issues
  • Design, build and maintain efficient, reliable, and scalable cloud-based infrastructure and services
  • Automate repetitive tasks and workflows to improve efficiency and reduce error using scripting and programming languages
  • Implement and manage observability tools for comprehensive monitoring, alerting, and logging
  • Develop and execute automation strategies using tools like Jenkins, GitLab, and Ansible/Chef
  • Define and oversee SLI, SLO, SLA, and Error Budget to maintain service quality
  • Provide on-call support for incident management and participate actively in response activities
Requirements
  • Should have 5 to 8 years of experience
  • Well-versed with scripting/programming languages (Python/Bash/PowerShell, etc.) to automate manual work, particularly within cloud environments
  • Well-versed with Observability tools (Grafana, Splunk, Dynatrace) for monitoring, alerting, and logging solutions to identify and address potential issues, especially in cloud infrastructure
  • Working experience with automation tools (Jenkins, GitLab, Ansible/Chef for configuration management) and processes to streamline deployment, monitoring, and management of systems and applications in the cloud
  • Hands-on experience with containerization and orchestration technologies such as Docker, Kubernetes, or similar, particularly in cloud-native environments
  • Well aware of SLI, SLO, SLA, and Error Budget concepts and their implementations; provide on-call support and participate in incident management & response activities as needed
We offer
  • Opportunity to work on technical challenges that may impact across geographies
  • Vast opportunities for self-development: online university, knowledge sharing opportunities globally, learning opportunities through external certifications
  • Opportunity to share your ideas on international platforms
  • Sponsored Tech Talks & Hackathons
  • Unlimited access to LinkedIn learning solutions
  • Possibility to relocate to any EPAM office for short and long-term projects
  • Focused individual development
  • Benefit package:
    • Health benefits
    • Retirement benefits
    • Paid time off
    • Flexible benefits
  • Forums to explore beyond work passion (CSR, photography, painting, sports, etc.)

Client-provided location(s): Chennai, Tamil Nadu, India
Job ID: EPAM-epamgdo_blt8c3f948f6d679b1a_en-us_Chennai_India
Employment Type: Other