EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
We are excited to invite applications for the role of Lead Site Reliability Engineer. The selected candidate will be instrumental in optimizing our infrastructure and application performance through proactive system management, automation, and monitoring. This role is perfect for individuals with a profound knowledge in cloud architectures and a knack for ensuring uninterrupted services.

Want more jobs like this?

Get Software Engineering jobs in Hyderabad, India delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.

#LI-DNI#EasyApply

Responsibilities

Design, build, and ensure the maintenance of scalable, reliable, and efficient cloud infrastructure across platforms like AWS and Azure
Automate repetitive tasks and system deployments using Python, Bash, or PowerShell in cloud settings
Implement and manage automation tools such as Jenkins, GitLab, and Ansible/Chef for seamless deployment, monitoring, and management of systems
Monitor overall system performance, proactively troubleshooting to ensure high availability and optimal functioning
Utilize tools like Grafana, New Relic, Splunk, or Dynatrace for effective monitoring, alerting, and logging to preemptively resolve potential issues in cloud infrastructure
Handle containerization and orchestration technologies including Docker and Kubernetes within cloud-native environments
Understand and apply concepts of SLI, SLO, SLA, and Error Budgets in day-to-day operations
Provide necessary on-call support and contribute to incident management and response initiatives as required

Requirements

8+ years of relevant working experience
At least 1 year of relevant leadership experience
Proficiency in managing cloud infrastructures, ideally on AWS or Azure
Competency in scripting and programming with Python, Bash, or PowerShell specifically tailored for cloud environments
Background in using automation and configuration management tools like Jenkins, GitLab, and Ansible/Chef
Familiarity with Observability and monitoring solutions such as Grafana, New Relic, Splunk, or Dynatrace
Expertise in deploying and managing containerized applications using Docker and Kubernetes
Knowledge of employing SLI, SLO, SLA, and Error Budget frameworks in operational settings

We offer

Opportunity to work on technical challenges that may impact across geographies
Vast opportunities for self-development: online university, knowledge sharing opportunities globally, learning opportunities through external certifications
Opportunity to share your ideas on international platforms
Sponsored Tech Talks & Hackathons
Unlimited access to LinkedIn learning solutions
Possibility to relocate to any EPAM office for short and long-term projects
Focused individual development
Benefit package:
- Health benefits
- Retirement benefits
- Paid time off
- Flexible benefits
Forums to explore beyond work passion (CSR, photography, painting, sports, etc.)

Want more jobs like this?

Search Additional Jobs