EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
We are excited to invite applications for the role of Lead Site Reliability Engineer. The selected candidate will be instrumental in optimizing our infrastructure and application performance through proactive system management, automation, and monitoring. This role is perfect for individuals with a profound knowledge in cloud architectures and a knack for ensuring uninterrupted services.
Want more jobs like this?
Get jobs in Hyderabad, India delivered to your inbox every week.
#LI-DNI#EasyApply
Responsibilities
- Design, build, and ensure the maintenance of scalable, reliable, and efficient cloud infrastructure across platforms like AWS and Azure
- Automate repetitive tasks and system deployments using Python, Bash, or PowerShell in cloud settings
- Implement and manage automation tools such as Jenkins, GitLab, and Ansible/Chef for seamless deployment, monitoring, and management of systems
- Monitor overall system performance, proactively troubleshooting to ensure high availability and optimal functioning
- Utilize tools like Grafana, New Relic, Splunk, or Dynatrace for effective monitoring, alerting, and logging to preemptively resolve potential issues in cloud infrastructure
- Handle containerization and orchestration technologies including Docker and Kubernetes within cloud-native environments
- Understand and apply concepts of SLI, SLO, SLA, and Error Budgets in day-to-day operations
- Provide necessary on-call support and contribute to incident management and response initiatives as required
- 8+ years of relevant working experience
- At least 1 year of relevant leadership experience
- Proficiency in managing cloud infrastructures, ideally on AWS or Azure
- Competency in scripting and programming with Python, Bash, or PowerShell specifically tailored for cloud environments
- Background in using automation and configuration management tools like Jenkins, GitLab, and Ansible/Chef
- Familiarity with Observability and monitoring solutions such as Grafana, New Relic, Splunk, or Dynatrace
- Expertise in deploying and managing containerized applications using Docker and Kubernetes
- Knowledge of employing SLI, SLO, SLA, and Error Budget frameworks in operational settings
- Opportunity to work on technical challenges that may impact across geographies
- Vast opportunities for self-development: online university, knowledge sharing opportunities globally, learning opportunities through external certifications
- Opportunity to share your ideas on international platforms
- Sponsored Tech Talks & Hackathons
- Unlimited access to LinkedIn learning solutions
- Possibility to relocate to any EPAM office for short and long-term projects
- Focused individual development
- Benefit package:
- Health benefits
- Retirement benefits
- Paid time off
- Flexible benefits
- Forums to explore beyond work passion (CSR, photography, painting, sports, etc.)