We are seeking a talented Senior Site Reliability Engineer to join our team and contribute to the development and maintenance of highly reliable and scalable systems. In this role, you will focus on optimizing infrastructure, automating processes, and ensuring system performance across distributed systems and cloud platforms. You will work closely with cross-functional teams, lead technical initiatives, and mentor team members to promote a culture of operational excellence and continuous improvement.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Want more jobs like this?

Get jobs in Bahía Blanca, Argentina delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.

#LI-DNI

Responsibilities

Optimize Linux-based operating systems to enhance the performance of internet-facing production services and distributed systems
Implement advanced telemetry solutions using tools like Splunk, Grafana, and Prometheus to improve monitoring and organizational capabilities
Troubleshoot complex issues in Kubernetes and establish best practices and standards for the team
Develop and maintain automation scripts using Bash and Python to streamline operational workflows
Build and manage container orchestration systems such as Kubernetes or EKS, sharing expertise with team members
Design and maintain high-performance cloud infrastructure with AWS to ensure reliability and scalability
Drive automation initiatives to reduce manual work and increase team efficiency
Provide leadership by fostering collaboration, clear communication, and ownership within the team
Promote continuous learning and professional growth within the team, encouraging innovation and improvement
Offer technical mentorship and guidance to team members to improve clarity and communication
Manage disaster recovery strategies and capacity planning to ensure system resilience and scalability
Automate deployment processes using tools like Terraform or CloudFormation to improve reliability and productivity
Integrate and leverage open-source technologies such as Cassandra, Kafka, Solr, Postgres, and Redis to enhance SRE practices

Requirements

Bachelor's degree in Computer Science, a related technical field, or equivalent practical experience
At least three years of hands-on experience as a Site Reliability Engineer
Proficiency in Bash for scripting and automation to improve workflows
Experience using Grafana for monitoring and visualization of system performance
Strong expertise in Linux systems and their optimization for production environments
Knowledge of Microsoft Internet Information Services (IIS) for managing and maintaining web server infrastructure
Proficiency in Prometheus for monitoring and alerting in distributed systems
Experience using Python for automation and operational improvements
English proficiency at a B2 level or higher, with strong verbal and written communication skills

Nice to have

Experience designing scalable solutions with Amazon Web Services (AWS)
Familiarity with cloud platforms and their integration into system architecture
Expertise in Kubernetes for managing and orchestrating containerized applications
Experience using Splunk for log management and advanced telemetry
Knowledge of Terraform and Terraform Cloud for infrastructure automation and deployment
Strong troubleshooting skills for identifying and resolving complex technical issues

We offer

Connectivity Bonus (15,000 ARS are paid with a salary receipt at the end of each month as a non-wages concept)
Medicina Prepaga (It covers the collaborator and direct family group)
Paternity Leave (Two additional days are added to what is established by law, total of 4 days)
Discounts card
English Training (English lessons, twice per week)
Training Program (Access to multiple customized training plans according to the needs of each role within the company)
Marriage bonus (The company doubles the allowance established by law that ANSES offers)
Referral Program (Referral bonus is paid when the referral of a collaborator joins the Company)
External Agreements and Discounts
Vacations: 14 calendar days a year

By applying to our role, you are agreeing that your personal data may be used as in set out in EPAM's Privacy Notice and Policy.

Want more jobs like this?

Search Additional Jobs