We are seeking a talented Senior Site Reliability Engineer to join our team and contribute to the development and maintenance of highly reliable and scalable systems. In this role, you will focus on optimizing infrastructure, automating processes, and ensuring system performance across distributed systems and cloud platforms. You will work closely with cross-functional teams, lead technical initiatives, and mentor team members to promote a culture of operational excellence and continuous improvement.
#LI-DNI
Responsibilities
- Optimize Linux-based operating systems to enhance the performance of internet-facing production services and distributed systems
- Implement advanced telemetry solutions using tools like Splunk, Grafana, and Prometheus to improve monitoring and organizational capabilities
- Troubleshoot complex issues in Kubernetes and establish best practices and standards for the team
- Develop and maintain automation scripts using Bash and Python to streamline operational workflows
- Build and manage container orchestration systems such as Kubernetes or EKS, sharing expertise with team members
- Design and maintain high-performance cloud infrastructure with AWS to ensure reliability and scalability
- Drive automation initiatives to reduce manual work and increase team efficiency
- Provide leadership by fostering collaboration, clear communication, and ownership within the team
- Promote continuous learning and professional growth within the team, encouraging innovation and improvement
- Offer technical mentorship and guidance to team members to improve clarity and communication
- Manage disaster recovery strategies and capacity planning to ensure system resilience and scalability
- Automate deployment processes using tools like Terraform or CloudFormation to improve reliability and productivity
- Integrate and leverage open-source technologies such as Cassandra, Kafka, Solr, Postgres, and Redis to enhance SRE practices
Want more jobs like this?
Get jobs in Río Grande, Mexico delivered to your inbox every week.
- Bachelor's degree in Computer Science, a related technical field, or equivalent practical experience
- At least three years of hands-on experience as a Site Reliability Engineer
- Proficiency in Bash for scripting and automation to improve workflows
- Experience using Grafana for monitoring and visualization of system performance
- Strong expertise in Linux systems and their optimization for production environments
- Knowledge of Microsoft Internet Information Services (IIS) for managing and maintaining web server infrastructure
- Proficiency in Prometheus for monitoring and alerting in distributed systems
- Experience using Python for automation and operational improvements
- English proficiency at a B2 level or higher, with strong verbal and written communication skills
- Experience designing scalable solutions with Amazon Web Services (AWS)
- Familiarity with cloud platforms and their integration into system architecture
- Expertise in Kubernetes for managing and orchestrating containerized applications
- Experience using Splunk for log management and advanced telemetry
- Knowledge of Terraform and Terraform Cloud for infrastructure automation and deployment
- Strong troubleshooting skills for identifying and resolving complex technical issues
- Career plan and real growth opportunities
- Unlimited access to LinkedIn learning solutions
- International Mobility Plan within 25 countries
- Constant training, mentoring, online corporate courses, eLearning and more
- English classes with a certified teacher
- Support for employee's initiatives (Algorithms club, toastmasters, agile club and more)
- Enjoyable working environment (Gaming room, napping area, amenities, events, sport teams and more)
- Flexible work schedule and dress code
- Collaborate in a multicultural environment and share best practices from around the globe
- Hired directly by EPAM & 100% under payroll
- Law benefits (IMSS, INFONAVIT, 25% vacation bonus)
- Major medical expenses insurance: Life, Major medical expenses with dental & visual coverage (for the employee and direct family members)
- 13 % employee savings fund, capped to the law limit
- Grocery coupons
- 30 days December bonus
- Employee Stock Purchase Plan
- 12 vacations days plus 4 floating days
- Official Mexican holidays, plus 5 extra holidays (Maundry Thursday and Friday, November 2nd, December 24th & 31st)
- Monthly non-taxable amount for the electricity and internet bills
By applying to our role, you are agreeing that your personal data may be used as in set out in EPAM's Privacy Notice and Policy.