Skip to main contentA logo with &quat;the muse&quat; in dark blue text.

Lead Site Reliability Engineer

AT EPAM Systems
EPAM Systems

Lead Site Reliability Engineer

Río Grande, Mexico

We are looking for an experienced Lead Site Reliability Engineer to join our team and play a key role in building and maintaining robust, scalable, and efficient systems. This position focuses on improving infrastructure, streamlining processes through automation, and ensuring optimal performance across distributed systems and cloud platforms. You will collaborate with diverse teams, lead technical projects, and mentor team members to foster a culture of innovation and operational excellence.

#LI-DNI

Responsibilities

  • Enhance the performance of Linux-based operating systems for production services and distributed systems
  • Develop and implement advanced monitoring solutions using tools like Grafana, Prometheus, and Splunk to improve system observability
  • Resolve complex Kubernetes-related issues and establish team-wide best practices and standards
  • Create and maintain automation scripts with Bash and Python to streamline operational processes
  • Build and manage container orchestration platforms such as Kubernetes or EKS, sharing knowledge with the team
  • Design and manage reliable and scalable cloud infrastructure on AWS to ensure system availability
  • Lead initiatives to automate repetitive processes and drive efficiency across the team
  • Provide leadership and promote a collaborative work environment through effective communication and ownership
  • Encourage continuous learning and development among team members to foster a culture of growth and curiosity
  • Offer technical guidance and mentorship to team members to improve communication and operational efficiency
  • Plan and manage disaster recovery strategies and capacity planning to ensure system resilience and scalability
  • Automate deployment workflows using tools like Terraform or CloudFormation to improve reliability and productivity
  • Incorporate open-source technologies such as Cassandra, Kafka, Postgres, Solr, and Redis to advance SRE methodologies
Requirements

Want more jobs like this?

Get jobs in Río Grande, Mexico delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.
  • A bachelor's degree in Computer Science, a related technical field, or equivalent practical experience
  • A minimum of five years of hands-on experience as a Site Reliability Engineer
  • At least one year of experience in a leadership or team management role
  • Proficiency in Bash for scripting and process automation
  • Experience with Grafana for system monitoring and visualization
  • Strong expertise in Linux systems and their optimization for high-performance environments
  • Familiarity with Microsoft Internet Information Services (IIS) for managing web servers
  • Knowledge of Prometheus for monitoring and alerting in distributed environments
  • Proficiency in Python for creating automation solutions and improving operational workflows
  • Fluency in English at a B2 level or higher, with strong verbal and written communication skills
Nice to have
  • Experience designing scalable solutions with Amazon Web Services (AWS)
  • Familiarity with cloud platforms and their integration into system designs
  • Advanced knowledge of Kubernetes for managing containerized applications
  • Experience using Splunk for log management and advanced telemetry
  • Expertise with Terraform and Terraform Cloud for infrastructure automation
  • Strong skills in troubleshooting and resolving complex technical challenges
We offer
  • Career plan and real growth opportunities
  • Unlimited access to LinkedIn learning solutions
  • International Mobility Plan within 25 countries
  • Constant training, mentoring, online corporate courses, eLearning and more
  • English classes with a certified teacher
  • Support for employee's initiatives (Algorithms club, toastmasters, agile club and more)
  • Enjoyable working environment (Gaming room, napping area, amenities, events, sport teams and more)
  • Flexible work schedule and dress code
  • Collaborate in a multicultural environment and share best practices from around the globe
  • Hired directly by EPAM & 100% under payroll
  • Law benefits (IMSS, INFONAVIT, 25% vacation bonus)
  • Major medical expenses insurance: Life, Major medical expenses with dental & visual coverage (for the employee and direct family members)
  • 13 % employee savings fund, capped to the law limit
  • Grocery coupons
  • 30 days December bonus
  • Employee Stock Purchase Plan
  • 12 vacations days plus 4 floating days
  • Official Mexican holidays, plus 5 extra holidays (Maundry Thursday and Friday, November 2nd, December 24th & 31st)
  • Monthly non-taxable amount for the electricity and internet bills
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
By applying to our role, you are agreeing that your personal data may be used as in set out in EPAM's Privacy Notice and Policy.

Client-provided location(s): Mexico
Job ID: EPAM-epamgdo_blt0664a559e79a9a39_en-us_Other_Mexico
Employment Type: Other