Skip to main contentA logo with &quat;the muse&quat; in dark blue text.

Senior Site Reliability Engineer

AT EPAM Systems
EPAM Systems

Senior Site Reliability Engineer

6 days agoSoacha, Colombia

Viewed on April 16, 2025

We are looking for a Senior Site Reliability Engineer to join our team and play a key role in ensuring the reliability, scalability, and performance of our systems. This position involves working across the entire service lifecycle, from design and deployment to monitoring and optimization. You will collaborate with global teams, tackle complex challenges, and implement automation strategies to improve system resilience and efficiency. Your expertise will be instrumental in maintaining the stability of critical systems and driving continuous improvement.
We accept CVs in English only.

#LI-DNI

Responsibilities

  • Participate in and enhance the full lifecycle of services, including design, deployment, operation, and refinement
  • Analyze ITSM activities for the platform and provide feedback to development teams to address operational gaps and improve resiliency
  • Support services pre-launch through system design consultation, capacity planning, and launch reviews
  • Monitor live services by tracking availability, latency, and overall system health
  • Scale systems sustainably through automation and advocate for changes that enhance reliability and velocity
  • Lead application automation efforts to validate and promote software across environments while adhering to best practices
  • Practice incident response with a focus on sustainable solutions and conduct blameless postmortems
  • Take a proactive approach to problem-solving, connecting insights across the technology stack during production events to minimize recovery time
  • Collaborate with global teams across multiple regions and time zones to ensure consistent support and operations
  • Share expertise and provide mentorship to junior team members
Requirements

Want more jobs like this?

Get jobs in Soacha, Colombia delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.
  • Bachelor's degree in Computer Science, or a related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience
  • At least three years of hands-on experience as a Site Reliability Engineer
  • Experience with technologies such as COBOL, JCL, VSAM, DB2, CICS, and MQ
  • Strong knowledge of algorithms, data structures, scripting, pipeline management, and software design
  • A systematic approach to problem-solving combined with excellent communication skills and a strong sense of ownership and drive
  • Proficiency in debugging and optimizing code, as well as automating routine tasks
  • Experience working with diverse stakeholders and handling urgent situations while making effective decisions
  • Interest and expertise in designing, analyzing, and troubleshooting large-scale distributed systems
  • English proficiency at a B2 level or higher, with strong verbal and written communication skills
Nice to have
  • Familiarity with cloud-native tools and platforms for enhancing system performance and scalability
  • Experience implementing observability solutions to monitor and optimize distributed systems
  • Knowledge of containerization and orchestration tools such as Docker and Kubernetes for managing application environments
We offer
  • Learning Culture - We want you to be the best version of yourself, that is why we offer unlimited access to learning platforms, a wide range of internal courses, and all the knowledge you need to grow professionally
  • Health Coverage - Health and wellness are important, that is why we have you and up to four family members in a premiere health plan. We have a couple of options, so you can choose what is best for you and your family
  • Visual Benefit - Seeing your work for us would be a sight for sore eyes. We want your vision to always be at 100% which is why we offer up to $200.000 COP for any visual health expenses
  • Life Insurance Plan - We have partnered with MetLife to offer a full-coverage Ife insurance plan. So, your family is covered, even if you are gone
  • Medical Leave Coverage - We are one of the few companies that cover 100% of your medical leave, for up to 90 days. Your health is the most important thing to us
  • Professional Growth Opportunities - We have designed a highly competitive and complete development process, where you will have all the tools to get where you have always wanted to be, personally and professionally
  • Stock Option Purchase Plan - As an EPAMer you can be more than just an employee, you will also have the opportunity to purchase stock at a reduced price and become a part owner of our organization
  • Additional Income - Besides your regular salary, you will also have the chance to earn extra income by referring talent, being a technical interviewer, and many more ways
  • Community Benefit - You will be part of a worldwide community of over 50,000 employees, where you can learn, challenge yourself, stand out, and share your knowledge and experience with multicultural teams!
Please note that even though you are applying for this position, you may be offered other projects to join within EPAM.
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

Client-provided location(s): Colombia
Job ID: EPAM-epamgdo_blt4b8ef4cc497f614c_en-us_Other_Colombia
Employment Type: Other