We are looking for a Senior Site Reliability Engineer to join our team and play a key role in ensuring the reliability, scalability, and performance of our systems. This position involves working across the entire service lifecycle, from design and deployment to monitoring and optimization. You will collaborate with global teams, tackle complex challenges, and implement automation strategies to improve system resilience and efficiency. Your expertise will be instrumental in maintaining the stability of critical systems and driving continuous improvement.

#LI-DNI

Responsibilities

Participate in and enhance the full lifecycle of services, including design, deployment, operation, and refinement
Analyze ITSM activities for the platform and provide feedback to development teams to address operational gaps and improve resiliency
Support services pre-launch through system design consultation, capacity planning, and launch reviews
Monitor live services by tracking availability, latency, and overall system health
Scale systems sustainably through automation and advocate for changes that enhance reliability and velocity
Lead application automation efforts to validate and promote software across environments while adhering to best practices
Practice incident response with a focus on sustainable solutions and conduct blameless postmortems
Take a proactive approach to problem-solving, connecting insights across the technology stack during production events to minimize recovery time
Collaborate with global teams across multiple regions and time zones to ensure consistent support and operations
Share expertise and provide mentorship to junior team members

Requirements

Want more jobs like this?

Get jobs in Barra do Garças, Brazil delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.

Bachelor's degree in Computer Science, or a related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience
At least three years of hands-on experience as a Site Reliability Engineer
Experience with technologies such as COBOL, JCL, VSAM, DB2, CICS, and MQ
Strong knowledge of algorithms, data structures, scripting, pipeline management, and software design
A systematic approach to problem-solving combined with excellent communication skills and a strong sense of ownership and drive
Proficiency in debugging and optimizing code, as well as automating routine tasks
Experience working with diverse stakeholders and handling urgent situations while making effective decisions
Interest and expertise in designing, analyzing, and troubleshooting large-scale distributed systems
English proficiency at a B2 level or higher, with strong verbal and written communication skills

Nice to have

Familiarity with cloud-native tools and platforms for enhancing system performance and scalability
Experience implementing observability solutions to monitor and optimize distributed systems
Knowledge of containerization and orchestration tools such as Docker and Kubernetes for managing application environments

We offer

International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Want more jobs like this?

Search Additional Jobs