We are looking for a Senior Site Reliability Engineer to join our team and play a key role in ensuring the reliability, scalability, and performance of our systems. This position involves working across the entire service lifecycle, from design and deployment to monitoring and optimization. You will collaborate with global teams, tackle complex challenges, and implement automation strategies to improve system resilience and efficiency. Your expertise will be instrumental in maintaining the stability of critical systems and driving continuous improvement.
#LI-DNI
Responsibilities
- Participate in and enhance the full lifecycle of services, including design, deployment, operation, and refinement
- Analyze ITSM activities for the platform and provide feedback to development teams to address operational gaps and improve resiliency
- Support services pre-launch through system design consultation, capacity planning, and launch reviews
- Monitor live services by tracking availability, latency, and overall system health
- Scale systems sustainably through automation and advocate for changes that enhance reliability and velocity
- Lead application automation efforts to validate and promote software across environments while adhering to best practices
- Practice incident response with a focus on sustainable solutions and conduct blameless postmortems
- Take a proactive approach to problem-solving, connecting insights across the technology stack during production events to minimize recovery time
- Collaborate with global teams across multiple regions and time zones to ensure consistent support and operations
- Share expertise and provide mentorship to junior team members
Want more jobs like this?
Get jobs in Barra do Garças, Brazil delivered to your inbox every week.
- Bachelor's degree in Computer Science, or a related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience
- At least three years of hands-on experience as a Site Reliability Engineer
- Experience with technologies such as COBOL, JCL, VSAM, DB2, CICS, and MQ
- Strong knowledge of algorithms, data structures, scripting, pipeline management, and software design
- A systematic approach to problem-solving combined with excellent communication skills and a strong sense of ownership and drive
- Proficiency in debugging and optimizing code, as well as automating routine tasks
- Experience working with diverse stakeholders and handling urgent situations while making effective decisions
- Interest and expertise in designing, analyzing, and troubleshooting large-scale distributed systems
- English proficiency at a B2 level or higher, with strong verbal and written communication skills
- Familiarity with cloud-native tools and platforms for enhancing system performance and scalability
- Experience implementing observability solutions to monitor and optimize distributed systems
- Knowledge of containerization and orchestration tools such as Docker and Kubernetes for managing application environments
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn