Introduction
At IBM, work is more than a job - it's a calling: To build. To design. To code. To consult. To think along with clients and sell. To make markets. To invent. To collaborate. Not just to do something better, but to attempt things you've never thought possible. Are you ready to lead in this new era of technology and solve some of the world's most challenging problems? If so, lets talk.
Your Role and Responsibilities
IBM WebSphere Liberty Site Reliability Engineering (SRE) is a discipline that combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. SRE is responsible for the availability and reliability of the IBM WebSphere Liberty Service, and ensure they meet the requirements of both internal and external users. We look for engineers who are motivated to collaborate with our Development squads to build and run sustainable production systems, and can evolve and adapt to changes in our fast-paced, worldwide environment. You should have a strong desire to work within a CI/CD environment and have a passion for embracing new cloud technologies and working with our customers to ensure they are successful. You need to be collaborative, able to handle responsibility, and love learning new techniques and tools. There is no requirement to be an expert in any one language or technology. However, knowledge of Go, Bash, Python, ArgoCD, Jenkins, Docker, Kubernetes, Openshift, or IBM Cloud/AWS/MS Azure would be useful. Knowledge in operating highly-available, zero-downtime production environments would also be beneficial. The key requirement is to have a passion for supporting, operating and developing a high-quality, highly available service.
Want more jobs like this?
Get Software Engineering jobs in Alajuela, Costa Rica delivered to your inbox every week.
Broadly, responsibilities include:
• Scaling and managing the service in the Cloud environments
• Create sustainable systems and services through automation
• Ensure a healthy Production environment by monitoring availability and applying fault diagnosis and remediation as appropriate
• Drive incident management process and support a blameless post-mortems culture
• Partner with development teams to improve services via rigorous testing and release procedures
• Participate in system design consulting, platform management, and capacity planning
• Ensure systems are Secure and Compliant
Required Technical and Professional Expertise
• Working knowledge of IBM WebSphere Liberty
• Experience in Cloud platforms
• Ability to code and primarily a software developer at heart
• Ability to write complex automation scripts
• Proficiency in one or more of the following: Go, Python, C, C++, Ansible or shell scripting
• BS degree in Computer Science or related technical field involving coding and / or systems engineering
• Experience with operating systems internals and /or networking
• Proactive with a 'can do' attitude and an ability to adapt to new technologies
Preferred Technical and Professional Expertise
Engineers that have been successful in this area typically have:
• Track record of understanding the customer problem, taking ownership, and collaborating with others to troubleshoot and define options for overcoming it
• Track record in performing live debugging (or historical debugging for RCAs) in production systems through log, metrics, infrastructure, and code analysis
• Great communication skills where they can summarize the customer problem and trials to others and take back recommendations to the customer
• Ability to self-prioritize issues based on customer impact
• Ability to bring back any learning and advocate for or implement improvements to troubleshooting documentation or software
• Are great team workers that are willing to turn their hand to whatever the highest priority issue of the day happens to be
• A solid engineering underpinning to understand what high-availability means
• They will understand and be skilled in Software component development, Container based deployments and best practice in Security, Compliance, High Availability, Resilience
• Will understand and can contribute to a delivery pipeline that takes code from development though to production in minimal time and little or no impact on the customer and designs and implements tools for automated deployment and monitoring of multiple environments
• They will understand and have worked in collaborative, agile environments, understanding all the key aspects of delivering high quality software services
Fluent level of English