Introduction
We are seeking a highly skilled and proactive Site Reliability Engineer (SRE) to join our global team. The ideal candidate will play a critical role in ensuring the availability, scalability, and performance of our systems. You will work on troubleshooting, monitoring, root cause analysis, and incident management to support our SaaS environments and other critical infrastructure.
Your Role and Responsibilities
- Troubleshoot, monitor, and support critical production systems.
- Perform root cause analysis and manage incidents to ensure timely resolution.
- Provision and deploy environments in a cloud infrastructure (preferably IBM Cloud).
- Handle initial intake for Salesforce-related customer cases, ensuring SLA commitments are met.
- Provide on-call support, sharing rotation duties with global resources (including Poland/Costa Rica), ensuring minimized MTTR (Mean Time to Recovery).
- Manage workloads and resources to maintain commitments and prevent SLA breaches.
Want more jobs like this?
Get jobs in Alajuela, Costa Rica delivered to your inbox every week.
Required Technical and Professional Expertise
- Strong working knowledge of Kubernetes and cloud infrastructures, with a preference for IBM Cloud.
- Expertise in administration, configuration, and management of MS SQL Server 2022.
- Proven experience in providing on-call support for critical production systems, with a focus on determining root cause analysis (RCA).
- Expertise in automation platforms such as AWX.
- Proficiency in scripting languages like Python and related tools.
- Strong problem-solving skills and attention to detail.
- Fluent english required
Preferred Technical and Professional Expertise
- Familiarity with Salesforce infrastructure and case management processes.
- Experience with monitoring tools and incident management platforms.
- Ability to work efficiently in a global, distributed team environment.