Introduction
The Site Reliability Team (SRE) ensures the service is highly available and fully optimizead in a 24/7 environment. As a SRE you will play a crutial role in ensuring the reliability and resiliency of our systems. If you are passionate about optimizing, building automation, solving problems, testing, deploying and managing highly-scalable environments - this is the perfect opportunity for you.
In this role, you will be part of a global SRE team who works closely with our development and product teams to increase the quality and reliability for our products and services but also deploy and manage of Kubernetes clusters on IBM Cloud and other cloud platforms (AWS, Azure). As a SRE you must be willing to work in a fast paced Cloud environment, share rotational on-call duty coverage with the global Ops team and support the back-end Cloud infrastructure components.

Want more jobs like this?

Get jobs in Alajuela, Costa Rica delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.

Your Role and Responsibilities
- Maitain high-available product and service on cloud
- Identify issues, ensure minimal downtime and drive them towards a resolution
- Automate repetitive tasks using scripts and tools, reduce manual interventions
- Collaborate with development teams - roll out new services, ensure stability and reliability
- Improve operational practices, ensure efficenty and innovation
- Share knowlegde, ideas and solutions with global team

Required Technical and Professional Expertise
5+ years of experience in a software development and delivery role
5+ years of experience in Cloud/DevOps engineering and/or Linux administration
Experience with at least one major public cloud provider or large scale private/hybrid cloud using container orchestration
Experience with a modern configuration management framework (Puppet, Ansible, Chef, etc.)
Production experience with one or more monitoring frameworks (Nagios, Prometheus, etc.)
Strong scripting skills in at least one language (BASH, Python, Ruby, etc.)
Experience with source control management such (git, subversion, etc.)
Understanding of software development life cycle and delivery process
Ability to manage multiple projects, while ensuring that commitments and timetables are met
Ability to partner with internal stakeholders to design operational solutions
Goal oriented, forward thinker that can provide solutions for complex technical problemsProduction Kubernetes/OpenShift experience strongly preferred
Experience with change management workflows
Experience with ELK/EFK stack (ElasticSearch, Logstash/Fluentd, and Kibana)
Experience with SQL and/or NoSQL datastores
Familiarity with application load balancing concepts (F5, ELB, etc)

Preferred Technical and Professional Expertise
Working knowledge of Azure, Amazon Web Services, or IBM Cloud is an asset knowledge of HTTP/HTTPS, DNS, CDN technologies, and networking fundamentals
Experience in enterprise-related development and deployment (scalability, performance)
Experience building applications on cloud infrastructure
Experience working in an agile team, e.g., Kanban Experience with ELK/EFK stack (ElasticSearch, Logstash/Fluentd, and Kibana)

Site Reliability Engineer

Site Reliability Engineer

Want more jobs like this?

Company Videos

Search Additional Jobs