Responsibilities
About the role
As a Site Reliability Engineer (Cloud), you will join and reinforce the TCM Kondor modernization and cloud enablement team which primary objectives will be to act as a central team to accelerate the Cloud Transformation journey across our core systems.
We are looking for a curious and enthusiast Site Reliability Engineer to join our team, to optimize, design, implement, observe and maintain our organization's cloud-based systems.
A Site Reliability Engineer's responsibilities include design, deploying and debugging systems, as well as executing new cloud initiatives.
Ultimately, you will work with different IT professionals and teams to ensure our cloud computing systems meet the needs of our organization and customers.
Want more jobs like this?
Get jobs in Bucharest, Romania delivered to your inbox every week.
Objectives of this Role
- Work in tandem with our engineering team to identify and implement the most optimal cloud-based solutions for the company.
- Define and document best practices and strategies regarding application deployment and infrastructure maintenance.
- Provide guidance, thought leadership, and mentorship to development teams to build cloud competencies.
- Ensure application performance, uptime, and scale, maintaining high standards of code quality and thoughtful design.
- Managing cloud environments in accordance with company security guidelines.
- Stay current with industry trends, making recommendations as needed to help the organization innovate and excel.
Responsibilities
- Develop, deploy and maintain infrastructure on Azure using Docker and Kubernetes.
- Implement automation tools and frameworks (CI/CD pipelines).
- Collaborate with team members to improve the company's engineering tools, systems and procedures, and data security.
- Optimize the company's computing architecture.
- Conduct systems tests for security, performance, and availability.
- Develop and maintain design and troubleshooting documentation.
- Collaborate with the engineering teams to enable their applications to run on Cloud infrastructure.
- Debugging technical issues inside a complex stack involving virtualization, containers, microservices, etc.
- Troubleshoot incidents, identify root cause, fix and document problems, and implement preventive measures.
- Employ exceptional problem-solving skills, with the ability to see and solve issues before they snowball into problems.
Requirements
- Bachelor's degree in computer science, information technology, or mathematics
- 5+ years of proven experience as a Site Reliability Engineer or similar role in software development and system administration.
- Experience in Docker for containerization and application deployment.
- Experience with Kubernetes and Helm for orchestration of Docker containers.
- Experience with Azure cloud services and understanding of their offerings and architecture.
- Knowledge of databases and operating systems.
- Ability to troubleshoot complex software and hardware issues.
- Knowledge of best practices related to data encryption and cybersecurity.
- Excellent problem-solving and communication skills.
- Experience in network, server, and application-status monitoring.
- Operating systems - any Linux/Unix flavor
- Monitoring - Prometheus, Grafana
Nice to Have
- Relevant certifications such as Certified Kubernetes Administrator (CKA), Certified Kubernetes Application Developer (CKAD), or Azure Certifications (AZ-104, AZ-204, AZ-400, etc.).
- Experience with other cloud platforms like AWS or GCP.
- Experience in network, server, and application-status monitoring.
- CI/CD - Jenkins (groovy)
- Exposure to Azure pipelines
- Knowledge on GIT Version control
- Scripting