Introduction
In this role, you'll work in one of our IBM Consulting Client Innovation Centers (Delivery Centers), where we deliver deep technical and industry expertise to a wide range of public and private sector clients around the world. Our delivery centers offer our clients locally based skills and technical expertise to drive innovation and adoption of new technology.
At IBM, work is more than a job - it's a calling: To build. To design. To code. To consult. To think along with clients and sell. To make markets. To invent. To collaborate. Not just to do something better, but to attempt things you've never thought possible. Are you ready to lead in this new era of technology and solve some of the world's most challenging problems? If so, lets talk.
Want more jobs like this?
Get jobs in Bangalore, India delivered to your inbox every week.
Your Role and Responsibilities
The Site Reliability Engineer is a critical role in Cloud based projects. An SRE works with the development squads to build platform & infrastructure management/provisioning automation and service monitoring using the same methods used in software development to support application development. SREs create a bridge between development and operations by applying a software engineering mindset to system administration topics. They split their time between operations/on-call duties and developing systems and software that help increase site reliability and performance.
Required Technical and Professional Expertise
- 10+ experience in Senior SRE related role. Deep understanding of the AWS platforms technology and capabilities to support site reliability goals.
- Responsible for identifying the point of failures and performance bottlenecks and provide feedback to the architecture teams. Identifies the tools best suitable for integrating to ci/cd pipeline for performance, code quality, code coverage measurement. Defined the quality gates in ci/cd pipeline by working with the application architects.
- SRE should define & implement the strategy for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. Identifies and implements the methods for scaling the applications; as well as tools for logging, monitoring, alerting and run book automation for auto remediation(self healing).
- Works with the application and support teams during critical situations in identifying the root cause of failures and help fix them. Incorporates aspects of software engineering and apply that to it operations problems.
- Applies aspects of software engineering to operations with the goal of creating software systems that are highly scalable and reliable.
Preferred Technical and Professional Expertise
- Smart Monitoring & Alerting for resilience, capacity & performance optimization
- Good understanding of applying infrastructure as code to control and automate build processes, Windows servers and infrastructure, hybrid identity using Azure AD Connect for SSO integration
- Understanding of DevOps and CI/CD tools (such as Jenkins, Ansible, Packer, Docker)