Introduction
A career in IBM Software means you'll be part of a team that transforms our customer's challenges into solutions.
Seeking new possibilities and always staying curious, we are a team dedicated to creating the world's leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career.
IBM's product and technology landscape includes Research, Software, and Infrastructure. Entering this domain positions you at the heart of IBM, where growth and innovation thrive.

Your Role and Responsibilities
As an Architect for Site Reliability Engineering , the focus is to ensure that the designed solution responds to non-functional requirements such as reliability, availability, performance, security, and maintainability. You will closely work with the development and other related Release and extended support teams.

Want more jobs like this?

Get jobs in Bangalore, India delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.

You will bring a strong engineering focus to operations, putting your leadership to identify methods for preventing incidents, increasing observability, automation frameworks, self-service infrastructure, logging and metrics, and operational reports.
You will be expected to use tools include logging, monitoring, event management, notification, Runbook Automation, ChatOps, Root Cause Analysis.
You will work with Automation Engineers and QA Engineers, development team to ensure seamless delivery of our service offerings.
Build sufficient expertise in the IBM Cloud control plane to create automated monitoring processes

In this role, you will lead the problem resolution process for our clients, from analysis and troubleshooting, to deploying the latest software updates & fixes.

Your primary responsibilities include:

24x7 Observability: Be part of a worldwide team that monitors the health of production systems and services around the clock, ensuring continuous reliability and optimal customer experience.
Cross-Functional Troubleshooting: Collaborate with engineering teams to provide initial assessments and possible workarounds for production issues. Troubleshoot and resolve production issues effectively.
Deployment and Configuration: Leverage Continuous Delivery (CI/CD) tools to deploy services and configuration changes at enterprise scale.
Security and Compliance Implementation: Implementing security measures that meet or exceed industry standards for regulations such as GDPR, SOC2, ISO 27001, PCI, HIPAA, and FBA.
Maintenance and Support
Keeping your assigned site or service up and running or getting it back up and running quickly when failure occurs
Working closely with internal partners and teams to ensure that our infrastructure meets security, SLA, and performance requirements
Writing, updating, and using documentation, including runbooks/playbooks
Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more
Debugging complex problems across an entire stack and creating solid solutions
Developing CI/CD processes to improve cadence
Persistent testing of application and infrastructure resiliency over a variety of error conditions.
Partnering with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.
Develop, communicate, and monitor standard processes to promote the long-term health of sustainability and health of operational development tasks.
Standup and maintain pre-production and developer environments to support the entire development organization and improve overall team velocity
Use metrics and analytics to determine reliability issues and remove them through automation and tooling
Be an advocate for our customers, providing them self-diagnosing tools to resolve common issues that arise in the field

Required Technical and Professional Expertise

10+ yrs of SRE/Level 3 support experience
A solid understanding of Cloud infrastructure/operations
Expertise on Linux internals
Experience debugging complex problems
Experience designing, building, and operating large-scale production systems
Expertise in Ansible, Bash, core Python development
Strong familiarity with one of C, C++, golang, Python, or Java
Experience with containers, such as with Docker, Kubernetes
Experience with standard industry tools for monitoring and observability
Experience automating infrastructure, configuration management, testing, and deployments using tools like Ansible, Chef and can explain the Infrastructure as Code paradigm
A strong understanding of diverse infrastructure platforms and infrastructure concepts required.
Has hands-on experience using source control and feature branching strategies
Understands networking and messaging, especially between services
Must have good experience in Infrastructure Operations automation and IT Service Management with hands on exposure in data center administration, configuration, Incident management and support
Strong communication skills

Preferred Technical and Professional Expertise

IBM Cloud API knowledge
Behavior Driven Development
Experience in Software Development Life Cycle, Test Driven Development, Continuous Integration and Continuous Delivery
Familiarity with cloud deployment tooling such as razee and launch darkly

SRE Architect

SRE Architect

Want more jobs like this?

Company Videos

Search Additional Jobs