Introduction
A career in IBM Software means you'll be part of a team that transforms our customer's challenges into solutions.
Seeking new possibilities and always staying curious, we are a team dedicated to creating the world's leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career.
IBM's product and technology landscape includes Research, Software, and Infrastructure. Entering this domain positions you at the heart of IBM, where growth and innovation thrive.
Your Role and Responsibilities
As an Architect for Site Reliability Engineering , the focus is to ensure that the designed solution responds to non-functional requirements such as reliability, availability, performance, security, and maintainability. You will closely work with the development and other related Release and extended support teams.
Want more jobs like this?
Get jobs in Bangalore, India delivered to your inbox every week.
- You will bring a strong engineering focus to operations, putting your leadership to identify methods for preventing incidents, increasing observability, automation frameworks, self-service infrastructure, logging and metrics, and operational reports.
- You will be expected to use tools include logging, monitoring, event management, notification, Runbook Automation, ChatOps, Root Cause Analysis.
- You will work with Automation Engineers and QA Engineers, development team to ensure seamless delivery of our service offerings.
- Build sufficient expertise in the IBM Cloud control plane to create automated monitoring processes
Your primary responsibilities include:
- 24x7 Observability: Be part of a worldwide team that monitors the health of production systems and services around the clock, ensuring continuous reliability and optimal customer experience.
- Cross-Functional Troubleshooting: Collaborate with engineering teams to provide initial assessments and possible workarounds for production issues. Troubleshoot and resolve production issues effectively.
- Deployment and Configuration: Leverage Continuous Delivery (CI/CD) tools to deploy services and configuration changes at enterprise scale.
- Security and Compliance Implementation: Implementing security measures that meet or exceed industry standards for regulations such as GDPR, SOC2, ISO 27001, PCI, HIPAA, and FBA.
- Maintenance and Support
- Keeping your assigned site or service up and running or getting it back up and running quickly when failure occurs
- Working closely with internal partners and teams to ensure that our infrastructure meets security, SLA, and performance requirements
- Writing, updating, and using documentation, including runbooks/playbooks
- Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more
- Debugging complex problems across an entire stack and creating solid solutions
- Developing CI/CD processes to improve cadence
- Persistent testing of application and infrastructure resiliency over a variety of error conditions.
- Partnering with security engineers and developing plans and automation to aggressively and safely respond to new risks and vulnerabilities.
- Develop, communicate, and monitor standard processes to promote the long-term health of sustainability and health of operational development tasks.
- Standup and maintain pre-production and developer environments to support the entire development organization and improve overall team velocity
- Use metrics and analytics to determine reliability issues and remove them through automation and tooling
- Be an advocate for our customers, providing them self-diagnosing tools to resolve common issues that arise in the field
Required Technical and Professional Expertise
- 10+ yrs of SRE/Level 3 support experience
- A solid understanding of Cloud infrastructure/operations
- Expertise on Linux internals
- Experience debugging complex problems
- Experience designing, building, and operating large-scale production systems
- Expertise in Ansible, Bash, core Python development
- Strong familiarity with one of C, C++, golang, Python, or Java
- Experience with containers, such as with Docker, Kubernetes
- Experience with standard industry tools for monitoring and observability
- Experience automating infrastructure, configuration management, testing, and deployments using tools like Ansible, Chef and can explain the Infrastructure as Code paradigm
- A strong understanding of diverse infrastructure platforms and infrastructure concepts required.
- Has hands-on experience using source control and feature branching strategies
- Understands networking and messaging, especially between services
- Must have good experience in Infrastructure Operations automation and IT Service Management with hands on exposure in data center administration, configuration, Incident management and support
- Strong communication skills
Preferred Technical and Professional Expertise
- IBM Cloud API knowledge
- Behavior Driven Development
- Experience in Software Development Life Cycle, Test Driven Development, Continuous Integration and Continuous Delivery
- Familiarity with cloud deployment tooling such as razee and launch darkly