Introduction
At IBM, we are driven to shift our technology to an as-a-service model and to help our clients transform themselves to take full advantage of the cloud. With industry leadership in AI, analytics, security, commerce, and quantum computing and with unmatched hardware and software design and industrial research capabilities, no other company is as well positioned to address the full opportunity of enterprise cloud computing. We are looking for a lead SRE architect to join our IBM Cloud VPC Observability team. This team is dedicated to ensuring that IBM Cloud is at the forefront of reliable enterprise cloud technology. We are building Observability platforms to deliver performance, reliability and predictability for our customers' most demanding workloads, at global scale and with leadership efficiency, resiliency and security.
Want more jobs like this?
Get jobs in Bangalore, India delivered to your inbox every week.
Your Role and Responsibilities
- Implement and administrate infrastructure and solutions that support the IBM Cloud VPC.
- Support the compliance and security integrity of the environment through your work
- Partner with other teams, functional managers and program managers to deliver mission-critical services to the market
- Support development of new and enhanced existing capabilities for our compute, storage and network services
- Adopt and build on automation solutions governed by SRE principles including CI CD pipelines, configuration management, immutable infrastructure deployment, auto healing systems etc.
- Provide technical escalation support for other Infrastructure Operations teams
- Conceptualize, Design, implement, manage and create a reliable, highly performant, scalable automation solutions that can build consistency across our infrastructure
- Work with and adopt open source technologies as well as participate in new IBM innovations across IaaS
- A self-driven attitude to propose, test and implement solutions and improvements for review and consideration with your peers
Required Technical and Professional Expertise
- 5+ years of experience in data center infrastructure or relevant work experience
- 5+ years of experience in large-scale infrastructure design, engineering, and support
- 5+ years of experience in IT Change, Incident, Problem, Asset management
- 5+ years of infrastructure engineering with proven record for delivering high-quality, large-scale solutions. Experience designing architectures for scale and performance
- 5+ years of practical experience with one or more operating systems: Ubuntu (Preferred), CentOS, RHEL or Debian Linux, and Windows Servers.
- 5+ years of experience debugging issues across a Linux environment with network, storage, compute and orchestration components. Does not need to be code debugging.
- Development experience with one or more programming languages: PowerShell, Python (preferred), and Ruby
- 2+ years practical experience with orchestration that uses desired state models and/or finite state machine models of orchestration: Kubernetes(Preferred), OpenShift, etc.
- 5+ years practical experience Containerization and container orchestration: Docker(preferred) Kubernetes (preferred), OpenShift, rancher, docker swarm, docker compose
- 5+ years experience with Monitoring technologies: Sydig (preferred), Grafana, Nagios, Zenoss, ELK, Splunk, Zabbix etc.
- Familiarity with Open Telemetry concepts, Tracing, Metrics, Events and other Observability principles
- 2+ years of experience with one or more Virtualization technologies: Citrix Xen Hypervisor (Preferred), KVM(also preferred), libvirt, qemu, VMware vSphere, etc.
- 5+ years of experience with one or more automation and configuration management tools/solutions: Ansible & Terraform (Preferred), Chef, python, bash, puppet, Rundeck, etc.
- 2+ years of experience with version control systems: github(preferred), gitlab, subversion, etc.
- Basic experience with databases, both RDBMS like mysql or postrgresql, as well as non-relational databases such as etcd, TimeScaleDB, InnoDB, etc. Not a DBA role.
- Working knowledge with Network and Storage technologies
- Working knowledge with ServiceNow, JIRA, Confluence, and GitHub
- ITIL Foundation V4 certification is a plus
Preferred Technical and Professional Expertise
- Excellent verbal and written communication skills
- Highly responsible, motivated, able to work with little direction
- Experience with design and development of complex systems
- Ability to troubleshoot complex problems and customer issues
- Working knowledge of Linux clustering, HA, and Fault Tolerant system implementations: active/active, active/passive, pacemaker, keepalived, haproxy, corosync, LVM
- 2+ years of experience with complex systems and layered architecture models: OSI, Kubernetes, virtualization, TCP/IP, etc.
- Working knowledge of what TCP/IP, BGP, Sockets, routing protocols, routes an keepalived are and how they participate in debugging and Highly available systems at scale.
- Ability to debug an issue across the entire OSI stack of a typical Linux environment across storage, network, compute, OS, system tuning, orchestration.
- Ability to debug stack traces to particular libraries in code and root cause identification.
- Working knowledge of a message bus and message queues: kafka(preferred), Spark, RabbitMQ, redis, etc.
- Extensive experience with databases and debugging their usage with application stacks
- Experience with and understanding of the interaction and dependencies of a typical three tier model of application stacks, as well as cloud