NVIDIA's Infrastructure, Planning and Processes (IPP) organization is seeking a hard-working and experienced Site Reliability/DevOps Engineer, with strong background in Infrastructure Management, Monitoring, Automation, & System Administration, to join our GVS Operations Team in Pune. The IPP Org provides Infrastructure, Products & Services for multiple software teams including GPU, Mobile, and Automotive divisions working on Nvidia's extraordinary products & services.
The team is responsible for hosting, enabling & running the large scale private cloud systems & services, for our in-house Build & Test framework. The cloud hosts a heterogeneous mix of machines and devices with various operating systems (Windows/Linux/Android, etc.), running with NVIDIA GPUs and Tegra Processors.
Want more jobs like this?
Get Software Engineering jobs in Pune, India delivered to your inbox every week.
What you'll be doing:
- Create resilient, scalable, and efficient Build and deployment pipelines.
- Design and implement complex automation platforms to identify & resolve operational inefficiencies.
- Triaging software, hardware and infrastructure issues and maintaining high availability for our infrastructure & services.
- Deploying & Monitoring critical high performance, large scale services running on Geo-distributed systems.
- Continuously Strive for efficient utilization & management of the infrastructure.
- Automate processes for enabling developers to adopt self-service practices, while ensuring compliance with security standards.
- Work with architects and engineers across the teams to review the designs & solutions during development and deployment phases.
- Collaborate with our other engineering teams to deliver reliable, robust, and high-performance capability of the underlying infra.
- Mine & analyse data from multiple sources for identifying scaling & optimization opportunities.
What we need to see:
- Bachelor's or Master's degree in computer science, Software Engineering, or equivalent experience with 8+ years of experience in a DevOps environment.
- Strong hands-on experience in Configuring, maintaining, and building upon deployments of industry-standard tools (e.g. Kubernetes, Jenkins, Docker, CMake, Gitlab, Jira, etc)
- Working Experience in monitoring & maintaining large-scale infrastructure applications running in a microservice-based architecture.
- Proficient with Virtualization architecture with strong experience in Kubernetes, VMs, & Dockers.
- Experience with CI/CD systems such as GitLab, GitOps, Jenkins, Terraform, etc.
- Experienced in data pipelines setup via Kafka, Filebeat and visualization using tools like Elastic search, Grafana, Kibana, Tableau
- Strong Python scripting skills, with proven background of using/writing JSON/REST APIs.
- Fluency in using MySQL or equivalent NoSQL databases queries
- Solid understanding of configuration management tools like, Chef, Puppet, Ansible, etc.
- Working Experience with Perforce, GIT or any other version control system.
- Experience with telemetry and alerting systems such as Kibana, Elastic Search, Zabbix, Grafana, and Prometheus to create rich visualizations of system health over time.
- Ability to self-manage, show leadership, mentor others and communicate well.
Ways to stand out from the crowd:
- Understanding of networking concepts like TCP/IP and firewall management.
- Exposure to web apps/dashboards on frameworks like Django, AngularJS, VueJS, etc.
- High level understanding of Build and Test systems.
- Experience in Building regression detection systems by analyzing real-time production data, emphasizing important metrics.
- Innovating with industry-standard tools and collaborating with the open source community
- Outstanding interpersonal skills and communication