NVIDIA seeks a data center infrastructure optimization and resiliency team manager to join its infrastructure specialist team. Academic and commercial groups worldwide use NVIDIA products to redefine deep learning, data analytics, and power data centers. Join the team building many of the world's largest and fastest data centers! NVIDIA is looking for someone who can lead a customer team responsible for production AI infrastructure and workflow optimization, working on a complex customer-focused operation optimization and related problem-solving, Planning, facilitating, and executing continuous improvement events using NVIDIA telemetry tools, and interfacing with company stakeholder management that requires excellent interpersonal skills. This role will involve interacting with customers, partners, and internal teams to analyze, define, and implement large-scale data center infrastructure optimization. These efforts include a combination of leading practical experience in handling data center team systems, networks, cloud operation and orchestration, AI workload resiliency, and performance optimization with an assurance of continual and efficient Planning, operation, validation services, and team performance.
Want more jobs like this?
Get jobs that are Remote delivered to your inbox every week.
What you will be doing:
- Manage regional, customer-dedicated teams focused on optimizing customer infrastructure and enhancing resiliency.
- Lead a team that inspects and observes infrastructure and AI workloads to ensure system health and performance.
- Establish and refine optimization workflows, collaborate with customers and analytics partners, and analyze results to improve AI workload production processes.
- Offer technical guidance and oversight for systems and networking activities.
- Guarantee that all tasks are completed with high quality, avoiding negative impacts on internal/external users and business operations.
- Plan and implement ongoing improvements to deployment methodologies for greater effectiveness and efficiency.
- Ensure all operational KPIs and metrics are tracked and met while maintaining a strong commitment to service quality and user experience.
- Supervise, evaluate, mentor, and coach team members, fostering a culture of continuous learning and professional growth.
What we need to see:
- At least 5 years of team management.
- Ten years of demonstrable and confirmed service operational management experience in enterprise-level data centers with continual infrastructure and service improvement.
- Data Center, Servers, and Networks related certification - preferred
- Bachelor's degree or equivalent experience.
- Fluent in both Japanese and English language.
- In-depth Practical knowledge and experience of data center environments, servers, network equipment, operations and services
- Extensive experience in installing, monitoring, and maintaining data center equipment.
- Analytical Attitude & Problem Solving - able to analyze information, problems, situations, practices, and/or procedures, collect and interpret data, reason logically, establish facts, identify and define existing and potential issues, recognize the interrelationships among elements, draw valid conclusions, develop recommendations, as well as alternative courses of action, select appropriate course, follow up, and evaluate
- Exceptional ability to work as part of a team, provide IT support, and resolve errors.
- Organization & Time Management - able to plan, schedule, and organize tasks related to the job to achieve goals within or ahead of established time frames.
- Willingness to travel (25%).
Way to stand out from the crowd:
- Experience in data center operations process, safety, and security measures.
- Knowledge of data center Infrastructure
- Outstanding social skills.