Skip to main contentA logo with &quat;the muse&quat; in dark blue text.

AI/HPC Systems Production Engineer

AT Meta
Meta

AI/HPC Systems Production Engineer

London, United Kingdom

Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI. We need to build and evolve our network infrastructure that connects myriads of training accelerators like GPUs together. In addition, we need to ensure that the network is running smoothly and meets stringent performance, availability and reliability requirements of RDMA workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look for opportunities across the stack: network fabric and host networking, communication libraries and scheduling infrastructure.

AI/HPC Systems Production Engineer Responsibilities:

  • Responsible for the overall reliability of the communication system, including monitoring, troubleshooting and proactive identification of production issues.
  • Develop, extend and maintain CI/CD, testing pipelines for host components of training stack infrastructure, e.g. collective communication libraries (NCCL, RCCL), RDMA host stack dependencies.
  • Active member of a multi-disciplinary team to develop solutions for large scale training systems. Work with performance engineers to ensure safe and robust rollout of new features.
Minimum Qualifications:

Want more jobs like this?

Get Data and Analytics jobs in London, United Kingdom delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.

  • BS/MS/PhD in relevant fields (EE, CS), with 4+ years work experience.
  • Python, C/C++ coding skills
  • Knowledge of Linux and foundational networking principles
Preferred Qualifications:
  • Experience working with up-to-date AI training workload packaging, CI/CD and distribution processes, containerization principles.
  • Understanding of RDMA network stack principles and pain points on InfiniBand and RoCE Networks. Experience in development of systems and applications utilizing RDMA technologies. Experience with using communication libraries, such as MPI, NVIDIA Collective Communication Library (NCCL).
  • Experience with GPU accelerator development frameworks, for example CUDA, OpenCL
  • Experience in developing and troubleshooting system level software
About Meta:

Meta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today-beyond the constraints of screens, the limits of distance, and even the rules of physics.

Individual compensation is determined by skills, qualifications, experience, and location. Compensation details listed in this posting reflect the base hourly rate, monthly rate, or annual salary only, and do not include bonus, equity or sales incentives, if applicable. In addition to base compensation, Meta offers benefits. Learn more about benefits at Meta.

Client-provided location(s): London, UK
Job ID: a1KDp00000E2TA7MAN
Employment Type: Other

Perks and Benefits

  • Health and Wellness

    • Health Insurance
    • Health Reimbursement Account
    • Dental Insurance
    • Vision Insurance
    • Life Insurance
    • Short-Term Disability
    • Long-Term Disability
    • FSA
    • FSA With Employer Contribution
    • HSA
    • HSA With Employer Contribution
    • Fitness Subsidies
    • On-Site Gym
    • Mental Health Benefits
  • Parental Benefits

    • Birth Parent or Maternity Leave
    • Non-Birth Parent or Paternity Leave
    • Fertility Benefits
    • Adoption Assistance Program
    • Family Support Resources
  • Work Flexibility

    • Flexible Work Hours
    • Remote Work Opportunities
    • Hybrid Work Opportunities
  • Office Life and Perks

    • Commuter Benefits Program
    • Casual Dress
    • Happy Hours
    • Snacks
    • Some Meals Provided
    • Company Outings
    • On-Site Cafeteria
    • Holiday Events
  • Vacation and Time Off

    • Paid Vacation
    • Unlimited Paid Time Off
    • Paid Holidays
    • Personal/Sick Days
    • Sabbatical
    • Leave of Absence
  • Financial and Retirement

    • 401(K)
    • 401(K) With Company Matching
    • Pension
    • Company Equity
    • Performance Bonus
    • Relocation Assistance
    • Financial Counseling
  • Professional Development

    • Learning and Development Stipend
    • Promote From Within
    • Mentor Program
    • Shadowing Opportunities
    • Access to Online Courses
    • Lunch and Learns
    • Internship Program
  • Diversity and Inclusion

    • Employee Resource Groups (ERG)

Company Videos

Hear directly from employees about what it is like to work at Meta.