Why Work at Lenovo
We are Lenovo. We do what we say. We own what we do. We WOW our customers.
Lenovo is a US$57 billion revenue global technology powerhouse, ranked #248 in the Fortune Global 500, and serving millions of customers every day in 180 markets. Focused on a bold vision to deliver Smarter Technology for All, Lenovo has built on its success as the world's largest PC company with a full-stack portfolio of AI-enabled, AI-ready, and AI-optimized devices (PCs, workstations, smartphones, tablets), infrastructure (server, storage, edge, high performance computing and software defined infrastructure), software, solutions, and services. Lenovo's continued investment in world-changing innovation is building a more equitable, trustworthy, and smarter future for everyone, everywhere. Lenovo is listed on the Hong Kong stock exchange under Lenovo Group Limited (HKSE: 992) (ADR: LNVGY).
Want more jobs like this?
Get Data and Analytics jobs in Beijing, China delivered to your inbox every week.
This transformation together with Lenovo's world-changing innovation is building a more inclusive, trustworthy, and smarter future for everyone, everywhere. To find out more visit www.lenovo.com, and read about the latest news via our StoryHub.
Description and Requirements
工作职责:
1.负责设计高可用大模型训练容错系统,支持千亿大模型预训练
2.负责大模型训练容错checkpoint优化,提升大模型checkpoint读写与恢复性能
3.负责大模型弹性训练框架的研发
岗位要求:
1.全日制硕士以上学历,计算机科学与技术、人工智能等相关专业;
2.熟练C++/Python语言、数据结构以及计算机系统结构,有AI模型性能调优经验,以及良好的工程实现能力;
3.熟悉 AI 领域常见的分布式训练技术,包括但不限于:数据并行、流水线并行和张量并行等,具有相应的项目经验;
4.至少熟悉一种AI框架(PyTorch/TensorFlow/Paddle/DeepSpeed等),能够熟练使用和调试;
5.熟悉 GPU 硬件结构和 CUDA 计算原理,有 CUDA 相关算子开发、调试经验,对 NCCL/cuDNN 等有一定了解;
6.对大规模预训练模型有较好的了解,熟悉常见的预训练模型(如GPT、BERT等)结构、训练方法和优化技巧。
7.具备出色的问题解决能力和创新思维,能够分析和解决复杂的训练问题,并提出改进和优化的方案;
8.具有良好的团队合作精神,能够与跨部门的团队紧密合作,共同推动项目的成功。
加分项:
1.有大模型研发和分布式训练经验;
2.熟悉Kubernetes架构以及大模型训练容错系统;
3.在AI或者HPC领域发表过高水平论文。
Additional Locations:
* China - Beijing - 北京(Beijing)