Skip to main contentA logo with &quat;the muse&quat; in dark blue text.

大模型分布式训练研究员

AT Lenovo
Lenovo

大模型分布式训练研究员

Beijing, China

Why Work at Lenovo

We are Lenovo. We do what we say. We own what we do. We WOW our customers.

Lenovo is a US$57 billion revenue global technology powerhouse, ranked #248 in the Fortune Global 500, and serving millions of customers every day in 180 markets. Focused on a bold vision to deliver Smarter Technology for All, Lenovo has built on its success as the world's largest PC company with a full-stack portfolio of AI-enabled, AI-ready, and AI-optimized devices (PCs, workstations, smartphones, tablets), infrastructure (server, storage, edge, high performance computing and software defined infrastructure), software, solutions, and services. Lenovo's continued investment in world-changing innovation is building a more equitable, trustworthy, and smarter future for everyone, everywhere. Lenovo is listed on the Hong Kong stock exchange under Lenovo Group Limited (HKSE: 992) (ADR: LNVGY).

Want more jobs like this?

Get Data and Analytics jobs delivered to your inbox every week.

Select a location
By signing up, you agree to our Terms of Service & Privacy Policy.


This transformation together with Lenovo's world-changing innovation is building a more inclusive, trustworthy, and smarter future for everyone, everywhere. To find out more visit www.lenovo.com, and read about the latest news via our StoryHub.

Description and Requirements

岗位职责:

1.负责深度学习大模型的分布式训练系统的架构设计与开发,优化模型训练效率和资源利用率;

2.研究并实现基于GPU等高性能计算平台的大规模深度学习模型并行训练算法;

3.对现有深度学习框架和分布式训练框架(如PyTorch、TensorFlow、DeepSpeed、Colossal-AI,megatron)进行深度定制和扩展,以满足大规模模型训练的需求;

4.与算法团队紧密合作,解决在超大规模数据集上模型训练过程中的性能瓶颈问题;

5.设计并实现模型训练监控系统,包括但不限于训练进度、资源占用情况、训练效果可视化等;

6.持续跟踪最新的分布式训练技术发展趋势,将前沿研究成果应用于实际项目中。

任职要求:

1.计算机科学或相关专业硕士及以上学历,具有3年以上深度学习领域工作经验,有大型互联网公司或者AI实验室工作经验者优先;

2.熟练掌握至少一种深度学习框架和分布式训练框架(如PyTorch、TensorFlow),并具备丰富的模型开发与训练经验;

3.精通分布式系统原理,熟悉常见的分布式计算框架(如MPI、DeepSpeed、Colossal-AI、OneFlow),有大规模并行计算和分布式训练系统开发经验;

4.具备良好的算法基础,对深度学习模型训练优化有深入理解和实践经验,包括但不限于梯度压缩、通信优化、异步训练等;

5.有大模型分布式训练理论和实践经验,熟悉国内外主流基础大模型;

6.具备优秀的分析和解决问题的能力,能够独立进行复杂问题定位与解决;

7.对于计算机体系结构、操作系统、网络编程等相关知识有一定理解;

8.英语读写能力强,能快速阅读英文文献和技术文档,追踪国际最新研究动态和技术趋势。

加分项:

1.在顶级会议或期刊(如NIPS, ICML, ICLR, JMLR等)发表过关于分布式训练或深度学习相关论文;

2.参与过开源分布式训练项目,并有显著贡献。

Additional Locations:
* China

Client-provided location(s): Beijing, China; China
Job ID: Lenovo-100015311
Employment Type: Other