Skip to content

How Meta trains large language models at scale #106

@YeonwooSung

Description

@YeonwooSung

meta engineering blog post

  • Meta requires massive computational power to train large language models (LLMs)
  • Traditional AI model training trains a large number of models, but requires a relatively small number of GPUs
  • With the advent of generative AI (GenAI), fewer tasks are required, but they are very large tasks.

Challenges of training large-scale models

  • Hardware reliability: Requires rigorous testing and quality control to minimize training disruption due to hardware failure.
  • Fast recovery in case of failure: need to be able to recover quickly when hardware failures occur. Reduced rescheduling overhead and fast training reinitialization required.
  • Efficient preservation of training state: Need to be able to efficiently save and recover training state in the event of a failure.
  • Optimal connectivity between GPUs: Data transfer between GPUs is critical for large-scale model training. This requires high-speed network infrastructure and efficient data transfer protocols.

Improving all layers of the infrastructure stack is critical

Training software

  • Enable researchers to quickly move from research to production using open source like PyTorch.
  • Developing new algorithms and techniques for large-scale training and integrating new software tools and frameworks.

Scheduling

  • Allocating and dynamically scheduling resources based on the needs of the job, using complex algorithms to optimize resources.

Hardware

  • Requires high-performance hardware to handle large-scale model training.
  • Optimized existing hardware and modified the Grand Teton platform with NVIDIA H100 GPUs, increasing the TDP of the GPUs to 700W and switching to HBM3.

Data Center Placement

  • Optimized resources (power, cooling, networking, etc.) by optimally placing GPUs and systems in the data center.
  • We deployed as many GPU racks as possible for maximum compute density.

Reliability

  • Detection and recovery plans in place to minimize downtime in the event of hardware failure.
  • Common failure modes: GPU unrecognized, DRAM & SRAM UCE, hardware network cable issues.

Network

  • High-speed network infrastructure and efficient data transfer protocols are required for large-scale model training.
  • Built two network clusters, RoCE and InfiniBand, to learn from operational experience.

Storage

  • Invested in high-capacity, high-speed storage technologies for large-scale data storage and developed new data storage solutions for specific tasks.

Looking ahead

  • We will use hundreds of thousands of GPUs to process more data and cover longer distances and latencies.
  • We plan to adopt new hardware technologies and GPU architectures and evolve our infrastructure.
  • We will explore the evolving landscape of AI and strive to push the boundaries of what is possible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions