# Distributed Training Distributed training in PyTorch means training a model using multiple GPUs (or machines) in parallel, where each GPU runs its own process on different data and synchronizes gradients automatically (usually via DDP). When to use distributed training : 1. Your **model or batch doesn’t fit on one GPU** 2. You want **faster training** using multiple GPUs / nodes 3. You’re training **large transformers / diffusion / LLM** In PyTorch, **1 process per GPU and** Each process has its own model copy, works on different data and syncs gradients with others PyTorch API : `nn.parallel.DistributedDataParallel (DDP)` `torch.distributed.fsdp` **How does DistributedDataParallel (DDP) works :**- 1. Each GPU runs one process 2. Data is sharded using `DistributedSampler` 3. Gradients are all-reduced automatically 4. Model stays fully replicated When Model does not fit in GPU memory such as (LLMs, ViTs, Diffusion models) then we can move to FSDP (Fully Sharded Data Parallel) from DDP. References :- * * * * * * * * * * --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://59r.gitbook.io/ml-university/distributed-training.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.