fix(deps): update dependency pytorch-lightning to v2 [security] #33

renovate · 2025-03-22T01:53:05Z

This PR contains the following updates:

Package	Change	Age	Confidence
pytorch-lightning	`^1.6.0` -> `^2.0.0`

GitHub Vulnerability Alerts

CVE-2024-8019

In lightning-ai/pytorch-lightning version 2.3.2, a vulnerability exists in the LightningApp when running on a Windows host. The vulnerability occurs at the /api/v1/upload_file/ endpoint, allowing an attacker to write or overwrite arbitrary files by providing a crafted filename. This can lead to potential remote code execution (RCE) by overwriting critical files or placing malicious files in sensitive locations.

Release Notes

Lightning-AI/lightning (pytorch-lightning)

`v2.4.0`: Lightning v2.4

Compare Source

Lightning AI ⚡ is excited to announce the release of Lightning 2.4. This is mainly a compatibility upgrade for PyTorch 2.4 and Python 3.12, with a sprinkle of a few features and bug fixes.

Did you know? The Lightning philosophy extends beyond a boilerplate-free deep learning framework: We've been hard at work bringing you Lightning Studio. Code together, prototype, train, deploy, host AI web apps. All from your browser, with zero setup.

Changes

PyTorch Lightning

Added

Made saving non-distributed checkpoints fully atomic (#20011)
Added dump_stats flag to AdvancedProfiler (#19703)
Added a flag verbose to the seed_everything() function (#20108)
Added support for PyTorch 2.4 (#20010)
Added support for Python 3.12 (20078)
The TQDMProgressBar now provides an option to retain prior training epoch bars (#19578)
Added the count of modules in train and eval mode to the printed ModelSummary table (#20159)

Changed

Triggering KeyboardInterrupt (Ctrl+C) during .fit(), .evaluate(), .test() or .predict() now terminates all processes launched by the Trainer and exits the program (#19976)
Changed the implementation of how seeds are chosen for dataloader workers when using seed_everything(..., workers=True) (#20055)
NumPy is no longer a required dependency (#20090)

Removed

Removed support for PyTorch 2.1 (#20009)
Removed support for Python 3.8 (#20071)

Fixed

Avoid LightningCLI saving hyperparameters with class_path and init_args since this would be a breaking change (#20068)
Fixed an issue that would cause too many printouts of the seed info when using seed_everything() (#20108)
Fixed _LoggerConnector's _ResultMetric to move all registered keys to the device of the logged value if needed (#19814)
Fixed _optimizer_to_device logic for special 'step' key in optimizer state causing performance regression (#20019)
Fixed parameter counts in ModelSummary when model has distributed parameters (DTensor) (#20163)

Lightning Fabric

Added

Made saving non-distributed checkpoints fully atomic (#20011)
Added a flag verbose to the seed_everything() function (#20108)
Added support for PyTorch 2.4 (#20028)
Added support for Python 3.12 (20078)

Changed

Changed the implementation of how seeds are chosen for dataloader workers when using seed_everything(..., workers=True) (#20055)
NumPy is no longer a required dependency (#20090)

Removed

Removed support for PyTorch 2.1 (#20009)
Removed support for Python 3.8 (#20071)

Fixed

Fixed an attribute error when loading a checkpoint into a quantized model using the _lazy_load() function (#20121)
Fixed _optimizer_to_device logic for special 'step' key in optimizer state causing performance regression (#20019)

Full commit list: 2.3.0 -> 2.4.0

Contributors

We thank all our contributors who submitted pull requests for features, bug fixes and documentation updates.

New Contributors

@SamuelLarkin made their first contribution in Lightning-AI/pytorch-lightning#19969
@liambsmith made their first contribution in Lightning-AI/pytorch-lightning#19986
@EtayLivne made their first contribution in Lightning-AI/pytorch-lightning#19915
@elmuz made their first contribution in Lightning-AI/pytorch-lightning#19998
@swyo made their first contribution in Lightning-AI/pytorch-lightning#19982
@corwinjoy made their first contribution in Lightning-AI/pytorch-lightning#20011
@omahs made their first contribution in Lightning-AI/pytorch-lightning#19979
@linbo0518 made their first contribution in Lightning-AI/pytorch-lightning#20040
@01AbhiSingh made their first contribution in Lightning-AI/pytorch-lightning#20055
@K-H-Ismail made their first contribution in Lightning-AI/pytorch-lightning#20099
@adosar made their first contribution in Lightning-AI/pytorch-lightning#20146
@jojje made their first contribution in Lightning-AI/pytorch-lightning#19578

Did you know?

Chuck Norris can solve NP-hard problems in polynomial time. In fact, any problem is easy when Chuck Norris solves it.

`v2.3.3`: Patch release v2.3.3

Compare Source

This release removes the code from the main lightning package that was reported in CVE-2024-5980.

`v2.3.2`: Patch release v2.3.2

Compare Source

Includes a minor bugfix that avoids a conflict with the entrypoint command with another package #20041.

`v2.3.1`: Patch release v2.3.1

Compare Source

Includes minor bugfixes and stability improvements.

Full Changelog: Lightning-AI/pytorch-lightning@2.3.0...2.3.1

`v2.3.0`: Lightning v2.3: Tensor Parallelism and 2D Parallelism

Compare Source

Lightning AI is excited to announce the release of Lightning 2.3 ⚡

Did you know? The Lightning philosophy extends beyond a boilerplate-free deep learning framework: We've been hard at work bringing you Lightning Studio. Code together, prototype, train, deploy, host AI web apps. All from your browser, with zero setup.

This release introduces experimental support for Tensor Parallelism and 2D Parallelism, PyTorch 2.3 support, and several bugfixes and stability improvements.

Highlights

Tensor Parallelism (beta)

Tensor parallelism (TP) is a technique that splits up the computation of selected layers across GPUs to save memory and speed up distributed models. To enable TP as well as other forms of parallelism, we introduce a ModelParallelStrategy for both Lightning Trainer and Fabric. Under the hood, TP is enabled through new experimental PyTorch APIs like DTensor and torch.distributed.tensor.parallel.

PyTorch Lightning

Enabling TP in a model with PyTorch Lightning requires you to implement the LightningModule.configure_model() method where you convert selected layers of a model to paralellized layers. This is an advanced feature, because it requires a deep understanding of the model architecture. Open the tutorial Studio to learn the basics of Tensor Parallelism.

import lightning as L
from lightning.pytorch.strategies import ModelParallelStrategy
from torch.distributed.tensor.parallel import ColwiseParallel, RowwiseParallel
from torch.distributed.tensor.parallel import parallelize_module

### 1. Implement the `configure_model()` method in LightningModule
class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = FeedForward(8192, 8192)

    def configure_model(self):

### Lightning will set up a `self.device_mesh` for you
        tp_mesh = self.device_mesh["tensor_parallel"]

### Use PyTorch's distributed tensor APIs to parallelize the model
        plan = {
            "w1": ColwiseParallel(),
            "w2": RowwiseParallel(),
            "w3": ColwiseParallel(),
        }
        parallelize_module(self.model, tp_mesh, plan)

    def training_step(self, batch):
        ...

### 2. Create the strategy
strategy = ModelParallelStrategy()

### 3. Configure devices and set the strategy in Trainer
trainer = L.Trainer(accelerator="cuda", devices=2, strategy=strategy)
trainer.fit(...)

Full training example (requires at least 2 GPUs).

import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.distributed.tensor.parallel import ColwiseParallel, RowwiseParallel
from torch.distributed.tensor.parallel import parallelize_module

import lightning as L
from lightning.pytorch.demos.boring_classes import RandomDataset
from lightning.pytorch.strategies import ModelParallelStrategy

class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = FeedForward(8192, 8192)

    def configure_model(self):
        if self.device_mesh is None:
            return

### Lightning will set up a `self.device_mesh` for you
        tp_mesh = self.device_mesh["tensor_parallel"]

### Use PyTorch's distributed tensor APIs to parallelize the model
        plan = {
            "w1": ColwiseParallel(),
            "w2": RowwiseParallel(),
            "w3": ColwiseParallel(),
        }
        parallelize_module(self.model, tp_mesh, plan)

    def training_step(self, batch):
        output = self.model(batch)
        loss = output.sum()
        return loss

    def configure_optimizers(self):
        return torch.optim.AdamW(self.model.parameters(), lr=3e-3)

    def train_dataloader(self):

### Trainer configures the sampler automatically for you such that
### all batches in a tensor-parallel group are identical
        dataset = RandomDataset(8192, 64)
        return torch.utils.data.DataLoader(dataset, batch_size=8, num_workers=2)

strategy = ModelParallelStrategy()
trainer = L.Trainer(
    accelerator="cuda",
    devices=2,
    strategy=strategy,
    max_epochs=1,
)

model = LitModel()
trainer.fit(model)

trainer.print(f"Peak memory usage: {torch.cuda.max_memory_allocated() / 1e9:.02f} GB")

Lightning Fabric

Applying TP in a model with Fabric requires you to implement a special function where you convert selected layers of a model to paralellized layers. This is an advanced feature, because it requires a deep understanding of the model architecture. Open the tutorial Studio to learn the basics of Tensor Parallelism.

import lightning as L
from lightning.fabric.strategies import ModelParallelStrategy
from torch.distributed.tensor.parallel import ColwiseParallel, RowwiseParallel
from torch.distributed.tensor.parallel import parallelize_module

### 1. Implement the parallelization function for your model
def parallelize_feedforward(model, device_mesh):

### Lightning will set up a device mesh for you
    tp_mesh = device_mesh["tensor_parallel"]

### Use PyTorch's distributed tensor APIs to parallelize the model
    plan = {
        "w1": ColwiseParallel(),
        "w2": RowwiseParallel(),
        "w3": ColwiseParallel(),
    }
    parallelize_module(model, tp_mesh, plan)
    return model

### 2. Pass the parallelization function to the strategy
strategy = ModelParallelStrategy(parallelize_fn=parallelize_feedforward)

### 3. Configure devices and set the strategy in Fabric
fabric = L.Fabric(accelerator="cuda", devices=2, strategy=strategy)
fabric.launch()

Full training example (requires at least 2 GPUs).

import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.distributed.tensor.parallel import ColwiseParallel, RowwiseParallel
from torch.distributed.tensor.parallel import parallelize_module

import lightning as L
from lightning.pytorch.demos.boring_classes import RandomDataset
from lightning.fabric.strategies import ModelParallelStrategy

class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

def parallelize_feedforward(model, device_mesh):

### Lightning will set up a device mesh for you
    tp_mesh = device_mesh["tensor_parallel"]

### Use PyTorch's distributed tensor APIs to parallelize the model
    plan = {
        "w1": ColwiseParallel(),
        "w2": RowwiseParallel(),
        "w3": ColwiseParallel(),
    }
    parallelize_module(model, tp_mesh, plan)
    return model

strategy = ModelParallelStrategy(parallelize_fn=parallelize_feedforward)
fabric = L.Fabric(accelerator="cuda", devices=2, strategy=strategy)
fabric.launch()

### Initialize the model
model = FeedForward(8192, 8192)
model = fabric.setup(model)

### Define the optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-3)
optimizer = fabric.setup_optimizers(optimizer)

### Define dataset/dataloader
dataset = RandomDataset(8192, 64)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
dataloader = fabric.setup_dataloaders(dataloader)

### Simplified training loop
for i, batch in enumerate(dataloader):
    output = model(batch)
    loss = output.sum()
    fabric.backward(loss)
    optimizer.step()
    optimizer.zero_grad()
    fabric.print(f"Iteration {i} complete")

fabric.print(f"Peak memory usage: {torch.cuda.max_memory_allocated() / 1e9:.02f} GB")

2D Parallelism (beta)

Tensor Parallelism by itself can be very effective for efficient inference of very large models. For training, TP is typically combined with other forms of parallelism, such as FSDP, to increase throughput and scalability on large clusters with 100s of GPUs. The new ModelParallelStrategy in this release supports the combination of TP + FSDP, which is referred to as 2D parallelism.

For an introduction to this feature, please also refer to the tutorial Studios (PyTorch Lightning, Lightning Fabric). At the moment, the PyTorch team is reimplementing FSDP under the name FSDP2 with the aim to make it compose well with other parallelisms such as TP. Therefore, for the experimental 2D parallelism support, you'll need to switch to using FSDP2 with the new ModelParallelStrategy. Please refer to our docs (PyTorch Lightning, Lightning Fabric) and stay tuned for future releases as these APIs mature.

Training Mode in Model Summary

The model summary table that gets displayed when you run Trainer.fit() now contains a new column "Mode" that shows the training mode each layer is in (#19468).

  | Name                 | Type            | Params | Mode 
-----------------------------------------------------------------
0 | model                | Sam             | 93.7 M | train
1 | model.image_encoder  | ImageEncoderViT | 89.7 M | eval 
2 | model.prompt_encoder | PromptEncoder   | 6.2 K  | train
3 | model.mask_decoder   | MaskDecoder     | 4.1 M  | train
-----------------------------------------------------------------
93.7 M    Trainable params
0         Non-trainable params
93.7 M    Total params
374.942   Total estimated model params size (MB)

A module in PyTorch is always either in train (default) or eval mode.
This improvement should give users more visibility into the state of their model and help debug issues, for example when you need to make sure certain layers of the model are frozen.

Special Forward Methods in Fabric

Until now, Lightning Fabric warned the user in case the forward pass of the model or a subset of its modules was conducted through methods other than the dedicated forward method of the PyTorch module. The reason for this is that PyTorch needs to run special hooks in case of DDP/FSDP and other strategies to function properly, and not running through the real forward method would skip these hooks and lead to correctness issues.

In Lightning Fabric 2.3, we added a feature to explicitly mark alternative forward methods so that Fabric can add the necessary rerouting behind the scenes:

import lightning as L

fabric = L.Fabric(devices=2, strategy="ddp")
fabric.launch()

model = MyModel()
model = fabric.setup(model)

### OK: Calling the model directly
output = model(input)

### ERROR: Calling another method that calls forward indirectly
prediction = model.generate(input)

### New: Mark special forward methods explicitly before using them
model.mark_forward_method(model.generate)

### OK: Now can use `model.generate()` in DDP/FSDP without issues
prediction = model.generate(input)

Find the full example and more details in our docs.

Notable Changes

The 2.0 series of Lightning releases guarantees core API stability: No name changes, argument renaming, hook removals etc. on core interfaces (Trainer, LightningModule, etc.) unless a feature is specifically marked experimental. Here we list a few behavioral changes made in places where the change was justified if it significantly improves the user experience, improves performance, or fixes the correctness of a feature. These changes will likely not impact most users.

Skipping the training step in DDP

It is no longer allowed to skip training_step() by returning None in distributed training (#19918). The following usage was previously possible but would result in unpredictable hangs and timeouts in distributed training:

def training_step(self, batch):
    loss = ...
    if loss.isnan():

### No longer allowed in multi-GPU!
### Raises error in Lightning >= 2.3
        return None
    return loss

We decided to raise an error if the user attempts to return None when running in a multi-GPU setting.

Miscellaneous Changes

Dropped support for PyTorch 1.13 (#19300). With every new Lightning release, we add official support for the latest PyTorch stable version and drop the oldest version in our support window.
The prepare_data() hook in LightningModule and LightningDataModule is now subject to a barrier without timeout to avoid long-running tasks to be interrupted (#19448). Similarly, also in Fabric the Fabric.rank_zero_first context manager now uses an infinite barrier (#19448).

CHANGELOG

PyTorch Lightning

Added

The ModelSummary and RichModelSummary callbacks now display the training mode of each layer in the column "Mode" (#19468)
Added load_from_checkpoint support for LightningCLI when using dependency injection (#18105)
Added robust timer duration parsing with an informative error message when parsing fails (#19513)
Added on_exception hook to LightningDataModule (#19601)
Added support for PyTorch 2.3 (#19708)
Added ModelParallelStrategy to support 2D parallelism (#19878, #19888)
Added a call to torch.distributed.destroy_process_group in atexit handler if process group needs destruction (#19931)
Added support for configuring hybrid-sharding by passing a tuple for the FSDPStrategy(device_mesh=...) argument (#19504)

Changed

The prepare_data() hook in LightningModule and LightningDataModule is now subject to a barrier without timeout to avoid long-running tasks to be interrupted (#19448)
Relaxed the requirement for custom batch samplers to expose drop_last for prediction (#19678)
It is no longer allowed to skip training_step() by returning None in distributed training (#19918)

Removed

Removed the Bagua integration (Trainer(strategy="bagua")) (#19445)
Removed support for PyTorch 1.13 (#19706)

Fixed

Fixed a matrix shape mismatch issue when running a model loaded from a quantized checkpoint (bitsandbytes) (#19886)
Fixed WandbLogger.log_hyperparameters() raising an error if hyperparameters are not JSON serializable (#19769)
Fixed an issue with the LightningCLI not being able to set the ModelCheckpoint(save_last=...) argument (#19808)
Fixed an issue causing ValueError for certain object such as TorchMetrics when dumping hyperparameters to YAML (#19804)
Fixed resetting epoch_loop.restarting to avoid full validation run after LearningRateFinder (#19818)

Lightning Fabric

Added

Added sanitization for classes before logging them as hyperparameters (#19771)
Enabled consolidating distributed checkpoints through fabric consolidate in the new CLI (#19560)
Added the ability to explicitly mark forward methods in Fabric via _FabricModule.mark_forward_method() (#19690)
Added support for PyTorch 2.3 (#19708)
Added ModelParallelStrategy to support 2D parallelism (#19846, #19852, #19870, #19872)
Added a call to torch.distributed.destroy_process_group in atexit handler if process group needs destruction (#19931)
Added support for configuring hybrid-sharding by passing a tuple for the FSDPStrategy(device_mesh=...) argument (#19504)

Changed

Renamed lightning run model to fabric run (#19442, #19527)
The Fabric.rank_zero_first context manager now uses a barrier without timeout to avoid long-running tasks to be interrupted (#19448)
Fabric now raises an error if you forget to call fabric.backward() when it is needed by the strategy or precision selection (#19447, #19493)
_BackwardSyncControl can now control what to do when gradient accumulation is disabled (#19577)

Removed

Removed support for PyTorch 1.13 (#19706)

Fixed

Fixed a matrix shape mismatch issue when running a model loaded from a quantized checkpoint (bitsandbytes) (#19886)

Full commit list: 2.2.0 -> 2.3.0

Contributors

We thank all our contributors who submitted pull requests for features, bug fixes and documentation updates.

New Contributors

@cauyxy made their first contribution in Lightning-AI/pytorch-lightning#19437
@mwip made their first contribution in Lightning-AI/pytorch-lightning#19518
@kylebgorman made their first contribution in Lightning-AI/pytorch-lightning#19513
@kashif made their first contribution in Lightning-AI/pytorch-lightning#19520
@ash0ts made their first contribution in Lightning-AI/pytorch-lightning#19451
@dimitri-voytan made their first contribution in Lightning-AI/pytorch-lightning#19524
@ankitgola005 made their first contribution in Lightning-AI/pytorch-lightning#19615
@invisprints made their first contribution in Lightning-AI/pytorch-lightning#19629
@kvenkman made their first contribution in Lightning-AI/pytorch-lightning#19465
@fnhirwa made their first contribution in Lightning-AI/pytorch-lightning#19640
@inyong37 made their first contribution in Lightning-AI/pytorch-lightning#19677
@clumsy made their first contribution in Lightning-AI/pytorch-lightning#19601
@judidoko made their first contribution in Lightning-AI/pytorch-lightning#19692
@Lunamos made their first contribution in Lightning-AI/pytorch-lightning#19701
@dominicgkerr made their first contribution in Lightning-AI/pytorch-lightning#19727
@daavoo made their first contribution in Lightning-AI/pytorch-lightning#19774
@Peiffap made their first contribution in Lightning-AI/pytorch-lightning#19805
@IvanYashchuk made their first contribution in Lightning-AI/pytorch-lightning#19926
@ringohoffman made their first contribution in Lightning-AI/pytorch-lightning#19904
@afspies made their first contribution in Lightning-AI/pytorch-lightning#19847
@fedebotu made their first contribution in Lightning-AI/pytorch-lightning#19822
@mariovas3 made their first contribution in Lightning-AI/pytorch-lightning#19808
@Bhavay-2001 made their first contribution in Lightning-AI/pytorch-lightning#19947
@V0XNIHILI made their first contribution in Lightning-AI/pytorch-lightning#19771

Did you know?

Chuck Norris is a big fan and daily user of Lightning Studio.

`v2.2.5`: Patch release v2.2.5

Compare Source

PyTorch Lightning + Fabric

Fixed

Fixed a matrix shape mismatch issue when running a model loaded from a quantized checkpoint (bitsandbytes) (#19886)

Full Changelog: Lightning-AI/pytorch-lightning@2.2.4...2.2.5

`v2.2.4`: Patch release v2.2.4

Compare Source

App

Fixed

Fixed HTTPClient retry for flow/work queue (#19837)

PyTorch

No Changes.

Fabric

No Changes.

Full Changelog: Lightning-AI/pytorch-lightning@2.2.3...2.2.4

`v2.2.3`: Patch release v2.2.3

Compare Source

PyTorch

Fixed

Fixed WandbLogger.log_hyperparameters() raising an error if hyperparameters are not JSON serializable (#19769)

Fabric

No Changes.

Full Changelog: Lightning-AI/pytorch-lightning@2.2.2...2.2.3

`v2.2.2`: Patch release v2.2.2

Compare Source

PyTorch

Fixed

Fixed an issue causing a TypeError when using torch.compile as a decorator (#19627)
Fixed a KeyError when saving a FSDP sharded checkpoint and setting save_weights_only=True (#19524)

Fabric

Fixed

Fixed an issue causing a TypeError when using torch.compile as a decorator (#19627)
Fixed issue where some model methods couldn't be monkeypatched after being Fabric wrapped (#19705)
Fixed an issue causing weights to be reset in Fabric.setup() when using FSDP (#19755)

Full Changelog: Lightning-AI/pytorch-lightning@2.2.1...2.2.2

Contributors

@ankitgola005 @awaelchli @Borda @carmocca @dmitsf @dvoytan-spark @fnhirwa

`v2.2.1`: Patch release v2.2.1

Compare Source

PyTorch

Fixed

Fixed an issue with CSVLogger trying to append to file from a previous run when the version is set manually (#19446)
Fixed the divisibility check for Trainer.accumulate_grad_batches and Trainer.log_every_n_steps in ThroughputMonitor (#19470)
Fixed support for Remote Stop and Remote Abort with NeptuneLogger (#19130)
Fixed infinite recursion error in precision plugin graveyard (#19542)

Fabric

Fixed

Fixed an issue with CSVLogger trying to append to file from a previous run when the version is set manually (#19446)

Full Changelog: Lightning-AI/pytorch-lightning@2.2.0post...2.2.1

Contributors

@Raalsky @awaelchli @carmocca @Borda

If we forgot someone due to not matching commit email with GitHub account, let us know :]

`v2.2.0`: Lightning v2.2

Compare Source

Lightning AI is excited to announce the release of Lightning 2.2 ⚡

Did you know? The Lightning philosophy extends beyond a boilerplate-free deep learning framework: We've been hard at work bringing you Lightning Studio. Code together, prototype, train, deploy, host AI web apps. All from your browser, with zero setup.

While our previous release was packed with many big new features, this time around we're rolling out mainly improvements based on feedback from the community. And of course, as the name implies, this release fully supports the latest PyTorch 2.2 🎉

Highlights

Monitoring Throughput

Lightning now has built-in utilities to measure throughput metrics such as batches/sec, samples/sec and Model FLOP Utilization (MFU) (#18848).

Trainer:

For the Trainer, this comes in form of a ThroughputMonitor callback. In order to track samples/sec, you need to provide a function to tell the monitor how to extract the batch dimension from your input. Furthermore, if you want to track MFU, you can provide a sample forward pass and the ThroughputMonitor will automatically estimate the utilization based on the hardware you are running on:

import lightning as L
from lightning.pytorch.callbacks import ThroughputMonitor
from lightning.fabric.utilities.throughput import measure_flops

class MyModel(LightningModule):
    def setup(self, stage):
        with torch.device("meta"):
            model = MyModel()

        def sample_forward():
            batch = torch.randn(..., device="meta")
            return model(batch)

        self.flops_per_batch = measure_flops(model, sample_forward, loss_fn=torch.Tensor.sum)

throughput = ThroughputMonitor(
    batch_size_fn=lambda batch: batch.size(0),

### optional, if your samples have a length (like number of tokens)
    sample_fn=lambda batch: batch.size(1)
)
trainer = L.Trainer(log_every_n_steps=10, callbacks=throughput, logger=...)
model = MyModel()
trainer.fit(model)

The results get automatically sent to the logger if one is configured on the Trainer.

Fabric:

For Fabric, the ThroughputMonitor is a simple utility object on which you call .update() and compute_and_log() during the training loop:

import lightning as L
from lightning.fabric.utilities import ThroughputMonitor

fabric = L.Fabric(logger=...)
throughput = ThroughputMonitor(fabric)

t0 = time()
for batch_idx, batch in enumerate(train_dataloader):
    do_work()
    torch.cuda.synchronize()  # required or else time() won't be correct
    throughput.update(
        time=(time() - t0), 
        batches=batch_idx, 
        samples=(batch_idx * batch_size)
    )
    if batch_idx % 10 == 0:
        throughput.compute_and_log(step=batch_idx)

Check out our TinyLlama LLM pretraining script for a full example using Fabric's ThroughputMonitor.

The troughput utilities can report:

batches per second (per process and across process)
samples per second (per process and across process)
items per second (e.g. tokens) (per process and across process)
flops per second (per process and across process)
model flops utilization (MFU) (per process)
total time, total samples, total batches, and total items (per process)

Improved Handling of Evaluation Mode

When you train a model and have validation enabled, the Trainer automatically calls .eval() when transitioning to the validation loop, and .train() when validation ends. Until now, this had the unfortunate side effect that any submodules in your LightningModule that were in evaluation mode get reset to train mode. In Lightning 2.2, the Trainer now captures the mode of every submodule before switching to validation, and restores the mode the modules were in when validation ends (#18951, #18951, #18951). This improvement will help users avoid silent correctness bugs and removes boilerplate code for managing frozen layers.

import lightning as L

class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.trainable_module = ...

### This will now stay in eval mode
        self.frozen_module = ...
        self.frozen_module.eval()
        
    def training_step(self, batch):

### Previously, modules were all in train mode
### Now: Modules are in mode they were set up with
        assert self.trainable_module.training
        assert not self.frozen_module.training
        ...
        
    def validation_step(self, batch):

### All modules are in eval mode
        ...
    
    
model = LitModel()
trainer = L.Trainer()
trainer.fit(model)

If you have overridden any of the LightningModule.on_{validation,test,predict}_model_{eval,train} hooks, they will still get called and execute your custom logic, but they are no longer required if you added them to preserve the eval mode of frozen modules.

[!IMPORTANT]
In some libraries, for example HuggingFace, models are created in evaluation mode by default (e.g. HFModel.from_pretrained(...)). Starting from 2.2, you will have to set .train() on these models if you intend to train them.

Converting FSDP Checkpoints

In the previous release, we introduced distributed checkpointing with FSDP to speed up saving and loading checkpoints for big models. These checkpoints are in a special format saved in a folder with shards from each GPU in a separate file. While these checkpoints can be loaded back with Lightning Trainer or Fabric very easily, they aren't easy to load or process externally. In Lightning 2.2, we introduced a CLI utility that lets you consolidate the checkpoint folder to a single file that can be loaded in raw PyTorch with torch.load() for example (#19213).

Given you saved a distributed checkpoint, you can then convert it like so:

### For Trainer checkpoints:
python -m lightning.pytorch.utilities.consolidate_checkpoint path/to/my/checkpoint

### For Fabric checkpoints:
python -m lightning.fabric.utilities.consolidate_checkpoint path/to/my/checkpoint

Read more about distributed checkpointing in our documentation: Trainer, Fabric.

Improvements to Compiling DDP/FSDP in Fabric

PyTorch 2.0+ introduced torch.compile, a powerful tool to speed up your models without changing the code.
We now added a comprehensive guide how to use torch.compile correctly with tips and tricks to help you troubleshoot common issues. On top of that, Fabric.setup() will now reapply torch.compile on top of DDP/FSDP if you are enabling these strategies (#19280).

import lightning as L

### Select a distributed strategy (DDP, FSDP, ...)
fabric = L.Fabric(strategy="ddp", devices=8)

### Compile your model before `.setup()`
model = torch.compile(model)

### Now automatically handles compiling also over DDP/FSDP
model = fabric.setup(model)

### You can opt-out if it is causing trouble
model = fabric.setup(model, _reapply_compile=False)

You might see fewer graph breaks, but there won't be any significant speed-ups with this. We introduced this mainly to make Fabric ready for future improvements from PyTorch to optimizing distributed operations.

Saving and Loading DataLoader State

If you use a dataloader/iterable that implements the .state_dict() and .load_state_dict() interface, the Trainer will now automatically save and load their state in the checkpoint (#19361).

import lightning as L

class MyDataLoader:
    """A dataloader that implements the 'stateful' interface."""
    
    def state_dict(self):

### Return a dictionary with state
        return {"batches_fetched": ...}
    
    def load_state_dict(self, state_dict):

### Load the state from the checkpoint
        self.batches_fetched = state_dict["batches_fetched"]

model = ...
dataloader = MyDataLoader()
trainer = L.Trainer()

### Saves checkpoints that include the dataloader state
trainer.fit(model, dataloader)

### When you resume training, the dataloader can now load its state
trainer.fit(model, dataloader, ckpt_path="path/to/my/checkpoint")

Note that the standard PyTorch DataLoader does not support this stateful interface. This feature only works on loaders that implement these two methods. A dataloader that supports full fault-tolerance will be included in our upcoming release of Lightning Data - a library to optimize data preprocessing and streaming in the cloud. Stay tuned!

Non-strict Checkpoint Loading in Trainer

A feature that has been requested for a long time by the community is non-strict checkpoint loading. By default, a checkpoint in PyTorch is loaded with strict=True to ensure all keys in the saved checkpoint match what's in the model's state dict.
However, in some use cases it might make sense to exclude certain weights from being included in the checkpoint. When resuming training, the user would then be required to set strict=False, which wasn't configurable until now.

You can now set the attribute strict_loading=False on your LightningModule if you want to allow loading partial checkpoints (#19404).

import lightning as L

class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()

### This model only trains the decoder, we don't save the encoder
        self.encoder = from_pretrained(...).requires_grad_(False)
        self.decoder = Decoder()

### Set to False because we only care about the decoder
        self.strict_loading = False
    
    def state_dict(self):

### Don't save the encoder, it is not being trained
        return {k: v for k, v in super().state_dict().items() if "encoder" not in k}

...

trainer = L.Trainer()
model = LitModel()

### Will load weights with `.load_state_dict(strict=model.strict_loading)`
trainer.fit(model, ckpt_path="path/to/checkpoint")

Full documentation here.

Notable Changes

The 2.0 series of Lightning releases guarantees core API stability: No name changes, argument renaming, hook removals etc. on core interfaces (Trainer, LightningModule, etc.) unless a feature is specifically marked experimental. Here we list a few behavioral changes made in places where the change was justified if it significantly improves the user experience, improves performance, or fixes the correctness of a feature. These changes will likely not impact most users.

ModelCheckpoint's save-last Feature

In Lightn

Configuration

📅 Schedule: Branch creation - "" in timezone Asia/Tokyo, Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

renovate · 2025-03-22T01:53:06Z

⚠️ Artifact update problem

Renovate failed to update an artifact related to this branch. You probably do not want to merge this PR as-is.

♻ Renovate will retry this branch, including artifacts, only when one of the following happens:

any of the package files in this branch needs updating, or
the branch becomes conflicted, or
you click the rebase/retry checkbox if found above, or
you rename this PR's title to start with "rebase!" to trigger it manually

The artifact failure details are included below:

File name: poetry.lock

Updating dependencies
Resolving dependencies...
<c1>Source (torch):</c1> Authorization error accessing https://download.pytorch.org/whl/cpu/pytorch-lightning/

Creating virtualenv pytorch-lightning-sam-callback-hplX6J3w-py3.13 in /home/ubuntu/.cache/pypoetry/virtualenvs

The current project's Python requirement (>=3.7,<4.0) is not compatible with some of the required packages Python requirement:
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.8, so it will not be satisfied for Python >=3.7,<3.8
  - pytorch-lightning requires Python >=3.9, so it will not be satisfied for Python >=3.7,<3.9
  - pytorch-lightning requires Python >=3.9, so it will not be satisfied for Python >=3.7,<3.9
  - pytorch-lightning requires Python >=3.9, so it will not be satisfied for Python >=3.7,<3.9
  - pytorch-lightning requires Python >=3.9, so it will not be satisfied for Python >=3.7,<3.9
  - pytorch-lightning requires Python >=3.9, so it will not be satisfied for Python >=3.7,<3.9
  - pytorch-lightning requires Python >=3.9, so it will not be satisfied for Python >=3.7,<3.9

Because no versions of pytorch-lightning match >2.0.0,<2.0.1 || >2.0.1,<2.0.1.post0 || >2.0.1.post0,<2.0.2 || >2.0.2,<2.0.3 || >2.0.3,<2.0.4 || >2.0.4,<2.0.5 || >2.0.5,<2.0.6 || >2.0.6,<2.0.7 || >2.0.7,<2.0.8 || >2.0.8,<2.0.9 || >2.0.9,<2.0.9.post0 || >2.0.9.post0,<2.1.0 || >2.1.0,<2.1.1 || >2.1.1,<2.1.2 || >2.1.2,<2.1.3 || >2.1.3,<2.1.4 || >2.1.4,<2.2.0 || >2.2.0,<2.2.0.post0 || >2.2.0.post0,<2.2.1 || >2.2.1,<2.2.2 || >2.2.2,<2.2.3 || >2.2.3,<2.2.4 || >2.2.4,<2.2.5 || >2.2.5,<2.3.0 || >2.3.0,<2.3.1 || >2.3.1,<2.3.2 || >2.3.2,<2.3.3 || >2.3.3,<2.4.0 || >2.4.0,<2.5.0 || >2.5.0,<2.5.0.post0 || >2.5.0.post0,<2.5.1 || >2.5.1,<2.5.1.post0 || >2.5.1.post0,<2.5.2 || >2.5.2,<2.5.3 || >2.5.3,<3.0.0
 and pytorch-lightning (2.0.0) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.0.1) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.0.1.post0) requires Python >=3.8
 and pytorch-lightning (2.0.2) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.0.3) requires Python >=3.8
 and pytorch-lightning (2.0.4) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.0.5) requires Python >=3.8
 and pytorch-lightning (2.0.6) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.0.7) requires Python >=3.8
 and pytorch-lightning (2.0.8) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.0.9) requires Python >=3.8
 and pytorch-lightning (2.0.9.post0) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.1.0) requires Python >=3.8
 and pytorch-lightning (2.1.1) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.1.2) requires Python >=3.8
 and pytorch-lightning (2.1.3) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.1.4) requires Python >=3.8
 and pytorch-lightning (2.2.0) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.2.0.post0) requires Python >=3.8
 and pytorch-lightning (2.2.1) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.2.2) requires Python >=3.8
 and pytorch-lightning (2.2.3) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.2.4) requires Python >=3.8
 and pytorch-lightning (2.2.5) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.3.0) requires Python >=3.8
 and pytorch-lightning (2.3.1) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.3.2) requires Python >=3.8
 and pytorch-lightning (2.3.3) requires Python >=3.8, pytorch-lightning is forbidden.
And because pytorch-lightning (2.4.0) requires Python >=3.8
 and pytorch-lightning (2.5.0) requires Python >=3.9, pytorch-lightning is forbidden.
And because pytorch-lightning (2.5.0.post0) requires Python >=3.9
 and pytorch-lightning (2.5.1) requires Python >=3.9, pytorch-lightning is forbidden.
And because pytorch-lightning (2.5.1.post0) requires Python >=3.9
 and pytorch-lightning (2.5.2) requires Python >=3.9, pytorch-lightning is forbidden.
So, because pytorch-lightning (2.5.3) requires Python >=3.9
 and pytorch-lightning-sam-callback depends on pytorch-lightning (^2.0.0), version solving failed.

  • Check your dependencies Python requirement: The Python requirement can be specified via the `python` or `markers` properties
    
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.8,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.9,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.9,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.9,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.9,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.9,<4.0"
    For pytorch-lightning, a possible solution would be to set the `python` property to ">=3.9,<4.0"

    https://python-poetry.org/docs/dependency-specification/#python-restricted-dependencies,
    https://python-poetry.org/docs/dependency-specification/#using-environment-markers

fix(deps): update dependency pytorch-lightning to v2 [security]

4a0e3c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(deps): update dependency pytorch-lightning to v2 [security] #33

fix(deps): update dependency pytorch-lightning to v2 [security] #33

Uh oh!

renovate bot commented Mar 22, 2025 •

edited

Loading

Uh oh!

renovate bot commented Mar 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix(deps): update dependency pytorch-lightning to v2 [security] #33

Are you sure you want to change the base?

fix(deps): update dependency pytorch-lightning to v2 [security] #33

Uh oh!

Conversation

renovate bot commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Vulnerability Alerts

CVE-2024-8019

Release Notes

v2.4.0: Lightning v2.4

Changes

PyTorch Lightning

Lightning Fabric

Contributors

New Contributors

Did you know?

v2.3.3: Patch release v2.3.3

v2.3.2: Patch release v2.3.2

v2.3.1: Patch release v2.3.1

v2.3.0: Lightning v2.3: Tensor Parallelism and 2D Parallelism

Highlights

Tensor Parallelism (beta)

PyTorch Lightning

Lightning Fabric

2D Parallelism (beta)

Training Mode in Model Summary

Special Forward Methods in Fabric

Notable Changes

Skipping the training step in DDP

Miscellaneous Changes

CHANGELOG

PyTorch Lightning

Lightning Fabric

Contributors

New Contributors

Did you know?

v2.2.5: Patch release v2.2.5

PyTorch Lightning + Fabric

Fixed

v2.2.4: Patch release v2.2.4

App

Fixed

PyTorch

Fabric

v2.2.3: Patch release v2.2.3

PyTorch

Fixed

Fabric

v2.2.2: Patch release v2.2.2

PyTorch

Fixed

Fabric

Fixed

Contributors

v2.2.1: Patch release v2.2.1

PyTorch

Fixed

Fabric

Fixed

Contributors

v2.2.0: Lightning v2.2

Highlights

Monitoring Throughput

Improved Handling of Evaluation Mode

Converting FSDP Checkpoints

Improvements to Compiling DDP/FSDP in Fabric

Saving and Loading DataLoader State

Non-strict Checkpoint Loading in Trainer

Notable Changes

ModelCheckpoint's save-last Feature

Configuration

Uh oh!

renovate bot commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Artifact update problem

File name: poetry.lock

Uh oh!

Reviewers

Assignees

renovate bot commented Mar 22, 2025 •

edited

Loading

`v2.4.0`: Lightning v2.4

`v2.3.3`: Patch release v2.3.3

`v2.3.2`: Patch release v2.3.2

`v2.3.1`: Patch release v2.3.1

`v2.3.0`: Lightning v2.3: Tensor Parallelism and 2D Parallelism

`v2.2.5`: Patch release v2.2.5

`v2.2.4`: Patch release v2.2.4

`v2.2.3`: Patch release v2.2.3

`v2.2.2`: Patch release v2.2.2

`v2.2.1`: Patch release v2.2.1

`v2.2.0`: Lightning v2.2

renovate bot commented Mar 22, 2025 •

edited

Loading