class, shard_strategy, process_group=None, reduce_scatter_process_group=None, reduce_scatter_bucket_size_mb=25, fp32_reduce_scatter=False, tensor_placement_policy='cuda', gradient_predivide_factor=1.0, reuse_fp16_shard=False, *args, **kwargs)[source]

A wrapper for the PyTorch module shards the model parameters among multiple GPU memory. Only 1/#nproc of parameters, gradients are stored in local CUDA memory, so forward and backward passes can be executed with limited CUDA memory budget.


You must use ShardedModelV2 with ShardedOptimizerV2.


Make sure you don’t use gradient accumulation and your optimizer can work with fp16 gradient and fp32 parameter, if you enable reuse_fp16_shard.

  • module (nn.Module) – A sharded module, which must be initialized by ZeroInitContext.

  • shard_strategy (BaseShardStrategy) – A shard strategy to manage shard behavior.

  • process_group (Optional[ProcessGroup], optional) – Data parallel process group. Defaults to None.

  • reduce_scatter_process_group (Optional[ProcessGroup], optional) – Reduce-scatter process group. Generally, it should be None, and it’s the same as process_group. Defaults to None.

  • reduce_scatter_bucket_size_mb (int, optional) – Reduce-scatter bucket size in MB. Defaults to 25.

  • fp32_reduce_scatter (bool, optional) – If set to True, gradients are forced to FP32 before reduce-scatter. Defaults to False.

  • tensor_placement_policy (str) – Which device to place held tensors. It can be ‘cpu’, ‘cuda’ and ‘auto’. If it’s ‘cpu’, parameters, gradients and optimizer states will be offloaded to CPU, which means min CUDA memory will be used. If it’s ‘cuda’, they won’t be offloaded, which means max CUDA memory will be used. If it’s ‘auto’, they are moving dynamically based on CPU and CUDA memory usage. It will utilize heterogeneous memory space evenly and well. Note that ‘auto’ policy can only work well when no other processes use CUDA during your training. Defaults to ‘cuda’.

  • gradient_predivide_factor (Optional[float], optional) – Gradient is divived by this value before reduce-scatter. Defaults to 1.0.

  • reuse_fp16_shard (bool, optional) – Whether to reuse fp16 shard for param and grad. Enabling this can reduce GPU memory usage, but you have to make sure you disable it when using gradient accumulation. In this mode, grad will be fp16. Make sure your optimizer supports mixed precision (fp32 param and fp16 grad). We find that PyTorch’s optimizers don’t support mixed precision, so we recommend you enable this only when using our CPUAdam with CPU offload. Defaults to False.


dummy memory tracer collected infomation to a file. try:

# forward: model(inputs) # backward: optimizer.backward()

except Exception as e:

model.dump_memory_stats() exit(0)