class, optimizer, gpu_margin_mem_ratio=0.0, initial_scale=4294967296, min_scale=1, growth_factor=2, backoff_factor=0.5, growth_interval=1000, hysteresis=2, max_scale=4294967296, dp_process_group=None, mp_process_group=None, verbose=False)[source]

A wrapper for optimizer. ShardedOptimizerV2 and ShardedModelV2 implement Zero Redundancy Optimizer (ZeRO).

By default the ZeRO optimizer stage 3 offload Optimizer States on CPU.

We apply the Device-aware Operator Placement technique for OS placement from the following paper.

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management

GPU margin space is the remaining space after removing peak non-model data from the overall GPU memory, which is detected by a runtime memory tracer.

We place as many OS chunks in the margin space as possible.

The size of margin space can be controlled by gpu_margin_mem_ratio. If it is set as 0.0, it is the same as classical ZeRO optimizer.


You must use ShardedOptimizerV2 with ShardedModelV2.


Make sure you set tensor_placement_policy in ShardedModelV2 to “auto”, if you set gpu_margin_mem_ratio > 0.

  • sharded_model (ShardedModelV2) – A sharded model initialized by class ShardedModelV2. The optimizer will use the shard strategy provided by sharded model to shard param fp32 tensors.

  • optimizer (Optimizer) – An Optimizer instance.

  • gpu_margin_mem_ratio (float, optional) – The ratio of GPU remaining memory (after the first forward-backward) which will be used when using hybrid CPU optimizer. This argument is meaningless when tensor_placement_policy of ShardedModelV2 is not “auto”. Defaults to 0.0.

  • initial_scale (float, optional) – Initial scale used by DynamicGradScaler. Defaults to 2**32.

  • min_scale (float, optional) – Min scale used by DynamicGradScaler. Defaults to 1.

  • growth_factor (float, optional) – growth_factor used by DynamicGradScaler. Defaults to 2.

  • backoff_factor (float, optional) – backoff_factor used by DynamicGradScaler. Defaults to 0.5.

  • growth_interval (float, optional) – growth_interval used by DynamicGradScaler. Defaults to 1000.

  • hysteresis (float, optional) – hysteresis used by DynamicGradScaler. Defaults to 2.

  • max_scale (int, optional) – max_scale used by DynamicGradScaler. Defaults to 2**32.

  • dp_process_group (Optional[ProcessGroup], optional) – data paralle process group. Defaults to None.

  • mp_process_group (Optional[ProcessGroup], optional) – model paralle process group. Defaults to None.


Get the memory usage of the optimizer. Including master_params (param fp32), momentum (self.state[p]['exp_avg']) variance (self.state[p]['exp_avg_sq'])


cuda/cpu memory usage in Byte.

Return type:

Tuple[int, int]

class, initial_scale=65536, min_scale=1, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, hysteresis=2, max_scale=16777216, clip_grad_norm=0.0, verbose=False, reduce_bucket_size=1048576, communication_dtype=None, overlap_communication=False, partition_grad=False, cpu_offload=False, forced_dtype=None)[source]

Optimizer used for ZeRO-1 and ZeRO-2.


Set parameter gradients to zero. If set_to_none = True, gradient will be set to None to save memory.


set_to_none (bool) – Whether set the gradient to None. Default value is True.