colossalai.zero.sharded_optim.sharded_optim_v2

class colossalai.zero.sharded_optim.sharded_optim_v2.OptimState(value)[source]

An enumeration.

class colossalai.zero.sharded_optim.sharded_optim_v2.ShardedOptimizerV2(sharded_model, optimizer, gpu_margin_mem_ratio=0.0, initial_scale=4294967296, min_scale=1, growth_factor=2, backoff_factor=0.5, growth_interval=1000, hysteresis=2, max_scale=4294967296, dp_process_group=None, mp_process_group=None, verbose=False)[source]

A wrapper for optimizer. ShardedOptimizerV2 and ShardedModelV2 implement Zero Redundancy Optimizer (ZeRO).

By default the ZeRO optimizer stage 3 offload Optimizer States on CPU.

We apply the Device-aware Operator Placement technique for OS placement from the following paper.

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management

GPU margin space is the remaining space after removing peak non-model data from the overall GPU memory, which is detected by a runtime memory tracer.

We place as many OS chunks in the margin space as possible.

The size of margin space can be controlled by gpu_margin_mem_ratio. If it is set as 0.0, it is the same as classical ZeRO optimizer.

Note

You must use ShardedOptimizerV2 with ShardedModelV2.

Note

Make sure you set tensor_placement_policy in ShardedModelV2 to “auto”, if you set gpu_margin_mem_ratio > 0.

Parameters:
  • sharded_model (ShardedModelV2) – A sharded model initialized by class ShardedModelV2. The optimizer will use the shard strategy provided by sharded model to shard param fp32 tensors.

  • optimizer (Optimizer) – An Optimizer instance.

  • gpu_margin_mem_ratio (float, optional) – The ratio of GPU remaining memory (after the first forward-backward) which will be used when using hybrid CPU optimizer. This argument is meaningless when tensor_placement_policy of ShardedModelV2 is not “auto”. Defaults to 0.0.

  • initial_scale (float, optional) – Initial scale used by DynamicGradScaler. Defaults to 2**32.

  • min_scale (float, optional) – Min scale used by DynamicGradScaler. Defaults to 1.

  • growth_factor (float, optional) – growth_factor used by DynamicGradScaler. Defaults to 2.

  • backoff_factor (float, optional) – backoff_factor used by DynamicGradScaler. Defaults to 0.5.

  • growth_interval (float, optional) – growth_interval used by DynamicGradScaler. Defaults to 1000.

  • hysteresis (float, optional) – hysteresis used by DynamicGradScaler. Defaults to 2.

  • max_scale (int, optional) – max_scale used by DynamicGradScaler. Defaults to 2**32.

  • dp_process_group (Optional[ProcessGroup], optional) – data paralle process group. Defaults to None.

  • mp_process_group (Optional[ProcessGroup], optional) – model paralle process group. Defaults to None.

get_memory_usage()[source]

Get the memory usage of the optimizer. Including master_params (param fp32), momentum (self.state[p]['exp_avg']) variance (self.state[p]['exp_avg_sq'])

Returns:

cuda/cpu memory usage in Byte.

Return type:

Tuple[int, int]