colossalai.zero.sharded_optim.sharded_optim_v2
- class colossalai.zero.sharded_optim.sharded_optim_v2.ShardedOptimizerV2(sharded_model, optimizer, gpu_margin_mem_ratio=0.0, initial_scale=4294967296, min_scale=1, growth_factor=2, backoff_factor=0.5, growth_interval=1000, hysteresis=2, max_scale=4294967296, dp_process_group=None, mp_process_group=None, verbose=False)[source]
A wrapper for optimizer.
ShardedOptimizerV2
andShardedModelV2
implement Zero Redundancy Optimizer (ZeRO).By default the ZeRO optimizer stage 3 offload Optimizer States on CPU.
We apply the Device-aware Operator Placement technique for OS placement from the following paper.
PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management
GPU margin space is the remaining space after removing peak non-model data from the overall GPU memory, which is detected by a runtime memory tracer.
We place as many OS chunks in the margin space as possible.
The size of margin space can be controlled by
gpu_margin_mem_ratio
. If it is set as0.0
, it is the same as classical ZeRO optimizer.Note
You must use
ShardedOptimizerV2
withShardedModelV2
.Note
Make sure you set
tensor_placement_policy
inShardedModelV2
to “auto”, if you setgpu_margin_mem_ratio > 0
.- Parameters:
sharded_model (ShardedModelV2) – A sharded model initialized by class ShardedModelV2. The optimizer will use the shard strategy provided by sharded model to shard param fp32 tensors.
optimizer (Optimizer) – An Optimizer instance.
gpu_margin_mem_ratio (float, optional) – The ratio of GPU remaining memory (after the first forward-backward) which will be used when using hybrid CPU optimizer. This argument is meaningless when tensor_placement_policy of ShardedModelV2 is not “auto”. Defaults to 0.0.
initial_scale (float, optional) – Initial scale used by DynamicGradScaler. Defaults to 2**32.
min_scale (float, optional) – Min scale used by DynamicGradScaler. Defaults to 1.
growth_factor (float, optional) – growth_factor used by DynamicGradScaler. Defaults to 2.
backoff_factor (float, optional) – backoff_factor used by DynamicGradScaler. Defaults to 0.5.
growth_interval (float, optional) – growth_interval used by DynamicGradScaler. Defaults to 1000.
hysteresis (float, optional) – hysteresis used by DynamicGradScaler. Defaults to 2.
max_scale (int, optional) – max_scale used by DynamicGradScaler. Defaults to 2**32.
dp_process_group (Optional[ProcessGroup], optional) – data paralle process group. Defaults to None.
mp_process_group (Optional[ProcessGroup], optional) – model paralle process group. Defaults to None.