colossalai.gemini

class colossalai.gemini.StatefulTensorMgr(tensor_placement_policy)[source]

Stateful Tensor Manager, inspired from PatrickStar

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management https://arxiv.org/abs/2108.05818

finish_iter()[source]

This function must be called when each iteration finishes

adjust_layout()[source]

Adjust the layout of statefuil tensor according to the information provided by mem_stats_collector, which should belongs to a Sharded Model.

class colossalai.gemini.GeminiManager(placement_policy, chunk_manager, memstats=None)[source]

Stateful Tensor Manager, inspired from PatrickStar

PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management https://arxiv.org/abs/2108.05818

Parameters:
  • placement_policy (str) – Which device to place held tensors. It can be ‘cpu’, ‘cuda’ and ‘auto’. If it’s ‘cpu’, parameters, gradients and optimizer states will be offloaded to CPU, which means min CUDA memory will be used. If it’s ‘cuda’, they won’t be offloaded, which means max CUDA memory will be used. If it’s ‘auto’, they are moving dynamically based on CPU and CUDA memory usage. It will utilize heterogeneous memory space evenly and well. Note that ‘auto’ policy can only work well when no other processes use CUDA during your training.

  • chunk_manager (ChunkManager) – A ChunkManager instance.

  • memstats (MemStats, optional) – a mem stats collected by a runtime mem tracer. if None then GeminiManager will collect it during a warmup iteration.

memstats()[source]

get the memory statistics during training. The stats could be collected by a runtime memory tracer, or collected by the GeminiManager. Note, for the latter, you can not access the memstats before warmup iteration finishes.

post_iter()[source]

This function must be called when each iteration finishes

adjust_layout(chunks)[source]

Adjust the layout of stateful tensors according to the information provided by mem_stats_collector, which should belongs to a Sharded Model.

class colossalai.gemini.TensorInfo(state: colossalai.gemini.chunk.chunk.TensorState, offset: int, end: int)[source]
class colossalai.gemini.TensorState(value)[source]

An enumeration.

class colossalai.gemini.ChunkManager(chunk_configuration, init_device=None)[source]

A manager class to manipulate the tensors in chunks.

Parameters:
  • chunk_configuration (Dict[int, Dict]) – the configuration dictionary of this chunk manager.

  • init_device (torch.device) – optional, the device on which the chunk is initialized. The default is None.

register_tensor(tensor, group_type, config_key, cpu_offload=False, pin_memory=False)[source]

Register a tensor to the chunk manager. Then, the tensor should be accessed by get_chunks.

Parameters:
  • tensor – the tensor appended to the chunk

  • group_type – the data type of the group.

  • config_key – the key of the group’s name, the size of the dp world

  • cpu_offload – if True, the chunk will be closed on CPU

  • pin_memory – whether the chunk is pinned in the cpu memory

close_all_groups()[source]

Close all the chunks of all groups.

access_chunk(chunk)[source]

Make the chunk can be used for calculation.

release_chunk(chunk)[source]

Scatter the chunk in CUDA.

move_chunk(chunk, device, force_copy=False)[source]

Move the shard of the chunk to the target device.

trans_tensor_state(tensor, state)[source]

Transit tensor state according to pre-defined state machine.

reduce_chunk(chunk)[source]

Reduce or all reduce the chunk.

copy_tensor_to_chunk_slice(tensor, data)[source]

Copy data to the chunk.

Parameters:
  • tensor (torch.Tensor) – the tensor used to retrive meta information

  • data (torch.Tensor) – the tensor to be copied to the chunk

get_chunk(tensor)[source]

Return the chunk owning the tensor.

Parameters:

tensor (torch.Tensor) – a torch tensor object

get_cuda_movable_chunks()[source]

Get all chunks that can be moved.

get_chunks(tensors)[source]

Get all chunks owning the input tensors.

Parameters:

tensors (Iterable[torch.Tensor]) – the tensors used to look for chunks

add_extern_static_tensor(tensor)[source]

Add extern static tensor to chunk manager. Those tensors won’t be managed by chunk manager, but we want to monitor memory usage of them. They are “static”, which means their shape, dtype, device never change. Thus, their memory usage never changes.

Parameters:

tensor (torch.Tensor) – An extern static tensor. E.g. optimizer state.

colossalai.gemini.search_chunk_configuration(model, search_range_mb, search_interval_byte, min_chunk_size_mb=32, filter_exlarge_params=True, strict_ddp_flag=False, memstas=None)[source]
Parameters:
  • model (nn.Module) – torch module

  • search_range_mb (float) – searching range in mega byte.

  • search_interval_byte (int) – searching interval in byte.

  • min_chunk_size_mb (float, optional) – the minimum size of a distributed chunk.

  • filter_exlarge_params (bool, optional) – filter extreme large parameters. Defaults to True.

  • strict_ddp_flag (bool, optional) – whether to enable the strict ddp mode. all parameters keep replicated in this mode.

Returns:

chunk config (a dict of dp_degree -> chunk init args) and its memory chunk waste in byte.

Return type:

Tuple[Dict, int]