colossalai.nn.optimizer

class colossalai.nn.optimizer.FusedLAMB(params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-06, weight_decay=0.01, amsgrad=False, adam_w_mode=True, grad_averaging=True, set_grad_none=True, max_grad_norm=1.0, use_nvlamb=False)[source]

Implements LAMB algorithm.

FusedLAMB requires CUDA extensions which can be built during installation or runtime.

This version of fused LAMB implements 2 fusions.

  • Fusion of the LAMB update’s elementwise operations

  • A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.

colossalai.nn.optimizer.FusedLAMB’s usage is identical to any ordinary Pytorch optimizer

colossalai.nn.optimizer.FusedLAMB may be used with or without Amp.

LAMB was proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.

  • lr (float, optional) – learning rate. (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its norm. (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-6)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0.01)

  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond NOT SUPPORTED now! (default: False)

  • adam_w_mode (boolean, optional) – Apply L2 regularization or weight decay True for decoupled weight decay(also known as AdamW) (default: True)

  • grad_averaging (bool, optional) – whether apply (1-beta2) to grad when calculating running averages of gradient. (default: True)

  • set_grad_none (bool, optional) – whether set grad to None when zero_grad() method is called. (default: True)

  • max_grad_norm (float, optional) – value used to clip global grad norm (default: 1.0)

  • use_nvlamb (boolean, optional) – Apply adaptive learning rate to 0.0 weight decay parameter (default: False)

step(closure=None)[source]

Performs a single optimization step.

Parameters:

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class colossalai.nn.optimizer.FusedAdam(params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, adamw_mode=True, weight_decay=0.0, amsgrad=False, set_grad_none=True)[source]

Implements Adam algorithm.

FusedAdam requires CUDA extensions which can be built during installation or runtime.

This version of fused Adam implements 2 fusions.

  • Fusion of the Adam update’s elementwise operations

  • A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.

colossalai.nn.optimizer.FusedAdam may be used as a drop-in replacement for torch.optim.AdamW, or torch.optim.Adam with adamw_mode=False

colossalai.nn.optimizer.FusedAdam may be used with or without Amp.

Adam was been proposed in Adam: A Method for Stochastic Optimization.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.

  • lr (float, optional) – learning rate. (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED in FusedAdam!

  • adamw_mode (boolean, optional) – Apply L2 regularization or weight decay True for decoupled weight decay(also known as AdamW) (default: True)

  • set_grad_none (bool, optional) – whether set grad to None when zero_grad() method is called. (default: True)

step(closure=None, grads=None, output_params=None, scale=None, grad_norms=None, div_scale=-1)[source]

Performs a single optimization step.

Parameters:

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

The remaining arguments are deprecated, and are only retained (for the moment) for error-checking purposes.

class colossalai.nn.optimizer.FusedSGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, wd_after_momentum=False)[source]

Implements stochastic gradient descent (optionally with momentum).

FusedSGD requires CUDA extensions which can be built during installation or runtime.

This version of fused SGD implements 2 fusions.

  • Fusion of the SGD update’s elementwise operations

  • A multi-tensor apply launch that batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches.

colossalai.nn.optimizer.FusedSGD may be used as a drop-in replacement for torch.optim.SGD

colossalai.nn.optimizer.FusedSGD may be used with or without Amp.

Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate

  • momentum (float, optional) – momentum factor (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • dampening (float, optional) – dampening for momentum (default: 0)

  • nesterov (bool, optional) – enables Nesterov momentum (default: False)

Note

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks. Considering the specific case of Momentum, the update can be written as

\[\begin{split}v = \rho * v + g \\ p = p - lr * v\end{split}\]

where p, g, v and \(\rho\) denote the parameters, gradient, velocity, and momentum respectively. This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form

\[\begin{split}v = \rho * v + lr * g \\ p = p - v\end{split}\]

The Nesterov version is analogously modified.

step(closure=None)[source]

Performs a single optimization step.

Parameters:

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class colossalai.nn.optimizer.Lamb(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0, adam=False)[source]

Implements Lamb algorithm. It has been proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-6)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • adam (bool, optional) – always use trust ratio = 1, which turns this into Adam. Useful for comparison purposes.

step(closure=None)[source]

Performs a single optimization step.

Parameters:

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class colossalai.nn.optimizer.Lars(params, lr=0.001, momentum=0, eeta=0.001, weight_decay=0, epsilon=0.0)[source]

Implements the LARS optimizer from “Large batch training of convolutional networks”.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float, optional) – learning rate (default: 1e-3)

  • momentum (float, optional) – momentum factor (default: 0)

  • eeta (float, optional) – LARS coefficient as used in the paper (default: 1e-3)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

step(closure=None)[source]

Performs a single optimization step.

Parameters:

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

class colossalai.nn.optimizer.CPUAdam(model_params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, adamw_mode=True, nvme_offload_fraction=0.0, nvme_offload_dir=None)[source]

Implements Adam algorithm.

Supports parameters updating on both GPU and CPU, depanding on the device of paramters. But the parameters and gradients should on the same device:

  • Parameters on CPU and gradients on CPU is allowed.

  • Parameters on GPU and gradients on GPU is allowed.

  • Parameters on GPU and gradients on CPU is not allowed.

CPUAdam requires CUDA extensions which can be built during installation or runtime.

This version of CPU Adam accelates parameters updating on CPU with SIMD. Support of AVX2 or AVX512 is required.

The GPU part is implemented in an naive way.

CPU Adam also supports the hybrid precision calculation, eg. fp32 parameters and fp16 gradients.

colossalai.nn.optimizer.CPUAdam may be used as a drop-in replacement for torch.optim.AdamW, or torch.optim.Adam with adamw_mode=False

Adam was been proposed in Adam: A Method for Stochastic Optimization.

Parameters:
  • model_params (iterable) – iterable of parameters of dicts defining parameter groups.

  • lr (float, optional) – learning rate. (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED yet in CPUAdam!

  • adamw_mode (boolean, optional) – Apply L2 regularization or weight decay True for decoupled weight decay(also known as AdamW) (default: True)

  • simd_log (boolean, optional) – whether to show if you are using SIMD to accelerate. (default: False)

  • nvme_offload_fraction (float, optional) – Fraction of optimizer states to be offloaded to NVMe. Defaults to 0.0.

  • nvme_offload_dir (Optional[str], optional) – Directory to save NVMe offload files. If it’s None, a random temporary directory will be used. Defaults to None.

class colossalai.nn.optimizer.HybridAdam(model_params, lr=0.001, bias_correction=True, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, adamw_mode=True, nvme_offload_fraction=0.0, nvme_offload_dir=None, **defaults)[source]

Implements Adam algorithm.

Supports parameters updating on both GPU and CPU, depanding on the device of paramters. But the parameters and gradients should on the same device:

  • Parameters on CPU and gradients on CPU is allowed.

  • Parameters on GPU and gradients on GPU is allowed.

  • Parameters on GPU and gradients on CPU is not allowed.

HybriadAdam requires CUDA extensions which can be built during installation or runtime.

This version of Hybrid Adam is an hybrid of CPUAdam and FusedAdam.

  • For parameters updating on CPU, it uses CPUAdam.

  • For parameters updating on GPU, it uses FusedAdam.

  • Hybird precision calculation of fp16 and fp32 is supported, eg fp32 parameters and fp16 gradients.

colossalai.nn.optimizer.HybridAdam may be used as a drop-in replacement for torch.optim.AdamW, or torch.optim.Adam with adamw_mode=False

Adam was been proposed in Adam: A Method for Stochastic Optimization.

Parameters:
  • model_params (iterable) – iterable of parameters of dicts defining parameter groups.

  • lr (float, optional) – learning rate. (default: 1e-3)

  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))

  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED yet in CPUAdam!

  • adamw_mode (boolean, optional) – Apply L2 regularization or weight decay True for decoupled weight decay(also known as AdamW) (default: True)

  • simd_log (boolean, optional) – whether to show if you are using SIMD to accelerate. (default: False)

  • nvme_offload_fraction (float, optional) – Fraction of optimizer states to be offloaded to NVMe. Defaults to 0.0.

  • nvme_offload_dir (Optional[str], optional) – Directory to save NVMe offload files. If it’s None, a random temporary directory will be used. Defaults to None.