colossalai.nn.lr_scheduler

class colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, total_steps, eta_min=0, last_epoch=-1, **kwargs)[source]

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr and \(T_{cur}\) is the number of epochs since the last restart in SGDR:

\[\begin{split}\begin{aligned} \eta_t & = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right), & T_{cur} \neq (2k+1)T_{max}; \\ \eta_{t+1} & = \eta_{t} + \frac{1}{2}(\eta_{max} - \eta_{min}) \left(1 - \cos\left(\frac{1}{T_{max}}\pi\right)\right), & T_{cur} = (2k+1)T_{max}. \end{aligned}\end{split}\]

When last_epoch=-1, sets initial lr as lr. Notice that because the schedule is defined recursively, the learning rate can be simultaneously modified outside this scheduler by other operators. If the learning rate is set solely by this scheduler, the learning rate at each step becomes:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right)\]

It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • eta_min (int, optional) – Minimum learning rate, defaults to 0.

  • last_epoch (int, optional) – The index of last epoch, defaults to -1. When last_epoch=-1, the schedule is started from the beginning or When last_epoch=-1, sets initial lr as lr.

class colossalai.nn.lr_scheduler.CosineAnnealingWarmupLR(optimizer, total_steps, warmup_steps=0, eta_min=0.0, last_epoch=-1)[source]

Cosine annealing learning rate scheduler with learning rate warmup. A linear warmup schedule will be applied.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • warmup_steps (int, optional) – Number of warmup steps, defaults to 0.

  • eta_min (int, optional) – Minimum learning rate, defaults to 0.

  • last_epoch (int, optional) – The index of last epoch, defaults to -1. When last_epoch=-1, the schedule is started from the beginning or When last_epoch=-1, sets initial lr as lr.

class colossalai.nn.lr_scheduler.FlatAnnealingLR(optimizer, total_steps, pct_start=0.72, last_epoch=-1, **kwargs)[source]

Flat and cosine annealing learning rate scheduler. The learning rate will be a fixed value before starting decay.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • pct_start (float, optional) – Percent of steps before starting learning rate decay, defaults to -0.72.

  • last_epoch (int, optional) – The index of last epoch, defaults to -1. When last_epoch=-1, the schedule is started from the beginning or When last_epoch=-1, sets initial lr as lr.

class colossalai.nn.lr_scheduler.FlatAnnealingWarmupLR(optimizer, total_steps, warmup_steps=0, pct_start=0.72, eta_min=0, last_epoch=-1, **kwargs)[source]

Flat and cosine annealing learning rate scheduler with learning rate warmup. A linear warmup schedule will be applied, and then the learning rate will be a fixed value before starting decay.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • warmup_steps (int, optional) – Number of warmup steps, defaults to 0.

  • pct_start (float, optional) – Percent of steps before starting learning rate decay, defaults to -0.72.

  • eta_min (int, optional) – Minimum learning rate, defaults to 0.

  • last_epoch (int, optional) – The index of last epoch, defaults to -1. When last_epoch=-1, the schedule is started from the beginning or When last_epoch=-1, sets initial lr as lr.

class colossalai.nn.lr_scheduler.LinearWarmupLR(optimizer, total_steps, warmup_steps=0, last_epoch=-1, **kwargs)[source]

Linearly warmup learning rate and then linearly decay.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • warmup_steps (int, optional) – Number of warmup steps, defaults to 0

  • last_epoch (int, optional) – The index of last epoch, defaults to -1. When last_epoch=-1, the schedule is started from the beginning or When last_epoch=-1, sets initial lr as lr.

class colossalai.nn.lr_scheduler.MultiStepLR(optimizer, total_steps, milestones=None, gamma=0.1, last_epoch=-1, **kwargs)[source]

Decays the learning rate of each parameter group by gamma once the number of epoch reaches one of the milestones. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • milestones (List[int], optional) – List of epoch indices. Must be increasing, defaults to None.

  • gamma (float, optional) – Multiplicative factor of learning rate decay, defaults to 0.1.

  • last_epoch (int, optional) – The index of last epoch, defaults to -1. When last_epoch=-1, the schedule is started from the beginning or When last_epoch=-1, sets initial lr as lr.

class colossalai.nn.lr_scheduler.MultiStepWarmupLR(optimizer, total_steps, warmup_steps=0, milestones=None, gamma=0.1, last_epoch=-1, **kwargs)[source]

Multistep learning rate scheduler with warmup.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • warmup_steps (int, optional) – Number of warmup steps, defaults to 0.

  • milestones (List[int], optional) – List of epoch indices. Must be increasing, defaults to None.

  • gamma (float, optional) – Multiplicative factor of learning rate decay, defaults to 0.1.

  • num_steps_per_epoch (int, optional) – Number of steps per epoch, defaults to -1.

  • last_epoch (int, optional) – The index of last epoch, defaults to -1. When last_epoch=-1, the schedule is started from the beginning or When last_epoch=-1, sets initial lr as lr.

class colossalai.nn.lr_scheduler.OneCycleLR(optimizer, total_steps, pct_start=0.3, anneal_strategy='cos', cycle_momentum=True, base_momentum=0.85, max_momentum=0.95, div_factor=25.0, final_div_factor=10000.0, last_epoch=-1, **kwargs)[source]

Sets the learning rate of each parameter group according to the 1cycle learning rate policy. The 1cycle policy anneals the learning rate from an initial learning rate to some maximum learning rate and then from that maximum learning rate to some minimum learning rate much lower than the initial learning rate. This policy was initially described in the paper Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. The 1cycle learning rate policy changes the learning rate after every batch. step should be called after a batch has been used for training. This scheduler is not chainable. Note also that the total number of steps in the cycle can be determined in one of two ways (listed in order of precedence):

  • A value for total_steps is explicitly provided.

  • A number of epochs (epochs) and a number of steps per epoch (steps_per_epoch) are provided. In this case, the number of total steps is inferred by total_steps = epochs * steps_per_epoch

You must either provide a value for total_steps or provide a value for both epochs and steps_per_epoch. The default behaviour of this scheduler follows the fastai implementation of 1cycle, which claims that “unpublished work has shown even better results by using only two phases”. To mimic the behaviour of the original paper instead, set three_phase=True.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • pct_start (float, optional) – The percentage of the cycle (in number of steps) spent increasing the learning rate, defaults to 0.3.

  • anneal_strategy (str, optional) – {‘cos’, ‘linear’}, Specifies the annealing strategy: “cos” for cosine annealing, “linear” for linear annealing, defaults to ‘cos’.

  • cycle_momentum (bool, optional) – If True, momentum is cycled inversely to learning rate between ‘base_momentum’ and ‘max_momentum’, defaults to True.

  • base_momentum (float, optional) – Lower momentum boundaries in the cycle for each parameter group. Note that momentum is cycled inversely to learning rate; at the peak of a cycle, momentum is ‘base_momentum’ and learning rate is ‘max_lr’, defaults to 0.85.

  • max_momentum (float, optional) – Upper momentum boundaries in the cycle for each parameter group. Functionally, it defines the cycle amplitude (max_momentum - base_momentum). Note that momentum is cycled inversely to learning rate; at the start of a cycle, momentum is ‘max_momentum’ and learning rate is ‘base_lr’, defaults to 0.95.

  • div_factor (float, optional) – Determines the initial learning rate via initial_lr = max_lr/div_factor, defaults to 25.0.

  • final_div_factor (float, optional) – Determines the minimum learning rate via min_lr = initial_lr/final_div_factor, defaults to 10000.0.

  • last_epoch (int, optional) – The index of the last batch. This parameter is used when resuming a training job. Since step() should be invoked after each batch instead of after each epoch, this number represents the total number of batches computed, not the total number of epochs computed. When last_epoch=-1, the schedule is started from the beginning, defaults to -1

The kwargs for initializing torch.optim.lr_scheduler.OneCycleLR should include parameters below:

epochs (int, optional, default=None)
steps_per_epoch (int, optional, default=None)
three_phase (bool, optional, default=False)
verbose (bool, optional, default=False)

More details about kwargs could be found in OneCycleLR.

class colossalai.nn.lr_scheduler.PolynomialLR(optimizer, total_steps, end_lr=0.0001, power=1.0, last_epoch=-1, **kwargs)[source]

Polynomial learning rate scheduler.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • end_lr (float, optional) – Minimum learning rate, defaults to 0.0001.

  • power (float, optional) – The power of polynomial, defaults to 1.0.

  • last_epoch (int, optional) – The index of last epoch, defaults to -1. When last_epoch=-1, the schedule is started from the beginning or When last_epoch=-1, sets initial lr as lr.

class colossalai.nn.lr_scheduler.PolynomialWarmupLR(optimizer, total_steps, warmup_steps=0, end_lr=0.0001, power=1.0, last_epoch=-1, **kwargs)[source]

Polynomial learning rate scheduler with warmup.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • warmup_steps (int, optional) – Number of warmup steps, defaults to 0.

  • end_lr (float, optional) – Minimum learning rate, defaults to 0.0001.

  • power (float, optional) – The power of polynomial, defaults to 1.0.

  • last_epoch (int, optional) – The index of last epoch, defaults to -1. When last_epoch=-1, the schedule is started from the beginning or When last_epoch=-1, sets initial lr as lr.

class colossalai.nn.lr_scheduler.LambdaLR(optimizer, total_steps, lr_lambda=None, last_epoch=-1)[source]

Sets the learning rate of each parameter group to the initial lr times a given function. When last_epoch=-1, sets initial lr as lr.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • lr_lambda (Union[function, list[function]]) – A function which computes a multiplicative factor given an integer parameter epoch, or a list of such functions, one for each group in optimizer.param_groups, defaults to None.

  • last_epoch (int, optional) – The index of last epoch, defaults to -1.

class colossalai.nn.lr_scheduler.MultiplicativeLR(optimizer, total_steps, lr_lambda=None, last_epoch=-1)[source]

Multiply the learning rate of each parameter group by the factor given in the specified function. When last_epoch=-1, sets initial lr as lr.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • lr_lambda (Union[function, list[function]]) – A function which computes a multiplicative factor given an integer parameter epoch, or a list of such functions, one for each group in optimizer.param_groups, defaults to None.

  • last_epoch (int, optional) – The index of last epoch, defaults to -1.

class colossalai.nn.lr_scheduler.StepLR(optimizer, total_steps, step_size=1, gamma=0.1, last_epoch=-1)[source]

Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.

Parameters:
  • optimizer (torch.optim.Optimizer) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • step_size (int, optional) – Period of learning rate decay, defaults to 1.

  • gamma (float, optional) – Multiplicative factor of learning rate decay, defaults to 0.1.

  • last_epoch (int, optional) – The index of last epoch, defaults to -1.

class colossalai.nn.lr_scheduler.ExponentialLR(optimizer, total_steps, gamma=1.0, last_epoch=-1)[source]

Decays the learning rate of each parameter group by gamma every epoch. When last_epoch=-1, sets initial lr as lr

Parameters:
  • optimizer (Union[torch.optim.Optimizer, colossalai.nn.optimizer]) – Wrapped optimizer.

  • total_steps (int) – Number of total training steps.

  • gamma (float, optional) – Multiplicative factor of learning rate decay, defaults to 1.0.

  • last_epoch (int, optional) – The index of last epoch, defaults to -1.