rannet.optimizer¶
Module Contents¶
Classes¶
Abstract optimizer base class. |
|
tf keras adam warmup |
Functions¶
|
piecewise linear |
Attributes¶
Abstract optimizer base class. |
|
- rannet.optimizer.symbolic¶
- rannet.optimizer.piecewise_linear(t: int, schedule: Dict[int, float], from_zero: bool = True)¶
piecewise linear modified from:
- Parameters:
t – int, iterations
schedule – Dict[int, float], e.g., for {1000: 1, 2000: 0.1}, when t ∈ [0, 1000], ratio increase from 0.0 to 1.0 uniformly, when t ∈ [1000, 2000], ratio decrease from 1.0 to 0.1 evenly, when t > 2000, ratio keep 0.1
- class rannet.optimizer.AdamWarmup(learning_rate: float = 0.001, beta_1: float = 0.9, beta_2: float = 0.999, amsgrad: bool = False, decay: float = 0.0, weight_decay: float = 0.0, epsilon: float = 1e-07, lr_schedule: Dict[int, float] | None = None, gradient_accumulation_steps: int = None, exclude_weight_decay_pattern: List[str] | None = None, include_weight_decay_pattern: List[str] | None = None, **kwargs)¶
Bases:
langml.keras.optimizers.OptimizerAbstract optimizer base class.
This class supports distributed training. If you want to implement your own optimizer, please subclass this class instead of _BaseOptimizer.
- Parameters:
{{base_optimizer_keyword_args}} –
### Usage
`python # Create an optimizer with the desired parameters. opt = keras.optimizers.SGD(learning_rate=0.1) var1, var2 = tf.Variable(1.0), tf.Variable(2.0) # `loss` is a callable that takes no argument and returns the value # to minimize. loss = lambda: 3 * var1 * var1 + 2 * var2 * var2 # Call minimize to update the list of variables. opt.minimize(loss, var_list=[var1, var2]) `### Processing gradients before applying them
Calling minimize() takes care of both computing the gradients and applying them to the variables. If you want to process the gradients before applying them you can instead use the optimizer in three steps:
Compute the gradients with tf.GradientTape.
Process the gradients as you wish.
Apply the processed gradients with apply_gradients().
Example:
```python # Create an optimizer. opt = tf.keras.optimizers.experimental.SGD(learning_rate=0.1) var1, var2 = tf.Variable(1.0), tf.Variable(2.0)
# Compute the gradients for a list of variables. with tf.GradientTape() as tape:
loss = 3 * var1 * var1 + 2 * var2 * var2
grads = tape.gradient(loss, [var1, var2])
# Process the gradients. grads[0] = grads[0] + 1
# Ask the optimizer to apply the gradients on variables. opt.apply_gradients(zip(grads, [var1, var2])) ```
### Dynamic learning rate
Dynamic learning rate can be achieved by setting learning rate as a built-in or customized tf.keras.optimizers.schedules.LearningRateSchedule.
Example:
>>> var = tf.Variable(np.random.random(size=(1,))) >>> learning_rate = tf.keras.optimizers.schedules.ExponentialDecay( ... initial_learning_rate=.01, decay_steps=20, decay_rate=.1) >>> opt = tf.keras.optimizers.experimental.SGD(learning_rate=learning_rate) >>> loss = lambda: 3 * var >>> opt.minimize(loss, var_list=[var])
### Gradients clipping
Users can clip the gradients before applying to variables by setting clipnorm, clipvalue and global_clipnorm. Notice that clipnorm and global_clipnorm can only have one being set.
Example:
>>> opt = tf.keras.optimizers.experimental.SGD(learning_rate=1, clipvalue=1) >>> var1, var2 = tf.Variable(2.0), tf.Variable(2.0) >>> with tf.GradientTape() as tape: ... loss = 2 * var1 + 2 * var2 >>> grads = tape.gradient(loss, [var1, var2]) >>> print([grads[0].numpy(), grads[1].numpy()]) [2.0, 2.0] >>> opt.apply_gradients(zip(grads, [var1, var2])) >>> # Without clipping, we should get [0, 0], but as gradients are clipped >>> # to have max value 1, we get [1.0, 1.0]. >>> print([var1.numpy(), var2.numpy()]) [1.0, 1.0]
### Using weight decay.
Weight decay in certain scenarios can boost the model’s performance. Keras has built-in support for weight decay in all optimizers. Users can apply weight decay by setting weight_decay argument.
>>> opt = tf.keras.optimizers.experimental.SGD(1, weight_decay=0.004) >>> grads, var1, var2 = tf.zeros(()), tf.Variable(2.0), tf.Variable(2.0) >>> # You can exclude variables from weight decay, in this case we >>> # exclude `var2`. >>> opt.exclude_from_weight_decay(var_list=[var2]) >>> opt.apply_gradients(zip([grads, grads], [var1, var2])) >>> print([var1.numpy(), var2.numpy()]) [1.992, 2.0]
### Using exponential moving average.
Empirically it has been found that using the exponential moving average (EMA) of the trained parameters of a deep network achieves a better performance than using its trained parameters directly. Keras optimizers allows users to compute this moving average and overwrite the model variables at desired time.
Example:
``python # Create an SGD optimizer with EMA on. `ema_momentum controls the decay # rate of the moving average. ema_momentum=1 means no decay and the stored # moving average is always model variable’s initial value before training. # Reversely, ema_momentum=0 is equivalent to not using EMA. # ema_overwrite_frequency=3 means every 3 iterations, we overwrite the # trainable variables with their moving average values. opt = tf.keras.optimizers.experimental.SGD(
learning_rate=1, use_ema=True, ema_momentum=0.5, ema_overwrite_frequency=3)
var1, var2 = tf.Variable(2.0), tf.Variable(2.0) with tf.GradientTape() as tape:
loss = var1 + var2
grads = tape.gradient(loss, [var1, var2]) # First iteration: [var1, var2] = [1.0, 1.0] opt.apply_gradients(zip(grads, [var1, var2])) print([var1, var2])
# Second iteration: [var1, var2] = [0.0, 0.0] opt.apply_gradients(zip(grads, [var1, var2])) print([var1, var2])
# Third iteration, without EMA, we should see [var1, var2] = [-1.0, -1.0], # but overwriting results in [var1, var2] = [-0.125, -0.125]. The full # calculation for the moving average of var1 is: # var1=2*0.5**3+1*(1-0.5)*0.5**2+0*(1-0.5)*0.5**1+(-1)*(1-0.5)=-0.125. opt.apply_gradients(zip(grads, [var1, var2])) print([var1, var2])
``` When optimizer is constructed with use_ema=True, in custom training loop, users can explicitly call finalize_variable_values() to overwrite trainable variables with their EMA values. finalize_variable_values() is by default called at the end of model.fit().
### Use with tf.distribute.Strategy
This optimizer class is tf.distribute.Strategy aware, which means it automatically sums gradients across all replicas. To aggregate gradients yourself, call apply_gradients with skip_aggregate_gradients set to True. This is useful if you need to process aggregated gradients.
```python # This example is not runnable, it consists of dummy code for simple # tutorial. strategy = tf.distribute.experimental.TPUStrategy()
- with strategy.scope():
opt = tf.keras.optimizers.experimental.SGD() model = magic_function_that_returns_model() gradients = magic_function_that_returns_gradients() # Custom logic to aggregate gradients. gradients = strategy.reduce(“SUM”, gradients, axis=None) opt.apply_gradients(zip(gradients, model.trainable_variables),
skip_aggregate_gradients=True)
### Creating a custom optimizer
If you intend to create your own optimization algorithm, please inherit from this class and override the following methods:
build: Create your optimizer-related variables, such as momentums in SGD optimizer.
update_step: Implement your optimizer’s updating logic.
get_config: serialization of the optimizer, include all hyper parameters.
Your optimizer would automatically be compatible with tensorflow distributed training if you subclass optimizer_experimental.Optimizer.
- _get_updates(loss, params)¶
- get_updates(loss, params)¶
- _handle_weight_decay_pattern(w)¶
- get_config()¶
Returns the config of the optimizer.
An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.
Subclass optimizer should override this method to include other hyperparameters.
- Returns:
Python dictionary.
- static get_custom_objects()¶
- class rannet.optimizer.AdamWarmupTF(learning_rate: float = 0.001, beta_1: float = 0.9, beta_2: float = 0.999, epsilon: float = 1e-07, weight_decay: float = 0.0, bias_correction: float = True, lr_schedule: Dict[int, float] | None = None, gradient_accumulation_steps: int = None, exclude_weight_decay_pattern: List[str] | None = None, include_weight_decay_pattern: List[str] | None = None, name: str = 'AdamWarmupTF', **kwargs)¶
Bases:
tensorflow.keras.optimizers.Optimizertf keras adam warmup Modified from: https://github.com/bojone/bert4keras/blob/master/bert4keras/optimizers.py#L14
- _create_slots(var_list)¶
- _do_resource_apply(grad, var, indices=None)¶
- _decayed_lr(var_dtype)¶
- _resource_apply(grad, var, indices=None)¶
- _handle_weight_decay_pattern(w)¶
- _resource_apply_dense(grad, var)¶
- _resource_apply_sparse(grad, var, indices)¶
- get_config()¶
Returns the config of the optimizer.
An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.
Subclass optimizer should override this method to include other hyperparameters.
- Returns:
Python dictionary.
- static get_custom_objects()¶
- rannet.optimizer.AdamWarmup¶
- rannet.optimizer.custom_objects¶