:py:mod:`rannet.optimizer` ========================== .. py:module:: rannet.optimizer Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: rannet.optimizer.AdamWarmup rannet.optimizer.AdamWarmupTF Functions ~~~~~~~~~ .. autoapisummary:: rannet.optimizer.piecewise_linear Attributes ~~~~~~~~~~ .. autoapisummary:: rannet.optimizer.symbolic rannet.optimizer.AdamWarmup rannet.optimizer.custom_objects .. py:data:: symbolic .. py:function:: piecewise_linear(t: int, schedule: Dict[int, float], from_zero: bool = True) piecewise linear modified from: https://github.com/bojone/bert4keras/blob/9c1c916def4d515a046c414c9849b2e7e11af1e3/bert4keras/backend.py#L73 :param t: int, iterations :param schedule: Dict[int, float], e.g., for {1000: 1, 2000: 0.1}, when t ∈ [0, 1000], ratio increase from 0.0 to 1.0 uniformly, when t ∈ [1000, 2000], ratio decrease from 1.0 to 0.1 evenly, when t > 2000, ratio keep 0.1 .. py:class:: AdamWarmup(learning_rate: float = 0.001, beta_1: float = 0.9, beta_2: float = 0.999, amsgrad: bool = False, decay: float = 0.0, weight_decay: float = 0.0, epsilon: float = 1e-07, lr_schedule: Optional[Dict[int, float]] = None, gradient_accumulation_steps: int = None, exclude_weight_decay_pattern: Optional[List[str]] = None, include_weight_decay_pattern: Optional[List[str]] = None, **kwargs) Bases: :py:obj:`langml.keras.optimizers.Optimizer` Abstract optimizer base class. This class supports distributed training. If you want to implement your own optimizer, please subclass this class instead of _BaseOptimizer. :param {{base_optimizer_keyword_args}}: ### Usage ```python # Create an optimizer with the desired parameters. opt = keras.optimizers.SGD(learning_rate=0.1) var1, var2 = tf.Variable(1.0), tf.Variable(2.0) # `loss` is a callable that takes no argument and returns the value # to minimize. loss = lambda: 3 * var1 * var1 + 2 * var2 * var2 # Call minimize to update the list of variables. opt.minimize(loss, var_list=[var1, var2]) ``` ### Processing gradients before applying them Calling `minimize()` takes care of both computing the gradients and applying them to the variables. If you want to process the gradients before applying them you can instead use the optimizer in three steps: 1. Compute the gradients with `tf.GradientTape`. 2. Process the gradients as you wish. 3. Apply the processed gradients with `apply_gradients()`. Example: ```python # Create an optimizer. opt = tf.keras.optimizers.experimental.SGD(learning_rate=0.1) var1, var2 = tf.Variable(1.0), tf.Variable(2.0) # Compute the gradients for a list of variables. with tf.GradientTape() as tape: loss = 3 * var1 * var1 + 2 * var2 * var2 grads = tape.gradient(loss, [var1, var2]) # Process the gradients. grads[0] = grads[0] + 1 # Ask the optimizer to apply the gradients on variables. opt.apply_gradients(zip(grads, [var1, var2])) ``` ### Dynamic learning rate Dynamic learning rate can be achieved by setting learning rate as a built-in or customized `tf.keras.optimizers.schedules.LearningRateSchedule`. Example: >>> var = tf.Variable(np.random.random(size=(1,))) >>> learning_rate = tf.keras.optimizers.schedules.ExponentialDecay( ... initial_learning_rate=.01, decay_steps=20, decay_rate=.1) >>> opt = tf.keras.optimizers.experimental.SGD(learning_rate=learning_rate) >>> loss = lambda: 3 * var >>> opt.minimize(loss, var_list=[var]) ### Gradients clipping Users can clip the gradients before applying to variables by setting `clipnorm`, `clipvalue` and `global_clipnorm`. Notice that `clipnorm` and `global_clipnorm` can only have one being set. Example: >>> opt = tf.keras.optimizers.experimental.SGD(learning_rate=1, clipvalue=1) >>> var1, var2 = tf.Variable(2.0), tf.Variable(2.0) >>> with tf.GradientTape() as tape: ... loss = 2 * var1 + 2 * var2 >>> grads = tape.gradient(loss, [var1, var2]) >>> print([grads[0].numpy(), grads[1].numpy()]) [2.0, 2.0] >>> opt.apply_gradients(zip(grads, [var1, var2])) >>> # Without clipping, we should get [0, 0], but as gradients are clipped >>> # to have max value 1, we get [1.0, 1.0]. >>> print([var1.numpy(), var2.numpy()]) [1.0, 1.0] ### Using weight decay. Weight decay in certain scenarios can boost the model's performance. Keras has built-in support for weight decay in all optimizers. Users can apply weight decay by setting `weight_decay` argument. >>> opt = tf.keras.optimizers.experimental.SGD(1, weight_decay=0.004) >>> grads, var1, var2 = tf.zeros(()), tf.Variable(2.0), tf.Variable(2.0) >>> # You can exclude variables from weight decay, in this case we >>> # exclude `var2`. >>> opt.exclude_from_weight_decay(var_list=[var2]) >>> opt.apply_gradients(zip([grads, grads], [var1, var2])) >>> print([var1.numpy(), var2.numpy()]) [1.992, 2.0] ### Using exponential moving average. Empirically it has been found that using the exponential moving average (EMA) of the trained parameters of a deep network achieves a better performance than using its trained parameters directly. Keras optimizers allows users to compute this moving average and overwrite the model variables at desired time. Example: ```python # Create an SGD optimizer with EMA on. `ema_momentum` controls the decay # rate of the moving average. `ema_momentum=1` means no decay and the stored # moving average is always model variable's initial value before training. # Reversely, `ema_momentum=0` is equivalent to not using EMA. # `ema_overwrite_frequency=3` means every 3 iterations, we overwrite the # trainable variables with their moving average values. opt = tf.keras.optimizers.experimental.SGD( learning_rate=1, use_ema=True, ema_momentum=0.5, ema_overwrite_frequency=3) var1, var2 = tf.Variable(2.0), tf.Variable(2.0) with tf.GradientTape() as tape: loss = var1 + var2 grads = tape.gradient(loss, [var1, var2]) # First iteration: [var1, var2] = [1.0, 1.0] opt.apply_gradients(zip(grads, [var1, var2])) print([var1, var2]) # Second iteration: [var1, var2] = [0.0, 0.0] opt.apply_gradients(zip(grads, [var1, var2])) print([var1, var2]) # Third iteration, without EMA, we should see [var1, var2] = [-1.0, -1.0], # but overwriting results in [var1, var2] = [-0.125, -0.125]. The full # calculation for the moving average of var1 is: # var1=2*0.5**3+1*(1-0.5)*0.5**2+0*(1-0.5)*0.5**1+(-1)*(1-0.5)=-0.125. opt.apply_gradients(zip(grads, [var1, var2])) print([var1, var2]) ``` When optimizer is constructed with `use_ema=True`, in custom training loop, users can explicitly call `finalize_variable_values()` to overwrite trainable variables with their EMA values. `finalize_variable_values()` is by default called at the end of `model.fit()`. ### Use with `tf.distribute.Strategy` This optimizer class is `tf.distribute.Strategy` aware, which means it automatically sums gradients across all replicas. To aggregate gradients yourself, call `apply_gradients` with `skip_aggregate_gradients` set to True. This is useful if you need to process aggregated gradients. ```python # This example is not runnable, it consists of dummy code for simple # tutorial. strategy = tf.distribute.experimental.TPUStrategy() with strategy.scope(): opt = tf.keras.optimizers.experimental.SGD() model = magic_function_that_returns_model() gradients = magic_function_that_returns_gradients() # Custom logic to aggregate gradients. gradients = strategy.reduce("SUM", gradients, axis=None) opt.apply_gradients(zip(gradients, model.trainable_variables), skip_aggregate_gradients=True) ``` ### Creating a custom optimizer If you intend to create your own optimization algorithm, please inherit from this class and override the following methods: - `build`: Create your optimizer-related variables, such as `momentums` in SGD optimizer. - `update_step`: Implement your optimizer's updating logic. - `get_config`: serialization of the optimizer, include all hyper parameters. Your optimizer would automatically be compatible with tensorflow distributed training if you subclass `optimizer_experimental.Optimizer`. .. py:method:: _get_updates(loss, params) .. py:method:: get_updates(loss, params) .. py:method:: _handle_weight_decay_pattern(w) .. py:method:: get_config() Returns the config of the optimizer. An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration. Subclass optimizer should override this method to include other hyperparameters. :returns: Python dictionary. .. py:method:: get_custom_objects() :staticmethod: .. py:class:: AdamWarmupTF(learning_rate: float = 0.001, beta_1: float = 0.9, beta_2: float = 0.999, epsilon: float = 1e-07, weight_decay: float = 0.0, bias_correction: float = True, lr_schedule: Optional[Dict[int, float]] = None, gradient_accumulation_steps: int = None, exclude_weight_decay_pattern: Optional[List[str]] = None, include_weight_decay_pattern: Optional[List[str]] = None, name: str = 'AdamWarmupTF', **kwargs) Bases: :py:obj:`tensorflow.keras.optimizers.Optimizer` tf keras adam warmup Modified from: https://github.com/bojone/bert4keras/blob/master/bert4keras/optimizers.py#L14 .. py:method:: _create_slots(var_list) .. py:method:: _do_resource_apply(grad, var, indices=None) .. py:method:: _decayed_lr(var_dtype) .. py:method:: _resource_apply(grad, var, indices=None) .. py:method:: _handle_weight_decay_pattern(w) .. py:method:: _resource_apply_dense(grad, var) .. py:method:: _resource_apply_sparse(grad, var, indices) .. py:method:: get_config() Returns the config of the optimizer. An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration. Subclass optimizer should override this method to include other hyperparameters. :returns: Python dictionary. .. py:method:: get_custom_objects() :staticmethod: .. py:data:: AdamWarmup .. py:data:: custom_objects