:py:mod:`rannet.optimizer`
==========================

.. py:module:: rannet.optimizer


Module Contents
---------------

Classes
~~~~~~~

.. autoapisummary::

   rannet.optimizer.AdamWarmup
   rannet.optimizer.AdamWarmupTF


Functions
~~~~~~~~~

.. autoapisummary::

   rannet.optimizer.piecewise_linear


Attributes
~~~~~~~~~~

.. autoapisummary::

   rannet.optimizer.symbolic
   rannet.optimizer.AdamWarmup
   rannet.optimizer.custom_objects


.. py:data:: symbolic

   
.. py:function:: piecewise_linear(t: int, schedule: Dict[int, float], from_zero: bool = True)

   piecewise linear
   modified from:
       https://github.com/bojone/bert4keras/blob/9c1c916def4d515a046c414c9849b2e7e11af1e3/bert4keras/backend.py#L73

   :param t: int, iterations
   :param schedule: Dict[int, float], e.g., for {1000: 1, 2000: 0.1},
                    when t ∈ [0, 1000], ratio increase from 0.0 to 1.0 uniformly,
                    when t ∈ [1000, 2000], ratio decrease from 1.0 to 0.1 evenly,
                    when t > 2000, ratio keep 0.1


.. py:class:: AdamWarmup(learning_rate: float = 0.001, beta_1: float = 0.9, beta_2: float = 0.999, amsgrad: bool = False, decay: float = 0.0, weight_decay: float = 0.0, epsilon: float = 1e-07, lr_schedule: Optional[Dict[int, float]] = None, gradient_accumulation_steps: int = None, exclude_weight_decay_pattern: Optional[List[str]] = None, include_weight_decay_pattern: Optional[List[str]] = None, **kwargs)


   Bases: :py:obj:`langml.keras.optimizers.Optimizer`

   Abstract optimizer base class.

   This class supports distributed training. If you want to implement your own
   optimizer, please subclass this class instead of _BaseOptimizer.

   :param {{base_optimizer_keyword_args}}:

   ### Usage

   ```python
   # Create an optimizer with the desired parameters.
   opt = keras.optimizers.SGD(learning_rate=0.1)
   var1, var2 = tf.Variable(1.0), tf.Variable(2.0)
   # `loss` is a callable that takes no argument and returns the value
   # to minimize.
   loss = lambda: 3 * var1 * var1 + 2 * var2 * var2
   # Call minimize to update the list of variables.
   opt.minimize(loss, var_list=[var1, var2])
   ```

   ### Processing gradients before applying them

   Calling `minimize()` takes care of both computing the gradients and
   applying them to the variables. If you want to process the gradients
   before applying them you can instead use the optimizer in three steps:

   1.  Compute the gradients with `tf.GradientTape`.
   2.  Process the gradients as you wish.
   3.  Apply the processed gradients with `apply_gradients()`.

   Example:

   ```python
   # Create an optimizer.
   opt = tf.keras.optimizers.experimental.SGD(learning_rate=0.1)
   var1, var2 = tf.Variable(1.0), tf.Variable(2.0)

   # Compute the gradients for a list of variables.
   with tf.GradientTape() as tape:
     loss = 3 * var1 * var1 + 2 * var2 * var2
   grads = tape.gradient(loss, [var1, var2])

   # Process the gradients.
   grads[0] = grads[0] + 1

   # Ask the optimizer to apply the gradients on variables.
   opt.apply_gradients(zip(grads, [var1, var2]))
   ```

   ### Dynamic learning rate

   Dynamic learning rate can be achieved by setting learning rate as a built-in
   or customized `tf.keras.optimizers.schedules.LearningRateSchedule`.

   Example:

   >>> var = tf.Variable(np.random.random(size=(1,)))
   >>> learning_rate = tf.keras.optimizers.schedules.ExponentialDecay(
   ...   initial_learning_rate=.01, decay_steps=20, decay_rate=.1)
   >>> opt = tf.keras.optimizers.experimental.SGD(learning_rate=learning_rate)
   >>> loss = lambda: 3 * var
   >>> opt.minimize(loss, var_list=[var])

   ### Gradients clipping

   Users can clip the gradients before applying to variables by setting
   `clipnorm`, `clipvalue` and `global_clipnorm`. Notice that `clipnorm` and
   `global_clipnorm` can only have one being set.

   Example:

   >>> opt = tf.keras.optimizers.experimental.SGD(learning_rate=1, clipvalue=1)
   >>> var1, var2 = tf.Variable(2.0), tf.Variable(2.0)
   >>> with tf.GradientTape() as tape:
   ...   loss = 2 * var1 + 2 * var2
   >>> grads = tape.gradient(loss, [var1, var2])
   >>> print([grads[0].numpy(), grads[1].numpy()])
   [2.0, 2.0]
   >>> opt.apply_gradients(zip(grads, [var1, var2]))
   >>> # Without clipping, we should get [0, 0], but as gradients are clipped
   >>> # to have max value 1, we get [1.0, 1.0].
   >>> print([var1.numpy(), var2.numpy()])
   [1.0, 1.0]

   ### Using weight decay.

   Weight decay in certain scenarios can boost the model's performance. Keras
   has built-in support for weight decay in all optimizers. Users can apply
   weight decay by setting `weight_decay` argument.

   >>> opt = tf.keras.optimizers.experimental.SGD(1, weight_decay=0.004)
   >>> grads, var1, var2 = tf.zeros(()), tf.Variable(2.0), tf.Variable(2.0)
   >>> # You can exclude variables from weight decay, in this case we
   >>> # exclude `var2`.
   >>> opt.exclude_from_weight_decay(var_list=[var2])
   >>> opt.apply_gradients(zip([grads, grads], [var1, var2]))
   >>> print([var1.numpy(), var2.numpy()])
   [1.992, 2.0]


   ### Using exponential moving average.

   Empirically it has been found that using the exponential moving average
   (EMA) of the trained parameters of a deep network achieves a better
   performance than using its trained parameters directly. Keras optimizers
   allows users to compute this moving average and overwrite the model
   variables at desired time.

   Example:

   ```python
   # Create an SGD optimizer with EMA on. `ema_momentum` controls the decay
   # rate of the moving average. `ema_momentum=1` means no decay and the stored
   # moving average is always model variable's initial value before training.
   # Reversely, `ema_momentum=0` is equivalent to not using EMA.
   # `ema_overwrite_frequency=3` means every 3 iterations, we overwrite the
   # trainable variables with their moving average values.
   opt = tf.keras.optimizers.experimental.SGD(
       learning_rate=1,
       use_ema=True,
       ema_momentum=0.5,
       ema_overwrite_frequency=3)
   var1, var2 = tf.Variable(2.0), tf.Variable(2.0)
   with tf.GradientTape() as tape:
     loss = var1 + var2
   grads = tape.gradient(loss, [var1, var2])
   # First iteration: [var1, var2] = [1.0, 1.0]
   opt.apply_gradients(zip(grads, [var1, var2]))
   print([var1, var2])

   # Second iteration: [var1, var2] = [0.0, 0.0]
   opt.apply_gradients(zip(grads, [var1, var2]))
   print([var1, var2])

   # Third iteration, without EMA, we should see [var1, var2] = [-1.0, -1.0],
   # but overwriting results in [var1, var2] = [-0.125, -0.125]. The full
   # calculation for the moving average of var1 is:
   # var1=2*0.5**3+1*(1-0.5)*0.5**2+0*(1-0.5)*0.5**1+(-1)*(1-0.5)=-0.125.
   opt.apply_gradients(zip(grads, [var1, var2]))
   print([var1, var2])

   ```
   When optimizer is constructed with `use_ema=True`, in custom training loop,
   users can explicitly call `finalize_variable_values()` to overwrite
   trainable variables with their EMA values. `finalize_variable_values()` is
   by default called at the end of `model.fit()`.

   ### Use with `tf.distribute.Strategy`

   This optimizer class is `tf.distribute.Strategy` aware, which means it
   automatically sums gradients across all replicas. To aggregate gradients
   yourself, call `apply_gradients` with `skip_aggregate_gradients` set to
   True.  This is useful if you need to process aggregated gradients.

   ```python
   # This example is not runnable, it consists of dummy code for simple
   # tutorial.
   strategy = tf.distribute.experimental.TPUStrategy()

   with strategy.scope():
     opt = tf.keras.optimizers.experimental.SGD()
     model = magic_function_that_returns_model()
     gradients = magic_function_that_returns_gradients()
     # Custom logic to aggregate gradients.
     gradients = strategy.reduce("SUM", gradients, axis=None)
     opt.apply_gradients(zip(gradients, model.trainable_variables),
         skip_aggregate_gradients=True)
   ```

   ### Creating a custom optimizer

   If you intend to create your own optimization algorithm, please inherit from
   this class and override the following methods:

     - `build`: Create your optimizer-related variables, such as `momentums` in
       SGD optimizer.
     - `update_step`: Implement your optimizer's updating logic.
     - `get_config`: serialization of the optimizer, include all hyper
       parameters.

   Your optimizer would automatically be compatible with tensorflow distributed
   training if you subclass `optimizer_experimental.Optimizer`.


   .. py:method:: _get_updates(loss, params)


   .. py:method:: get_updates(loss, params)


   .. py:method:: _handle_weight_decay_pattern(w)


   .. py:method:: get_config()

      Returns the config of the optimizer.

      An optimizer config is a Python dictionary (serializable)
      containing the configuration of an optimizer.
      The same optimizer can be reinstantiated later
      (without any saved state) from this configuration.

      Subclass optimizer should override this method to include other
      hyperparameters.

      :returns: Python dictionary.


   .. py:method:: get_custom_objects()
      :staticmethod:


.. py:class:: AdamWarmupTF(learning_rate: float = 0.001, beta_1: float = 0.9, beta_2: float = 0.999, epsilon: float = 1e-07, weight_decay: float = 0.0, bias_correction: float = True, lr_schedule: Optional[Dict[int, float]] = None, gradient_accumulation_steps: int = None, exclude_weight_decay_pattern: Optional[List[str]] = None, include_weight_decay_pattern: Optional[List[str]] = None, name: str = 'AdamWarmupTF', **kwargs)


   Bases: :py:obj:`tensorflow.keras.optimizers.Optimizer`

   tf keras adam warmup
   Modified from: https://github.com/bojone/bert4keras/blob/master/bert4keras/optimizers.py#L14

   .. py:method:: _create_slots(var_list)


   .. py:method:: _do_resource_apply(grad, var, indices=None)


   .. py:method:: _decayed_lr(var_dtype)


   .. py:method:: _resource_apply(grad, var, indices=None)


   .. py:method:: _handle_weight_decay_pattern(w)


   .. py:method:: _resource_apply_dense(grad, var)


   .. py:method:: _resource_apply_sparse(grad, var, indices)


   .. py:method:: get_config()

      Returns the config of the optimizer.

      An optimizer config is a Python dictionary (serializable)
      containing the configuration of an optimizer.
      The same optimizer can be reinstantiated later
      (without any saved state) from this configuration.

      Subclass optimizer should override this method to include other
      hyperparameters.

      :returns: Python dictionary.


   .. py:method:: get_custom_objects()
      :staticmethod:


.. py:data:: AdamWarmup

   
.. py:data:: custom_objects