Base RL Q Model



polyaxon.models.rl.base.BaseQModel(mode, graph_fn, num_states, num_actions, loss_config=None, optimizer_config=None, eval_metrics_config=None, discount=0.97, exploration_config=None, use_target_graph=True, target_update_frequency=5, is_continuous=False, dueling='mean', use_expert_demo=False, summaries='all', clip_gradients=0.5, clip_embed_gradients=0.1, name='Model')

Base reinforcement learning model class.

  • Args:

    • mode: str, Specifies if this training, evaluation or prediction. See Modes.
    • graph_fn: Graph function. Follows the signature:
      • Args:
      • mode: Specifies if this training, evaluation or prediction. See Modes.
      • inputs: the feature inputs.
    • loss_config: An instance of LossConfig.
    • num_states: int. The number of states.
    • num_actions: int. The number of actions.
    • optimizer_config: An instance of OptimizerConfig. Default value Adam.
    • eval_metrics_config: a list of MetricConfig instances.
    • discount: float. The discount factor on the target Q values.
    • exploration_config: An instance ExplorationConfig
    • use_target_graph: bool. To use a second “target” network, which we will use to compute target Q values during our updates.
    • target_update_frequency: int. At which frequency to update the target graph. Only used when use_target_graph is set tot True.
    • is_continuous: bool. Is the model built for a continuous or discrete space.
    • dueling: str or bool. To compute separately the advantage and value functions.
      • Options:
      • True: creates advantage and state value without any further computation.
      • mean, max, and naive: creates advantage and state value, and computes Q = V(s) + A(s, a) where A = A - mean(A) or A = A - max(A) or A = A.
    • use_expert_demo: Whether to pretrain the model on a human/expert data.
    • summaries: str or list. The verbosity of the tensorboard visualization. Possible values: all, activations, loss, learning_rate, variables, gradients
    • clip_gradients: float. Gradients clipping by global norm.
    • clip_embed_gradients: float. Embedding gradients clipping to a specified value.
    • name: str, the name of this model, everything will be encapsulated inside this scope.
  • Returns: EstimatorSpec



Creates the exploration op.

  • TODO: Think about whether we should pass the episode number here or internally by changing the optimize_loss function????



Create the chosen action with an exploration policy.

If inference mode is used the, actions are chosen directly without exploration.



Create the new graph_fn based on the one specified by the user.

The structure of the graph is the following: 1 - call the graph specified by the user. 2 - create the advantage action probabilities, and the state value. 3 - return the the probabilities, if a dueling method is specified, calculate the new probabilities. - Returns: function. The graph function. The graph function must return a QModelSpec.


_call_graph_fn(self, inputs)

Calls graph function.

Creates first one or two graph, i.e. train and target graphs. Return the optimal action given an exploration policy.

If is_dueling is set to True, then another layer is added that represents the state value.

  • Args:
    • inputs: Tensor or dict of tensors



Creates a copy operation from train graph to target graph.


_build_train_op(self, loss)

Creates the training operation,

In case of use_target_network == True, we append also the update op while taking into account the update_frequency.


_preprocess(self, features, labels)

Model specific preprocessing.

  • Args:
    • features: array, Tensor or dict. The environment states. if dict it must contain a state key.
    • labels: dict. A dictionary containing action, reward, advantage.