RLQModels

[source]

DQNModel

polyaxon.models.rl.base.DQNModel(mode, graph_fn, num_states, num_actions, loss_config=None, optimizer_config=None, eval_metrics_config=None, discount=0.97, exploration_config=None, use_target_graph=True, target_update_frequency=5, is_continuous=False, dueling='mean', use_expert_demo=False, summaries='all', clip_gradients=0.5, clip_embed_gradients=0.1, name='Model')

Implements a double deep Q model.

  • Args:

    • mode: str, Specifies if this training, evaluation or prediction. See Modes.
    • graph_fn: Graph function. Follows the signature:
      • Args:
      • mode: Specifies if this training, evaluation or prediction. See Modes.
      • inputs: the feature inputs.
    • loss_config: An instance of LossConfig.
    • num_states: int. The number of states.
    • num_actions: int. The number of actions.
    • optimizer_config: An instance of OptimizerConfig. Default value Adam.
    • eval_metrics_config: a list of MetricConfig instances.
    • discount: float. The discount factor on the target Q values.
    • exploration_config: An instance ExplorationConfig
    • use_target_graph: bool. To use a second “target” network, which we will use to compute target Q values during our updates.
    • update_frequency: int. At which frequency to update the target graph. Only used when use_target_graph is set tot True.
    • is_continuous: bool. Is the model built for a continuous or discrete space.
    • dueling: str or bool. To compute separately the advantage and value functions.
      • Options:
      • True: creates advantage and state value without any further computation.
      • mean, max, and naive: creates advantage and state value, and computes Q = V(s) + A(s, a) where A = A - mean(A) or A = A - max(A) or A = A.
    • use_expert_demo: Whether to pretrain the model on a human/expert data.
    • summaries: str or list. The verbosity of the tensorboard visualization. Possible values: all, activations, loss, learning_rate, variables, gradients
    • clip_gradients: float. Gradients clipping by global norm.
    • clip_embed_gradients: float. Embedding gradients clipping to a specified value.
    • name: str, the name of this model, everything will be encapsulated inside this scope.
  • Returns: EstimatorSpec


[source]

DDQNModel

polyaxon.models.rl.base.DDQNModel(mode, graph_fn, num_states, num_actions, loss_config=None, optimizer_config=None, eval_metrics_config=None, discount=0.97, exploration_config=None, use_target_graph=True, target_update_frequency=5, is_continuous=False, dueling='mean', use_expert_demo=False, summaries='all', clip_gradients=0.5, clip_embed_gradients=0.1, name='Model')

Implements a double deep Q model.

  • Args:

    • mode: str, Specifies if this training, evaluation or prediction. See Modes.
    • graph_fn: Graph function. Follows the signature:
      • Args:
      • mode: Specifies if this training, evaluation or prediction. See Modes.
      • inputs: the feature inputs.
    • loss_config: An instance of LossConfig.
    • num_states: int. The number of states.
    • num_actions: int. The number of actions.
    • optimizer_config: An instance of OptimizerConfig. Default value Adam.
    • eval_metrics_config: a list of MetricConfig instances.
    • discount: float. The discount factor on the target Q values.
    • exploration_config: An instance ExplorationConfig
    • use_target_graph: bool. To use a second “target” network, which we will use to compute target Q values during our updates.
    • update_frequency: int. At which frequency to update the target graph. Only used when use_target_graph is set tot True.
    • is_continuous: bool. Is the model built for a continuous or discrete space.
    • dueling: str or bool. To compute separately the advantage and value functions.
      • Options:
      • True: creates advantage and state value without any further computation.
      • mean, max, and naive: creates advantage and state value, and computes Q = V(s) + A(s, a) where A = A - mean(A) or A = A - max(A) or A = A.
    • use_expert_demo: Whether to pretrain the model on a human/expert data.
    • summaries: str or list. The verbosity of the tensorboard visualization. Possible values: all, activations, loss, learning_rate, variables, gradients
    • clip_gradients: float. Gradients clipping by global norm.
    • clip_embed_gradients: float. Embedding gradients clipping to a specified value.
    • name: str, the name of this model, everything will be encapsulated inside this scope.
  • Returns: EstimatorSpec


[source]

NAFModel

polyaxon.models.rl.naf.NAFModel(mode, graph_fn, loss_config, num_states, num_actions, optimizer_config=None, eval_metrics_config=None, discount=0.97, exploration_config=None, use_target_graph=True, target_update_frequency=5, is_continuous=True, use_expert_demo=False, summaries='all', clip_gradients=0.5, clip_embed_gradients=0.1, name='Model')

Implements a normalized advantage functions model.

  • Args:

    • mode: str, Specifies if this training, evaluation or prediction. See Modes.
    • graph_fn: Graph function. Follows the signature:
      • Args:
      • mode: Specifies if this training, evaluation or prediction. See Modes.
      • inputs: the feature inputs.
    • loss_config: An instance of LossConfig.
    • num_states: int. The number of states.
    • num_actions: int. The number of actions.
    • optimizer_config: An instance of OptimizerConfig. Default value Adam.
    • eval_metrics_config: a list of MetricConfig instances.
    • discount: float. The discount factor on the target Q values.
    • exploration_config: An instance ExplorationConfig
    • use_target_graph: bool. To use a second “target” network, which we will use to compute target Q values during our updates.
    • target_update_frequency: int. At which frequency to update the target graph. Only used when use_target_graph is set tot True.
    • is_continuous: bool. Is the model built for a continuous or discrete space.
    • use_expert_demo: Whether to pretrain the model on a human/expert data.
    • summaries: str or list. The verbosity of the tensorboard visualization. Possible values: all, activations, loss, learning_rate, variables, gradients
    • clip_gradients: float. Gradients clipping by global norm.
    • clip_embed_gradients: float. Embedding gradients clipping to a specified value.
    • name: str, the name of this model, everything will be encapsulated inside this scope.
  • Returns: EstimatorSpec