-
Notifications
You must be signed in to change notification settings - Fork 743
Description
I'm implementing a continuous-action space for TF-Agents, where I want the action to be a four-element array with elements:
I'm then using RandomTFPolicy to test taking actions for a batch of 5 observations. This is what I'm getting every time I run my code:
[[ 3.1179297 -0.37641406 -0.37641406 -0.37641406]
[ 8.263412 0.65268254 0.65268254 0.65268254]
[ 6.849456 0.36989117 0.36989117 0.36989117]
[ 0.06709099 -0.9865818 -0.9865818 -0.9865818 ]
[ 7.8749514 0.5749903 0.5749903 0.5749903 ]]
My questions are:
- How come
$dx$ ,$dy$ , and$dz$ are the same float? Why aren't they sampled independently? - How come I get the exact same action values every time I run my code? I'm not setting random seeds anywhere.
I'm using arm64 MacOS and:
python==3.11.13
tf-agents==0.19.0
tensorflow==2.15.1
tensorflow-metal==1.1.0
tensorflow-probability==0.23.0
numpy==1.26.4
Interestingly, this does not happen on a x86 MacOS, neither on a Windows machine with the same package versions! There, all numbers are random:
[[ 2.9280972e+00 -3.7891769e-01 -7.7160120e-02 8.4350657e-01]
[ 2.3010242e+00 -1.9348240e-01 -6.7645931e-01 2.9825187e-01]
[ 6.4993248e+00 4.0297508e-03 -5.8490920e-01 -5.0786805e-01]
[ 9.6005363e+00 -2.8406858e-01 -7.8258038e-02 5.8963799e-01]
[ 1.4861953e+00 -8.2189059e-01 -2.9714632e-01 -5.1117587e-01]]
My code:
import numpy as np
import tensorflow as tf
from tf_agents.specs import array_spec
from tf_agents.trajectories import time_step as ts
from tf_agents.policies import random_tf_policy
observation_spec = array_spec.BoundedArraySpec(
shape=(64, 64, 2),
dtype=np.float32,
minimum=0.0,
maximum=1.0,
name="observation",
)
action_spec = array_spec.BoundedArraySpec(
shape=(4,),
dtype=np.float32,
minimum=np.array([0.0, -1.0, -1.0, -1.0], dtype=np.float32),
maximum=np.array([10.0, 1.0, 1.0, 1.0], dtype=np.float32),
name="action",
)
time_step_spec = ts.time_step_spec(observation_spec)
policy = random_tf_policy.RandomTFPolicy(time_step_spec=time_step_spec,
action_spec=action_spec)
obs = tf.random.uniform(shape=(5, 64, 64, 2), minval=0.0, maxval=1.0, dtype=tf.float32)
timestep = ts.restart(observation=obs, batch_size=5)
action_step = policy.action(timestep, seed=None)
actions = action_step.action
print(actions.numpy())Moreover, I'm also seeing that agent.collect_policy leads to the same action sampled for the same observation value, just like it should be for agent.policy. My understanding is that collect_policy should always be stochastic? Here's a couple of actions where the last 11 actions correspond to the agent seeing the exact same observation (note that the actions are deterministic at this point but should be stochastic):
<tf.Tensor: shape=(1, 20, 4), dtype=float32, numpy=
array([[[ 4.543779 , -0.7549822 , -0.98628926, -0.6924889 ],
[ 4.543779 , -0.75414133, -0.9862873 , -0.6924889 ],
[ 4.543778 , -0.7378491 , -0.98595184, -0.6924772 ],
[ 4.5437074 , -0.69559884, -0.98290896, -0.69219804],
[ 4.5435147 , -0.66731834, -0.97919464, -0.6916727 ],
[ 4.543341 , -0.6532383 , -0.9768447 , -0.6912718 ],
[ 4.5433545 , -0.6541612 , -0.9770225 , -0.6912997 ],
[ 4.543534 , -0.6692811 , -0.979528 , -0.69171715],
[ 4.5436664 , -0.6870763 , -0.9819559 , -0.69207287],
[ 4.5437007 , -0.6941073 , -0.982763 , -0.6921773 ],
[ 4.5437007 , -0.6941073 , -0.982763 , -0.6921773 ],
[ 4.5437007 , -0.6941073 , -0.982763 , -0.6921773 ],
[ 4.5437007 , -0.6941073 , -0.982763 , -0.6921773 ],
[ 4.5437007 , -0.6941073 , -0.982763 , -0.6921773 ],
[ 4.5437007 , -0.6941073 , -0.982763 , -0.6921773 ],
[ 4.5437007 , -0.6941073 , -0.982763 , -0.6921773 ],
[ 4.5437007 , -0.6941073 , -0.982763 , -0.6921773 ],
[ 4.5437007 , -0.6941073 , -0.982763 , -0.6921773 ],
[ 4.5437007 , -0.6941073 , -0.982763 , -0.6921773 ],
[ 4.5437007 , -0.6941073 , -0.982763 , -0.6921773 ]]],
dtype=float32)>
I've used a simple REINFORCE agent here:
agent = reinforce_agent.ReinforceAgent(time_step_spec=train_env.time_step_spec(),
action_spec=train_env.action_spec(),
actor_network=actor_net,
optimizer=optimizer,
train_step_counter=train_step_counter,
gamma=0.95,
normalize_returns=False,
entropy_regularization=None)
agent.initialize()