-
Notifications
You must be signed in to change notification settings - Fork 11
Conversation
Add BitFlipping Environment
@@ -0,0 +1,8 @@ | |||
@testset "bit_flipping_env" begin | |||
|
|||
env = BitFlippingEnv(; N = 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use an independent rng
like
ReinforcementLearningEnvironments.jl/test/environments/examples/tiger_problem_env.jl
Lines 3 to 5 in ed9e04c
rng = StableRNG(123) | |
obs_prob = 0.85 | |
env = TigerProblemEnv(; rng = rng, obs_prob = obs_prob) |
GLOBAL_RNG
being polluted.
RLBase.DynamicStyle(::BitFlippingEnv) = SEQUENTIAL | ||
RLBase.ActionStyle(::BitFlippingEnv) = MINIMAL_ACTION_SET | ||
RLBase.InformationStyle(::BitFlippingEnv) = PERFECT_INFORMATION | ||
RLBase.StateStyle(::BitFlippingEnv) = Observation{BitArray{1}}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you support two state styles in this environment. You can return it here.
RLBase.StateStyle(::BitFlippingEnv) = Observation{BitArray{1}}() | |
RLBase.StateStyle(::BitFlippingEnv) = (Observation{BitArray{1}}(), GoalState()) |
if env.state == env.goal_state | ||
1.0 | ||
else | ||
0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return -1
instead of 0.
here based on the description in the original paper:
For every episode we sample uniformly an initial state as well as a target state and the policy gets areward of−1as long as it is not in the target state
RLBase.ActionStyle(::BitFlippingEnv) = MINIMAL_ACTION_SET | ||
RLBase.InformationStyle(::BitFlippingEnv) = PERFECT_INFORMATION | ||
RLBase.StateStyle(::BitFlippingEnv) = Observation{BitArray{1}}() | ||
RLBase.RewardStyle(::BitFlippingEnv) = TERMINAL_REWARD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return a reward of -1
at each non-terminated step, then I think this environment is a STEP_REWARD
env?
struct GoalState{T} <: RLBase.AbstractStateStyle end | ||
GoalState() = GoalState{Any}() | ||
|
||
mutable struct BitFlippingEnv <: AbstractEnv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make this immutable
RLBase.is_terminated(env::BitFlippingEnv) = env.state == env.goal_state | ||
|
||
function RLBase.reset!(env::BitFlippingEnv) | ||
env.state = bitrand(env.rng,env.N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
env.state = bitrand(env.rng,env.N) | |
env.state .= bitrand(env.rng,env.N) |
|
||
function RLBase.reset!(env::BitFlippingEnv) | ||
env.state = bitrand(env.rng,env.N) | ||
env.goal_state = bitrand(env.rng,env.N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
env.goal_state = bitrand(env.rng,env.N) | |
env.goal_state .= bitrand(env.rng,env.N) |
Add Bit Flipping Environment inspired from Hindsight Experience Replay(https://arxiv.org/pdf/1707.01495.pdf)