Skip to content

Commit

Permalink
Fix video link for PPO robot.
Browse files Browse the repository at this point in the history
  • Loading branch information
jachiam committed Mar 14, 2019
1 parent 41ce375 commit 705446c
Show file tree
Hide file tree
Showing 22 changed files with 20 additions and 19 deletions.
Binary file modified docs/_build/doctrees/algorithms/vpg.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/_build/doctrees/spinningup/keypapers.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/spinningup/rl_intro.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/user/algorithms.doctree
Binary file not shown.
Binary file modified docs/_build/doctrees/user/introduction.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/_build/html/_modules/spinup/algos/ddpg/ddpg.html
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,7 @@ <h1>Source code for spinup.algos.ddpg.ddpg</h1><div class="highlight"><pre>
<span class="n">logger</span><span class="o">.</span><span class="n">setup_tf_saver</span><span class="p">(</span><span class="n">sess</span><span class="p">,</span> <span class="n">inputs</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;x&#39;</span><span class="p">:</span> <span class="n">x_ph</span><span class="p">,</span> <span class="s1">&#39;a&#39;</span><span class="p">:</span> <span class="n">a_ph</span><span class="p">},</span> <span class="n">outputs</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;pi&#39;</span><span class="p">:</span> <span class="n">pi</span><span class="p">,</span> <span class="s1">&#39;q&#39;</span><span class="p">:</span> <span class="n">q</span><span class="p">})</span>

<span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="n">o</span><span class="p">,</span> <span class="n">noise_scale</span><span class="p">):</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">pi</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="p">{</span><span class="n">x_ph</span><span class="p">:</span> <span class="n">o</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">)})</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">pi</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="p">{</span><span class="n">x_ph</span><span class="p">:</span> <span class="n">o</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">)})[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">a</span> <span class="o">+=</span> <span class="n">noise_scale</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">act_dim</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">-</span><span class="n">act_limit</span><span class="p">,</span> <span class="n">act_limit</span><span class="p">)</span>

Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/_modules/spinup/algos/sac/sac.html
Original file line number Diff line number Diff line change
Expand Up @@ -404,7 +404,7 @@ <h1>Source code for spinup.algos.sac.sac</h1><div class="highlight"><pre>

<span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="n">o</span><span class="p">,</span> <span class="n">deterministic</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
<span class="n">act_op</span> <span class="o">=</span> <span class="n">mu</span> <span class="k">if</span> <span class="n">deterministic</span> <span class="k">else</span> <span class="n">pi</span>
<span class="k">return</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">act_op</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="p">{</span><span class="n">x_ph</span><span class="p">:</span> <span class="n">o</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">)})</span>
<span class="k">return</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">act_op</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="p">{</span><span class="n">x_ph</span><span class="p">:</span> <span class="n">o</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">)})[</span><span class="mi">0</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">test_agent</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
<span class="k">global</span> <span class="n">sess</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="n">q1</span><span class="p">,</span> <span class="n">q2</span><span class="p">,</span> <span class="n">q1_pi</span><span class="p">,</span> <span class="n">q2_pi</span>
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/_modules/spinup/algos/td3/td3.html
Original file line number Diff line number Diff line change
Expand Up @@ -394,7 +394,7 @@ <h1>Source code for spinup.algos.td3.td3</h1><div class="highlight"><pre>
<span class="n">logger</span><span class="o">.</span><span class="n">setup_tf_saver</span><span class="p">(</span><span class="n">sess</span><span class="p">,</span> <span class="n">inputs</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;x&#39;</span><span class="p">:</span> <span class="n">x_ph</span><span class="p">,</span> <span class="s1">&#39;a&#39;</span><span class="p">:</span> <span class="n">a_ph</span><span class="p">},</span> <span class="n">outputs</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;pi&#39;</span><span class="p">:</span> <span class="n">pi</span><span class="p">,</span> <span class="s1">&#39;q1&#39;</span><span class="p">:</span> <span class="n">q1</span><span class="p">,</span> <span class="s1">&#39;q2&#39;</span><span class="p">:</span> <span class="n">q2</span><span class="p">})</span>

<span class="k">def</span> <span class="nf">get_action</span><span class="p">(</span><span class="n">o</span><span class="p">,</span> <span class="n">noise_scale</span><span class="p">):</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">pi</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="p">{</span><span class="n">x_ph</span><span class="p">:</span> <span class="n">o</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">)})</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">pi</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="p">{</span><span class="n">x_ph</span><span class="p">:</span> <span class="n">o</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="o">-</span><span class="mi">1</span><span class="p">)})[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">a</span> <span class="o">+=</span> <span class="n">noise_scale</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">act_dim</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="o">-</span><span class="n">act_limit</span><span class="p">,</span> <span class="n">act_limit</span><span class="p">)</span>

Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/_modules/spinup/utils/run_utils.html
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,7 @@ <h1>Source code for spinup.utils.run_utils</h1><div class="highlight"><pre>
<span class="n">encoded_thunk</span> <span class="o">=</span> <span class="n">base64</span><span class="o">.</span><span class="n">b64encode</span><span class="p">(</span><span class="n">zlib</span><span class="o">.</span><span class="n">compress</span><span class="p">(</span><span class="n">pickled_thunk</span><span class="p">))</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s1">&#39;utf-8&#39;</span><span class="p">)</span>

<span class="n">entrypoint</span> <span class="o">=</span> <span class="n">osp</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">osp</span><span class="o">.</span><span class="n">abspath</span><span class="p">(</span><span class="n">osp</span><span class="o">.</span><span class="n">dirname</span><span class="p">(</span><span class="vm">__file__</span><span class="p">)),</span><span class="s1">&#39;run_entrypoint.py&#39;</span><span class="p">)</span>
<span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;python&#39;</span><span class="p">,</span> <span class="n">entrypoint</span><span class="p">,</span> <span class="n">encoded_thunk</span><span class="p">]</span>
<span class="n">cmd</span> <span class="o">=</span> <span class="p">[</span><span class="n">sys</span><span class="o">.</span><span class="n">executable</span> <span class="k">if</span> <span class="n">sys</span><span class="o">.</span><span class="n">executable</span> <span class="k">else</span> <span class="s1">&#39;python&#39;</span><span class="p">,</span> <span class="n">entrypoint</span><span class="p">,</span> <span class="n">encoded_thunk</span><span class="p">]</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">subprocess</span><span class="o">.</span><span class="n">check_call</span><span class="p">(</span><span class="n">cmd</span><span class="p">,</span> <span class="n">env</span><span class="o">=</span><span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">)</span>
<span class="k">except</span> <span class="n">CalledProcessError</span><span class="p">:</span>
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/_sources/algorithms/vpg.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ The policy gradient algorithm works by updating policy parameters via stochastic
\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta_k})
Policy gradient implementations typically compute advantage function estimates based on the infinite-horizon discounted return, despite otherwise use the finite-horizon undiscounted policy gradient formula.
Policy gradient implementations typically compute advantage function estimates based on the infinite-horizon discounted return, despite otherwise using the finite-horizon undiscounted policy gradient formula.

Exploration vs. Exploitation
----------------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/_sources/spinningup/keypapers.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -191,7 +191,7 @@ a. Model is Learned
.. [#] `Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning <https://arxiv.org/abs/1708.02596>`_, Nagabandi et al, 2017. **Algorithm: MBMF.**
.. [#] `Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning <https://arxiv.org/abs/1803.00101>`_, Feinberg et al, 2018. **Algorithm: MBVE.**
.. [#] `Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning <https://arxiv.org/abs/1803.00101>`_, Feinberg et al, 2018. **Algorithm: MVE.**
.. [#] `Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion <https://arxiv.org/abs/1807.01675>`_, Buckman et al, 2018. **Algorithm: STEVE.**
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/_sources/spinningup/rl_intro.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ RL methods have recently enjoyed a wide variety of successes. For example, it's

.. raw:: html

<video autoplay="" src="https://storage.googleapis.com/joschu-public/knocked-over-stand-up.mp4" loop="" controls="" style="display: block; margin-left: auto; margin-right: auto; margin-bottom:1.5em; width: 100%; max-width: 720px; max-height: 80vh;">
<video autoplay="" src="https://d4mucfpksywv.cloudfront.net/openai-baselines-ppo/knocked-over-stand-up.mp4" loop="" controls="" style="display: block; margin-left: auto; margin-right: auto; margin-bottom:1.5em; width: 100%; max-width: 720px; max-height: 80vh;">
</video>

...and in the real world...
Expand Down
3 changes: 2 additions & 1 deletion docs/_build/html/_sources/user/algorithms.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The following algorithms are implemented in the Spinning Up package:
- `Twin Delayed DDPG`_ (TD3)
- `Soft Actor-Critic`_ (SAC)

They are all implemented with MLP (non-recurrent) actor-critics, making them suitable for fully-observed, non-image-based RL environments, eg the `Gym Mujoco`_ environments.
They are all implemented with `MLP`_ (non-recurrent) actor-critics, making them suitable for fully-observed, non-image-based RL environments, eg the `Gym Mujoco`_ environments.

.. _`Gym Mujoco`: https://gym.openai.com/envs/#mujoco
.. _`Vanilla Policy Gradient`: ../algorithms/vpg.html
Expand All @@ -25,6 +25,7 @@ They are all implemented with MLP (non-recurrent) actor-critics, making them sui
.. _`Deep Deterministic Policy Gradient`: ../algorithms/ddpg.html
.. _`Twin Delayed DDPG`: ../algorithms/td3.html
.. _`Soft Actor-Critic`: ../algorithms/sac.html
.. _`MLP`: https://en.wikipedia.org/wiki/Multilayer_perceptron


Why These Algorithms?
Expand Down
6 changes: 3 additions & 3 deletions docs/_build/html/_sources/user/introduction.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ However, while there are many resources to help people quickly ramp up on deep l

The high-level view is hard to come by because of how new the field is. There is not yet a standard deep RL textbook, so most of the knowledge is locked up in either papers or lecture series, which can take a long time to parse and digest. And learning to implement deep RL algorithms is typically painful, because either

- the paper that publishes an algorithm omits or inadvertantly obscures key design details,
- the paper that publishes an algorithm omits or inadvertently obscures key design details,
- or widely-public implementations of an algorithm are hard to read, hiding how the code lines up with the algorithm.

While fantastic repos like rllab_, Baselines_, and rllib_ make it easier for researchers who are already in the field to make progress, they build algorithms into frameworks in ways that involve many non-obvious choices and trade-offs, which makes them hard to learn from. Consequently, the field of deep RL has a pretty high barrier to entry---for new researchers as well as practitioners and hobbyists.
Expand Down Expand Up @@ -66,7 +66,7 @@ The algorithm implementations in the Spinning Up repo are designed to be
- as simple as possible while still being reasonably good,
- and highly-consistent with each other to expose fundamental similarities between algorithms.

They are almost completely self-contained, with virtually no common code shared between them (except for logging, saving, loading, and MPI utilities), so that an interested person can study each algorithm separately without having to dig through an endless chain of dependencies to see how something is done. The implementations are patterned so that they come as close to pseudocode as possible, to minimize the gap between theory and code.
They are almost completely self-contained, with virtually no common code shared between them (except for logging, saving, loading, and `MPI <https://en.wikipedia.org/wiki/Message_Passing_Interface>`_ utilities), so that an interested person can study each algorithm separately without having to dig through an endless chain of dependencies to see how something is done. The implementations are patterned so that they come as close to pseudocode as possible, to minimize the gap between theory and code.

Importantly, they're all structured similarly, so if you clearly understand one, jumping into the next is painless.

Expand Down Expand Up @@ -105,4 +105,4 @@ Additionally, as discussed in the blog post, we are using Spinning Up in the cur
.. _`original TD3 code`: https://github.com/sfujim/TD3/blob/25dfc0a6562c54ae5575fad5b8f08bc9d5c4e26c/main.py#L89
.. _`benchmarks`: ../spinningup/bench.html
.. _Scholars : https://jobs.lever.co/openai/cf6de4ed-4afd-4ace-9273-8842c003c842
.. _Fellows : https://jobs.lever.co/openai/c9ba3f64-2419-4ff9-b81d-0526ae059f57
.. _Fellows : https://jobs.lever.co/openai/c9ba3f64-2419-4ff9-b81d-0526ae059f57
2 changes: 1 addition & 1 deletion docs/_build/html/algorithms/vpg.html
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,7 @@ <h3><a class="toc-backref" href="#id4">Key Equations</a><a class="headerlink" hr
<p>The policy gradient algorithm works by updating policy parameters via stochastic gradient ascent on policy performance:</p>
<div class="math">
<p><img src="../_images/math/9053f48788d67cbf5725c480779428ee8ed21f98.svg" alt="\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta_k})"/></p>
</div><p>Policy gradient implementations typically compute advantage function estimates based on the infinite-horizon discounted return, despite otherwise use the finite-horizon undiscounted policy gradient formula.</p>
</div><p>Policy gradient implementations typically compute advantage function estimates based on the infinite-horizon discounted return, despite otherwise using the finite-horizon undiscounted policy gradient formula.</p>
</div>
<div class="section" id="exploration-vs-exploitation">
<h3><a class="toc-backref" href="#id5">Exploration vs. Exploitation</a><a class="headerlink" href="#exploration-vs-exploitation" title="Permalink to this headline"></a></h3>
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/searchindex.js

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/_build/html/spinningup/keypapers.html
Original file line number Diff line number Diff line change
Expand Up @@ -659,7 +659,7 @@ <h3>a. Model is Learned<a class="headerlink" href="#a-model-is-learned" title="P
<table class="docutils footnote" frame="void" id="id61" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label">[61]</td><td><a class="reference external" href="https://arxiv.org/abs/1803.00101">Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning</a>, Feinberg et al, 2018. <strong>Algorithm: MBVE.</strong></td></tr>
<tr><td class="label">[61]</td><td><a class="reference external" href="https://arxiv.org/abs/1803.00101">Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning</a>, Feinberg et al, 2018. <strong>Algorithm: MVE.</strong></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id62" rules="none">
Expand Down
2 changes: 1 addition & 1 deletion docs/_build/html/spinningup/rl_intro.html
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ <h1><a class="toc-backref" href="#id2">Part 1: Key Concepts in RL</a><a class="h
<div class="section" id="what-can-rl-do">
<h2><a class="toc-backref" href="#id3">What Can RL Do?</a><a class="headerlink" href="#what-can-rl-do" title="Permalink to this headline"></a></h2>
<p>RL methods have recently enjoyed a wide variety of successes. For example, it&#8217;s been used to teach computers to control robots in simulation...</p>
<video autoplay="" src="https://storage.googleapis.com/joschu-public/knocked-over-stand-up.mp4" loop="" controls="" style="display: block; margin-left: auto; margin-right: auto; margin-bottom:1.5em; width: 100%; max-width: 720px; max-height: 80vh;">
<video autoplay="" src="https://d4mucfpksywv.cloudfront.net/openai-baselines-ppo/knocked-over-stand-up.mp4" loop="" controls="" style="display: block; margin-left: auto; margin-right: auto; margin-bottom:1.5em; width: 100%; max-width: 720px; max-height: 80vh;">
</video><p>...and in the real world...</p>
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; max-width: 100%; height: auto;">
<iframe src="https://www.youtube.com/embed/jwSbzNHGflM?ecver=1" frameborder="0" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe>
Expand Down
Loading

0 comments on commit 705446c

Please sign in to comment.