Skip to content

Commit

Permalink
Neatened up Bolts Documentation (#262)
Browse files Browse the repository at this point in the history
* Documentation clean up

Looked for grammatical and markdown errors mostly

* Bolts grammatical changes
  • Loading branch information
captain695 authored Oct 7, 2020

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
1 parent c3e11f1 commit eac5839
Showing 9 changed files with 53 additions and 53 deletions.
4 changes: 2 additions & 2 deletions docs/source/classic_ml.rst
Original file line number Diff line number Diff line change
@@ -9,7 +9,7 @@ half-precision training.
Linear Regression
-----------------
Linear regression fits a linear model between a real-valued target variable :math:`y` and one or more features :math:`X`. We
estimate the regression coefficients that minimizes the mean squared error between the predicted and true target
estimate the regression coefficients that minimize the mean squared error between the predicted and true target
values.

We formulate the linear regression model as a single-layer neural network. By default we include only one neuron in
@@ -69,7 +69,7 @@ Add either L1 or L2 regularization, or both, by specifying the regularization st
trainer.test(test_dataloaders=dm.test_dataloader(batch_size=12))
Any input will be flattened across all dimensions except the firs one (batch).
Any input will be flattened across all dimensions except the first one (batch).
This means images, sound, etc... work out of the box.

.. code-block:: python
5 changes: 4 additions & 1 deletion docs/source/dataloaders.rst
Original file line number Diff line number Diff line change
@@ -3,7 +3,10 @@ AsynchronousLoader
This dataloader behaves identically to the standard pytorch dataloader, but will transfer
data asynchronously to the GPU with training. You can also use it to wrap an existing dataloader.

Example::
Example:

.. code-block:: python
dataloader = AsynchronousLoader(DataLoader(ds, batch_size=16), device=device)
for b in dataloader:
4 changes: 2 additions & 2 deletions docs/source/datamodules.rst
Original file line number Diff line number Diff line change
@@ -7,9 +7,9 @@ DataModules (introduced in PyTorch Lightning 0.9.0) decouple the data from a mod
is simply a collection of a training dataloder, val dataloader and test dataloader. In addition,
it specifies how to:

- Downloading/preparing data.
- Download/prepare data.
- Train/val/test splits.
- Transforms
- Transform

Then you can use it like this:

14 changes: 7 additions & 7 deletions docs/source/introduction_guide.rst
Original file line number Diff line number Diff line change
@@ -10,7 +10,7 @@ Bolts is a Deep learning research and production toolbox of:
- Losses.
- Datasets.

**The Main goal of bolts is to enable trying new ideas as fast as possible!**
**The Main goal of Bolts is to enable trying new ideas as fast as possible!**

All models are tested (daily), benchmarked, documented and work on CPUs, TPUs, GPUs and 16-bit precision.

@@ -90,11 +90,11 @@ All models are tested (daily), benchmarked, documented and work on CPUs, TPUs, G

Community Built
---------------
Bolts are built-by the Lightning community and contributed to bolts.
Then lightning community builds bolts and contributes them to Bolts.
The lightning team guarantees that contributions are:

1. Rigorously Tested (CPUs, GPUs, TPUs).
2. Rigorously Documented.
1. Rigorously tested (CPUs, GPUs, TPUs).
2. Rigorously documented.
3. Standardized via PyTorch Lightning.
4. Optimized for speed.
5. Checked for correctness.
@@ -351,7 +351,7 @@ In case your job or research doesn't need a "hammer", we offer implementations o
which benefit from lightning's multi-GPU and TPU support.

So, now you can run huge workloads scalably, without needing to do any engineering.
For instance, here we can run Logistic Regression on Imagenet (each epoch takes about 3 minutes)!
For instance, here we can run logistic Regression on Imagenet (each epoch takes about 3 minutes)!

.. code-block:: python
@@ -414,7 +414,7 @@ But more importantly, you can scale up to many GPUs, TPUs or even CPUs
Logistic Regression
^^^^^^^^^^^^^^^^^^^
Here's an example for Logistic regression
Here's an example for logistic regression

.. code-block:: python
@@ -436,7 +436,7 @@ Here's an example for Logistic regression
trainer.test(test_dataloaders=dm.test_dataloader(batch_size=12))
Any input will be flattened across all dimensions except the firs one (batch).
Any input will be flattened across all dimensions except the first one (batch).
This means images, sound, etc... work out of the box.

.. code-block:: python
4 changes: 2 additions & 2 deletions docs/source/models.rst
Original file line number Diff line number Diff line change
@@ -15,7 +15,7 @@ by adding your contribution to bolts you get these **additional** benefits!
6. We'll pretrain expensive models for you and host weights.
7. We will improve the speed of your models!
8. Eligible for invited talks to discuss your implementation.
9. Lightning Swag + involvement in the broader contributor community :)
9. Lightning swag + involvement in the broader contributor community :)

.. note:: You still get to keep your attribution and be recognized for your work!

@@ -98,7 +98,7 @@ We request that each contribution have:
- Your name and your team's name as the implementation authors.
- Your team's affiliation
- Any generated examples, or result plots.
- Hyperparameters configurations for the results.
- Hyperparameter configurations for the results.

Thank you for all your amazing contributions!

67 changes: 33 additions & 34 deletions docs/source/reinforce_learn.rst
Original file line number Diff line number Diff line change
@@ -29,8 +29,8 @@ Contributions by: `Donal Byrne <https://github.com/djbyrne>`_
DQN Models
----------

The following models are based on DQN. DQN uses Value based learning where it is deciding what action to take based
on the models current learned value (V), or the state action value (Q) of the current state. These Values are defined
The following models are based on DQN. DQN uses value based learning where it is deciding what action to take based
on the model's current learned value (V), or the state action value (Q) of the current state. These values are defined
as the discounted total reward of the agents state or state action pair.

---------------
@@ -47,12 +47,12 @@ The DQN was introduced in `Playing Atari with Deep Reinforcement Learning <https
researchers at DeepMind. This took the concept of tabular Q learning and scaled it to much larger problems by
apporximating the Q function using a deep neural network.

The goal behind DQN was to take the simple control method of Q learning and scale it up in order to solve complicated
tasks. As well as this, the method needed to be stable. The DQN solves these issues with the following additions.
The goal behind DQN was to take the simple control method of Q learning and scale it up in order to solve complicated \
tasks. As well as this, the method needed to be stable. The DQN solves these issues with the following additions.

**Approximated Q Function**

Storing Q values in a table works well in theory, but is completely unscalable. Instead, the authors apporximate the
Storing Q values in a table works well in theory, but is completely unscalable. Instead, the authors approximate the
Q function using a deep neural network. This allows the DQN to be used for much more complicated tasks

**Replay Buffer**
@@ -173,7 +173,7 @@ value as the 'base reward' from being in state s.
The advantage tells us how much 'extra' reward we get from taking action a while in state s. The advantage bridges the
gap between Q(s, a) and V(s) as Q(s, a) = V(s) + A(s, a).

In the paper [Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581) the
In the paper `Dueling Network Architectures for Deep Reinforcement Learning <https://arxiv.org/abs/1511.06581>` the
network uses two heads, one outputs the value state and the other outputs the advantage. This leads to better
training stability, faster convergence and overall better results. The V head outputs a single scalar
(the state value), while the advantage head outputs a tensor equal to the size of the action space, containing
@@ -189,14 +189,14 @@ by subtracting the mean advantage from the Q value. This essentially pulls the m
Dueling DQN Benefits
~~~~~~~~~~~~~~~~~~~~

- Ability to efficiently learn the state value function. In the dueling network, every Q update also updates the Value
stream, where as in DQN only the value of the chosen action is updated. This provides a better approximation of the
values
- Ability to efficiently learn the state value function. In the dueling network, every Q update also updates the value
stream, where as in DQN only the value of the chosen action is updated. This provides a better approximation of the
values
- The differences between total Q values for a given state are quite small in relation to the magnitude of Q. The
difference in the Q values between the best action and the second best action can be very small, while the average
state value can be much larger. The differences in scale can introduce noise, which may lead to the greedy policy
switching the priority of these actions. The seperate estimators for state value and advantage makes the Dueling
DQN robust to this type of scenario
difference in the Q values between the best action and the second best action can be very small, while the average
state value can be much larger. The differences in scale can introduce noise, which may lead to the greedy policy
switching the priority of these actions. The seperate estimators for state value and advantage makes the Dueling
DQN robust to this type of scenario

Dueling DQN Results
~~~~~~~~~~~~~~~~~~~
@@ -255,11 +255,11 @@ Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane
Original implementation by: `Donal Byrne <https://github.com/djbyrne>`_

Up until now the DQN agent uses a seperate exploration policy, generally epsilon-greedy where start and end values
are set for its exploration. [Noisy Networks For Exploration](https://arxiv.org/abs/1706.10295) introduces
a new exploration strategy by adding noise parameters to the weightsof the fully connect layers which get updated
are set for its exploration. `Noisy Networks For Exploration <https://arxiv.org/abs/1706.10295>` introduces
a new exploration strategy by adding noise parameters to the weights of the fully connect layers which get updated
during backpropagation of the network. The noise parameters drive
the exploration of the network instead of simply taking random actions more frequently at the start of training and
less frequently towards the end.The of authors of
less frequently towards the end. The of authors of
propose two ways of doing this.

During the optimization step a new set of noisy parameters are sampled. During training the agent acts according to
@@ -270,23 +270,23 @@ distribution.
The authors propose two methods of injecting noise to the network.

1) Independent Gaussian Noise: This injects noise per weight. For each weight a random value is taken from
the distribution. Noise parameters are stored inside the layer and are updated during backpropagation.
The output of the layer is calculated as normal.
the distribution. Noise parameters are stored inside the layer and are updated during backpropagation.
The output of the layer is calculated as normal.
2) Factorized Gaussian Noise: This injects nosier per input/ouput. In order to minimize the number of random values
this method stores two random vectors, one with the size of the input and the other with the size of the output.
Using these two vectors, a random matrix is generated for the layer by calculating the outer products of the vector
this method stores two random vectors, one with the size of the input and the other with the size of the output.
Using these two vectors, a random matrix is generated for the layer by calculating the outer products of the vector


Noisy DQN Benefits
~~~~~~~~~~~~~~~~~~

- Improved exploration function. Instead of just performing completely random actions, we add decreasing amount of noise
and uncertainty to our policy allowing to explore while still utilising its policy
and uncertainty to our policy allowing to explore while still utilising its policy.
- The fact that this method is automatically tuned means that we do not have to tune hyper parameters for
epsilon-greedy!
epsilon-greedy!

.. note::
for now I have just implemented the Independant Gaussian as it has been reported there isn't much difference
For now I have just implemented the Independant Gaussian as it has been reported there isn't much difference
in results for these benchmark environments.

In order to update the basic DQN to a Noisy DQN we need to do the following
@@ -349,8 +349,8 @@ pair using a single step which looks like this
Q(s_t,a_t)=r_t+{\gamma}\max_aQ(s_{t+1},a_{t+1})
but because the Q function is recursive we can continue to roll this out into multiple steps, looking at the expected
return for each step into the future.
but because the Q function is recursive we can continue to roll this out into multiple steps, looking at the expected
return for each step into the future.

.. math::
@@ -373,14 +373,14 @@ method like DQN with a large replay buffer will make this even worse, as there i
training on experiences using an old policy that was worse than our current policy.

So we need to strike a balance between looking far enough ahead to improve the convergence of our agent, but not so far
that are updates become unstable. In general, small values of 2-4 work best.
that are updates become unstable. In general, small values of 2-4 work best.

N-Step Benefits
~~~~~~~~~~~~~~~

- Multi-Step learning is capable of learning faster than typical 1 step learning methods.
- Note that this method introduces a new hyperparameter n. Although n=4 is generally a good starting point and provides
good results across the board.
good results across the board.

N-Step Results
~~~~~~~~~~~~~~
@@ -464,7 +464,7 @@ PER Benefits
~~~~~~~~~~~~

- The benefits of this technique are that the agent sees more samples that it struggled with and gets more
chances to improve upon it.
chances to improve upon it.

**Memory Buffer**

@@ -500,10 +500,10 @@ on an optimal policy faster.
**DQN vs PER DQN: Pong**

In comparison to the base DQN, the PER DQN does show improved stability and performance. As expected, the loss
of the PER DQN is siginificantly lower. This is the main objective of PER by focusing on experiences with high loss.
of the PER DQN is siginificantly lower. This is the main objective of PER by focusing on experiences with high loss.

It is important to note that loss is not the only metric we should be looking at. Although the agent may have very
low loss during training, it may still perform poorly due to lack of exploration.
low loss during training, it may still perform poorly due to lack of exploration.

.. image:: _images/rl_benchmark/pong_per_dqn_baseline_v1_results_comp.jpg
:width: 800
@@ -535,7 +535,7 @@ suggested by our policy gradient in order to find a policy that produces the hig
Policy Gradient Key Points:
- Outputs a distribution of actions instead of discrete Q values
- Optimizes the policy directly, instead of indirectly through the optimization of Q values
- The policy distribution of actions allows the model to handle more complex action spaces, such as continuos actions
- The policy distribution of actions allows the model to handle more complex action spaces, such as continuous actions
- The policy distribution introduces stochasticity, providing natural exploration to the model
- The policy distribution provides a more stable update as a change in weights will only change the total distribution
slightly, as opposed to changing weights based on the Q value of state S will change all Q values with similar states.
@@ -570,17 +570,16 @@ algorithm is as follows:
.. math::
L = - \sum_{k,t} Q_{k,t} \log(\pi(S_{k,t}, A_{k,t}))
5. Perform SGD on the loss and repeat.


What this loss function is saying is simply that we want to take the log probability of action A at state S given
our policy (network output). This is then scaled by the discounted reward that we calculated in the previous step.
We then take the negative of our sum. This is because the loss is minimized during SGD, but we want to
maximize our policy.

.. note::
the current implementation does not actually wait for the batch episodes the complete every time as we pass in a
The current implementation does not actually wait for the batch episodes the complete every time as we pass in a
fixed batch size. For the time being we simply use a large batch size to accomodate this. This approach still works
well for simple tasks as it still manages to get an accurate Q value by using a large batch size, but it is not
as accurate or completely correct. This will be updated in a later version.
3 changes: 1 addition & 2 deletions pl_bolts/models/rl/dqn_model.py
Original file line number Diff line number Diff line change
@@ -92,8 +92,7 @@ def __init__(
Note:
This example is based on:
https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition\
/blob/master/Chapter06/02_dqn_pong.py
https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition/blob/master/Chapter06/02_dqn_pong.py
Note:
Currently only supports CPU and single GPU training with `distributed_backend=dp`
3 changes: 1 addition & 2 deletions pl_bolts/models/rl/per_dqn_model.py
Original file line number Diff line number Diff line change
@@ -54,8 +54,7 @@ class PERDQN(DQN):
.. note::
This example is based on:
https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition\
/blob/master/Chapter08/05_dqn_prio_replay.py
https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On-Second-Edition/blob/master/Chapter08/05_dqn_prio_replay.py
.. note:: Currently only supports CPU and single GPU training with `distributed_backend=dp`
2 changes: 1 addition & 1 deletion pl_bolts/models/self_supervised/byol/byol_module.py
Original file line number Diff line number Diff line change
@@ -29,7 +29,7 @@ def __init__(self,
PyTorch Lightning implementation of `Bootstrap Your Own Latent (BYOL)
<https://arxiv.org/pdf/2006.07733.pdf>`_
Paper authors: Jean-Bastien Grill ,Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, \
Paper authors: Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, \
Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, \
Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko.

0 comments on commit eac5839

Please sign in to comment.