Skip to content

Commit

Permalink
Remove latex
Browse files Browse the repository at this point in the history
  • Loading branch information
tjburch committed Aug 23, 2020
1 parent e0b8a2d commit fcdf5e6
Showing 1 changed file with 17 additions and 16 deletions.
33 changes: 17 additions & 16 deletions riddler538_2020_Aug21/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,32 +16,33 @@ To solve this, first I simulated observations consistent with the problem statem
- 38% 1's
- 17% 2's

To achieve that, I threw random numbers ($n=25000$) on the interval $[0,1]$ with floating point precision. Then to achieve the proportions as designated, I set all instances on the interval $[0.00, 0.45)$ as 0, $[0.45, 0.83)$ as 1 and $[0.83, 1.00]$ as 2. This generated the following distribution of data, where the histogram is normalized to 1 in order to show rate rather than raw counts.
To achieve that, I threw random numbers (n=25000) on the interval [0,1] with floating point precision. Then to achieve the proportions as designated, I set all instances on the interval [0.00, 0.45) as 0, [0.45, 0.83) as 1 and [0.83, 1.00] as 2. This generated the following distribution of data, where the histogram is normalized to 1 in order to show rate rather than raw counts.

![](plots/simulated_distribution.svg)

This histogram is consistent with the given rates, so the generated data appears valid, with 25,000 observations.

### Model

This data can be modeled using a [Poisson Binomial distribution](https://en.wikipedia.org/wiki/Poisson_binomial_distribution). This distribution is the convolution of $n$ Bernoulli distributed variables, with probability $p_1, p_2, \dots, p_n$. In our case, we have $n=2$ where each source omits a signal with probability $p_i$, for $i=1,2$ during each sampling interval.
This data can be modeled using a [Poisson Binomial distribution](https://en.wikipedia.org/wiki/Poisson_binomial_distribution). This distribution is the convolution of _n_ Bernoulli distributed variables, with probability p_1, p_2, ..., p_n. In our case, we have n=2 where each source omits a signal with probability p_i, for i=1,2 during each sampling interval.

Generic Poisson Binomial distributions with $i=2$ for various $p_i$ values are shown in the figure below.
Generic Poisson Binomial distributions with i=2 for various p_i values are shown in the figure below.

![](plots/poisson_binomial_examples.svg)

Given two sources emitting with some fixed probability, this is the type of distribution I would expect. The distribution is bounded from 0-2, corresponding to neither source, one source, or both sources emitting during a sampling period. If both $p$ values are low, the sources aren't emitting frequently, so we expect many samplings with 0 readings. If both $p$ values are high, we expect many cases in which we receive signals from both, so many 2 readings. Anywhere in the middle, we run into a high degree of one or the other emitting a signal, so 1 is most common. This type of behavior we observe in these generic distributions, so this model seems to fit the problem.
Given two sources emitting with some fixed probability, this is the type of distribution I would expect. The distribution is bounded from 0-2, corresponding to neither source, one source, or both sources emitting during a sampling period. If both p values are low, the sources aren't emitting frequently, so we expect many samplings with 0 readings. If both p values are high, we expect many cases in which we receive signals from both, so many 2 readings. Anywhere in the middle, we run into a high degree of one or the other emitting a signal, so 1 is most common. This type of behavior we observe in these generic distributions, so this model seems to fit the problem.

Formalizing the model,

\begin{align*}
y_i &\sim \text{PoissonBinomial}(\mathbf{p})\\
p_i &\sim \text{Uniform}(0,1)\\
\end{align*}

Flat priors are used for $p_1$ and $p_2$. Given our data, we see 0 is the most common value, so most likely both $p$ values will be low, below 0.5. However, since many observations were simulated (25,000), there's flexibility to have less informative priors. With so many observations, the data will be the dominant factor in shaping the posterior.
y_i ~ PoissonBinomial(**p**)

The flat prior, $\text{Uniform}(0,1)$, does provide information to the model in terms of bounds - on each time interval, each source has between a 0% and 100% chance of emitting. We don't permit for negative or greater than 100% probabilities, since those don't make much physical sense.
p_i ~ Uniform(0,1)


Flat priors are used for p_1 and p_2. Given our data, we see 0 is the most common value, so most likely both p values will be low, below 0.5. However, since many observations were simulated (25,000), there's flexibility to have less informative priors. With so many observations, the data will be the dominant factor in shaping the posterior.

The flat prior, \text{Uniform}(0,1), does provide information to the model in terms of bounds - on each time interval, each source has between a 0% and 100% chance of emitting. We don't permit for negative or greater than 100% probabilities, since those don't make much physical sense.


### Fitting the model
Expand Down Expand Up @@ -73,9 +74,9 @@ Quantiles
p_b 0.3468 0.3539 0.3575 0.3613 0.3684
```

> Note: p_a and p_b are used in lieu of $p_1$ and $p_2$, by choice of personal code reading preference
> Note: p_a and p_b are used in lieu of p_1 and p_2, by choice of personal code reading preference
There's no reason the model ought to prefer one $p$ over another, so it makes sense their mean and standard deviations are the same. The 95% credible intervals are from 0.349 to 0.368, a bit narrower than I would have expected. This means that to see the pattern we've gotten with two sources, both must have a probability of around 0.358 of emitting per sampling period.
There's no reason the model ought to prefer one p over another, so it makes sense their mean and standard deviations are the same. The 95% credible intervals are from 0.349 to 0.368, a bit narrower than I would have expected. This means that to see the pattern we've gotten with two sources, both must have a probability of around 0.358 of emitting per sampling period.


Inspecting the chains and posteriors to make sure convergence is good:
Expand All @@ -84,7 +85,7 @@ Inspecting the chains and posteriors to make sure convergence is good:

This all looks normal, the fit appears to have worked well. Distributions appear roughly normal, centered at the above mean of 0.358 for both. Probably the most interesting check is how the probabilities covary.

This solution approach searches for possible parameter values that can realize the observations. Meaning, to get the same observations, as one $p$ value increases, the other $p$ value ought to decrease. In other words, if one source is emitting signals more often, to get the same realized data, the other source must emit less often. Thus, we expect a negative correlation between the two emission probability parameters.
This solution approach searches for possible parameter values that can realize the observations. Meaning, to get the same observations, as one p value increases, the other p value ought to decrease. In other words, if one source is emitting signals more often, to get the same realized data, the other source must emit less often. Thus, we expect a negative correlation between the two emission probability parameters.

![](plots/p_covariance.svg)

Expand All @@ -104,8 +105,8 @@ This analysis shows, yes, this is definitely possible. If it were true, we would

With so many posterior samples, its interesting from a sheer curiosity standpoint to look at the most extreme values that could result in this dataset. If one is emitting with the highest possible probability to realize this data, what is that probability? Subsequently, how infrequent would the other have to emit in this scenario?

The highest emission probability consistent with our data was found to be $p = 0.382$, where the corresponding probability from the second source is $p = 0.344$ (outside the lower limits of the 95% credible interval, expected due to covariance).
The highest emission probability consistent with our data was found to be p = 0.382, where the corresponding probability from the second source is p = 0.344 (outside the lower limits of the 95% credible interval, expected due to covariance).

The lowest emission probability was found to be $p=0.335$, with a corresponding probability from the second source of $p=0.378$ (outside upper limits of 95% credible credible interval).
The lowest emission probability was found to be p=0.335, with a corresponding probability from the second source of p=0.378 (outside upper limits of 95% credible credible interval).

As mentioned, these are the most extreme values, so very, very unlikely to be realized, and we should lend effectively no belief to values below $p=0.33$ or above $p=0.39$.
As mentioned, these are the most extreme values, so very, very unlikely to be realized, and we should lend effectively no belief to values below p=0.33 or above p=0.39.

0 comments on commit fcdf5e6

Please sign in to comment.