Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elixir 1.15 sometimes fails during mix release due to missing module #12777

Closed
AndrewDryga opened this issue Jul 10, 2023 · 8 comments
Closed

Comments

@AndrewDryga
Copy link
Contributor

AndrewDryga commented Jul 10, 2023

Elixir and Erlang/OTP versions

elixir 1.15.2-otp-26
erlang 26.0.2

Operating system

Ubuntu 22.04.2

Current behavior

We started to have issues with mix release that fails once in a while. Restarting the CI job most usually fixes the issue, so it's most likely some sort of race condition during compilation.

For crash errors see:

https://github.com/firezone/firezone/actions/runs/5491388141/jobs/10007864417#step:4:1638
https://github.com/firezone/firezone/actions/runs/5510881208/jobs/10045788818#step:9:1009

25.18 == Compilation error in file lib/web/mailer/auth_email.ex ==
25.18 ** (UndefinedFunctionError) function Web.Mailer.module_info/1 is undefined (module Web.Mailer is not available)
25.18     Web.Mailer.module_info(:exports)
25.18     (elixir 1.15.2) src/elixir_import.erl:143: :elixir_import.get_functions/1
25.18     (elixir 1.15.2) src/elixir_import.erl:106: :elixir_import.calculate/6
25.18     (elixir 1.15.2) src/elixir_import.erl:28: :elixir_import.import/4

you can see other builds that were successful at the same branch without any code changes, eg.:

https://github.com/firezone/firezone/actions/runs/5491969244/jobs/10009045418#step:4:2783 (the latter failed job was triggered on the same codebase by GitHub merge queue checks).

Expected behavior

Consistently compile the module as it was pre-1.15.

@AndrewDryga
Copy link
Contributor Author

AndrewDryga commented Jul 10, 2023

Initially, we blamed the cache so we reset it and then split up the cache for Docker, but it did not help. I also hit compilation races once in a while locally (by running mix test without docker). MIX_ENV=test mix clean resolves it, but I can't find a way to reproduce it reliably so did not report it as a separate issue yet.

One thing to notice is that we use the module name Domain.Mailer.NoopAdapter in the parent umbrella app (domain) as an atom (without defining the module itself) and the actual module is defined and used only by the child (web) app. Is there anything changed in 1.15 around that?

@josevalim
Copy link
Member

Thank you. However, without an isolated mechanism to reproduce the error, there isn't much we can do.

One thing to notice is that we use the module name Domain.Mailer.NoopAdapter

It depends. If you are using the module at compile-time, then it can be a problem (for example, a race condition). Elixir v1.15 did not change anything related to that, explicitly, but a race condition may happen when we make parts of the system faster (making it more likely for the race to be encountered). In both cases though, we need an isolated and minimal way to reproduce it. :)

@AndrewDryga
Copy link
Contributor Author

AndrewDryga commented Jul 10, 2023

I will report back if we find a way to reliably reproduce it, so far we can't even do it reliably on CI because the issue self-resolves after a restart.

If you are using the module at compile-time

No, it's just an atom there, so should not be the root cause.

jamilbk added a commit to firezone/firezone that referenced this issue Jul 10, 2023
@AndrewDryga
Copy link
Contributor Author

Looks like this commit fixed the issue: firezone/firezone@1ffd08f. I think we might get a hint from it since it only does two things:

  1. ensures that the adapter module (Web.Mailer.NoopAdapter) is always compiled before we try to use it;
  2. swoosh dep is started a little more time before the web app is started because the web awaits for domain to start.

We also noticed that the race only happens with the Swoosh adapter and other code is not affected, I checked the Swoosh source, and the only suspicious thing is Swoosh.Mailer which uses an @on_load module attribute that executes the following function: https://github.com/swoosh/swoosh/blob/v1.11.3/lib/swoosh/mailer.ex#L236.

Will report more if I find the root cause.

jamilbk pushed a commit to firezone/firezone that referenced this issue Jul 12, 2023
Make all tests pass

I removed some of VPN/Wall settings (they are irrelevant once we move out gateway) along with port-based rules conditions (since we are moving to userspace wg).

Make sure that container can be built and run in PR CI step

Remove omnibus install scripts

Bring ecto.* helpers back to life

Fix priv/repo path

Add skeleton of API app

Add client, gateway, relay boilerplate code

Drop REST API boilerplate for now

Add primitive tests and more structure for API app

Control channels for Clients, Relays and Gateways (#1551)

Replace web app with a new one based on Tailwind and esbuild (#1568)

Re-enable SQL sandboxing for Phoenix apps

Bring back browser/config.xml

Remove unused import

Remove unused docker-compose file

Add minimal scaffholding for relay

Install necessary components for toolchain

Avoid concurrent jobs

Move everything to a workspace

Move gitignore and lockfile to workspace root

Move rust-toolchain to workspace root

Add caching to CI

Update .github/workflows/rust.yml

Signed-off-by: Thomas Eizinger <[email protected]>

Implement basic STUN server (#1603)

This is an alternative to #1602
that implements the server using a library I've found called
`stun_codec`.

It already has support for parsing a variety of attributes.

The following is a nice website to test some of the functionality:
https://icetest.info/

The server is still listening on:
`ec2-3-89-112-240.compute-1.amazonaws.com:3478`.

Install Rust before computing cache keys (#1606)

Enforce no warnings in docs (#1605)

relay: Parse and respond to allocation requests (#1604)

With this patch, the relay can parse and respond to allocation requests. I
ran some basics tests against https://icetest.info/ and implemented a
regression test as a result of the logged data.

In writing this, I also had to slightly change the design of `Server`
(as expected). Event handlers for incoming data now do not return a
message directly. Instead, the caller is responsible to drain `Command`s
from it.

When creating an allocation, we need to start listening on a new port.
This needs to happen outside the `Server` as I am going for a sans-IO
style. We emit a `Command` that instructs the main event loop to listen
on a new port. Any incoming data on that port will be forwarded to the
`Server`.

At the moment, this incoming data is just dropped. This is actually
standards-compliant because we cannot handle binding requests yet which
would allow this data to be forwarded to the client.

In some areas, the code is still a bit rough but I expect to iron those
things out as we go along.

relay: add basic README (#1611)

relay: refresh allocations (#1610)

relay: don't repeat magic numbers througout the code (#1612)

A small refactoring to keep magic numbers only in one place.

relay: remember allocations by port (#1613)

Instead of remembering the used ports separately, we store a reference
to each allocation by port.

ci: remove broken workflows (#1614)

These workflows are all red which is expected as far as I understand.
I'd suggest we remove them to reduce the noise when reviewing PRs.

In case we ever wanted to bring parts of it back, Git is our best
friend.

Feel free to close if you think differently.

Update workflows for cloud chaos (#1615)

Updating workflows to skip on PR and run on merges to `cloud`.

IAM context (#1577)

Things I've left for later to IAM:
1. Subject session expiration (to prevent session extension attacks);
2. UserPass adapter;
3. Token adapter and removal of APITokens in favor of `api_client` actor
with a Token provider;
4. Cleanup of Configurations schema and table
5. SCIM
6. Groups and Actor Profile (name, email) Sync
7. Email delivery once Web app is done with the templates
8. We might also want to persist sessions to database, to then show list
of active sessions to the user and allow to terminate some of them from
UI
9. SAML?
10. Rename `unprivileged` role name to `end_user`
11. Add `first_` and `last_name`, and sync/edit blocking logic around
it.
12. Rename Clients to Devices?

Fix PR-labeler config (#1623)

Fix PR labeler config 🤞

fix(relay): use correct variable (#1617)

We had a semantic conflict here that resulted in a broken build. This PR
fixes that.

Co-authored-by: Jamil <[email protected]>

1.0 views (part 1) (#1599)

- [x] Users
- [x] Groups
- [x] Devices
- [x] Gateways

relay: create channel bindings and relay data (#1618)

Here is a short demo:

[Relay](https://github.com/firezone/firezone/assets/5486389/c0199294-70ca-47b4-90ae-2c96428bdb56)

You can run this locally using the `./run_smoke_test.sh` shell-script.
It is not reliable enough yet to be used in CI but I used one if its
outputs to make a regression test.

---------

Co-authored-by: Jamil <[email protected]>

Implementing channels logic (#1619)

Fix minor bugs and tidy up existing work on new views (#1628)

Just fixing some bugs and inconsistencies I found while going through
the new views.

Fix some of TODOs left from IAM PR (#1627)

Move elixir code to a subfolder (#1631)

refactor(relay): introduce type-safe `Server` APIs (#1630)

We introduce dedicated types for each message that the `Server` can
handle. This allows us to make the functions public because the
type-system now guarantees that those are either parsed from bytes or
constructed with the correct data.

The latter will be useful to write tests against a richer API.

Deployment for the cloud version (#1638)

TODO:
- [x] Cluster formation for all API and web nodes
- [x] Injest Docker logs to Stackdriver
- [x] Fix assets building for prod

To finish later:
- [ ] Structured logging:
https://issuetracker.google.com/issues/285950891
- [ ] Better networking policy (eg. use public postmark ranges and deny
all unwanted egress)
- [ ] OpenTelemetry collector for Google Stackdriver
- [ ] LoggerJSON.Plug integration

---------

Signed-off-by: Andrew Dryga <[email protected]>
Co-authored-by: Jamil <[email protected]>

Set correct outbound email in local env

Try to fix CI step

relay: implement authentication (#1641)

Remove Elixir checks from pre-commit hook and rename CI step that runs it

Always run Elixir CI checks when code in main branch changed

Fix typos

Run pre-commit CI step on all PRs

Add newlines in the end of files

Add resource type and expose it in WS API along with name (#1649)

Additionally:
1. Fixed ipv6 formatting for stun/turn addresses
2. Fixed a tests that check for race conditions concurrently

Normalize CIDR resource addresses

Remove outdated TODO

feat(rust): bump to new stable release 1.70.0 (#1648)

Continuous delivery to staging (#1655)

Add terraform code owners

Lave a note on workflow_run feature and fix checkout feature

Experiment with condition

Workflow is not picked up by GitHub for some reason

Try a different CI setup

Add missing on_workflow call

Remove copy-pasted required inputs

Fix races for concurrency control

Inherit secrets to child workflows

Fix path to versions file

Rename pre-commit step

Bump checkout action vsn in rust workflow

Try pushing update using GH API

Fix github branch name

Do not attempt to persist tag versions back to the repo

Add missing env for terraform workflow

Try to wrap tf vars in backticks

Add double quotes to the var itself

Fix assets pipeline, add Elixir deps audit, add Android applink manifest (#1659)

feat(relay): implement nonces for authentication (#1654)

To complete the authentication scheme for the relay, we need to prompt
the client with a nonce when they send an unauthenticated request. The
semantic meaning of a nonce is opaque to the client. As a starting
point, we implement a count-based scheme. Each nonce is valid for 10
requests. After that, a request will be rejected with a 401 and the
client has to authenticate with a new nonce.

This scheme provides a basic form of replay-protection.

feat(relay): provide a commandline interface using clap (#1658)

This saves us several lines of code and allows usage of the relay via
commandline arguments in addition to env variables. Note that because of
`#[arg(env)]`, all of these can still be configured via environment
variables too.

feat(relay): add Dockerfile (#1661)

This adds a basic Dockerfile for the relay so users and devs can easily
start it.

fix(relay): treat `stamp_secret` as string (#1660)

Previously, the relay would treat the `stamp_secret` internally as bytes and share it with the outside world as hex-string. The portal however treats it as an opaque string and uses the UTF-8 bytes to create username and password.

This patch aligns the relay's functionality with the portal and stores the `stamp_secret` internally as a string.

ci: specify workspace directory for cache action correctly (#1663)

ci: install musl target via `rust-toolchain.toml` file (#1664)

Targets specified in the `rust-toolchain.toml` file are automatically installed by `rustup`. This avoid setup steps for other devs and also simplifies the CI setup.

To be able to compile native code to musl, we do need `musl-gcc` which comes with the `musl-tools` package on ubuntu.

feat(relay): connect to portal on startup (#1643)

With this PR, the relay can be configured with a WebSocket URL on startup. If given, it will attempt to connect to it and join the `relay` room with its `stamp_secret`. Once the `init` message is received, regular relay operation will begin.

jamilbk%feat/stub website in cloud (#1675)

* Remove `www/`
* Stub empty `website/` to silence Vercel. This shouldn't cause
conflicts when we merge `cloud` to `master`. Perhaps we want to start
working off `master` soon, and move the current tip of master to
`legacy`?

Use pnpm over yarn (#1678)

Did some research when picking a package manager for the website and
settled on `pnpm` for the following reasons:

- CLI-compatible with `npm`
- Typically faster than even `yarn` especially on Apple silicon
- Security: Pnpm uses a different dependency resolution algorithm and
different folder structure of node_modules that prevents illegal access
to packages by other packages.

I think I caught all the places, but I may be missing something, so if
this isn't a good idea we can revert back.

This PR also cleans up the actions workflows to remove dead code.

Use pnpm for asset setup too (#1681)

Add pnpm to runners (#1683)

Found another place where pnpm needs to be added.

Hotifx seeds and references (#1689)

connlib: moves it to the main firezone library

 This brindgs connlib from its own separated repo to firezone's monorepo.

 On top of bringing connlib we also add and unify the Dockerfile for all
 rust binaries and add a docker-compose that can run a headless client, a
 relay and a gateway which eventually will test the whole flow between a
 client and a resource. For this to work we also incorporated some elixir
 scripts to generate portal tokens for those components.

Do not expire encoded Gateway/Relay tokens

Fix API error rendering

Render error when public key is reused

Fix stub module name

Remove outdated env files

rust: fix dockerfile for building multiple images in parallel (#1699)

When using `docker compose build` or any other way of building docker
images in parallel the way the cache was working with the rust's
Dockerfile made the caches between images overlap and corrupt each
other. We add a `locked` which prevents multiple writers to the same
cache to fix this behaviour.

Return changeset on name suffix constraint error

docker: fix building for macos (#1700)

There are problems building the docker images in macos using musl due to
ring's problems therefore we started using slim-debian with glibc for
development.

Authentication for the live app (#1674)

Co-authored-by: Jamil <[email protected]>

portal: Policies CRUD views (#1692)

@AndrewDryga ~~Was still hitting some redirect issues so I'll wait for
those to be resolved before continuing on building more views.~~ Edit:
After some sleep and coffee, I figured it out. Nice work on the sign in
form!

I went ahead and scoped existing dashboard links with `@account` and
fixed a dark mode issue -- you may want to cherry-pick those commits.
I'll add these to authenticated routes and integrate into what you have
so far.

As I was going through last night exploring your route approach I
thought of some edge cases; can discuss next week. I think the main one
that came to mind was that we probably want to differentiate between
login flows initiated directly in the browser (this is an admin logging
into the dashboard) vs login flows initiated from a client app (these
will terminate with a final redirect to respective `dest` whitelisted
URL). Maybe it makes sense to segregate these flows?

If a regular user tries login directly from the browser maybe we want to
show them something like "Please login from your Firezone application
instead" as they should only be able to initiate logins from a client
application. Or maybe there's simply no possibility to end up at the
final Android App Link or `firezone://` URI with a login initiated
directly from the browser?

portal: Status indicator badge (#1703)

Did some research on status page providers to manage incidents.
statuspage.io seems to be easy to use and cost-effective, fairly popular
and provides a good amount of flexibility to customize emails,
notifications, etc.

Super easy to set up and use but am not married to it if anyone feels
strongly about using another incident management service.

https://firezone.statuspage.io

<img width="235" alt="Screenshot 2023-06-27 at 8 07 29 AM"
src="https://github.com/firezone/firezone/assets/167144/8ad12b9b-7345-4a5d-bf43-c8af798d85f9">

Fix compilation warnings that are not fixed in merged PRs

Do not render ipv6 relay address if it's nil

CONTRIBUTING.md updates (#1704)

**Update CONTRIBUTING.md**

Why:

* The CONTRIBUTING.md doc seems to have fallen slightly out of date with
      how Firezone now works.  This commit updates the doc to provide a
quick start guide for getting all of the various Firezone components
up and running as quick as possible. The doc then links to the more
      specific `Elixir` and `Rust` README.md files in the respective
      directories to help developers who would like to contribute.

**Update docker-compose vault health check**

 Why:

* The current Vault health check listed in the docker-compose file does
not seem to be working when using `localhost` in the `wget` command.
      Updating the URL to use `127.0.0.1` seems to have fixed it.

---------

Signed-off-by: bmanifold <[email protected]>
Co-authored-by: Jamil <[email protected]>

Fix formatting issue

My editor failed here due to a bug: elixir-lsp/vscode-elixir-ls#345

connlib: Improve FFI bridges for Apple and Android (#1691)

This makes it possible to build the Apple/Android FFI bridges and
integrate them with their respective client apps.

---------

Signed-off-by: Francesca Lovebloom <[email protected]>
Co-authored-by: Roopesh Chander <[email protected]>

Fix/docker compose up (#1705)

This PR fixes `docker compose up` but it doesn't have the test client ->
resource flow working but it prevent anything from erroring at startup.

This fixes:
* tokens (use the correct token for the client user agent we are using)
* randomize `name_suffix` at start up for connlib (we will eventually
allow options to set it manually)
* remove port ranges for relay (see firezone/corp#613)

fix(relay): ensure smoke test script fails on error (#1711)

Due to a silly bash mistake (I hate bash), the error from the gateway
binary wasn't actually propagated to the script. Thus, we did not notice
that it was been broken for a while.

Attempting to fix it turned up that we were double-hexing the relay
secret and using invalid passwords for the clients.

fix(connlib): format with `cargo fmt` (#1709)

Runs `cargo fmt` on the entire `rust/` directory. This somehow doesn't
seem to be enforced, I think that is because we changed the previous CI
to now only run for the `relay` crate.

I'd like to merge this first to avoid the diff and in a 2nd PR, we can
work on unifying CI again.

fix(relay): remove smoke test CI script (#1717)

Unfortunately, this doesn't seem to be stable. I don't really understand
why. Judging from the logs, the problem is not in the relay but somehow
the final UDP packet doesn't arrive at the `gateway` binary.

To not unnecessarily block other PRs, I am removing the check for now.

Add more websocat examples for connecting to a resource

Wait for client and gateway containers for api to become ready

Add docs section to see if everything is connected to the panel

Explicitly subscribe to id channels

Looks like for some reason the id/1 callback doesn't subscribe the channel process any more (only the socket itself), so we are doing that explicitly now.

Stub out client app directories in monorepo structure (#1716)

Stubs out the client app dirs and basic CI workflow for the client apps
in preparation to move them into this repository.

After this is merged @roop @pratikvelani you should be able to add the
client repos here.

chore: unify and optimize Rust CI (#1710)

- Instead of having two, very similar jobs, we run our fmt, clippy and
tests steps across all crates and operating systems.
- We remove the dependency of the android and apple builds on the tests
and thus get faster feedback.
- We force clippy to fail on any warning. This one is super important
IMO. Warnings in Rust are very useful and ignoring them can lead to bugs
(think "unused Result" etc).

Resolves #1714.

---------

Signed-off-by: Thomas Eizinger <[email protected]>
Co-authored-by: Francesca Lovebloom <[email protected]>

connlib: Connection mock (#1721)

Resolves firezone/corp#607

Setting the env var `CONNLIB_MOCK` when building through either
`build-rust.sh` or `gradle` will activate the `mock` feature.

Attempt to enable merge queue (#1713)

https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#merge_group

Feat/connlib full flow (#1722)

With this PR the full control-plane message flow is working.

Meaning that if you do:

```
docker compose up -d
docker compose exec -it client "ping 172.20.0.2" # will fix this IP later
```

Messages start flowing to gateway. The gateway still not correctly
forwards the messages to the resource since masquerading is still not
working, although I suspect there might be an additional problem. Will
fix this in my next PR along with a README on how to test this whole
flow.

This PR also fixes how we sent the stamp secret to the gateway from the
relay, but I still see some warnings in the webrtc that I'm sure that
are due to a mismatch between how webrtc-rs and the relay handle
messages (The most important being `bind() failed: unexpected response
type`), I will take a look at that and a way to test that the flow works
when:
1. hole-punching is available
2. through relay when it's not
Since the flow right now works without hole-punching or relay since the
gateway is in the same network in the docker compose.

Bump Elixir/OTP versions (#1730)

Bump versions in Dockerfile

Fix flaky tests

docs(relay): bring README.md up to date (#1718)

Drop invalid cache restore keys

Fix ubuntu 20.04 CI (#1734)

add a prefix key with host os to rust test job to prevent caching issues

CI: add a flow that test client to resource ping (#1729)

This PR fixes a bunch of small things to allow a new flow to test
clients pinging a resource within docker compose.

Masquerade/Forwarding is enabled directly in the container for now, this
might change in the future.

Also added a README to be able to run this locally.

---------

Signed-off-by: Gabi <[email protected]>
Co-authored-by: Jamil <[email protected]>

feat(relay): default portal URL (#1719)

Instead of having portal URL and token optional, we default the portal
URL and decide based on the presence of the token, whether we should
connect to the portal on startup. This allows the relay to be
used/tested standalone and keeps the number of config options and error
cases small.

We require the user to config the full path of the websocket and thus
avoid the need for duplicating the connlib function. Given that most
users will never need to override this option, this seems like a good
trade-off.

Resolves firezone/corp#614.

Feat/connlib handle error messages (#1735)

With this PR we handle in the client an error message due to
gateway/relay although rate limiting is needed.

Waiting for #1729 to be merged.

portal: Stub out Settings views (#1702)

Adds Setting UI views based on the Balsamiq Wireframes. This should be
merged **after** #1679
<img width="1469" alt="Screenshot 2023-06-26 at 4 48 55 PM"
src="https://github.com/firezone/firezone/assets/167144/0994b12b-5d8d-48a6-bc8d-c9ba07d2403c">

<img width="1469" alt="Screenshot 2023-06-26 at 4 49 01 PM"
src="https://github.com/firezone/firezone/assets/167144/1d69a54d-2740-4ab0-819b-75a50a976285">
<img width="1616" alt="Screenshot 2023-06-29 at 12 29 26 AM"
src="https://github.com/firezone/firezone/assets/167144/94a8913f-93be-4502-b30e-c70f147dbe62">

<img width="1616" alt="Screenshot 2023-06-29 at 12 29 14 AM"
src="https://github.com/firezone/firezone/assets/167144/16dfc709-65b9-44fd-adad-c412dc1d44e6">

<img width="1616" alt="Screenshot 2023-06-29 at 2 36 43 PM"
src="https://github.com/firezone/firezone/assets/167144/3cddc4b3-7494-4710-953e-4d60108b9aa8">
<img width="1616" alt="Screenshot 2023-06-29 at 2 36 56 PM"
src="https://github.com/firezone/firezone/assets/167144/1f433239-1023-471d-916c-76c43f47835e">
<img width="1616" alt="Screenshot 2023-06-29 at 2 37 05 PM"
src="https://github.com/firezone/firezone/assets/167144/9cd4be23-02eb-4adf-902b-00c02cecd744">

Add android client to the repo (#1738)

- Add android client to the repo

---------

Signed-off-by: Pratik Velani <[email protected]>
Co-authored-by: Jamil <[email protected]>

Bring in apple client into monorepo (#1737)

This PR brings in the apple client into the monorepo.

---------

Co-authored-by: Jamil <[email protected]>

feat(relay): use structured logging (#1741)

With this patch, the relay exposes a `--json` and `JSON_LOG` env
variable that will activate logs in JSON format the way it is expected
by google cloud:
https://cloud.google.com/logging/docs/structured-logging

In addition, we make use of spans to record contextual information as
first-class variables that are available in the context of every
message. An example output here is:

```
{"time":"2023-07-06T19:54:42.643694430Z","target":"relay","logging.googleapis.com/sourceLocation":{"file":"relay/src/main.rs","line":"156"},"severity":"INFO","message":"Seeding RNG from '0'"}
{"time":"2023-07-06T19:54:42.644408014Z","target":"relay","logging.googleapis.com/sourceLocation":{"file":"relay/src/main.rs","line":"130"},"severity":"INFO","message":"Listening for incoming traffic on UDP port 3478"}
{"time":"2023-07-06T19:54:42.843247996Z","target":"relay","logging.googleapis.com/sourceLocation":{"file":"relay/src/server.rs","line":"417"},"span":{"lifetime":"600","name":"allocate"},"spans":[{"sender":"127.0.0.1:46406","transaction_id":"0531a911a24d1e5297b94cb2","name":"client"},{"lifetime":"600","name":"allocate"}],"severity":"INFO","ip4RelayAddress":"127.0.0.1:65460","message":"Created new allocation"}
{"time":"2023-07-06T19:54:42.851623041Z","target":"relay","logging.googleapis.com/sourceLocation":{"file":"relay/src/server.rs","line":"569"},"span":{"allocation":"AID-1","peer_address":"127.0.0.1:42314","requested_channel":"16384","name":"channel_bind"},"spans":[{"sender":"127.0.0.1:46406","transaction_id":"e99e07e482789cdc30bd2b50","name":"client"},{"allocation":"AID-1","peer_address":"127.0.0.1:42314","requested_channel":"16384","name":"channel_bind"}],"severity":"INFO","message":"Successfully bound channel"}
{"time":"2023-07-06T19:54:42.852889208Z","target":"relay","logging.googleapis.com/sourceLocation":{"file":"relay/src/server.rs","line":"288"},"span":{"allocation_id":"AID-1","channel":16384,"recipient":"127.0.0.1:46406","sender":"127.0.0.1:42314","name":"peer"},"spans":[{"allocation_id":"AID-1","channel":16384,"recipient":"127.0.0.1:46406","sender":"127.0.0.1:42314","name":"peer"}],"severity":"DEBUG","message":"Relaying 32 bytes"}
{"time":"2023-07-06T19:54:42.854625857Z","target":"relay","logging.googleapis.com/sourceLocation":{"file":"relay/src/server.rs","line":"619"},"span":{"channel":"16384","recipient":"127.0.0.1:42314","name":"channel_data"},"spans":[{"sender":"127.0.0.1:46406","name":"client"},{"channel":"16384","recipient":"127.0.0.1:42314","name":"channel_data"}],"severity":"DEBUG","message":"Relaying 32 bytes"}
```

For some reason, the current `span` is always duplicated but I don't
think that is a big issue. When run using the regular log formatter, it
looks like this:

```
2023-07-06T20:02:33.939273Z  INFO relay: Seeding RNG from '0'
2023-07-06T20:02:33.940153Z  INFO relay: Listening for incoming traffic on UDP port 3478
2023-07-06T20:02:34.135801Z  INFO client{sender=127.0.0.1:33919 transaction_id="7092a2363377709cd18b9d98"}:allocate{lifetime=600}: relay: Created new allocation ip4_relay_address=127.0.0.1:65460
2023-07-06T20:02:34.144833Z  INFO client{sender=127.0.0.1:33919 transaction_id="4e1a18e58953242c92a075a3"}:channel_bind{requested_channel=16384 peer_address=127.0.0.1:47859 allocation="AID-1"}: relay: Successfully bound channel
2023-07-06T20:02:34.145501Z DEBUG peer{sender=127.0.0.1:47859 allocation_id=AID-1 recipient=127.0.0.1:33919 channel=16384}: relay: Relaying 32 bytes
2023-07-06T20:02:34.146863Z DEBUG client{sender=127.0.0.1:33919}:channel_data{channel=16384 recipient=127.0.0.1:47859}: relay: Relaying 32 bytes
```

This provides lots of contextual information in a DRY and easily
parse-able way.

---------

Co-authored-by: Jamil <[email protected]>

Pass all required checks that weren't triggered in the PR (#1748)

Fixes #1747
Fixes #1746

Pass-checks workflow per subdir (#1749)

Fix cache for Docker buildx (#1750)

~~This is an attempt to fix the CI bug
[here](https://github.com/firezone/firezone/actions/runs/5491388141/jobs/10007864417#step:4:1638)
possibly introduced in
[d9eb2d1](d9eb2d18#diff-88bd94db0d5cfd5f0617b7c4ed48c0212597378ed7e28714c5d86c95999b4c7dR29)
and uncovered / exacerbated in Elixir 1.15~~

Edit: looks like this ended up being a couple cache issues with GitHub
actions:
1. The `elixir_api-container-build` cache would always overwrite the
`elixir_web-container-build` on subsequent builds of the same
`github.ref_name` (cache is scoped to branch name by default), leading
to the consistent error `Elixir.Web.Mailer.NoopAdapter does not exist`
whenever a branch was pushed to more than once.
2. The same thing happens with the `integration_test-basic-flow` job
because the `api` service gets built after the `web` service in
docker-compose.yml, overwriting its cache

For some reason it seems the `APPLICATION_NAME` ARG is not busting the
Docker cache properly on GitHub actions for elixir container builds, so
the fix here was to [use
`scope=`](https://docs.docker.com/build/cache/backends/gha/#scope) to
segregate the cache layers between builds of the same branch.

Move NoopAdapter to Domain app (#1756)

Workaround for this:

elixir-lang/elixir#12777

Feat/expire peers (#1739)

This PR takes care of expiring connections with peer from the gateway
side.

---------

Co-authored-by: Jamil <[email protected]>

fix(relay): reuse `delete_allocation` function (#1743)

Previously, we would access the state around allocations from different
places. This actually led to a minor memory leak where we wouldn't clean
up the `allocations_by_port` table. We refactor the code slightly to
avoid this.

---------

Co-authored-by: Jamil <[email protected]>

connlib: Use latest `swift-bridge` release (#1753)

A new version of `swift-bridge` released today, so we don't need it to
be a git dependency anymore.

headless & gateway: impl callbacks (#1757)

After rebasing over this #1744 CI should pass

connlib: Hook up callbacks (#1744)

Co-authored-by: Jamil <[email protected]>

Add slack notification for failed deployments

Fix flaky test

Fix health checks path
@josevalim
Copy link
Member

I actually think that explains a lot. If the @on_load hook fails, the module is not loaded, which is why it says it cannot find the module. The actual root cause is logged only a couple of lines above:

Error: 23.19 23:30:27.694 [error] Elixir.Web.Mailer.NoopAdapter does not exist

The reason why this happens is because Web.Mailer.NoopAdapter may or may not have been compiled before the mailer is loaded. The race condition has always been there, it is just that Elixir v1.15 compiles faster, so it may make it more likely to happen. To reproduce it consistently, you can add a Process.sleep(10000) before defmodule Web.Mailer.NoopAdapter, making the race more likely to happen.

There are two fixes here:

  1. Temporary: define NoopAdapter in the same file and before Web.Mailer, to make sure it is always defined be

  2. Move the Swoosh @on_load check to compile-time (such as @before_compile/@after_compile) and use Code.ensure_compiled instead of Code.ensure_loaded, which will guarantee an adapter defined in the same project will be compiled before hand. I don't see the need to check for this at runtime. The way Elixir manages your dependencies, it is safe to assume that if the module is available at compile, it should be available at runtime

I will move the issue to Swoosh. :) Thank you for the follow up!

@josevalim
Copy link
Member

Done, please consider sending a PR there too: swoosh/swoosh#792 :)

@princemaple
Copy link
Contributor

Give 1.11.4 a go! @AndrewDryga

@AndrewDryga
Copy link
Contributor Author

Thank you @josevalim and @princemaple ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants