Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow overriding MetadataSyncJob timeout via Qubes service flag #1552

Merged
merged 4 commits into from
Sep 13, 2022

Conversation

cfm
Copy link
Member

@cfm cfm commented Aug 30, 2022

Description

Closes #1547 by:

  1. teaching the securedrop-client script to check for a Qubes service flag like SDEXTENDEDTIMEOUT_N and export it as SDEXTENDEDTIMEOUT=N;
  2. having the MetadataSyncJob (noisily) override its sdclientapi.API.default_request_timeout with $SDEXTENDEDTIMEOUT if set in the environment.

Test Plan

Point your SecureDrop Workstation at my testing instance per #1547 (comment) and #1547 (comment). Then:

To test the current failure mode (#1547)

  1. Check out 1547-override-timeout in sd-app:/home/user/securedrop-client.

  2. Apply the following patch so that we can test files/securedrop-client within the virtual environment:

    diff --git a/files/securedrop-client b/files/securedrop-client
    index 9674911..403e37f 100755
    --- a/files/securedrop-client
    +++ b/files/securedrop-client
    @@ -18,5 +18,6 @@ if [ -n "$timeout_flag_value" ]; then
            export SDEXTENDEDTIMEOUT="$timeout_flag_value"
     fi
     
    +cd ~/securedrop-client
     # Now execute the actual client, only if running in an sd-app
    -if [ "$(qubesdb-read /name)" = "sd-app" ]; then ./bin/sd-client; else echo "Not running in sd-app, client not starting."; fi
    +if [ "$(qubesdb-read /name)" = "sd-app" ]; then python -m securedrop_client; else echo "Not running in sd-app, client not starting."; fi
  3. Then:

    user@sd-app:~/securedrop-client$ make venv
    user@sd-app:~/securedrop-client$ source .venv/bin/activate
    user@sd-app:~/securedrop-client$ files/securedrop-client
  4. Log in and wait for sync. Against my testing instance (~3500 sources and submissions):

    • the initial sync succeeds (~10 min); and
    • subsequent syncs fail (~2 min).

To test overriding the timeout

  1. Set the service flag:

    [user@dom0 ~]$ qvm-service --enable sd-app SDEXTENDEDTIMEOUT_600
    [user@dom0 ~]$ qvm-shutdown sd-app && sleep 5 && qvm-start sd-app

    You can also do SDEXTENDEDTIMEOUT_10 if you're impatient. :-)

  2. Explore how that flag is exposed within sd-app:

    user@sd-app:~$ qubesdb-list /qubes-service/
    SDEXTENDEDTIMEOUT_600
    meminfo-writer
    paxctld
    user@sd-app:~$ qubesdb-list /qubes-service/SD
    EXTENDEDTIMEOUT_600
    user@sd-app:~$ qubesdb-list /qubes-service/SDEXTENDEDTIMEOUT_
    600
  3. Launch the client as above:

    user@sd-app:~/securedrop-client$ source .venv/bin/activate
    user@sd-app:~/securedrop-client$ files/securedrop-client
    • SDEXTENDEDTIMEOUT=600 appears in securedrop-client's standard output.
  4. Log in and wait for sync.

    • Sync either (a) succeeds, if the SDEXTENDEDTIMEOUT you set is long enough, or (b) fails after SDEXTENDEDTIMEOUT seconds.
    • In sd-log, you see (e.g.): WARNING: MetadataSyncJob will use default_request_timeout=600

Checklist

If these changes modify code paths involving cryptography, the opening of files in VMs or network (via the RPC service) traffic, Qubes testing in the staging environment is required. For fine tuning of the graphical user interface, testing in any environment in Qubes is required. Please check as applicable:

  • I have tested these changes in the appropriate Qubes environment
  • I do not have an appropriate Qubes OS workstation set up (the reviewer will need to test these changes)
  • These changes should not need testing in Qubes

If these changes add or remove files other than client code, the AppArmor profile may need to be updated. Please check as applicable:

  • I have updated the AppArmor profile
  • No update to the AppArmor profile is required for these changes

And see e21c163 for why!

  • I don't know and would appreciate guidance

If these changes modify the database schema, you should include a database migration. Please check as applicable:

  • I have written a migration and upgraded a test database based on main and confirmed that the migration is self-contained and applies cleanly
  • I have written a migration but have not upgraded a test database based on main and would like the reviewer to do so
  • I need help writing a database migration
  • No database schema changes are needed

@cfm cfm requested a review from eaon August 30, 2022 23:20
@cfm cfm requested a review from a team as a code owner August 30, 2022 23:20
Copy link
Contributor

@gonzalo-bulnes gonzalo-bulnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really a review because I'm not set up right now to follow the test plan. But just a note to say that I find the description and test plan fantastic! Having only loosely followed the conversation on this issue, I feel I understand what was done and how it should behave. 🙌

Comment on lines +49 to +52
logger.warn(
f"{self.__class__.__name__} will use "
f"default_request_timeout={api_client.default_request_timeout}"
)
Copy link
Contributor

@gonzalo-bulnes gonzalo-bulnes Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor pet peeve of mine: since the message doesn't require action (e.g. like a deprecation warning does), I would prefer it to be at INFO level.

I say pet peeve because I find personally that most warnings shouldn't be emitted (potentially controversial opinion) because they fall into the following categories:

  • Errors that are not fatal because they're handled. (In my experience, the source of the error is unlikely to be fixed anyway if given the opportunity the choice was made, instead, to print a warning. 🤷 So the warning is mostly noise IMHO.)
  • Fatal errors, which shouldn't have happened and would better be logged at the ERROR level since they signal a bug to be fixed. (Foreseeable errors must be handled.)
  • Or messages that don't require user action and only provide context. (At which point I think competing for attention at the warning level contributes more to error-fatigue than it really helps keeping us informed. Using an INFO level seems more honest to me.)

What would I make a warning? Things that will become fatal errors unless action is taken, e.g. deprecation warnings. YMMV! Admittedly it's nitpickey and mostly a matter of personal preference. 🙂

Copy link
Member Author

@cfm cfm Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree with your criteria, @gonzalo-bulnes. In this specific case, I'm rounding up slightly from our decision to call out this flag "explicitly as experimental, i.e. we cannot guarantee that we will support it in feature" to something akin to a deprecation warning. But I'm happy to demote this message to INFO if that logic isn't persuasive.

(See also: #1166.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that makes sense @cfm I missed that intent! I'm not sure what context goes into the current wording: would there be an opportunity to call out explicitly (and additionally to the warning level): "WARNING: ... use experimental default_..."?

cfm added 4 commits September 7, 2022 15:22
---so that we have something to compare to when we override it.
…IMEOUT=N

Since a Qubes service just sets a boolean flag, we use qubesdb-list (1)
as a glob, for any key beginning with the prefix "SDEXTENDEDTIMEOUT_";
and (2) to return the "value", aka the key without that prefix.

This approach is too naïve to work for arbitrary keys (which risk
colliding in the glob) or arbitrarily-typed values (without encoding).
But it's good enough for this experimental flag.  More-sophisticated
approaches, such as scanning the contents of "/var/run/qubes-service",
would require AppArmor grants, which we already have for qubesdb-cmd.
@gonzalo-bulnes
Copy link
Contributor

(rebased without changes to include new CI check)

Copy link
Contributor

@eaon eaon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going with an approval without merging as I have not tested this, I'm just just weighing in on the implementation:

I think this looks great! Balances precedent (qubesdb reads in the start script) and an approximation of what we may see in the future (i.e. lightweight ways of communicating information to the client without requiring a run of sdw-admin --apply) - and is easily reversible. Love it!

I don't have time to go through the test plan, but if someone else can confirm this is working as expected, I think this PR is ready to be merged! Thanks @cfm 😄

@eloquence
Copy link
Member

(Just a note that I'm currently syncing with the SDTIMEOUT_600 setting against Cory's server with 3500 sources. Will report results here and merge if it looks good, per Michael's prior conditional approval.)

remote_user = factory.RemoteUser()
api_client.get_users = mocker.MagicMock(return_value=[remote_user])

os.environ["SDEXTENDEDTIMEOUT"] = str(TIMEOUT_OVERRIDE) # environment value must be string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative strategy to consider for tests that modify env vars might be to mock os.environ, see https://github.com/freedomofpress/securedrop-workstation/blob/b31d0acdf35cf5248acbe2dcd9372483dfde424f/launcher/tests/test_util.py#L259-L266 for an example (not a blocking comment, just an observation).

@eloquence
Copy link
Member

I've not tested the old behavior, since I've previously confirmed that syncs against a large number of sources do indeed time out. The new behavior works as expected.

  • Exploratory testing confirms the expected behavior of the new service
  • SDEXTENDEDTIMEOUT=600 appears in securedrop-client's standard output.
  • Sync succeeds,
  • In sd-log, you see (e.g.): WARNING: MetadataSyncJob will use default_request_timeout=600

@eloquence
Copy link
Member

(Upon consideration, I've held off on merge for now, since we've invited an external stakeholder to test this PR before merging.)

@philmcmahon
Copy link
Contributor

I've not managed to run this locally but can confirm that this appears to do exactly what we've been doing by manually patching sync.py, so a big 👍🏻 from us

@eloquence
Copy link
Member

Per chat with @creviera and previous reviews, merging :)

@eloquence eloquence merged commit 611f396 into main Sep 13, 2022
@eloquence eloquence deleted the 1547-override-timeout branch September 13, 2022 01:14
@cfm
Copy link
Member Author

cfm commented Sep 15, 2022

Thanks to @eaon for pairing on the design here and to all who reviewed, especially @philmcmahon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow experimental override of default_request_timeout
5 participants