Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(robot-server): fetching of data files used in runs of a protocol #15908

Merged

Conversation

sanni-t
Copy link
Member

@sanni-t sanni-t commented Aug 6, 2024

Closes AUTH-638

Overview

Fixes a bug where CSV files used in protocol runs were not being saved to the CSV RTP table for runs, and as a result, were not being included in response for the /protocols/{protocolId}/dataFiles endpoint.

Test Plan and Hands on Testing

Testing with app & ODD:

  • Upload a protocol that uses a CSV parameter
  • Send the protocol to a robot, upload the csv file to use
  • Run the protocol
  • Run the protocol again with a different CSV file -> Do this 5 times (so total 6 runs with 6 different CSV files). By re-running the protocol 6 times, we are making the robot delete its oldest analysis (since max analyses per protocol is 5), essentially deleting the first CSV file from the analysis csv table, but not from runs table
  • Check that when you run the protocol again on the ODD, it shows you all the 6 different CSV files previously uploaded

Testing with Postman/ direct HTTP requests:

  • Upload a few data files
  • Upload a protocol that uses a CSV parameter and specify a data file (data_file_1) for the CSV param
  • Start a new analysis for the same protocol by specifying a second data file (data_file_2) for the CSV param
  • Create a run for the protocol by specifying data_file_1 for its CSV param
  • Create another run for the protocol by specifying a third data file (data_file_3) for its CSV param
  • Check that the response to GET /protocols/{protocolId}/dataFiles contains the 3 data files used with the runs & analyses. Check that they are listed in the order that the files were uploaded to the server (via POST /dataFiles)

Changelog

  • wired up CSV RTP table insertion during run creation
  • updated the run deletion code to remove the CSV RTP entry from the run_csv_rtp_table before deleting the run.
  • updated the ../{protocolId}/dataFiles response so that it lists the files in the order they were uploaded.
  • added tests

Review requests

  • Usual code review stuff
  • Is there a better way to write the SQL query to get the ordered list of data files used in runs and analyses?

Risk assessment

Low. Fixes bug

@sanni-t sanni-t requested review from a team as code owners August 6, 2024 20:47
Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me once that failing integration test is passing. Thanks!

Comment on lines +209 to +211
self._run_store.insert_csv_rtp(
run_id=run_id, run_time_parameters=run_time_parameters
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for this PR, but ideally, I think we'd want this insert_csv_rtp(...) call to be combined with the self._run_store.insert(...) call a few lines above so it can be one atomic transaction.

Comment on lines -30 to +32
status_code: 201
status_code:
- 201
- 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in a call: Yeah, we should have a DELETE method so clients can manually delete data files and these tests can restore the robot to its original state. Certainly for a separate PR.

Comment on lines 335 to 342
ids_in_analyses = transaction.execute(analysis_csv_file_ids).scalars().all()
ids_in_runs = transaction.execute(run_csv_file_ids).scalars().all()
data_files_rows = transaction.execute(
data_files_table.select()
.where(
data_files_table.c.id.in_(list(set(ids_in_analyses + ids_in_runs)))
)
.order_by(sqlite_rowid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answering your review question, I think we can do this:

        with self._sql_engine.begin() as transaction:
            data_files_rows = transaction.execute(
                data_files_table.select()
                .where(
                    data_files_table.c.id.in_(analysis_csv_file_ids)
                    | data_files_table.c.id.in_(run_csv_file_ids)
                )
                .order_by(sqlite_rowid)
            ).all()

Which keeps the set operations inside SQLite.

The documentation for | is difficult to Google for, but it's here.

Some kind of JOIN would work too. I haven't tested this but we might be reaching the point of query complexity where it actually makes a difference.

Copy link
Contributor

@SyntaxColoring SyntaxColoring Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, completely unrelated, but for future reference, tables that need to support queries like this should probably have indexes. For example, right now, to find out which analyses use a given CSV file, SQLite needs to linearly scan on analysis_csv_rtp_table.file_id. And to find out which CSV files are used by a given analysis, SQLite needs to linearly scan on analysis_csv_rtp_table.analysis_id.

The way that I check for this kind of thing is:

  1. When you add a new query, run the unit test with --log-level=INFO -rA so you can see the compiled SQL. Equivalently, you could watch the logs in the dev server as you interact with it via Postman.
  2. Run sqlite3 on any robot-server .db file to get a REPL. The database can be empty; you just need it to have the correct schema.
  3. Run EXPLAIN QUERY PLAN in the REPL and paste the query from above. You might need to adjust the parameter placeholders to be syntactically valid. If the query explanation says SCAN, that means it's doing an O(n) full table scan and the schema deserves a second look. If it says SEARCH ... USING INDEX, that's a better O(log(n)).

Here's example output from this query:

QUERY PLAN
|--MULTI-INDEX OR
|  |--INDEX 1
|  |  |--LIST SUBQUERY 2
|  |  |  |--SCAN analysis_csv_rtp_table
|  |  |  `--LIST SUBQUERY 1
|  |  |     `--SEARCH analysis USING INDEX ix_analysis_protocol_id (protocol_id=?)
|  |  `--SEARCH data_files USING INDEX sqlite_autoindex_data_files_1 (id=?)
|  `--INDEX 2
|     |--LIST SUBQUERY 4
|     |  |--SCAN run_csv_rtp_table
|     |  `--LIST SUBQUERY 3
|     |     `--SCAN run
|     `--SEARCH data_files USING INDEX sqlite_autoindex_data_files_1 (id=?)
`--USE TEMP B-TREE FOR ORDER BY

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tips!

Comment on lines 317 to 332
select_referencing_analysis_ids = sqlalchemy.select(analysis_table.c.id).where(
analysis_table.c.protocol_id == protocol_id
)
select_referencing_run_ids = sqlalchemy.select(run_table.c.id).where(
run_table.c.protocol_id == protocol_id
)
# Get all entries in csv table that match the analyses
# Get all entries in analysis_csv_table that match the analysis IDs above
analysis_csv_file_ids = sqlalchemy.select(
analysis_csv_rtp_table.c.file_id
).where(
analysis_csv_rtp_table.c.analysis_id.in_(select_referencing_analysis_ids)
)
# Get all entries in run_csv_table that match the run IDs above
run_csv_file_ids = sqlalchemy.select(run_csv_rtp_table.c.file_id).where(
run_csv_rtp_table.c.run_id.in_(select_referencing_run_ids)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, and not introduced in this PR: Can these variable names either consistently have a select_ prefix to indicate that they're merely statements instead of results, or consistently not have that prefix?

@sanni-t sanni-t merged commit 90127ff into chore_release-8.0.0 Aug 7, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants