Skip to content

Latest commit

 

History

History
1482 lines (970 loc) · 43.4 KB

index.md

File metadata and controls

1482 lines (970 loc) · 43.4 KB
home
true

{% include init_counters.md %}

{% include h2_unnumbered.md title="what this is" %}

  • notes and working examples that instructors can use to perform a lesson
    • do not expect novices with no prior SQL experience to be able to learn from them
  • musical analogy
    • this is the chord changes and melody
    • we expect instructors to create an arrangement and/or improvise while delivering
    • see [Teaching Tech Together][t3] for background

{% include h2_unnumbered.md title="scope" %}

  • [intended audience][persona]
    • Rachel has a master's degree in cell biology and now works in a research hospital doing cell assays.
    • She learned a bit of R in an undergrad biostatistics course and has been through [the Carpentries lesson on the Unix shell][carpentries-shell].
    • Rachel is thinking about becoming a data scientist and would like to understand how data is stored and managed.
    • Her work schedule is unpredictable and highly variable, so she needs to be able to learn a bit at a time.
  • prerequisites
    • basic Unix command line: cd, ls, * wildcard
    • basic tabular data analysis: filtering rows, aggregating within groups
  • learning outcomes
    1. Explain the difference between a database and a database manager.
    2. Write SQL to select, filter, sort, group, and aggregate data.
    3. Define tables and insert, update, and delete records.
    4. Describe different types of join and write queries that use them to combine data.
    5. Use windowing functions to operate on adjacent rows.
    6. Explain what transactions are and write queries that roll back when constraints are violated.
    7. Explain what triggers are and write SQL to create them.
    8. Manipulate JSON data using SQL.
    9. Interact with a database using Python directly, from a Jupyter notebook, and via an ORM.

{% include h2_unnumbered.md title="setup" %}

  • Download the latest release
  • Unzip the file in a temporary directory to create:
    • ./db/*.db: the SQLite databases used in the examples
    • ./src/*.*: SQL queries, Python scripts, and other source code
    • ./out/*.*: expected output for examples

{% include h2_unnumbered.md title="background concepts" %}

  • A database is a collection of data that can be searched and retrieved
  • A database management system (DBMS) is a program that manages a particular kind of database
  • Each DBMS stores data in its own way
    • [SQLite][sqlite] stores each database in a single file
    • [PostgreSQL][postgresql] spreads information across many files for higher performance
  • DBMS can be a library embedded in other programs ([SQLite][sqlite]) or a server ([PostgreSQL][postgresql])
  • A relational database management system (RDBMS) stores data in tables and uses [SQL][sql] for queries
    • Unfortunately, every RDBMS has its own dialect of SQL
  • There are also NoSQL databases like [MongoDB][mongodb] that don't use tables

concept map: overview

{% include h2_unnumbered.md title="connect to database" %}

{% include single.md file="src/connect_penguins.sh" %}

  • Not actually a query
  • But we have to do it before we can do anything else

{% include h2_numbered.md title="select constant" %}

{% include double.md stem="select_1" suffix="sql out" %}

  • select is a keyword
  • Normally used to select data from table…
  • …but if all we want is a constant value, we don't need to specify one
  • Semi-colon terminator is required

{% include h2_numbered.md title="select all values from table" %}

{% include double.md stem="select_star" suffix="sql out" %}

  • An actual query
  • Use * to mean "all columns"
  • Use from tablename to specify table
  • Output format is not particularly readable

{% include h2_unnumbered.md title="administrative commands" %}

{% include double.md stem="admin_commands" suffix="sql out" %}

  • SQLite administrative commands start with . and aren't part of the SQL standard
    • PostgreSQL's special commands start with \
  • Use .help for a complete list

{% include h2_numbered.md title="specify columns" %}

{% include double.md stem="specify_columns" suffix="sql out" %}

  • Specify column names separated by commas
    • In any order
    • Duplicates allowed
  • Line breaks allowed encouraged for readability

{% include h2_numbered.md title="sort" %}

{% include double.md stem="sort" suffix="sql out" %}

  • order by must follow from (which must follow select)
  • asc is ascending, desc is descending
    • Default is ascending, but please specify

{% include h2_numbered.md title="limit output" %}

  • Full dataset has 344 rows

{% include double.md stem="limit" suffix="sql out" %}

  • Comments start with -- and run to the end of the line
  • limit N specifies maximum number of rows returned by query

{% include h2_numbered.md title="page output" %}

{% include double.md stem="page" suffix="sql out" %}

  • offset N must follow limit
  • Specifies number of rows to skip from the start of the selection
  • So this query skips the first 3 and shows the next 10

{% include h2_numbered.md title="remove duplicates" %}

{% include double.md stem="distinct" suffix="sql out" %}

  • distinct keyword must appear right after select
    • SQL was supposed to read like English
  • Shows distinct combinations
  • Blanks in sex column show missing data
    • We'll talk about this in a bit

{% include h2_numbered.md title="filter results" %}

{% include double.md stem="filter" suffix="sql out" %}

  • where condition filters the rows produced by selection
  • Condition is evaluated independently for each row
  • Only rows that pass the test appear in results
  • Use single quotes for 'text data' and double quotes for "weird column names"
    • SQLite will accept double-quoted text data

{% include h2_numbered.md title="filter with more complex conditions" %}

{% include double.md stem="filter_and" suffix="sql out" %}

  • and: both sub-conditions must be true
  • or: either or both part must be true
  • Notice that the row for Gentoo penguins on Biscoe island with unknown (empty) sex didn't pass the test
    • We'll talk about this in a bit

{% include h2_numbered.md title="do calculations" %}

{% include double.md stem="calculations" suffix="sql out" %}

  • Can do the usual kinds of arithmetic on individual values
    • Calculation done for each row independently
  • Column name shows the calculation done

{% include h2_numbered.md title="rename columns" %}

{% include double.md stem="rename_columns" suffix="sql out" %}

  • Use expression as name to rename
  • Give result of calculation a meaningful name
  • Can also rename columns without modifying

{% include h2_unnumbered.md title="check your understanding" %}

concept map: selection

{% include h2_numbered.md title="calculate with missing values" %} {% include double.md stem="show_missing_values" suffix="sql out" %}

  • SQL uses a special value null to representing missing data
    • Not 0 or empty string, but "I don't know"
  • Flipper length and body weight not known for one of the first five penguins
  • "I don't know" divided by 10 or 1000 is "I don't know"

{% include h2_numbered.md title="null equality" %}

  • Repeated from above so it doesn't count against our query limit

{% include double.md stem="filter" suffix="sql out" %}

  • If we ask for female penguins the row with the missing sex drops out

{% include double.md stem="null_equality" suffix="sql out" %}

{% include h2_numbered.md title="null inequality" %}

  • But if we ask for penguins that aren't female it drops out as well

{% include double.md stem="null_inequality" suffix="sql out" %}

{% include h2_numbered.md title="ternary logic" %}

{% include double.md stem="ternary_logic" suffix="sql out" %}

  • If we don't know the left and right values, we don't know if they're equal or not
  • So the result is null
  • Get the same answer for null != null
  • Ternary logic
equality
X Y null
X true false null
Y false true null
null null null null

{% include h2_numbered.md title="handle null safely" %}

{% include double.md stem="safe_null_equality" suffix="sql out" %}

  • Use is null and is not null to handle null safely
  • Other parts of SQL handle nulls specially

{% include h2_unnumbered.md title="check your understanding" %}

concept map: null

{% include h2_numbered.md title="aggregate" %}

{% include double.md stem="simple_sum" suffix="sql out" %}

  • Aggregation combines many values to produce one
  • sum is an aggregation function
  • Combines corresponding values from multiple rows

{% include h2_numbered.md title="common aggregation functions" %}

{% include double.md stem="common_aggregations" suffix="sql out" %}

  • This actually shouldn't work: can't calculate maximum or average if any values are null
  • SQL does the useful thing instead of the right one

{% include h2_numbered.md title="counting" %}

{% include double.md stem="count_behavior" suffix="sql out" %}

  • count(*) counts rows
  • count(column) counts non-null entries in column
  • count(distinct column) counts distinct non-null entries

{% include h2_numbered.md title="group" %}

{% include double.md stem="simple_group" suffix="sql out" %}

  • Put rows in groups based on distinct combinations of values in columns specified with group by
  • Then perform aggregation separately for each group
  • But which is which?

{% include h2_numbered.md title="behavior of unaggregated columns" %}

{% include double.md stem="unaggregated_columns" suffix="sql out" %}

  • All rows in each group have the same value for sex, so no need to aggregate

{% include h2_numbered.md title="arbitrary choice in aggregation" %}

{% include double.md stem="arbitrary_in_aggregation" suffix="sql out" %}

  • If we don't specify how to aggregate a column, SQL can choose any arbitrary value from the group
  • All penguins in each group have the same sex because we grouped by that, so we get the right answer
  • The body mass values are in the data but unpredictable
  • A common mistake

{% include h2_numbered.md title="filter aggregated values" %}

{% include double.md stem="filter_aggregation" suffix="sql out" %}

  • Using having condition instead of where condition for aggregates

{% include h2_numbered.md title="readable output" %}

{% include double.md stem="readable_aggregation" suffix="sql out" %}

  • Use round(value, decimals) to round off a number

{% include h2_numbered.md title="filter aggregate inputs" %}

{% include double.md stem="filter_aggregate_inputs" suffix="sql out" %}

  • filter (where condition) applies to inputs

{% include h2_unnumbered.md title="check your understanding" %}

concept map: aggregation

{% include h2_unnumbered.md title="create in-memory database" %}

{% include single.md file="src/in_memory_db.sh" %}

  • "Connect" to an in-memory database

{% include h2_numbered.md title="create tables" %}

{% include single.md file="src/create_work_job.sql" %}

  • create table name followed by parenthesized list of columns
  • Each column is a name, a data type, and optional extra information
    • E.g., not null prevents nulls from being added
  • .schema is not standard SQL
  • SQLite has added a few things
    • create if not exists
    • upper-case keywords (SQL is case insensitive)

{% include h2_numbered.md title="insert data" %}

{% include single.md file="src/populate_work_job.sql" %} {% include single.md file="out/insert_values.out" %}

{% include h2_numbered.md title="update rows" %}

{% include single.md file="src/update_work_job.sql" %} {% include single.md file="out/update_rows.out" %}

  • (Almost) always specify row(w) to update using where
    • Would otherwise update all rows
  • Useful to give each row a primary key that uniquely identifies it for this purpose
    • Will see other uses below

{% include h2_numbered.md title="delete rows" %}

{% include double.md stem="delete_rows" suffix="sql out" %}

  • Again, (almost) always specify row(s) to delete using where

{% include h2_numbered.md title="backing up" %}

{% include double.md stem="backing_up" suffix="sql out" %}

{% include h2_unnumbered.md title="check your understanding" %}

concept map: data definition and modification

{% include h2_numbered.md title="join tables" %}

{% include double.md stem="cross_join" suffix="sql out" %}

  • A join combines information from two tables
  • full outer join (also called cross join) constructs their cross product
    • All combinations of rows from each
  • Result isn't particularly useful: job and name don't match

{% include h2_numbered.md title="inner join" %}

{% include double.md stem="inner_join" suffix="sql out" %}

  • Use table.column notation to specify columns
    • A column can have the same name as a table
  • Use on condition to specify join condition
  • Since complain doesn't appear in job.name, none of those rows are kept

{% include h2_numbered.md title="aggregate joined data" %}

{% include double.md stem="aggregate_join" suffix="sql out" %}

  • Combines ideas we've seen before
  • But Tay is missing from the table

{% include h2_numbered.md title="left join" %}

{% include double.md stem="left_join" suffix="sql out" %}

  • A left outer join keeps all rows from the left table
  • Fills missing values from right table with null

{% include h2_numbered.md title="aggregate left joins" %}

{% include double.md stem="aggregate_left_join" suffix="sql out" %}

  • That's better, but we'd like to see 0 rather than a blank

{% include h2_unnumbered.md title="check your understanding" %}

concept map: join

{% include h2_numbered.md title="coalesce values" %}

{% include double.md stem="coalesce" suffix="sql out" %}

  • coalesce(val1, val2, …) returns first non-null value

{% include h2_numbered.md title="negate incorrectly" %}

  • Who doesn't calibrate?

{% include double.md stem="negate_incorrectly" suffix="sql out" %}

  • But Mik does calibrate
  • Problem is that there's an entry for Mik cleaning
  • And since 'clean' != 'calibrate', that row is included in the results
  • We need a different approach

{% include h2_numbered.md title="set membership" %}

{% include double.md stem="set_membership" suffix="sql out" %}

  • in values and not in values do exactly what you expect

{% include h2_numbered.md title="subqueries" %}

{% include double.md stem="subquery_set" suffix="sql out" %}

  • Use a subquery to select the people who do calibrate
  • Then select all the people who aren't in that set
  • Initially feels odd, but subqueries are useful in other ways

{% include h2_unnumbered.md title="M to N relationships" %}

  • Relationships between entities are usually characterized as:
    • 1-to-1: fields in the same record
    • 1-to-many: the many have a foreign key referring to the one's primary key
    • many-to-many: don't know how many keys to add to records ("maximum" never is)
  • Nearly-universal solution is a join table
    • Each record is a pair of foreign keys
    • I.e., each record is the fact that records A and B are related

{% include h2_numbered.md title="autoincrement and primary key" %}

{% include double.md stem="autoincrement" suffix="sql out" %}

  • Database autoincrements ident each time a new record is added
  • Use that field as the primary key
    • So that if Mik changes their name again, we only have to change one fact in the database
    • Downside: manual queries are harder to read (who is person 17?)

{% include h2_unnumbered.md title="internal tables" %}

{% include double.md stem="sequence_table" suffix="sql out" %}

  • Sequence numbers are not reset when rows are deleted

{% include h2_numbered.md title="alter tables" %}

{% include double.md stem="alter_tables" suffix="sql out" %}

  • Add a column after the fact
  • Since it can't be null, we have to provide a default value
    • Really want to make it the primary key, but SQLite doesn't allow that (easily) after the fact
  • Then use update to modify existing records
    • Can modify any number of records at once
    • So be careful about where clause
  • Data migration

{% include h2_numbered.md title="create new tables from old" %}

{% include double.md stem="insert_select" suffix="sql out" %}

  • new_work is our join table
  • Each column refers to a record in some other table

{% include h2_numbered.md title="remove tables" %}

{% include double.md stem="drop_table" suffix="sql out" %}

  • Remove the old table and rename the new one to take its place
    • Note if exists
  • Be careful…

{% include h2_numbered.md title="compare individual values to aggregates" %}

  • Go back to penguins

{% include double.md stem="compare_individual_aggregate" suffix="sql out" %}

  • Get average body mass in subquery
  • Compare each row against that
  • Requires two scans of the data, but there's no way to avoid that
  • Null values aren't included in the average or in the final results

{% include h2_numbered.md title="compare individual values to aggregates within groups" %}

{% include double.md stem="compare_within_groups" suffix="sql out" %}

  • Subquery runs first to create temporary table averaged with average mass per species
  • Join that with penguins
  • Filter to find penguins heavier than average within their species

{% include h2_numbered.md title="common table expressions" %}

{% include double.md stem="common_table_expressions" suffix="sql out" %}

  • Use common table expression (CTE) to make queries clearer
    • Nested subqueries quickly become difficult to understand
  • Database decides how to optimize

{% include h2_unnumbered.md title="explain query plan" %}

{% include double.md stem="explain_query_plan" suffix="sql out" %}

  • SQLite plans to scan every row of the table
  • It will build a temporary B-tree data structure to group rows

{% include h2_numbered.md title="enumerate rows" %}

  • Every table has a special column called rowid

{% include double.md stem="rowid" suffix="sql out" %}

  • rowid is persistent within a session
    • I.e., if we delete the first 5 rows we now have row IDs 6…N
  • Do not rely on row ID
    • In particular, do not use it as a key

{% include h2_numbered.md title="if-else function" %}

{% include double.md stem="if_else" suffix="sql out" %}

  • iif(condition, true_result, false_result)
    • Note: iif with two i's

{% include h2_numbered.md title="select a case" %}

  • What if we want small, medium, and large?
  • Can nest iif, but quickly becomes unreadable

{% include double.md stem="case_when" suffix="sql out" %}

  • Evaluate when options in order and take first
  • Result of case is null if no condition is true
  • Use else as fallback

{% include h2_numbered.md title="check range" %}

{% include double.md stem="check_range" suffix="sql out" %}

  • between can make queries easier to read
  • But be careful of the and in the middle

{% include h2_unnumbered.md title="yet another database" %}

  • Entity-relationship diagram (ER diagram) shows relationships between tables
  • Like everything to do with databases, there are lots of variations

assay database table diagram

assay ER diagram

{% include double.md stem="assay_staff" suffix="sql out" %}

{% include h2_numbered.md title="pattern matching" %}

{% include double.md stem="like_glob" suffix="sql out" %}

  • like is the original SQL pattern matcher
    • % matches zero or more characters at the start or end of a string
    • Case insensitive by default
  • glob supports Unix-style wildcards
name purpose
substr Get substring given starting point and length
trim Remove characters from beginning and end of string
ltrim Remove characters from beginning of string
rtrim Remove characters from end of string
length Length of string
replace Replace occurrences of substring with another string
upper Return upper-case version of string
lower Return lower-case version of string
instr Find location of first occurrence of substring (returns 0 if not found)

{% include h2_numbered.md title="select first and last rows" %}

{% include double.md stem="union_all" suffix="sql out" %}

  • union all combines records
    • Keeps duplicates: union on its own keeps unique records
  • Yes, it feels like the extra select * from should be unnecessary

{% include h2_numbered.md title="intersection" %}

{% include double.md stem="intersect" suffix="sql out" %}

  • Tables being intersected must have same structure
  • Intersection usually used when pulling values from different tables
    • In this case, would be clearer to use where

{% include h2_numbered.md title="exclusion" %}

{% include double.md stem="except" suffix="sql out" %}

  • Again, tables must have same structure
    • And this would be clearer with where
  • SQL operates on sets, not tables, except where it doesn't

{% include h2_numbered.md title="random numbers and why not" %}

{% include double.md stem="random_numbers" suffix="sql out" %}

  • There is no way to seed SQLite's random number generator
  • Which means there is no way to reproduce one of its "random" sequences

{% include h2_numbered.md title="creating index" %}

{% include double.md stem="create_use_index" suffix="sql out" %}

  • An index is an auxiliary data structure that enables faster access to records
    • Spend storage space to buy speed
  • Don't have to mention it explicitly in queries
    • Database manager will use it automatically

{% include h2_numbered.md title="generate sequence" %}

{% include double.md stem="generate_sequence" suffix="sql out" %}

  • A (non-standard) table-valued function

{% include h2_numbered.md title="generate sequence based on data" %}

{% include double.md stem="data_range_sequence" suffix="sql out" %}

  • Must have the parentheses around the min and max selections to keep SQLite happy

{% include h2_numbered.md title="generate sequence of dates" %}

{% include double.md stem="date_sequence" suffix="sql out" %}

  • SQLite represents dates as YYYY-MM-DD strings or as Julian days or as Unix milliseconds or…
    • Julian days is fractional number of days since November 24, 4714 BCE
  • julianday and date convert back and forth

{% include h2_numbered.md title="count experiments started per day without gaps" %}

{% include double.md stem="experiments_per_day" suffix="sql out" %}

{% include h2_numbered.md title="self join" %}

{% include double.md stem="self_join" suffix="sql out" %}

  • Join a table to itself
    • Use as to create aliases for copies of tables to distinguish them
    • Nothing special about the names left and right
  • Get all n2 pairs, including person with themself

{% include h2_numbered.md title="generate unique pairs" %}

{% include double.md stem="unique_pairs" suffix="sql out" %}

  • left.ident < right.ident ensures distinct pairs without duplicates
  • Use left.ident <= 4 and right.ident <= 4 to limit output
  • Quick check: n(n-1)/2 pairs

{% include h2_numbered.md title="filter pairs" %}

{% include double.md stem="filter_pairs" suffix="sql out" %}

{% include h2_numbered.md title="existence and correlated subqueries" %}

{% include double.md stem="correlated_subquery" suffix="sql out" %}

  • Nobody works in Endocrinology
  • select 1 could equally be select true or any other value
  • A correlated subquery depends on a value from the outer query
    • Equivalent to nested loop

{% include h2_numbered.md title="nonexistence" %}

{% include double.md stem="nonexistence" suffix="sql out" %}

{% include h2_unnumbered.md title="avoiding correlated subqueries" %}

{% include double.md stem="avoid_correlated_subqueries" suffix="sql out" %}

  • The join might or might not be faster than the correlated subquery
  • Hard to find unstaffed departments without either not exists or count and a check for 0

{% include h2_numbered.md title="lead and lag" %}

{% include double.md stem="lead_lag" suffix="sql out" %}

  • Use strftime to extract year and month
    • Clumsy, but date/time handling is not SQLite's strong point
  • Use window functions lead and lag to shift values
    • Unavailable values are null

{% include h2_numbered.md title="window functions" %}

{% include double.md stem="window_functions" suffix="sql out" %}

  • sum() over does a running total
  • cume_dist is fraction of rows seen so far

{% include h2_unnumbered.md title="explain another query plain" %}

{% include double.md stem="explain_window_function" suffix="sql out" %}

  • Becomes useful…eventually

{% include h2_numbered.md title="partitioned windows" %}

{% include double.md stem="partition_window" suffix="sql out" %}

  • partition by creates groups
  • So this counts experiments started since the beginning of each year

{% include h2_numbered.md title="blobs" %}

{% include double.md stem="blob" suffix="sql out" %}

  • A blob is a binary large object
    • Bytes in, bytes out…
  • If you think that's odd, check out [Fossil][fossil]

{% include h2_unnumbered.md title="yet another database" %}

{% include single.md file="src/lab_log_db.sh" %} {% include double.md stem="lab_log_schema" suffix="sql out" %}

{% include h2_numbered.md title="store JSON" %}

{% include double.md stem="json_in_table" suffix="sql out" %}

  • Store heterogeneous data as JSON-formatted text (with double-quoted strings)
    • Database parses it each time it is queried
  • Alternatively store as blob
    • Can't just view it
    • But more efficient

{% include h2_numbered.md title="select field from JSON" %}

{% include double.md stem="json_field" suffix="sql out" %}

  • Single arrow -> returns JSON representation result
  • Double arrow ->> returns SQL text, integer, real, or null
  • Left side is column
  • Right side is path expression
    • Start with $ (meaning "root")
    • Fields separated by .

{% include h2_numbered.md title="JSON array access" %}

{% include double.md stem="json_array" suffix="sql out" %}

  • SQLite (and other database managers) has lots of JSON manipulation functions
  • json_array_length gives number of elements in selected array
  • subscripts start with 0
  • Characters outside 7-bit ASCII represented as Unicode escapes

{% include h2_numbered.md title="unpack JSON array" %}

{% include double.md stem="json_unpack" suffix="sql out" %}

  • json_each is another table-valued function
  • Use json_each.name to get properties of unpacked array

{% include h2_numbered.md title="last element of array" %}

{% include double.md stem="json_array_last" suffix="sql out" %}

{% include h2_numbered.md title="modify JSON" %}

{% include double.md stem="json_modify" suffix="sql out" %}

  • Updates the in-memory copy of the JSON, not the database record
  • Please use json_quote rather than trying to format JSON with string operations

{% include h2_unnumbered.md title="refresh penguins" %}

{% include double.md stem="count_penguins" suffix="sql out" %}

  • We will restore full database after each example

{% include h2_numbered.md title="tombstones" %}

{% include single.md file="src/make_active.sql" %} {% include double.md stem="active_penguins" suffix="sql out" %}

  • Use a tombstone to mark (in)active records
  • Every query must now include it

{% include h2_numbered.md title="views" %}

{% include double.md stem="views" suffix="sql out" %}

  • A view is a saved query that other queries can invoke
  • View is re-run each time it's used
  • Like a CTE, but:
    • Can be shared between queries
    • Views came first
  • Some databases offer materialized views
    • Update-on-demand temporary tables

{% include h2_unnumbered.md title="hours reminder" %}

{% include double.md stem="all_jobs" suffix="sql out" %}

{% include h2_numbered.md title="add check" %}

{% include double.md stem="all_jobs_check" suffix="sql out" %}

  • check adds constraint to table
    • Must produce a Boolean result
    • Run each time values added or modified
  • But changes made before the error have taken effect

{% include h2_unnumbered.md title="ACID" %}

  • Atomic: change cannot be broken down into smaller ones (i.e., all or nothing)
  • Consistent: database goes from one consistent state to another
  • Isolated: looks like changes happened one after another
  • Durable: if change takes place, it's still there after a restart

{% include h2_numbered.md title="transactions" %}

{% include double.md stem="transaction" suffix="sql out" %}

  • Statements outside transaction execute and are committed immediately
  • Statement(s) inside transaction don't take effect until:
    • end transaction (success)
    • rollback (undo)
  • Can have any number of statements inside a transaction
  • But cannot nest transactions in SQLite
    • Other databases support this

{% include h2_numbered.md title="rollback in constraint" %}

{% include double.md stem="rollback_constraint" suffix="sql out" %}

  • All of second insert rolled back as soon as error occurred
  • But first insert took effect

{% include h2_numbered.md title="rollback in statement" %}

{% include double.md stem="rollback_statement" suffix="sql out" %}

  • Constraint is in table definition
  • Action is in statement

{% include h2_numbered.md title="upsert" %}

{% include double.md stem="upsert" suffix="sql out" %}

  • upsert stands for "update or insert"
    • Create if record doesn't exist
    • Update if it does
  • Not standard SQL but widely implemented
  • Example also shows use of SQLite .print command

{% include h2_unnumbered.md title="normalization" %}

  • First normal form (1NF): every field of every record contains one indivisible value.

  • Second normal form (2NF) and third normal form (3NF): every value in a record that isn't a key depends solely on the key, not on other values.

  • Denormalization: explicitly store values that could be calculated on the fly

    • To simplify queries and/or make processing faster

{% include h2_numbered.md title="create trigger" %}

  • A trigger automatically runs before or after a specified operation
  • Can have side effects (e.g., update some other table)
  • And/or implement checks (e.g., make sure other records exist)
  • Add processing overhead…
  • …but data is either cheap or correct, never both
  • Inside trigger, refer to old and new versions of record as old.column and new.column

{% include single.md file="src/trigger_setup.sql" %} {% include double.md stem="trigger_successful" suffix="sql out" %}

081: trigger firing

{% include double.md stem="trigger_firing" suffix="sql out" %}

{% include h2_unnumbered.md title="represent graphs" %}

{% include single.md file="src/lineage_setup.sql" %} {% include double.md stem="represent_graph" suffix="sql out" %}

lineage diagram

{% include h2_numbered.md title="recursive query" %}

{% include double.md stem="recursive_lineage" suffix="sql out" %}

  • Use a recursive CTE to create a temporary table (descendent)
  • Base case seeds this table
  • Recursive case relies on value(s) already in that table and external table(s)
  • union all to combine rows
    • Can use union but that has lower performance (must check uniqueness each time)
  • Stops when the recursive case yields an empty row set (nothing new to add)
  • Then select the desired values from the CTE

{% include h2_unnumbered.md title="contact tracing database" %}

{% include double.md stem="contact_person" suffix="sql out" %} {% include double.md stem="contact_contacts" suffix="sql out" %}

contact diagram

{% include h2_numbered.md title="bidirectional contacts" %}

{% include double.md stem="bidirectional" suffix="sql out" %}

  • Create a temporary table rather than using a long chain of CTEs
    • Only lasts as long as the session (not saved to disk)
  • Duplicate information rather than writing more complicated query

{% include h2_numbered.md title="update group identifiers" %}

{% include double.md stem="update_group_ids" suffix="sql out" %}

  • new_ident is minimum of own identifier and identifiers one step away
  • Doesn't keep people with no contacts

{% include h2_numbered.md title="recursive labeling" %}

{% include double.md stem="recursive_labeling" suffix="sql out" %}

  • Use union instead of union all to prevent infinite recursion

{% include h2_numbered.md title="query from Python" %}

{% include double.md stem="basic_python_query" suffix="py out" %}

  • sqlite3 is part of Python's standard library
  • Create a connection to a database file
  • Get a cursor by executing a query
    • More common to create cursor and use that to run queries
  • Fetch all rows at once as list of tuples

{% include h2_numbered.md title="incremental fetch" %}

{% include double.md stem="incremental_fetch" suffix="py out" %}

  • cursor.fetchone returns None when no more data
  • There is also fetchmany(N) to fetch (up to) a certain number of rows

{% include h2_numbered.md title="insert, delete, and all that" %}

{% include double.md stem="insert_delete" suffix="py out" %}

  • Each execute is its own transaction

{% include h2_numbered.md title="interpolate values" %}

{% include double.md stem="interpolate" suffix="py out" %}

  • From [XKCD][xkcd-tables]

XKCD Exploits of a Mom

{% include h2_numbered.md title="script execution" %}

{% include double.md stem="script_execution" suffix="py out" %}

  • But what if something goes wrong?

{% include h2_numbered.md title="SQLite exceptions in Python" %}

{% include double.md stem="exceptions" suffix="py out" %}

{% include h2_numbered.md title="Python in SQLite" %}

{% include double.md stem="embedded_python" suffix="py out" %}

  • SQLite calls back into Python to execute the function
  • Other databases can run Python (and other languages) in the database server process
  • Be careful

{% include h2_numbered.md title="handle dates and times" %}

{% include double.md stem="dates_times" suffix="py out" %}

  • sqlite3.PARSE_DECLTYPES tells sqlite3 library to use converts based on declared column types
  • Adapt on the way in, convert on the way out

{% include h2_numbered.md title="SQL in Jupyter notebooks" %}

{% include single.md file="src/install_jupysql.sh" %}

  • And then inside the notebook:

{% include single.md file="src/load_ext.text" %}

  • Loads extension

{% include double.md stem="jupyter_connect" suffix="text out" %}

  • Connects to database
    • sqlite:// with two slashes is the protocol
    • /data/penguins.db (one leading slash) is a local path
  • Single percent sign %sql introduces one-line command
  • Use double percent sign %%sql to indicate that the rest of the cell is SQL

{% include double.md stem="jupyter_select" suffix="text out" %}

species num
Adelie 152
Chinstrap 68
Gentoo 124

{% include h2_numbered.md title="Pandas and SQL" %}

{% include single.md file="src/install_pandas.sh" %} {% include double.md stem="select_pandas" suffix="py out" %}

  • Be careful about datatype conversion

{% include h2_numbered.md title="Polars and SQL" %}

{% include single.md file="src/install_polars.sh" %} {% include double.md stem="select_polars" suffix="py out" %}

  • The Uniform Resource Identifier (URI) specifies the database
  • The query is the query
  • Use the ADBC engine instead of the default ConnectorX

{% include h2_numbered.md title="object-relational mapper" %}

{% include double.md stem="orm" suffix="py out" %}

  • An object-relational mapper (ORM) translates table columns to object properties and vice versa
  • SQLModel relies on Python type hints

{% include h2_numbered.md title="relations with ORM" %}

{% include double.md stem="orm_relation" suffix="py out" %}

  • Make foreign keys explicit in class definitions
  • SQLModel automatically does the join
    • The two staff with no department aren't included in the result

{% include h2_unnumbered.md title="Appendices" %}

{% include h2_unnumbered.md title=" Terms" %}

{% include glossary.html %}

{% include h2_unnumbered.md title=" Acknowledgments" %}

This tutorial would not have been possible without:

  • [Andi Albrecht][albrecht-andi]'s [sqlparse][sqlparse] module
  • [Dimitri Fontaine][fontaine-dimitri]'s [The Art of PostgreSQL][art-postgresql]
  • David Rozenshtein's The Essence of SQL (now sadly out of print)

I would also like to thank the following for spotting issues, making suggestions, or submitting changes:

{% include thanks.html %}

{% include links.md links=site.data.tutorial %}