SQL for Python Developers

Gaurav Agarwal

Agenda

Data is the new oil, and SQL is the drill

Software Engineer & Product Developer

Director of Engineering & Founder @ https://codermana.com

ex-Tarka Labs, ex-BrowserStack, ex-ThoughtWorks

As an instructor

I promise to
- make this class as interactive as possible
- use as many resources as available to keep you engaged
- ensure everyone's questions are addressed

What I need from you

Be vocal
- Let me know if there are any audio/video issues ASAP
- Feel free to interrupt me and ask me questions
Be punctual
Give feedback
Work on the exercises
Be on mute unless you are speaking

Class progression

Here you are trying to learn something, while here your brain is doing you a favor by making sure the learning doesn't stick!

Some tips (1/2)

Slow down => stop & think
- listen for the questions and answer
Do the exercises
- not add-ons; not optional
There are no dumb questions!
Drink water. Lots of it!

Some tips (2/2)

Take notes
- Try: Repetitive Spaced Out Learning
Talk about it out loud
Listen to your brain
Experiment!

📚 Content `>` 🕒 Time

Show of hands

Yay's - in Chat

Python Database landscape?

DB API 2.0 Specification

The Python Database API (DB-API) is the standard interface for connecting to databases in Python

It's independent of database engines, so Python scripts can access any database engine

DB Drivers

Database (DB) drivers are software components that allow applications to interact with database management systems (DBMS) using specific protocols or interfaces

These drivers act as intermediaries, translating queries from an application into commands that the database understands, and then converting the database's responses back into a format usable by the application

Synchronous Drivers

psycopg

The most popular PostgreSQL adapter for Python. It's feature-rich, mature, and widely used in production

Full compliance with the Python DB-API 2.0
Support for prepared statements, server-side cursors, and COPY commands
Extensive support for PostgreSQL data types
Thread-safe for use with multiple connections

pg8000

A pure-Python PostgreSQL driver (doesn’t rely on C extensions)

Easy to deploy since it has no external dependencies
Fully DB-API 2.0 compliant

Asynchronous Drivers

Async drivers are designed for non-blocking I/O and are suitable for asynchronous frameworks like asyncio

asyncpg

A high-performance asynchronous PostgreSQL client library

asyncpg implements PostgreSQL server protocol natively and exposes its features directly, as opposed to hiding them behind a generic facade like DB-API

This enables asyncpg to have easy-to-use support for:

prepared statements
scrollable cursors
partial iteration on query results
automatic encoding and decoding of composite types, arrays, and any combination of those
straightforward support for custom data types

aiopg

Asyncio-based PostgreSQL driver built on psycopg2

Provides async functionality while maintaining compatibility with psycopg
Offers coroutine-based connection pools and cursors

Applications transitioning from psycopg to async workflows can leverage aiopg

Choosing the Right Driver

Use psycopg for general-purpose, synchronous applications
Use asyncpg for high-performance, async workloads
Use pg8000 if you need a pure Python solution without external dependencies
Use aiopg for async workflows with existing psycopg experience

Why ORM?

Object-Relational Mapping (ORM) is a programming technique that allows developers to interact with databases using the object-oriented paradigm rather than writing raw SQL queries

ORM frameworks automate the process of converting data between incompatible systems, specifically between objects in code and rows in a relational database

ORMs abstract the complexity of (simple) SQL, making database operations more intuitive by allowing developers to interact with the database as though it were a set of objects in code

Without ORM (raw SQL):
```
SELECT * FROM users WHERE id = 1;
```
With ORM (Python example using SQLAlchemy):
```
user = session.query(User).get(1)
```

ORMs use parameterized queries by default, which helps protect against SQL injection attacks.

ORM libraries

SQLAlchemy

Works as an abstraction layer, leveraging drivers like psycopg or asyncpg

SQLAlchemy generates optimized SQL queries and supports lazy loading, eager loading, and batch loading for efficient data fetching.
The Core API enables precise control over query performance, making it suitable for high-performance applications.

SQLAlchemy offers more than just ORM. It includes:

Core SQL Layer: Provides fine-grained control over SQL queries for high-performance and complex use cases.
ORM Layer: Allows developers to interact with the database using Python objects, simplifying development.
Schema Management: Tools for defining, creating, and altering database schemas.

SQLAlchemy can be used in two main ways:

SQLAlchemy Core: A lower-level abstraction for constructing and executing raw SQL queries programmatically.
SQLAlchemy ORM: A high-level abstraction that maps Python classes to database tables.

This flexibility makes it suitable for projects ranging from simple applications to highly complex systems.

Django ORM

Defaults to psycopg2 for PostgreSQL

Tortoise ORM

Async ORM that works with asyncpg

Limitations of ORMs

Performance Overhead
- ORMs can be slower than raw SQL for complex queries, as they generate queries dynamically.
Learning Curve
- While ORMs simplify database interactions, understanding how they work under the hood can take time.
Complex Queries
- Writing intricate queries might require raw SQL for better control.
  - Eg: Complex joins or database-specific optimizations.
Dependency
- Applications using an ORM are tied to its API, making migration to another ORM or raw SQL more difficult.

Schema/Migration Management

Flyway

A database migration tool that emphasizes simplicity and compatibility with multiple databases.

Uses a file-based approach where migrations are typically written in SQL or Java.

Alembic

Alembic is a database migrations tool written by the author of SQLAlchemy.

Can emit ALTER statements to a database in order to change the structure of tables and other constructs
Provides a system whereby "migration scripts" may be constructed; each script indicates a particular series of steps that can "upgrade" a target database to a new version, and optionally a series of steps that can "downgrade" similarly, doing the same steps in reverse
Allows the scripts to execute in some sequential manner

Atlas

Atlas is a language-agnostic tool for managing and migrating database schemas using modern DevOps principles.

It offers two workflows:

Declarative: Similar to Terraform, Atlas compares the current state of the database to the desired state, as defined in an HCL, SQL, or ORM schema. Based on this comparison, it generates and executes a migration plan to transition the database to its desired state.
Versioned: Unlike other tools, Atlas automatically plans schema migrations for you. Users can describe their desired database schema in HCL, SQL, or their chosen ORM, and by utilizing Atlas, they can plan, lint, and apply the necessary migrations to the database.

Setup & Getting Started

Install Postgres (15 & above)
- Make sure Postgres server is running on your system
Test using: psql template1
Clone the repo (https://github.com/AgarwalConsulting/SQLforPythonDevelopersTraining)
- pip install -r requirements.txt

`Psycopg` basics

'%s', '%b', '%t' are the only supported query parameter types

Connection Pooling with `psycopg`

`Sqlalchemy` basics

Core: Provides a low-level interface for SQL execution.
ORM: Provides an object-oriented abstraction over the database.

Core connection example

Sqlalchemy ORM

Changes in 2.0

.content-credits[https://docs.sqlalchemy.org/en/20/changelog/migration_20.html#migration-20-query-usage]

Using built-in auto-migration support

caveats of using built-in auto-migration support!

Basic migrations using `alembic`

To get started

alembic init alembic

To generate new migration

alembic revision [--autogenerate] -m "<migration message>"

ORM: Deep Dive

Relationships

One to One
One to Many
Many to Many
Self-Referential
Polymorphic

The dreaded (n+1)

Relationship Loading Techniques

Raw Queries

Constraints, validations and events

Defining constraints on columns

Defining Validations

Events

Commonly Used Events

before_insert: Triggered before an object is inserted into the database.
after_insert: Triggered after an object is inserted.
before_update: Triggered before an update operation.
after_update: Triggered after an update operation.
before_delete: Triggered before an object is deleted.
after_delete: Triggered after an object is deleted.

Default Values

Advanced Schema Management

Postgres Basics: Refresher

Types

SERIAL

(Date, Time) vs Timestamp [with Timezone]

Loading/Dumping csv data

Using COPY or \COPY

COPY <tablename> TO '<csv_file>' WITH(FORMAT CSV, HEADER); -- Dumping

COPY <tablename> FROM '<csv_file>' {DELIMITER ','} CSV HEADER; -- Loading

Normalization

1 NF
2 NF
3 NF
BCNF

Let's take an employee management system as an example...

The problems with flat tables

Data Redundancy: Repeated storage of the same data increases storage needs.
Data Integrity Issues: Risk of inconsistencies when updating repeated data.
Scalability Challenges: Tables grow large quickly, impacting performance.
Lack of Normalization: Makes managing relationships between data entities difficult.
Limited Query Flexibility: Complex queries may become inefficient or convoluted.
Maintenance Overhead: Adding or modifying data structures is harder and error-prone.
Poor Data Organization: Difficult to handle hierarchical or multi-dimensional data effectively.

When NOT to Normalize

Performance is the Key Concern
- Normalization requires more joins to fetch related data, which can slow down read-heavy operations.
You Have a Read-Heavy Application
- For applications like reporting or analytics, de-normalized data reduces the need for joins, speeding up queries.
Storage is Cheap and Plentiful
- Modern storage costs are low, so eliminating redundancy may not always be worth the effort.

Your Data is Relatively Static
- If your data rarely changes, you can afford some redundancy without significant risks of anomalies.
You’re Using a NoSQL Database
- Document-based databases (e.g., MongoDB) favor denormalized structures for faster read performance and scalability.

To Normalize or not to Normalize: How to Decide?

Application Type
- Transactional systems (OLTP): Normalize.
- Analytical systems (OLAP): Often de-normalize.
Query Complexity
- Normalize when query patterns involve fine-grained data updates or strict constraints.
- De-normalize when queries are simple but need to run fast.
Maintainability vs. Performance
- Normalize for long-term maintainability and data consistency.
- De-normalize for immediate performance gains in specific use cases.

Hybrid Approaches

Partially Normalize
- Normalize up to 3NF, then selectively de-normalize performance-critical parts.
Use Indexing and Materialized Views
- Optimize performance without sacrificing full normalization by creating indexes or materialized views for frequently used queries.

Normalize for integrity, de-normalize for speed.

Joins

Usage of `IN` vs `NOT IN` clauses

Differences Between `IN` and `NOT IN`

Logic

IN returns rows where the value in the column matches any value in the list.
NOT IN returns rows where the value in the column does not match any value in the list.

Performance

IN: Works efficiently when checking against a relatively small list of values.
NOT IN: May not perform as efficiently when dealing with large sets of values or NULL values. This is because NOT IN may involve extra checks for NULL.

Handling NULLs

IN: If the subquery contains NULL, the result of IN can behave unpredictably (e.g., NULL values cause the result to be unknown or false in the query).
NOT IN: If any value in the list is NULL, the result will be NULL for all rows, because any comparison with NULL is unknown (this leads to no rows being returned unless the list is carefully handled).

Alternative with `EXISTS` or `NOT EXISTS`

If you need to handle subqueries more efficiently (especially with NULL values), consider using EXISTS or NOT EXISTS as an alternative to IN and NOT IN.

SELECT name
FROM employees e
WHERE EXISTS (
    SELECT 1
    FROM departments d
    WHERE e.department_id = d.dept_id
    AND d.department_name = 'HR'
);

Which JOINs to choose?

JOIN vs UNION

Analyzing Performance using `EXPLAIN` or `EXPLAIN ANALYZE`

Working with indices

Efficient Postgres: Tips & tricks

Indexing
- Use appropriate indexes (B-tree, GIN, GiST, etc.) for filtering and sorting.
- Avoid over-indexing to reduce storage and maintenance overhead.
Query Design
- Always specify needed columns (SELECT col1, col2), not SELECT *.
- Use WHERE clauses to filter data early.
Joins and Subqueries
- Optimize join order; smaller or filtered tables should come first.
- Replace correlated subqueries with joins if possible.
Query Analysis
- Use EXPLAIN and EXPLAIN ANALYZE to inspect execution plans.
- Identify and address slow sequential scans.

Sorting and Grouping
- Index columns frequently used for sorting or grouping.
Maintenance
- Regularly run VACUUM and ANALYZE to update statistics.
- Enable autovacuum for large, write-heavy tables.
Partitioning and Parallelism
- Partition large tables to reduce scanned data.
- Enable parallel queries for large datasets.
Avoid Redundancy
- Avoid applying functions to indexed columns in WHERE clauses. (Eg: date(timestamp_col))

Materialized Views
- Use materialized views for expensive, repetitive queries.
Connection Pooling
- Use a pooler like PgBouncer to manage database connections efficiently.
Performance Monitoring
- Install pg_stat_statements to track query performance.

Working with JSONB

JSONB vs JSON

JSONB vs hstore

Using GIN Indexes

Storage of JSONB (& other large objects)

Typically, when the size of your column exceeds the TOAST_TUPLE_THRESHOLD (2kb default), PostgreSQL will attempt to compress the data and fit it in 2kb.

If that doesn’t work, the data is moved to out-of-line storage. This is what they call “TOASTing” the data.

When the data is fetched, the reverse process of “deTOASTting” must happen.

Partitioning

Partitioning divides a table into smaller pieces (partitions), which can improve performance for queries and management of large datasets.

In PostgreSQL, there are two main types of partitioning:

range and
list

SQLAlchemy allows you to interact with partitioned tables just like any other table, but PostgreSQL automatically handles the routing of rows to the appropriate partition based on the region value.

Windowing Functions

Windowing functions allow you to perform calculations across a set of rows related to the current row

running totals, rankings, or moving averages.

CAP theorem

defines the limitations and trade-offs in a distributed system

It suggests that distributed computer systems can only deliver two out of the following three guarantees:

Consistency: Every node sees the same data even when concurrent updates occur

Availability: All requests receive responses on whether it was a success or a failure

Partition tolerance: The system will keep operating even if there is a network partition in communication between different nodes

In the case of a network partition, the CAP theorem forces a trade-off between Consistency and Availability.

A system must either:

Maintain consistency, but sacrifice availability (not all requests are responded to).
Maintain availability, but sacrifice consistency (some responses may be outdated).

Replication vs Sharding vs Clustering

Replication

Replication is the process of copying data from one PostgreSQL server (primary) to one or more other servers (replicas).

Replication is useful for:

Load balancing for read-heavy applications (replicas handle read operations).
High availability (replicas can take over in case the primary server fails).

Setting up Replication

Sharding

Sharding is the process of dividing data across multiple databases or servers (shards), based on a partitioning key.

Handles very large datasets and enables horizontal scaling.

Sharding is useful for:

Applications with high write or data-volume demands.
Multi-tenant systems, where each tenant's data is isolated to specific shards.
Scenarios requiring geographically distributed data storage.

Setting up Sharding

Related: Foreign Data Wrapper

Clustering

Clustering involves configuring multiple PostgreSQL servers to work together as a single system.

Clustering is useful for:

Scaling both read and write operations.
Distributed databases for geographically dispersed applications.
Applications requiring parallel query execution.

Postgres natively does not support clustering!

CitusData for clustering

Choosing the Right Approach

Use Replication for read-heavy workloads and high availability with simpler management.
Use Clustering for more complex scaling and parallel query execution requirements.
Use Sharding for extremely large datasets or applications needing distributed data storage.

Locking

Implicit Locking

In PostgreSQL, implicit locks are automatically acquired by the database system to maintain the integrity of the data and manage concurrent transactions.

These locks are generally not directly managed by the user but are handled internally by PostgreSQL as part of its transaction management system.

Implicit locks help ensure consistency and prevent conflicts in multi-user environments.

Types of Implicit Locks

Row-Level Locks
- Automatically acquired during operations like SELECT FOR UPDATE or SELECT FOR SHARE to lock specific rows.
Table-Level Locks
- Acquired during table modifications (INSERT, UPDATE, DELETE, ALTER TABLE) to prevent conflicting operations.
Transaction-Level Locks
- Used in higher isolation levels like SERIALIZABLE to ensure consistency and prevent anomalies.
Index Locks
- Automatically applied during index creation or modifications to ensure no conflicting operations occur.

MVCC (Multi-Version Concurrency Control)
- Implicitly handles row versions during updates or deletes to maintain consistency without blocking reads.
Foreign Key Constraints
- Implicit locks are used to ensure referential integrity when modifying or deleting rows involved in foreign key relationships.
Maintenance Operations
- Commands like VACUUM or CLUSTER acquire locks to perform maintenance tasks while avoiding conflicts.

Explicit Locking

Operations

Backup & Restore

pgdump
pgrestore

Code https://github.com/AgarwalConsulting/SQLforPythonDevelopersTraining

Slides https://sql-for-python-developers.slides.AgarwalConsulting.com

Files

slides.md

Latest commit

History