Data Platform Notebooks

This repository holds notebooks used on the Data Platform. The notebooks in this repository are preloaded into notebook instances hosted by AWS sagemaker.

Running Jupyter in AWS Sagemaker

Please follow our playbook guide for using AWS sagemaker on the Data Platform.

Running Jupyter locally using Docker

Prerequisites

Docker Desktop

Run the Container

If you have make and aws-vault installed and setup with credentials for hackney-dataplatform-staging already you can follow option 1, otherwise use the second option. The first option has the advantage that you don't have to change your credentials each day as they get rotated.

To install and setup aws-vault follow the instructions in step 3 of the setup section in the project README.

Option 1. Using make and aws-vault (Preferred)

Dependendant on which glue version you which to use, there are 3 different container setups. To use version 2, run

make run-notebook-v2

You can replace v2 for v1 or v3 to use a different version.

Option 2. Using access keys and docker

Navigate to the hackney SSO, click on the account you want to use then click "Command line or programmatic access".
Copy the aws access key, aws secret access key and aws session token into the file /aws-config/credentials.
Dependendant on which glue version you which to use, there are 3 different container setups. To use version 2, run

docker compose up notebook-v2

You can replace v2 for v1 or v3 to use a different version.

Open Jupyter Notebook

Navigate to http://localhost:8888

Test connection to AWS

Within the notebook open test-s3-connection.
Change the s3_url variable to be a s3 bucket that exists in the AWS account you are using.
Click "Run"
You should get some data back from the s3 bucket and no errors.

Spark SQL read–eval–print loop

Follow the instructions for "Run your container" above.

Run make thrift-server, and wait for the command to finish, you should get a message similar to the below.

starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /home/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/logs/spark--org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-aa00aa00aa00.out

Wait around 10 seconds for the server to finish starting
Run make spark-sql, and wait for a SQL prompt to appear
```
0: jdbc:hive2://localhost:10000/default>
```

Test your console is working by copy & pasting the following SQL

SELECT lpi_key, uprn, longitude, latitude, import_year, import_month, import_day, import_date
FROM `dataplatform-stg-raw-zone-unrestricted-address-api`.`unrestricted_address_api_dbo_hackney_address` LIMIT 10;

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github		.github
aws-config		aws-config
notebooks		notebooks
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Platform Notebooks

Running Jupyter in AWS Sagemaker

Running Jupyter locally using Docker

Prerequisites

Run the Container

Option 1. Using make and aws-vault (Preferred)

Option 2. Using access keys and docker

Open Jupyter Notebook

Test connection to AWS

Spark SQL read–eval–print loop

About

Releases

Packages

Contributors 9

Languages

LBHackney-IT/Data-Platform-Notebooks

Folders and files

Latest commit

History

Repository files navigation

Data Platform Notebooks

Running Jupyter in AWS Sagemaker

Running Jupyter locally using Docker

Prerequisites

Run the Container

Option 1. Using make and aws-vault (Preferred)

Option 2. Using access keys and docker

Open Jupyter Notebook

Test connection to AWS

Spark SQL read–eval–print loop

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages