This repository holds notebooks used on the Data Platform. The notebooks in this repository are preloaded into notebook instances hosted by AWS sagemaker.
Please follow our playbook guide for using AWS sagemaker on the Data Platform.
If you have make and aws-vault installed and setup with credentials for hackney-dataplatform-staging
already you can follow option 1, otherwise use the second option. The first option has the advantage that you don't have to change your credentials each day as they get rotated.
To install and setup aws-vault follow the instructions in step 3 of the setup section in the project README.
Dependendant on which glue version you which to use, there are 3 different container setups. To use version 2, run
make run-notebook-v2
You can replace v2 for v1 or v3 to use a different version.
- Navigate to the hackney SSO, click on the account you want to use then click "Command line or programmatic access".
- Copy the aws access key, aws secret access key and aws session token into the file
/aws-config/credentials
. - Dependendant on which glue version you which to use, there are 3 different container setups. To use version 2, run
docker compose up notebook-v2
You can replace v2 for v1 or v3 to use a different version.
- Navigate to http://localhost:8888
- Within the notebook open
test-s3-connection
. - Change the
s3_url
variable to be a s3 bucket that exists in the AWS account you are using. - Click "Run"
- You should get some data back from the s3 bucket and no errors.
- Follow the instructions for "Run your container" above.
- Run
make thrift-server
, and wait for the command to finish, you should get a message similar to the below.starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /home/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/logs/spark--org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-aa00aa00aa00.out
- Wait around 10 seconds for the server to finish starting
- Run
make spark-sql
, and wait for a SQL prompt to appear0: jdbc:hive2://localhost:10000/default>
- Test your console is working by copy & pasting the following SQL
SELECT lpi_key, uprn, longitude, latitude, import_year, import_month, import_day, import_date FROM `dataplatform-stg-raw-zone-unrestricted-address-api`.`unrestricted_address_api_dbo_hackney_address` LIMIT 10;