Skip to content

Commit

Permalink
Merge pull request #29 from dfe-analytical-services/Rstudio_databrick…
Browse files Browse the repository at this point in the history
…s_video_add

add link to ADA_Rstudio_connect GitHub repo
  • Loading branch information
chfoster authored Dec 18, 2023
2 parents 9b8df43 + 0dc426d commit fcc3e77
Showing 1 changed file with 23 additions and 12 deletions.
35 changes: 23 additions & 12 deletions ADA/databricks_rstudio.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -65,29 +65,40 @@ You must have:
a. Enter a 'Data Source Name' and 'Description'. Choose a short and sensible data source name and note it down as this is what you will use to connect to Databricks through RStudio. *As you can set up more than one cluster on Databricks, use the description to make clear which cluster this connection is for. The description shown below describes that this connection is using an 8 core cluster on Databricks Runtime Environment 13.*
b. Set the remaning options to the settings below.
- Enter the 'Server Hostname' for your cluster in the 'Host(s):' field (you noted this down in step 2).
- In the Port section, remove the default number and use the Port number you noted in step 2.
- Set the Authentication Mechanism to 'User Name and Password'.
- Enter the word 'token' into the 'User Name:' field, then enter your 'Databricks access token' in the 'Password:' field.
- Change the Thrift Transport option to HTTP.
- Click the 'HTTP Options...' button and enter the 'HTTP Path' of your Databricks cluster, then click 'Okay'.
- Click the 'SSL Options...' button and tick the 'Enable SSL' box, then click the 'OK' button.
- ![](../images/odbc-driver-settings.png)
c. Click the 'Test' button to verify the connection has worked. You should see the following message. *If you get an error here, repeat steps 5.e.i -- 5.e.ix again and ensure all the values are correct.*

![](../images/databricks-test-connection.png)

d. Click the 'OK' button to exist the 'Test Results' window, then the 'OK' button in the 'Simba Spark ODBC Driver DSN Setup' window.
d. Click the 'OK' button to exit the 'Test Results' window, then the 'OK' button in the 'Simba Spark ODBC Driver DSN Setup' window.

5. Connect through RStudio. Now the connection between our laptop and Databricks works we can use it to query data stored in Databricks from RStudio.
5. Connect through RStudio. Watch the below video and view the [ADA_RStudio_connect GitHub repo](https://github.com/dfe-analytical-services/ADA_RStudio_connect) for methods on connecting to Databricks and querying data from RStudio.

i) Open RStudio, install the 'RODBC' package and load the 'RODBC' library.

ii) Create a connection variable by passing the 'Data Source Name' you gave to your ODBC connection to the odbcConnect() function. `conn <- obbcConnect("DatabricksSQL Warehouse")`
------------------------------------------------------------------------

# Pulling data into R studio from Databricks

------------------------------------------------------------------------

Once you have set up an ODBC connection as detailed above, you can then use that connection to pull data directly from Databricks into R Studio. Charlotte recorded a video demonstrating two possible methods of how to do this. The recording is embedded below:

<div align="center">
<iframe src="https://educationgovuk.sharepoint.com/sites/lvewp00086/_layouts/15/embed.aspx?UniqueId=5fad039e-a763-40c8-8ea0-403aea712f4c&embed=%7B%22ust%22%3Atrue%2C%22hv%22%3A%22CopyEmbedCode%22%7D&referrer=StreamWebApp&referrerScenario=EmbedDialog.Create" width="640" height="360" frameborder="0" scrolling="no" allowfullscreen title="ADA_Rstudio_2.mp4"></iframe>
</div>

A template of all of the code used in the above video can be found in the [ADA_RStudio_connect GitHub repo](https://github.com/dfe-analytical-services/ADA_RStudio_connect).

6. Query Data. You can now use your connection variable to send SQL queries to Databricks.
Key takeaways from the video and example code:

i) Use the `sqlQuery()` function with the connection variable to see what data catalogues are available on DataBricks *Databricks uses the American spelling of 'catalog'.* ![](../images/databricks-catalogs-rstudio.png)
ii) Use the `sqlQuery(conn, "USE CATALOG preprod_catalog_p02")` command to tell Databricks which catalogue to use. *Ensure that you have the correct permissions for the catalogue first. If you don't you will need to contact the ADA team.*
iii) You can now send SQL queries like you would using any other database. To see information about all the tables you have access to you can query the INFORMATION_SCHEMA using the `"SELECT * FROM INFORMATION_SCHEMA.TABLES"` query like so;
* The main change here compared to connecting to SQL databases is the connection method. The installation and setup of the ODBC driver are all done pre-code, and the only part of the code that will need updating is your connection (usually your con variable).
* If your existing code was pulling in tables from SQL via the `RODBC` package or the `dbplyr` package, then this code should in theory run with minimal edits needed.
* If you were writing tables back into SQL from R, this is where your code may need the most edits.
* If your code is stored in a repo where multiple analysts contribute to and run the code, in order for the code to run for everyone you will all need to individually install the ODBC driver and **give it the same name** so that when the `con` variable is called, the name used in the code matches everyone's individual driver and runs for everyone. If this is the case, please **add a note about this to your repo's readme file** to help your future colleagues.

```
tables <- sqlQuery(con, "SELECT * FROM INFORMATION_SCHEMA.TABLES")
View(tables)
```

0 comments on commit fcc3e77

Please sign in to comment.