diff --git a/README.qmd b/README.qmd index f0c6f14..7de61af 100644 --- a/README.qmd +++ b/README.qmd @@ -65,7 +65,7 @@ devtools::load_all() -{{< include vignettes/_include/setup-data-directory.qmd >}} +{{< include inst/vignette-include/setup-data-directory.qmd >}} # Using the package diff --git a/vignettes/_include/setup-data-directory.qmd b/inst/vignette-include/setup-data-directory.qmd similarity index 100% rename from vignettes/_include/setup-data-directory.qmd rename to inst/vignette-include/setup-data-directory.qmd diff --git a/vignettes/convert.qmd b/vignettes/convert.qmd index 45cfeb7..8d0ae2a 100644 --- a/vignettes/convert.qmd +++ b/vignettes/convert.qmd @@ -19,7 +19,7 @@ execute: **TL;DR (too long, didn't read): For analysing more than 1 week of data, use `spod_convert()` to convert the data into `DuckDB` and `spod_connect()` to connect to it for analysis using `{dplyr}`. Skip to the [section about it](#duckdb).** -The main focus of this vignette is to show how to get long periods of origin-destination data for analysis. First, we describe and compare the two ways to get the mobility data using origin-destination data as an example. The package functions and overall approaches are the same for working with other types of data available through the package, such as the number of trips, overnight stays and any other data. Then we show how to get a few days of origin-destination data with `spod_get()`. Finally, we show how to download and convert multiple weeks, months or even years of origin-destination data into analysis-ready formats. See description of datasets in the [Codebook and cookbook for v1 (2020-2021) Spanish mobility data](v1-2020-2021-mitma-data-codebook.qmd) and in the [Codebook and cookbook for v2 (2022 onwards) Spanish mobility data](v2-2022-onwards-mitma-data-codebook.qmd). +The main focus of this vignette is to show how to get long periods of origin-destination data for analysis. First, we describe and compare the two ways to get the mobility data using origin-destination data as an example. The package functions and overall approaches are the same for working with other types of data available through the package, such as the number of trips, overnight stays and any other data. Then we show how to get a few days of origin-destination data with `spod_get()`. Finally, we show how to download and convert multiple weeks, months or even years of origin-destination data into analysis-ready formats. See description of datasets in the [Codebook and cookbook for v1 (2020-2021) Spanish mobility data](v1-2020-2021-mitma-data-codebook.html) and in the [Codebook and cookbook for v2 (2022 onwards) Spanish mobility data](v2-2022-onwards-mitma-data-codebook.html). # Two ways to get the data @@ -36,7 +36,7 @@ There are two main ways to import the datasets: The mobility datasets available through `{spanishiddata}` are very large. Particularly the origin-destination data, which contains millions of rows. These data sets may not fit into the memory of your computer, especially if you plan to run the analysis over multiple days, weeks, months, or even years. -To work with these datasets, we highly recommend using `DuckDB` and `Parquet`. These are systems for efficiently processing larger-than-memory datasets, while being user-firendly by presenting the data in a familiar `data.frame`/`tibble` object (almost). For a great intoroduction to both, we recommend materials by Danielle Navarro, Jonathan Keane, and Stephanie Hazlitt: [website](https://arrow-user2022.netlify.app/){target="_blank"}, [slides](https://arrow-user2022.netlify.app/slides){target="_blank"}, and [the video tutorial](https://www.youtube.com/watch?v=YZMuFavEgA4){target="_blank"}. You can also find examples of aggregating origin-destination data for flows analysis and visualisation in our vignettes on [static](vignettes/flowmaps-static.qmd) and interactive (TODO: add link) flows visualisation. +To work with these datasets, we highly recommend using `DuckDB` and `Parquet`. These are systems for efficiently processing larger-than-memory datasets, while being user-firendly by presenting the data in a familiar `data.frame`/`tibble` object (almost). For a great intoroduction to both, we recommend materials by Danielle Navarro, Jonathan Keane, and Stephanie Hazlitt: [website](https://arrow-user2022.netlify.app/){target="_blank"}, [slides](https://arrow-user2022.netlify.app/slides){target="_blank"}, and [the video tutorial](https://www.youtube.com/watch?v=YZMuFavEgA4){target="_blank"}. You can also find examples of aggregating origin-destination data for flows analysis and visualisation in our vignettes on [static](flowmaps-static.html) and interactive (TODO: add link) flows visualisation. Learning to use `DuckDB` and `Parquet` is easy for anyone who have ever worked with `{dplyr}` functions such as `select()`, `filter()`, `mutate()`, `group_by()`, `summarise()`, etc. However, since there is some learning curve to master these new tools, we provide some helper functions for novices to get started and easily open the datasets from `DuckDB` and `Parquet`. Please read the relevant sections below, where we first show how to convert the data, and then how to use it. @@ -92,11 +92,11 @@ Make sure you have loaded the package: library(spanishoddata) ``` -{{< include _include/setup-data-directory.qmd >}} +{{< include ../inst/vignette-include/setup-data-directory.qmd >}} # Getting a single day with `spod_get()` {#spod-get} -As you might have seen in the codebooks for [v1](v1-2020-2021-mitma-data-codebook.qmd) and [v2](v2-2022-onwards-mitma-data-codebook.qmd) data, you can get a single day's worth of data as an in-memory object with `spod_get()`: +As you might have seen in the codebooks for [v1](v1-2020-2021-mitma-data-codebook.html) and [v2](v2-2022-onwards-mitma-data-codebook.html) data, you can get a single day's worth of data as an in-memory object with `spod_get()`: ```{r} dates <- c("2024-03-01") @@ -282,7 +282,7 @@ Due to mobile network outages, some dates are missing, so do not assume that a s ## Download all data -Here the example is for origin-destination on district level for v1 data. You can change the `type` to "number_of_trips" and the `zones` to "municipalities" for v1 data. For v2 data, just use `dates` starting with 2022-01-01 or the `dates_v2` from above. Use all other function arguments for v2 in the same way as shown for v1, but also consult the [v2 data codebook](v2-2022-onwards-mitma-data-codebook.qmd), as it has many more datasets in addition to "origin-destination" and "number_of_trips". +Here the example is for origin-destination on district level for v1 data. You can change the `type` to "number_of_trips" and the `zones` to "municipalities" for v1 data. For v2 data, just use `dates` starting with 2022-01-01 or the `dates_v2` from above. Use all other function arguments for v2 in the same way as shown for v1, but also consult the [v2 data codebook](v2-2022-onwards-mitma-data-codebook.html), as it has many more datasets in addition to "origin-destination" and "number_of_trips". ```{r} type <- "origin-destination" @@ -310,7 +310,7 @@ analysis_data_storage <- spod_convert_data( ) ``` -This will convert all downloaded data to `DuckDB` format for lightning fast analysis. You can change the `save_format` to `parquet` if you want to save the data in `Parquet` format. For comparison overview of the two formats please see the [Converting the data to DuckDB/Parquet for faster analysis](converting-the-data-for-faster-analysis.qmd). +This will convert all downloaded data to `DuckDB` format for lightning fast analysis. You can change the `save_format` to `parquet` if you want to save the data in `Parquet` format. For comparison overview of the two formats please see the [Converting the data to DuckDB/Parquet for faster analysis](converting-the-data-for-faster-analysis.html). By default, `spod_convert_data()` will save the converted data in the `SPANISH_OD_DATA_DIR` directory. You can change the `save_path` argument of `spod_convert_data()` if you want to save the data in a different location. diff --git a/vignettes/flowmaps-static.qmd b/vignettes/flowmaps-static.qmd index 0365ad7..1d47304 100644 --- a/vignettes/flowmaps-static.qmd +++ b/vignettes/flowmaps-static.qmd @@ -33,7 +33,7 @@ library(tidyverse) library(sf) ``` -{{< include _include/setup-data-directory.qmd >}} +{{< include ../inst/vignette-include/setup-data-directory.qmd >}} # Simple example - plot flows data as it is {#simple-example} @@ -66,7 +66,7 @@ head(od_20210407) ### Zones -We also get the district zones polygons to match the flows. We use version 1 for the polygons, because the selected date is in 2021, which corresponds to the v1 data (see the relevant [codebook](v1-2020-2021-mitma-data-codebook.qmd)). +We also get the district zones polygons to match the flows. We use version 1 for the polygons, because the selected date is in 2021, which corresponds to the v1 data (see the relevant [codebook](v1-2020-2021-mitma-data-codebook.html)). ```{r} districts_v1 <- spod_get_zones("dist", ver = 1) @@ -394,7 +394,7 @@ Let us get the origin-destination flows between `districts` for a typical workin od <- spod_get("od", zones = "distr", dates = "2022-04-06") ``` -Also get the spatial data for the zones. We are using the version 2 of zones, because the data we got was for 2022 and onwards, which corresponds to the v2 data (see the relevant [codebook](v2-2022-onwards-mitma-data-codebook.qmd)). +Also get the spatial data for the zones. We are using the version 2 of zones, because the data we got was for 2022 and onwards, which corresponds to the v2 data (see the relevant [codebook](v2-2022-onwards-mitma-data-codebook.html)). ```{r} districts <- spod_get_zones("distr", ver = 2) diff --git a/vignettes/v1-2020-2021-mitma-data-codebook.qmd b/vignettes/v1-2020-2021-mitma-data-codebook.qmd index 5808e48..abec12a 100644 --- a/vignettes/v1-2020-2021-mitma-data-codebook.qmd +++ b/vignettes/v1-2020-2021-mitma-data-codebook.qmd @@ -50,21 +50,22 @@ remotes::install_github("Robinlovelace/spanishoddata", force = TRUE, dependencies = TRUE) ``` +Load the package: Load the package: ```{r} library(spanishoddata) ``` -Using the instructions below, set the data folder for the package to download the files into. You may need up to 30 GB to download all data and another 30 GB if you would like to convert the downloaded data into analysis ready format (a `DuckDB` database file, or a folder of `parquet` files). You can find more info on this conversion in the [Download and convert OD datasets](convert.qmd) vignette. +Using the instructions below, set the data folder for the package to download the files into. You may need up to 30 GB to download all data and another 30 GB if you would like to convert the downloaded data into analysis ready format (a `DuckDB` database file, or a folder of `parquet` files). You can find more info on this conversion in the [Download and convert OD datasets](convert.html) vignette. -{{< include _include/setup-data-directory.qmd >}} +{{< include ../inst/vignette-include/setup-data-directory.qmd >}} # Overall approach to working with data If you only want to analyse the data for a few days, you can use the `spod_get()` function. It will download the raw data in CSV format and let you analyse it in-memory. This is what we cover in the steps on this page. -If you need longer periods (several months or years), you should use the `spod_convert()` and `spod_connect()` functions, which will convert the data into special format which is much faster for analysis, for this see the [Download and convert OD datasets](convert.qmd) vignette. `spod_get_zones()` will give you spatial data with zones that can be matched with the origin-destination flows from the functions above using zones 'id's. Please see a simple example below, and also consult the vignettes with detailed data description and instructions in the package vignettes with `spod_codebook(ver = 1)` and `spod_codebook(ver = 2)`, or simply visit the package website at [https://robinlovelace.github.io/spanishoddata/](https://robinlovelace.github.io/spanishoddata/). The @fig-overall-flow presents the overall approach to accessing the data in the `spanishoddata` package. +If you need longer periods (several months or years), you should use the `spod_convert()` and `spod_connect()` functions, which will convert the data into special format which is much faster for analysis, for this see the [Download and convert OD datasets](convert.html) vignette. `spod_get_zones()` will give you spatial data with zones that can be matched with the origin-destination flows from the functions above using zones 'id's. Please see a simple example below, and also consult the vignettes with detailed data description and instructions in the package vignettes with `spod_codebook(ver = 1)` and `spod_codebook(ver = 2)`, or simply visit the package website at [https://robinlovelace.github.io/spanishoddata/](https://robinlovelace.github.io/spanishoddata/). The @fig-overall-flow presents the overall approach to accessing the data in the `spanishoddata` package. ![The overview of how to use the pacakge functions to get the data](media/package-functions-overview.svg){#fig-overall-flow width="78%"} @@ -94,8 +95,8 @@ Data structure: | `census_districts` | A list of census district identifiers as classified by the Spanish Statistical Office (INE) that are spatially bound within polygons with `id` above. | | `municipalities_mitma` | A list of municipality identifiers as assigned by the data provider in municipality zones spatial dataset that correspond to a given district `id` . | | `municipalities` | A list of municipality identifiers as classified by the Spanish Statistical Office (INE) that correspond to polygons with `id` above. | -| `district_names_in_v2` | A list of names of district polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.qmd) that covers the year 2022 and onwards that correspond to polygons with `id` above. | -| `district_ids_in_v2` | A list of identifiers of district polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.qmd) that covers the year 2022 and onwards that correspond to polygons with `id` above. | +| `district_names_in_v2` | A list of names of district polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.html) that covers the year 2022 and onwards that correspond to polygons with `id` above. | +| `district_ids_in_v2` | A list of identifiers of district polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.html) that covers the year 2022 and onwards that correspond to polygons with `id` above. | ## 1.2 `Municipalities` {#municipalities} @@ -117,8 +118,8 @@ Data structure: | `municipalities` | A list of municipality identifiers as classified by the Spanish Statistical Office (INE) that correspond to polygons with `id` above. | | `districts_mitma` | A list of district identifiers as assigned by the data provider in districts zones spatial dataset that correspond to a given municipality `id` . | | `census_districts` | A list of census district identifiers as classified by the Spanish Statistical Office (INE) that are spatially bound within polygons with `id` above. | -| `municipality_names_in_v2` | A list of names of municipality polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.qmd) that covers the year 2022 and onwards that correspond to polygons with `id` above. | -| `municipality_ids_in_v2` | A list of identifiers of municipality polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.qmd) that covers the year 2022 and onwards that correspond to polygons with `id` above. | +| `municipality_names_in_v2` | A list of names of municipality polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.html) that covers the year 2022 and onwards that correspond to polygons with `id` above. | +| `municipality_ids_in_v2` | A list of identifiers of municipality polygons defined in the [v2 version of this data](v2-2022-onwards-mitma-data-codebook.html) that covers the year 2022 and onwards that correspond to polygons with `id` above. | The spatial data you get via `spanishoddata` package is downloaded directly from the source, the geometries of polygons are automatically fixed if there are any invalid geometries. The zone identifiers are stored in `id` column. Apart from that `id` column, the original zones files do not have any metadata. However, as seen above, using the `spanishoddata` package you get many additional columns that provide a semantic connection between official statistical zones used by the Spanish government and the zones you can get for the v2 data (for 2022 onward). @@ -248,4 +249,4 @@ tpp_dist_tbl <- tpp_dist |> dplyr::collect() # Advanced use -For more advanced use, especially for analysing longer periods (months or even years), please see [Download and convert OD datasets](convert.qmd). +For more advanced use, especially for analysing longer periods (months or even years), please see [Download and convert mobility datasets](convert.html). diff --git a/vignettes/v2-2022-onwards-mitma-data-codebook.qmd b/vignettes/v2-2022-onwards-mitma-data-codebook.qmd index 15d7fac..f251d96 100644 --- a/vignettes/v2-2022-onwards-mitma-data-codebook.qmd +++ b/vignettes/v2-2022-onwards-mitma-data-codebook.qmd @@ -21,6 +21,6 @@ spanishoddata::spod_codebook(ver = 2) ``` -Work-in-progress. Coming soon, please come back in a few weeks. For now, please consult the official documentation [@mitma-mobility-2024-v6]. Some data is already available in the same way as [v1 data](v1-2020-2021-mitma-data-codebook.qmd), just use dates starting with 2022-01-01. +Work-in-progress. Coming soon, please come back in a few weeks. For now, please consult the official documentation [@mitma-mobility-2024-v6]. Some data is already available in the same way as [v1 data](v1-2020-2021-mitma-data-codebook.html), just use dates starting with 2022-01-01. -{{< include _include/setup-data-directory.qmd >}} +{{< include ../inst/vignette-include/setup-data-directory.qmd >}}