-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update habitatmap_stdized and habitatmap_terr #50
Conversation
update habmap_correction table add delete_dsn = TRUE as argument to st_write
add renv::restore() tot setup-chunck
remove DBI since we will use sf::st_write to write table to gpkg
provide overview table of applied code corrections use sf:st_write to write table to gpkg
remove DBI as we will write table to gpkg using sf::st_write
use n2khab::read_habitatmap() to read the original habitatmap and correct variable names (for example 'polygon_id' instead of 'TAG') use sf:st_write to write table to gpkg
add renv::restore() in setup-chunck
…om the reference small typo's
and used English ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Toon,
It's a clear improvement.
I encountered one issue. After updating (10_generate_habmap_stdized) I assume this result would be the same as the version on Zenodo, but the hashes didn't match.
I did not upload the result to Zenodo yet. However I did update the hashes in the data source overview table. So you should compare the hashes of the result you obtain with those in the overview table. |
OK I did. The hashes of my std-version (\n2khab_data\20_processed\habitatmap_stdized\habitatmap_stdized.gpkg = endresult of 01_) are different from the google-sheet. And more this morning's version (md5: 8bea3bfb829bf58012b1f2af6cde5278) is different from the afternoon's (md5: 289d583a574bbef7b437161bb88c6fb9). Strange |
I think that a GPKG file, regardless of its contents, will each time have a different checksum after a rewrite (not 100% sure, but seems confirmed by @w-jan). Maybe has to do with timestamping, with random numbers, ... Possibly you can find reasons in https://www.sqlite.org/fileformat.html and in https://www.ogc.org/standards/geopackage. However you could compare the resulting R objects after reading them in (layer by layer, or with the a <- "Hello there"
digest::digest(a, "md5")
#> [1] "dc7bc0712052ce0b42d178a287f0e516"
digest::digest(a, "sha256")
#> [1] "572dc06abe78ea83c682287c4dbf36df9da83fc9745161069a8a7990155750ad" Created on 2021-02-10 by the reprex package (v1.0.0) Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.3 (2020-10-10)
#> os Linux Mint 20
#> system x86_64, linux-gnu
#> ui X11
#> language nl_BE:nl
#> collate nl_BE.UTF-8
#> ctype nl_BE.UTF-8
#> tz Europe/Brussels
#> date 2021-02-10
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.3)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.3)
#> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3)
#> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3)
#> xfun 0.20 2021-01-06 [1] CRAN (R 4.0.3)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library In order to actually implement this for verifying integrity of a reproduced result - which is not necessarily needed to insert here IMHO (one can always do such comparisons oneself post-hoc, based on the official Zenodo files) - some pitfalls may need to be addressed. E.g. different results on 32- vs 64-bit systems; I believe this will not apply to file checksums with |
@ToonHub Can you include the rendered bookdown html files as attachment-links in your comment? (You'll need to zip them first). The html files will make it a bit easier to review (they include chunk output). |
src/generate_habitatmap_terr/20_generate_habmap_terr_interpreted.Rmd
Outdated
Show resolved
Hide resolved
See below the rendered boodwon files |
In inbo/n2khab#112 more background and ideas are given to support choosing the preferred checksum algorithm, as it will become an important feature of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your efforts @ToonHub ! Nice work. 👏
First part of my review: habitatmap_stdized
I mainly looked at the html, did not re-run code (I believe in you). I did run a few basic checks on the data source (see further).
Only one comment which may impact the resulting contents (*):
-
Furthermore the sum of the phab values within the polygon is higher than 100. Therefore we will distribute teh phab value for uncetain types equally over the two possible types (in this case phab = 20 for type 3150 and phab = 20 for type rbbmr).
These lines are about the example copied below from your html:
In reality HAB2 may be either 3150 or rbbmr, and perhaps a mix of both. Hence deliberately partitioning the value of 40 as 20+20 favours the latter case, in an attempt to maintain a total of 100%.
- However should we not be more conservative and relax the condition of total 100% in this case? I.e. better assign 100 to 3150 and 40 to rbbmr? I presume that handling types in an individual way is the main usecase, and then there's no point in decreasing values (I guess).
- Regarding the meaning of
certain = TRUE
vsFALSE
in this case: considering 3150 we are only 'uncertain' about the phab value, but not about the type's presence (assuming ahabitatmap
without errors), and the latter is the meaning ofcertain
IIUC. (It would match the metadata, saying 'For some polygons the vegetation type is uncertain, ...'.) So I thinkcertain
must beTRUE
here, and not becomeFALSE
for 3150 (only for rbbmr).
(*) if for some reason things will stay as-is, then the below comments can still be handled in the source, but then there's no actual need to re-run and overwrite the result.
Can you update:
-
supposed to be run from within the src/process_habitatmap subfolder. You can use the process_habitatmap.Rproj RStudio project file
99_sessioninfo.Rmd
to the version undersrc/generate_watercourse_100mseg/99_sessioninfo.Rmd
(on current master). It adds the versions of geospatial libraries. [This will hold forhabitatmap_terr
as well.]- the document title to 'Generating habitatmap_stdized'. I believe the title 'processing habitatmap' comes from times where we had not yet decided on all workflows.
- the name of the Rproj file to
generate_habitatmap_stdized.Rproj
On the technical side:
-
renv::restore()
will work with justrenv.lock
, however for interactive usage & collaboration IMO it's best to add all (not git-ignored) files (to imitate your RStudio behaviour as closely as possible). Then, if anyone else updates packages in the lockfile it will occur in the same way. E.g. some behaviour can be changed permanently (with arenv
function) insettings.dcf
. For another recent data source these were the files committed, I suggest to add the ones missing:$ git log --name-status 30faabf^..30faabf commit 30faabfd9518dbed8402d95f791ee25ddad760fb Author: florisvdh <[email protected]> Date: Mon Nov 30 15:05:15 2020 +0100 Generate watercourse_100m...: use renv A src/generate_watercourse_100mseg/.Rprofile A src/generate_watercourse_100mseg/renv.lock A src/generate_watercourse_100mseg/renv/.gitignore A src/generate_watercourse_100mseg/renv/activate.R A src/generate_watercourse_100mseg/renv/settings.dcf
-
I advise to add a chapter 'checks on the datasource' after writing the datasource to disk (i.e. reading the new file back in and perform some obvious checks). You could do that in a second Rmd file if you like (bookdown will concatenate anyway). Regarding
habitatmap_stdized
you could already integrate the checks present in this file (it's on thehms_fv
branch now, as an update of the original file onmaster
) and more generally include some ideas from these code lines (cf. rendered file chapter 3).
To be clear, there's no problem with keeping both md5 and sha256 checksums in the code, for now, even while there's no real point in using md5 when also using sha256 (inbo/n2khab#112).
|
||
## Processing of the attribute table | ||
|
||
Every polygon in the habitat map can consist of maximum 5 different vegetation types. This information is stored in the columns 'HAB1', 'HAB2',..., 'HAB5' of the attribute table. The estimated fraction of each vegetation type within the polygons is stored in the columns 'PHAB1', 'PHAB2', ..., 'PHAB5'. | ||
|
||
We will convert the attribute table to a long format, so that every row contains one vegetation type. | ||
|
||
In some cases, the vegetation type is uncertain, and 2 or 3 possible vegetation types are provided, separated with a ','. We will split the different possible vegetation types and create one row for each of them. An additional variable 'certain' will be FALSE if the orginal habitatmap code consists of 2 or 3 possible vegetation types, and TRUE if only one vegetation type is provided. | ||
|
||
|
||
```{r select_polygons} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a note - feel free to make such changes. Gathering HAB & PHAB columns can now be done in one step using pivot_longer_spec()
. E.g. first example below https://tidyr.tidyverse.org/articles/pivot.html#by-hand-1.
src/generate_habitatmap_stdized/habmap_correction/habmap_correction.tsv
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First part of my review:
habitatmap_stdized
Forgot to check the coordinate reference system of the habitatmap_polygons
layer, and this will need a rewrite of the resulting data source. Checks have now been added in the referred notebook (I updated the '2020' link in previous comment); you best integrate those CRS checks here (checks chapter).
In summary, you should do something like the following when reading habitatmap
, i.e. override the CRS with the crs
argument:
n2khab-preprocessing/src/generate_habitatquarries/10_generate_habitatquarries.Rmd
Lines 33 to 34 in 0e4053c
habitatquarries <- read_sf(file.path(tempdir(), "8310_v2018_RAPHAB2019.shp"), | |
crs = 31370) |
This subject was inspected in more detail only recently (inbo/n2khab#85 (comment)). With the “Belge 1972 / Belgian Lambert 72” CRS defined with different (and technically non-equivalent) parameter values (the ESRI way), the CRS of habitatmap
(and derivates) is technically another CRS (ESRI:102499) than the official EPSG:31370 CRS, even though the numerical differences for applications are negligible. But adopting EPSG standards is done throughout open geospatial tooling (including R). The difference with ESRI:102499 already caused clashes here and there in the past and we can expect this problem will never vanish in near future. It is the reason why EPSG:31370 is forced by the n2khab-reading-functions for files where necessary, e.g. for raw data sources and for several 2018-files such as this one. But since this became clear, we should take care of it in future data sources right away, in the files.
It technically is a fundamental problem and I hope to pursue it further, eventually. It affects many Geopunt data sources, the metadata of which wrongly state they conform to EPSG:31370. [Apart from that there's the discussion on why still using EPSG:31370: it has a Belgian datum. Scientifically it is better to use EPSG:3812 - it applies the European ETRS89 datum hence achieves exact conversion between projected CRSs of neighbouring countries when those countries apply the ETRS89 datum as well. It appears that INSPIRE updates on this have recently be added at Geopunt.]
src/generate_habitatmap_terr/20_generate_habmap_terr_interpreted.Rmd
Outdated
Show resolved
Hide resolved
src/generate_habitatmap_terr/20_generate_habmap_terr_interpreted.Rmd
Outdated
Show resolved
Hide resolved
src/generate_habitatmap_terr/20_generate_habmap_terr_interpreted.Rmd
Outdated
Show resolved
Hide resolved
…r instead of 31xx,rbbmr)
I updated both geopackages.
In the original habitatmap there is no code that corrsponds to 31xx,rbbmr (only 31xx_rbbr). However, if this is the case in future versions, I avoided Below the updated rendered bookdown files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent job, thank you @ToonHub!
You're right that we cannot successfully use the underscore to separate strings! Your implementation works great IMO.
@cecileherr / @w-jan did you intend to have another look, or shall I merge?
The new GeoPackage versions are the ones in the n2khab-binaire-databronnen
folder on Q-drive. Cf. their md5sums which match the ones in the bookdown output 👍.
@ToonHub: once merged, it should be safe to upload the new versions to Zenodo. Some of the ideas for documentation updates in inbo/n2khab#117 will also be relevant for the Zenodo metadata.
Some polygons in the habitat map contain codes that do not correspond with the standardized list of habitat types. | ||
We correct these codes to make processing of the habitat map more straightforward. | ||
Table \@ref(tab:codeCorrected) shows the corrected codes and the number of polygons for which the correction is applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -151,7 +148,7 @@ Therefore `certain` will be set to `TRUE` for both types included in `code`. | |||
habmap_long <- habmap_longHAB %>% | |||
left_join(habmap_longPHAB, by = c("polygon_id", "patch_id")) %>% | |||
mutate(certain = !str_detect(code, ","), | |||
certain = ifelse(code %in% c("3130,rbbmr", "3140,rbbmr", "3150,rbbmr", "3160,rbbmr"), TRUE, certain)) %>% | |||
certain = ifelse(code_orig %in% c("3130_rbbmr", "3140_rbbmr", "3150_rbbmr", "3160_rbbmr"), TRUE, certain)) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok for me to merge! I don't have any more comments about the latest version (note there will be an extra check soon since I will try to update read_watersurfaces_hab later this week, and I will need habitatmap_stdize).
Thank you for the adaptations & follow up!
OK @cecileherr. |
OK |
I checked the geometry of the new versions of habitatmap_terr and habitatmap_stdized.
|
@cecileherr did the affected polygons get intersected in this test? In that case, we could indeed leave things as-is. Ideally we should have more insight why the operations worked in one case and not in another, in order to make the best decision. |
At least I didn't get any error while making the intersection (but I haven't checked the polygon manually - maybe I should). Actually I suspect that I might have updated the sf package (or another dependency) in the meantime and that it could be the reason why the same kind of faulty polygons do not trigger errors anymore. I will try to check and post the result here |
n2khab-preprocessing/src/generate_watersurfaces_hab/renv.lock Lines 903 to 909 in 0a816d1
|
It should have been the case BUT, unfortunately, I am a little bit messier than that... |
Follow-up of the geometry problems in habitat_map and the processed habitatmaps
I tried an intersection with Flanders: as far as I can see (I only checked manually in leaflet and QGIS for +/- 10 cases), the results of the intersection with the faulty polygons are as expected. Note that the layer resulting from the intersection doesn't contain geometry errors anymore! e.g. for habitatmap_terr:
|
Thanks for checking this. I gave up reproducing - the |
I will leave the decision to you @cecileherr. Up to now it has not been done - so if it appears not to cause trouble, you could leave it as-is. |
Conclusion after a live discussion: we decide NOT to correct the geometries |
Co-authored-by: Cécile Herr <[email protected]>
Thanks for all the work. Merging here. |
I updated habitatmap_stdized and habitatmap_terr based on version habitatmap_2020 of the original habitatmap.
I applied following changes that were suggested by Floris:
I also added an addational aggregation of types within a polygon in case you have a certain and uncertain record of the same type within a polygon. The procedure is explained in 10_generate_habmap_stdized.Rmd.
I added the resulting files to the Q-drive.