Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update habitatmap_stdized and habitatmap_terr #50

Merged
merged 48 commits into from
May 12, 2021

Conversation

ToonHub
Copy link
Contributor

@ToonHub ToonHub commented Feb 9, 2021

I updated habitatmap_stdized and habitatmap_terr based on version habitatmap_2020 of the original habitatmap.

I applied following changes that were suggested by Floris:

  • make use of renv to manage different versions of packages
  • check the used data source versions by comparing the md5 and sha256 hashes to the hashes provided in data source overview table
  • provide an overview of the applied code corrections (for example 3130_rbbmr --> 3130,rbbmr)
  • use sf:st_write to write the habitamap_types table to a gpkg

I also added an addational aggregation of types within a polygon in case you have a certain and uncertain record of the same type within a polygon. The procedure is explained in 10_generate_habmap_stdized.Rmd.

I added the resulting files to the Q-drive.

update habmap_correction table
add delete_dsn = TRUE as argument to st_write
add renv::restore() tot setup-chunck
remove DBI since we will use sf::st_write to write table to gpkg
provide overview table of applied code corrections
use sf:st_write to write table to gpkg
remove DBI as we will write table to gpkg using sf::st_write
use n2khab::read_habitatmap() to read the original habitatmap and correct variable names (for example 'polygon_id' instead of 'TAG')
use sf:st_write to write table to gpkg
add renv::restore() in setup-chunck
Copy link
Collaborator

@w-jan w-jan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Toon,
It's a clear improvement.

I encountered one issue. After updating (10_generate_habmap_stdized) I assume this result would be the same as the version on Zenodo, but the hashes didn't match.

@ToonHub
Copy link
Contributor Author

ToonHub commented Feb 10, 2021

I encountered one issue. After updating (10_generate_habmap_stdized) I assume this result would be the same as the version on Zenodo, but the hashes didn't match.

I did not upload the result to Zenodo yet.

However I did update the hashes in the data source overview table. So you should compare the hashes of the result you obtain with those in the overview table.

@w-jan
Copy link
Collaborator

w-jan commented Feb 10, 2021

I encountered one issue. After updating (10_generate_habmap_stdized) I assume this result would be the same as the version on Zenodo, but the hashes didn't match.

I did not upload the result to Zenodo yet.

However I did update the hashes in the data source overview table. So you should compare the hashes of the result you obtain with those in the overview table.

OK I did. The hashes of my std-version (\n2khab_data\20_processed\habitatmap_stdized\habitatmap_stdized.gpkg = endresult of 01_) are different from the google-sheet. And more this morning's version (md5: 8bea3bfb829bf58012b1f2af6cde5278) is different from the afternoon's (md5: 289d583a574bbef7b437161bb88c6fb9). Strange

@florisvdh
Copy link
Member

florisvdh commented Feb 10, 2021

So you should compare the hashes of the result you obtain with those in the overview table.

I think that a GPKG file, regardless of its contents, will each time have a different checksum after a rewrite (not 100% sure, but seems confirmed by @w-jan). Maybe has to do with timestamping, with random numbers, ... Possibly you can find reasons in https://www.sqlite.org/fileformat.html and in https://www.ogc.org/standards/geopackage.

However you could compare the resulting R objects after reading them in (layer by layer, or with the read_xxx() function of n2khab if that already works with the new objects). See:

a <- "Hello there"
digest::digest(a, "md5")
#> [1] "dc7bc0712052ce0b42d178a287f0e516"
digest::digest(a, "sha256")
#> [1] "572dc06abe78ea83c682287c4dbf36df9da83fc9745161069a8a7990155750ad"

Created on 2021-02-10 by the reprex package (v1.0.0)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.3 (2020-10-10)
#>  os       Linux Mint 20               
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language nl_BE:nl                    
#>  collate  nl_BE.UTF-8                 
#>  ctype    nl_BE.UTF-8                 
#>  tz       Europe/Brussels             
#>  date     2021-02-10                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
#>  cli           2.3.0   2021-01-31 [1] CRAN (R 4.0.3)
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.3)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.2)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.2)
#>  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#>  knitr         1.31    2021-01-27 [1] CRAN (R 4.0.3)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.3)
#>  reprex        1.0.0   2021-01-27 [1] CRAN (R 4.0.3)
#>  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.3)
#>  rmarkdown     2.6     2020-12-14 [1] CRAN (R 4.0.3)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.3)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
#>  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
#>  withr         2.4.1   2021-01-26 [1] CRAN (R 4.0.3)
#>  xfun          0.20    2021-01-06 [1] CRAN (R 4.0.3)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)
#> 
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library

In order to actually implement this for verifying integrity of a reproduced result - which is not necessarily needed to insert here IMHO (one can always do such comparisons oneself post-hoc, based on the official Zenodo files) - some pitfalls may need to be addressed. E.g. different results on 32- vs 64-bit systems; I believe this will not apply to file checksums with openssl package. See vignette("sha1", package = "digest") (notice the vignette author 😉 ).

@florisvdh
Copy link
Member

@ToonHub Can you include the rendered bookdown html files as attachment-links in your comment? (You'll need to zip them first). The html files will make it a bit easier to review (they include chunk output).

@ToonHub
Copy link
Contributor Author

ToonHub commented Feb 11, 2021

See below the rendered boodwon files
generating_habitatmap_stdized.zip
generating_habitatmap_terr.zip

@florisvdh
Copy link
Member

In inbo/n2khab#112 more background and ideas are given to support choosing the preferred checksum algorithm, as it will become an important feature of n2khab later. Please share your general thoughts / discussion (which are not linked to this specific PR) over there.

Copy link
Member

@florisvdh florisvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your efforts @ToonHub ! Nice work. 👏

First part of my review: habitatmap_stdized

I mainly looked at the html, did not re-run code (I believe in you). I did run a few basic checks on the data source (see further).

Only one comment which may impact the resulting contents (*):

  • Furthermore the sum of the phab values within the polygon is higher than 100. Therefore we will distribute teh phab value for uncetain types equally over the two possible types (in this case phab = 20 for type 3150 and phab = 20 for type rbbmr).

    These lines are about the example copied below from your html:

    afbeelding

    afbeelding

    In reality HAB2 may be either 3150 or rbbmr, and perhaps a mix of both. Hence deliberately partitioning the value of 40 as 20+20 favours the latter case, in an attempt to maintain a total of 100%.

    • However should we not be more conservative and relax the condition of total 100% in this case? I.e. better assign 100 to 3150 and 40 to rbbmr? I presume that handling types in an individual way is the main usecase, and then there's no point in decreasing values (I guess).
    • Regarding the meaning of certain = TRUE vs FALSE in this case: considering 3150 we are only 'uncertain' about the phab value, but not about the type's presence (assuming a habitatmap without errors), and the latter is the meaning of certain IIUC. (It would match the metadata, saying 'For some polygons the vegetation type is uncertain, ...'.) So I think certain must be TRUE here, and not become FALSE for 3150 (only for rbbmr).

(*) if for some reason things will stay as-is, then the below comments can still be handled in the source, but then there's no actual need to re-run and overwrite the result.

Can you update:

  • supposed to be run from within the src/process_habitatmap subfolder. You can use the process_habitatmap.Rproj RStudio project file

  • 99_sessioninfo.Rmd to the version under src/generate_watercourse_100mseg/99_sessioninfo.Rmd (on current master). It adds the versions of geospatial libraries. [This will hold for habitatmap_terr as well.]
  • the document title to 'Generating habitatmap_stdized'. I believe the title 'processing habitatmap' comes from times where we had not yet decided on all workflows.
  • the name of the Rproj file to generate_habitatmap_stdized.Rproj

On the technical side:

  • renv::restore() will work with just renv.lock, however for interactive usage & collaboration IMO it's best to add all (not git-ignored) files (to imitate your RStudio behaviour as closely as possible). Then, if anyone else updates packages in the lockfile it will occur in the same way. E.g. some behaviour can be changed permanently (with a renv function) in settings.dcf. For another recent data source these were the files committed, I suggest to add the ones missing:

    $ git log --name-status 30faabf^..30faabf
    commit 30faabfd9518dbed8402d95f791ee25ddad760fb
    Author: florisvdh <[email protected]>
    Date:   Mon Nov 30 15:05:15 2020 +0100
    
        Generate watercourse_100m...: use renv
    
    A       src/generate_watercourse_100mseg/.Rprofile
    A       src/generate_watercourse_100mseg/renv.lock
    A       src/generate_watercourse_100mseg/renv/.gitignore
    A       src/generate_watercourse_100mseg/renv/activate.R
    A       src/generate_watercourse_100mseg/renv/settings.dcf
  • I advise to add a chapter 'checks on the datasource' after writing the datasource to disk (i.e. reading the new file back in and perform some obvious checks). You could do that in a second Rmd file if you like (bookdown will concatenate anyway). Regarding habitatmap_stdized you could already integrate the checks present in this file (it's on the hms_fv branch now, as an update of the original file on master) and more generally include some ideas from these code lines (cf. rendered file chapter 3).

    • Regarding the former file, it is relevant to note that the number of records with phab = 0 has increased from 755 to 1055. (rendered versions: 2018 and 2020).
    • Also it appears that there are about 11700 extra polygons in habitatmap_stdized_2020_v1 compared to habitatmap_stdized_2018_v2.

To be clear, there's no problem with keeping both md5 and sha256 checksums in the code, for now, even while there's no real point in using md5 when also using sha256 (inbo/n2khab#112).


## Processing of the attribute table

Every polygon in the habitat map can consist of maximum 5 different vegetation types. This information is stored in the columns 'HAB1', 'HAB2',..., 'HAB5' of the attribute table. The estimated fraction of each vegetation type within the polygons is stored in the columns 'PHAB1', 'PHAB2', ..., 'PHAB5'.

We will convert the attribute table to a long format, so that every row contains one vegetation type.

In some cases, the vegetation type is uncertain, and 2 or 3 possible vegetation types are provided, separated with a ','. We will split the different possible vegetation types and create one row for each of them. An additional variable 'certain' will be FALSE if the orginal habitatmap code consists of 2 or 3 possible vegetation types, and TRUE if only one vegetation type is provided.


```{r select_polygons}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note - feel free to make such changes. Gathering HAB & PHAB columns can now be done in one step using pivot_longer_spec(). E.g. first example below https://tidyr.tidyverse.org/articles/pivot.html#by-hand-1.

Copy link
Member

@florisvdh florisvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part of my review: habitatmap_stdized

Forgot to check the coordinate reference system of the habitatmap_polygons layer, and this will need a rewrite of the resulting data source. Checks have now been added in the referred notebook (I updated the '2020' link in previous comment); you best integrate those CRS checks here (checks chapter).

In summary, you should do something like the following when reading habitatmap, i.e. override the CRS with the crs argument:

habitatquarries <- read_sf(file.path(tempdir(), "8310_v2018_RAPHAB2019.shp"),
crs = 31370)

This subject was inspected in more detail only recently (inbo/n2khab#85 (comment)). With the “Belge 1972 / Belgian Lambert 72” CRS defined with different (and technically non-equivalent) parameter values (the ESRI way), the CRS of habitatmap (and derivates) is technically another CRS (ESRI:102499) than the official EPSG:31370 CRS, even though the numerical differences for applications are negligible. But adopting EPSG standards is done throughout open geospatial tooling (including R). The difference with ESRI:102499 already caused clashes here and there in the past and we can expect this problem will never vanish in near future. It is the reason why EPSG:31370 is forced by the n2khab-reading-functions for files where necessary, e.g. for raw data sources and for several 2018-files such as this one. But since this became clear, we should take care of it in future data sources right away, in the files.

It technically is a fundamental problem and I hope to pursue it further, eventually. It affects many Geopunt data sources, the metadata of which wrongly state they conform to EPSG:31370. [Apart from that there's the discussion on why still using EPSG:31370: it has a Belgian datum. Scientifically it is better to use EPSG:3812 - it applies the European ETRS89 datum hence achieves exact conversion between projected CRSs of neighbouring countries when those countries apply the ETRS89 datum as well. It appears that INSPIRE updates on this have recently be added at Geopunt.]

@ToonHub
Copy link
Contributor Author

ToonHub commented Mar 11, 2021

I updated both geopackages.
The actual differences:

  • 7140_cl is changed to 7140_meso
  • for polygons containing both 31xx and rbbmr, the orginal code is shown (=31xx_rbbmr) instead of the modified code (=31xx,rbbmr)

In the original habitatmap there is no code that corrsponds to 31xx,rbbmr (only 31xx_rbbr). However, if this is the case in future versions, I avoided certain to be set to TRUE.

Below the updated rendered bookdown files:
generating_habitatmap_terr.zip
generating_habitatmap_stdized.zip

Copy link
Member

@florisvdh florisvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent job, thank you @ToonHub!

You're right that we cannot successfully use the underscore to separate strings! Your implementation works great IMO.

@cecileherr / @w-jan did you intend to have another look, or shall I merge?

The new GeoPackage versions are the ones in the n2khab-binaire-databronnen folder on Q-drive. Cf. their md5sums which match the ones in the bookdown output 👍.

@ToonHub: once merged, it should be safe to upload the new versions to Zenodo. Some of the ideas for documentation updates in inbo/n2khab#117 will also be relevant for the Zenodo metadata.

Comment on lines +94 to +96
Some polygons in the habitat map contain codes that do not correspond with the standardized list of habitat types.
We correct these codes to make processing of the habitat map more straightforward.
Table \@ref(tab:codeCorrected) shows the corrected codes and the number of polygons for which the correction is applied.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -151,7 +148,7 @@ Therefore `certain` will be set to `TRUE` for both types included in `code`.
habmap_long <- habmap_longHAB %>%
left_join(habmap_longPHAB, by = c("polygon_id", "patch_id")) %>%
mutate(certain = !str_detect(code, ","),
certain = ifelse(code %in% c("3130,rbbmr", "3140,rbbmr", "3150,rbbmr", "3160,rbbmr"), TRUE, certain)) %>%
certain = ifelse(code_orig %in% c("3130_rbbmr", "3140_rbbmr", "3150_rbbmr", "3160_rbbmr"), TRUE, certain)) %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator

@cecileherr cecileherr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me to merge! I don't have any more comments about the latest version (note there will be an extra check soon since I will try to update read_watersurfaces_hab later this week, and I will need habitatmap_stdize).
Thank you for the adaptations & follow up!

@florisvdh
Copy link
Member

OK @cecileherr.
@ToonHub perhaps it's simplest to keep this PR open until Cécile has reworked the related n2khab functions / created watersurfaces_hab? I don't expect changes will be needed, but if things would still pop up, changes can be added here. This PR only amends files in the specific subdirs, so there won't be conflicts with other work. In the meantime, the data sources are available on Q.

@ToonHub
Copy link
Contributor Author

ToonHub commented Mar 17, 2021

OK

@cecileherr
Copy link
Collaborator

cecileherr commented Apr 8, 2021

I checked the geometry of the new versions of habitatmap_terr and habitatmap_stdized.
There are a few invalid polygons, but nevertheless I can carry out geometrical operations (intersection, ...) without getting errors.
@florisvdh: do you think we should correct the geometries despite the absence of errors?

# HABITATMAP STDIZED

habst <- read_habitatmap_stdized() 
any(na.omit(st_is_valid(habst$habitatmap_polygons)) == FALSE)
# [1] TRUE
st_is_valid(habst$habitatmap_polygons, reason = TRUE) %>% 
  .[. != "Valid Geometry"] %>% 
  as_tibble() %>% 
  mutate(problem =  str_extract(string = value, pattern = "[^\\[]+")) %>% 
  count(problem)
#  problem                                 n
# <chr>                                  <int>
#  Ring Self-intersection           83

# HABITATMAP TERR

habter <- read_habitatmap_terr()
any(na.omit(st_is_valid(habter$habitatmap_terr_polygons)) == FALSE)
#[1] TRUE
st_is_valid(habter$habitatmap_terr_polygons, reason = TRUE) %>% 
  .[. != "Valid Geometry"] %>% 
  as_tibble() %>% 
  mutate(problem =  str_extract(string = value, pattern = "[^\\[]+")) %>% 
  count(problem)
# problem                                n
# <chr>                                    <int>
#  Ring Self-intersection            80

@florisvdh
Copy link
Member

nevertheless I can carry out geometrical operations (intersection, ...) without getting errors

@cecileherr did the affected polygons get intersected in this test? In that case, we could indeed leave things as-is. Ideally we should have more insight why the operations worked in one case and not in another, in order to make the best decision.

@cecileherr
Copy link
Collaborator

cecileherr commented Apr 12, 2021

@cecileherr did the affected polygons get intersected in this test?

At least I didn't get any error while making the intersection (but I haven't checked the polygon manually - maybe I should).

Actually I suspect that I might have updated the sf package (or another dependency) in the meantime and that it could be the reason why the same kind of faulty polygons do not trigger errors anymore. I will try to check and post the result here

@florisvdh
Copy link
Member

watersurfaces_hab, where you encountered the problems IIUC, appears to have been created with current CRAN version of sf:

"sf": {
"Package": "sf",
"Version": "0.9-8",
"Source": "Repository",
"Repository": "CRAN",
"Hash": "d82215b501772c9faadb4dcd71395c7d"
},

@cecileherr
Copy link
Collaborator

watersurfaces_hab, where you encountered the problems IIUC, appears to have been created with current CRAN version of sf:

It should have been the case BUT, unfortunately, I am a little bit messier than that...
In the beginning I was not working with renv (yet) but only with a session info file. Then I needed an update of n2khab (to be able to use your new function for the file hashes), and I think that I also updated other packages to be sure that the final version would be a bit up to date.
I suspect I might have encountered the geometry problems BEFORE the update and before the implementation of renv. So I need to find which versions of the packages I was using before that and check the st_intersection with both versions to see if that is the cause. And I hope it is, because I do not have any other explanation :-)

@cecileherr
Copy link
Collaborator

Follow-up of the geometry problems in habitat_map and the processed habitatmaps

@cecileherr did the affected polygons get intersected in this test?

At least I didn't get any error while making the intersection (but I haven't checked the polygon manually - maybe I should).

I tried an intersection with Flanders: as far as I can see (I only checked manually in leaflet and QGIS for +/- 10 cases), the results of the intersection with the faulty polygons are as expected. Note that the layer resulting from the intersection doesn't contain geometry errors anymore!

e.g. for habitatmap_terr:

sf_vl <- read_admin_areas() %>%
  st_transform(crs = 31370)

int <- st_intersection(sf_vl , habter$habitatmap_terr_polygons)
# Warning message:
#   attribute variables are assumed to be spatially constant throughout all geometries

library(leaflet)
check_geom <- st_is_valid(habter$habitatmap_terr_polygons %>%
                            st_transform(crs = 4326),
                          reason = TRUE) %>%
  .[. != "Valid Geometry"] %>%
  as_tibble() %>%
  mutate(lng_lat = str_extract(string = value,
                               pattern = "(?<=\\[).*(?=\\])"),
         lng = str_extract(string = lng_lat,
                           pattern = "^.*\\s") %>% as.double(),
         lat = str_extract(string = lng_lat,
                           pattern = " .*") %>% as.double(),
         problem = str_extract(string = value,
                               pattern = "[^\\[]+")) %>%
  select(-value, -lng_lat)

leaflet() %>%
  setView(lng = check_geom[1,]$lng, lat = check_geom[1,]$lat,
          zoom = 9) %>%
  addTiles(group = "OSM (default)") %>%

  addPolygons(data = habter$habitatmap_terr_polygons %>%
                st_transform(crs = 4326),
              group = "habmap terr (blue)") %>%
  addPolygons(data = int %>%
                st_transform(crs = 4326),
              color = "black", group = "intersect hm terr - Vlaanderen (black)") %>%
  addCircleMarkers(data = check_geom,
                   lng  = ~lng, lat = ~lat,
                   color = "red") %>%
  addLayersControl(
    overlayGroups = c("habmap terr (blue)", "intersect hm terr - Vlaanderen (black)"),
    options = layersControlOptions(collapsed = FALSE))

# the intersection has no geom errors anymore!
any(na.omit(st_is_valid(int)) == FALSE)
[1] FALSE

@florisvdh
Copy link
Member

Thanks for checking this. I gave up reproducing - the st_intersection() line takes an age 🙄. Note: you shouldn't have needed %>% st_transform(crs = 31370) - any reason why you did that?

@florisvdh
Copy link
Member

do you think we should correct the geometries despite the absence of errors

I will leave the decision to you @cecileherr. Up to now it has not been done - so if it appears not to cause trouble, you could leave it as-is.

@cecileherr
Copy link
Collaborator

do you think we should correct the geometries despite the absence of errors

Conclusion after a live discussion: we decide NOT to correct the geometries

@florisvdh
Copy link
Member

Thanks for all the work. Merging here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants