Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing diet values - Data base update? #281

Open
smorsbach opened this issue Aug 1, 2024 · 4 comments
Open

Missing diet values - Data base update? #281

smorsbach opened this issue Aug 1, 2024 · 4 comments

Comments

@smorsbach
Copy link

Session Info
> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rglobi_0.3.4    rfishbase_4.1.1 lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.1   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    bitops_1.0-8      stringi_1.8.4     hms_1.1.3         digest_0.6.36     magrittr_2.0.3    evaluate_0.24.0   grid_4.4.1        timechange_0.3.0  contentid_0.0.18  fastmap_1.2.0     jsonlite_1.8.8    progress_1.2.3   
[15] DBI_1.2.3         httr_1.4.7        fansi_1.0.6       scales_1.3.0      duckdb_1.0.0-2    cli_3.6.3         rlang_1.1.4       crayon_1.5.3      dbplyr_2.5.0      bit64_4.0.5       munsell_0.5.1     cachem_1.1.0      remotes_2.5.0     withr_3.0.1      
[29] yaml_2.3.8        parallel_4.4.1    tools_4.4.1       tzdb_0.4.0        memoise_2.0.1     colorspace_2.1-0  curl_5.2.1        vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   bit_4.0.5         fs_1.6.4          vroom_1.6.5       pkgconfig_2.0.3  
[43] pillar_1.9.0      gtable_0.3.5      glue_1.7.0        xfun_0.45         tidyselect_1.2.1  rstudioapi_0.16.0 knitr_1.47        htmltools_0.5.8.1 rmarkdown_2.27    compiler_4.4.1    prettyunits_1.2.0 askpass_1.2.0     openssl_2.2.0     RCurl_1.98-1.16

Hello!

I am right now trying to retrieve diet information tables for dozens of species from SeaLifeBase. However, many of these don't contain data that can be found online. I am particularly interested in the PercentEmpty column, which is returned when running the diet() command. Please find an example below; this is one of the cases where the mentioned column is empty, despite containing data in the online version.

Code Example
> diet(species_list = "Larus argentatus", server = "sealifebase") %>% select(Species, DietRefNo, PercentEmpty)
Joining with `by = join_by(SpecCode)`
# A tibble: 7 × 3
  Species          DietRefNo PercentEmpty
  <chr>                <int>        <dbl>
1 Larus argentatus      4276           NA
2 Larus argentatus      4276           NA
3 Larus argentatus      4276           NA
4 Larus argentatus      6390           NA
5 Larus argentatus      6515           NA
6 Larus argentatus      6570           NA
7 Larus argentatus      6587           NA

The latest release is from May 2023, could it have something to do with that? And if yes, are there any updates planned in the near future?
Or is this command simply faulty when running it on SeaLifeBase? However, when running > fb_tbl("diet", server="sealifebase"), there are entries with values in that column.

Any help would be much appreciated.
Thanks in advance!

@cboettig
Copy link
Member

cboettig commented Aug 1, 2024

yup, working on preparing the new release. should be up soon

@cboettig
Copy link
Member

cboettig commented Aug 1, 2024

Package plumbing still in the works, but can you try:

diet = arrow::read_parquet("https://fishbase.ropensci.org/sealifebase/diet.parquet")

This directly accesses our API and should provide the current data (July 2024 release) for the sealifebase diet table.

@smorsbach
Copy link
Author

Thanks for the quick answer!

I noticed something else though and I think this issue has been raised in the past (#87), namely the combination of information from the diet() and the diet_items(). Is there a way to automatically include the more detailed diet composition data from diet_items() in the diet table? It took me a long time to figure out how to get from one to the other before realizing that you can simply join them via the DietCode column and I think it would be really helpful to just combine the two into one. No information is lost and it makes the structure of the dozens of tables a bit more tidy. For anyone encountering the same issue, I added my code below.

Table Join
# pull all diet items from SeaLifeBase
diet_items <- fb_tbl(tbl = "diet_items", server = "sealifebase")

# get diet for single or list of species; here harbour porpoise
diet_species <- diet(species_list = "Phocoena phocoena", server = "sealifebase")

# join tables and select columns
diet_species_detail <- diet_species %>% 
  left_join(diet_items, by = "DietCode") %>% 
  select(Species, SampleStage, DietCode, C_Code, Locality, January, February, March, April, May, June, July, August, September, October, November, December, E_Code, FoodI, FoodII, FoodIII, ItemName, Stage, DietPercent)

I think in issue #87 an attempt was made to do that, but right now that leads to the diet table displaying wrong or incomplete values in the PercentEmpty and OtherItems columns. I tried to include an example for Atlantic cod below and hope that this clarifies what I mean. Maybe I also misunderstood what these two columns should contain. Also, the diet() table per se does not contain any taxonomic column on the actual food item. In the online version, there is a column called "Main Food" which is missing in the R version. That emphasizes the benefit of merging diet() and diet_items() into one.

Example
# pull all diet items from FishBase
diet_items <- diet_items()

# get diet for cod
diet_species <- diet(species_list = "Gadus morhua", server = "fishbase")

# join tables
diet_species_detail <- diet_species %>% 
  left_join(diet_items, by = "DietCode") %>% 
  select(Species, SampleStage, DietCode, FoodI, FoodII, FoodIII, ItemName, Stage, DietPercent, OtherItems, PercentEmpty) %>% 
  filter(DietCode == 14)

# PercentEmpty and OtherItems contain low values which can neither be found
# in the detailed diet information nor in the current table version online

The code chunk above brings me to my last point, namely the diet_items() command. As for now, there is no option to specify whether the data should be downloaded from FishBase or SeaLifeBase, it automatically just uses FishBase. It is possible to get the diet_items() from SeaLifeBase by using fb_tbl("diet_items", "sealifebase"), but maybe add the server option to diet_items() as well?

Really appreciate you work and thanks in advance!

@cboettig
Copy link
Member

cboettig commented Aug 6, 2024

Thanks for sharing your experience and examples, I am sure this will be helpful to other users as well.

At the moment I recommend using the fb_tbl function for everything, e.g. fb_tbl(tbl = "diet_items", server = "sealifebase"). This function covers all ~200+ tables from either fishbase or sealifebase explicitly, and avoids the potential confusion of setting the 'server' (which is a deprecated syntax that dates back to an entirely different function architecture under the hood over a decade ago, when we could access far fewer tables and needed a very different mechanism to do so).

I agree that it takes a while to get used to the idea of joining separate tables. But this approach that Hadley Wickham calls "Tidy Data", also known as "relational database design", where everything is just a table and all operations can be expressed by functions like joins, selects, and filters, is in fact incredibly powerful and incredibly general once you get used to it. Knowing that everything is a just a table and can be joined to other tables with key columns is in the end much more useful than relying on helper utilities like our old diet() function that could do some joins and filters internally. As you see, the dplyr package has made these operations relatively easy and intuitive, and the internet (and now the chatbots) are all well-acquainted with advice on how to use dplyr. It didn't exist when we started rfishbase, but now I highly recommend using precisely the patterns you show above, and just using fb_tbl() as the only rfishbase function to access any table. HTH.

(rfishbase still needs to be pointed to the 2024 data, new release still coming soon!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants