Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC Project Proposal]: NOAA trawl survey database #74

Open
kellijohnson-NOAA opened this issue Feb 10, 2025 · 7 comments
Open

[GSoC Project Proposal]: NOAA trawl survey database #74

kellijohnson-NOAA opened this issue Feb 10, 2025 · 7 comments
Labels
GSoC25 project idea Designates a proposed project idea

Comments

@kellijohnson-NOAA
Copy link

kellijohnson-NOAA commented Feb 10, 2025

Project Description

The primary objective of this project is to create an accessible international database of transboundary marine survey data across the Northeast Pacific Ocean. Initial work for this project cleaned and joined haul-level data from several surveys operating along the west coast of North America, spanning two countries into a data frame (3.3 million observations of 55 species). Extending this work into a database, rather than data frame, would (1) allow for joining more data such as life-history information like the age of fish in the haul, and (2) allow for a larger number of species to be included (because of file sizes currently, we are providing data from only 55 of more than 1000 species). Most of the data are publicly available in independent-regional databases but no international database exists and the independent databases are not standardized, which significantly inhibits the use of the data for research and in assessments of the status of marine resources. This international database would help strengthen our understanding of climate-driven shifts in groundfish distribution in the North Pacific Ocean through data sharing between Fisheries and Oceans Canada (DFO) and NOAA Fisheries (NMFS) and it has the potential to improve the assessments and management of those species, serving as as a proof-of-concept and foundation for a proposed North America-wide effort to join survey data.

Expected Outcomes

We expect at a minimum that the data already compiled into the joined data frame would be enhanced by being moved to a queryable database. Second, more data that require the relational structure of a database, e.g., age- and length-composition data, could be added to the database as time allows. Including such data will allow for the database to be a one-stop shop for survey data, which would drastically reduce the time needed to compile the data for use in both research and management-related tasks.

Skills Required

SQL, R

Additional Background/Issues

Mentor(s)

Eric Ward (@ericward-noaa), Sean Anderson (@seananderson), Kelli Johnson (@kellijohnson-NOAA), Derek Bolser (@dgbolser)

Expected Project Size

175 hours

Project Difficulty

Intermediate

@kellijohnson-NOAA kellijohnson-NOAA added GSoC25 project idea Designates a proposed project idea labels Feb 10, 2025
@kellijohnson-NOAA kellijohnson-NOAA changed the title [GSoC Project Proposal]: [GSoC Project Proposal]: NOAA trawl survey database Feb 10, 2025
@w-nityammm
Copy link

I'm interested in working on this. Are there any preferred db systems? As I'm thinking of using postgres since this is obviously huge and going to be pretty query-heavy.

@kellijohnson-NOAA
Copy link
Author

Thank you @w-nityammm for your interest. To the best of my knowledge we do not have a preferred db system but we can ask around to see if something is preferred within NOAA and get back you. @dgbolser might have more information when he returns the office this week.

@dgbolser
Copy link

Yes, thanks for your interest @w-nityammm! Postgres seems like a good option and I don't see any obvious incompatibilities with other databases we maintain. I am checking with the database manager here at HQ to see if there's a preferred system.

@dgbolser
Copy link

@w-nityammm Oracle has been our default but I confirmed that there won't be issues with if we go with postgres. Our folks see the advantages and support going in that direction if it is best for the project.

@w-nityammm
Copy link

Alright sounds good. Will start looking into it :) . Thank you for the response! @dgbolser

@7yl4r
Copy link
Contributor

7yl4r commented Feb 24, 2025

Is it possible to use OBIS as the unified database? Data would feed into OBIS and then be queried back out, similar to using SQL. ROBIS is an R library for fetching OBIS data.

@ericward-noaa
Copy link

Interesting thought @7yl4r -- I'd defer to someone who knows more about OBIS, but my initial reaction is that we might be constrained by size. OBIS claims to have 136,000,000 records. I don't know exactly how many records we'll be dealing with -- but at least an order of magnitude more than the 3.3 million aggregate records we already have (more if we include more than 55 species). I can do some initial summaries of samples / species to get an idea.

In an ideal world, we'd be able to serve up data for any species, including some of the corals, sponges, and other invertebrates. There's a lot of 0s for rarer species, and we wouldn't need to store those -- but it would be useful to include data on individual samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GSoC25 project idea Designates a proposed project idea
Projects
None yet
Development

No branches or pull requests

5 participants