Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biomarkers transform for ModelAD #148

Merged
merged 41 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
f348d68
Added biomarker files and functions to necessary locations, none are …
Sep 18, 2024
2d43820
Added biomarker transform for the Model-AD project. The transform out…
Sep 18, 2024
b19a529
Biomarkers input and output test files
Sep 19, 2024
84355f7
Added tests for biomarkers
Sep 19, 2024
1297757
Ran black formatter
Sep 20, 2024
537605c
Biomarkers test passes when it should
Sep 20, 2024
597bfca
Biomarkers transform working, need to remove custom_transform from yaml
Sep 23, 2024
b796474
Correct use of the custom_transformations parameter in yaml config file
Sep 23, 2024
d6a7d19
Added fake test data made by hand for testing biomarkers transform
Sep 24, 2024
46feee2
Added testing for duplicate data
Sep 25, 2024
4ac1f23
Formatting with black
Sep 25, 2024
bb8cb5d
Addressing PR comment about process_dataset() error message.
Sep 26, 2024
1a09560
Reformatting process.py
Sep 26, 2024
2e4c792
Addressing PR comment about TypeError for biomarkers dataset.
Sep 26, 2024
200f068
Addressing PR comment: Improved docstring and typing for the transfor…
Sep 26, 2024
dd8d422
PR comment: Reverting back to using standard typing hints to prevent …
Sep 26, 2024
3fae0ae
Removed unused import that caused pre-commit to fail.
Sep 26, 2024
f75638b
Removed unnecessary formatting from ADTDataProcessingError message.
Sep 26, 2024
8016df4
Using typing library to add more specific type hints to the transform…
Sep 26, 2024
5e3dfeb
PR comment - using preferred context managed open for converting a li…
Sep 26, 2024
8ee844d
Reverting change to see if it fixes CI: pre-commit fail
Sep 27, 2024
8ca5cc9
Maybe now the CI pre-commit will pass?
Sep 27, 2024
44cbdb2
What about now? Will the CI pre-commit pass now?
Sep 27, 2024
653bede
I think the problem was just formatting
Sep 27, 2024
f1c8e93
Added test for none/NA/nan values and using pd fillna for missing or …
Sep 30, 2024
5e1046b
Added test for missing data
Sep 30, 2024
da793a3
Added a fail test case for missing columns
Sep 30, 2024
d882457
Added passing test for datasets with extra/unknown columns
Sep 30, 2024
88d3f9d
PR comment: removed unnecessary type checking in transform_biomarkers…
Sep 30, 2024
d2b1ba0
Removed biomarkers test with real data, we have lots of test with fak…
Sep 30, 2024
e73210a
Using warning instead of raise exception to allow datasets without a …
Sep 30, 2024
31c5806
Simplifying transform_biomarkers() to make it more readable and maint…
Sep 30, 2024
e663961
Removing unnecessary warning and explicit isinstance(df,DataFrame) in…
Sep 30, 2024
ae0d9ef
Added output type hint to apply_custom_transformations()
Sep 30, 2024
97f8c54
Removing unnecessary comment
Sep 30, 2024
2c36981
Improved docstring for list_to_json()
Sep 30, 2024
50c2f5c
Outputting biomarkers transform as pd.DataFrame, removing unnecessary…
Oct 2, 2024
b464333
Removed unused import statements
Oct 2, 2024
ce25c12
PR comments - removed unused warnings import and added None as part o…
Oct 2, 2024
7848b96
Removing extra typing import, pre-commit passes
Oct 2, 2024
cc782c6
Changing ageDead to age_death in all processing and test files
Oct 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions modelad_test_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
destination: &dest syn51498092
staging_path: ./staging
gx_folder: none
gx_table: none
datasets:
- biomarkers:
files:
- name: biomarkers
id: syn61250724.1
format: csv
final_format: json
provenance:
- syn61250724.1
destination: *dest
custom_transformations: 1
column_rename:
agedeath: age_death
2 changes: 2 additions & 0 deletions src/agoradatatools/etl/transform/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
)
from agoradatatools.etl.transform.team_info import transform_team_info
from agoradatatools.etl.transform.proteomics import transform_proteomics
from agoradatatools.etl.transform.biomarkers import transform_biomarkers

__all__ = [
"transform_distribution_data",
Expand All @@ -28,4 +29,5 @@
"transform_rnaseq_differential_expression",
"transform_team_info",
"transform_proteomics",
"transform_biomarkers",
]
46 changes: 46 additions & 0 deletions src/agoradatatools/etl/transform/biomarkers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
"""
This module contains the transformation logic for the biomarkers dataset.
This is for the Model AD project.
"""

import pandas as pd
from typing import Dict


def transform_biomarkers(datasets: Dict[str, pd.DataFrame]) -> pd.DataFrame:
"""
Takes a dictionary of dataset DataFrames, extracts the biomarkers
DataFrame, and transforms it into a DataFrame grouped by
'model', 'type', 'age_death', 'tissue', and 'units'.

Args:
datasets (Dict[str, pd.DataFrame]): Dictionary of dataset names mapped to their DataFrame.

Returns:
pd.DataFrame: A DataFrame containing biomarker data modeled after intended final structure.
"""
biomarkers_dataset = datasets["biomarkers"]
group_columns = ["model", "type", "age_death", "tissue", "units"]
point_columns = ["genotype", "measurement", "sex"]

missing_columns = [
col
for col in group_columns + point_columns
if col not in biomarkers_dataset.columns
]
if missing_columns:
raise ValueError(
f"Biomarker dataset missing columns: {', '.join(missing_columns)}"
)

biomarkers_dataset = biomarkers_dataset.fillna("none")
data_rows = []

grouped = biomarkers_dataset.groupby(group_columns)

for group_key, group in grouped:
entry = dict(zip(group_columns, group_key))
entry["points"] = group[point_columns].to_dict("records")
data_rows.append(entry)

return pd.DataFrame(data_rows)
10 changes: 7 additions & 3 deletions src/agoradatatools/process.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import logging
import typing
from typing import Union

import synapseclient
Expand All @@ -13,11 +12,14 @@
from agoradatatools.reporter import ADTGXReporter, DatasetReport
from agoradatatools.constants import Platform


logger = logging.getLogger(__name__)


# TODO refactor to avoid so many if's - maybe some sort of mapping to callables
def apply_custom_transformations(datasets: dict, dataset_name: str, dataset_obj: dict):
def apply_custom_transformations(
datasets: dict, dataset_name: str, dataset_obj: dict
) -> Union[DataFrame, dict, None]:
if not isinstance(datasets, dict) or not isinstance(dataset_name, str):
return None
if dataset_name == "biodomain_info":
Expand Down Expand Up @@ -59,6 +61,8 @@ def apply_custom_transformations(datasets: dict, dataset_name: str, dataset_obj:
if dataset_name in ["proteomics", "proteomics_tmt", "proteomics_srm"]:
df = datasets[dataset_name]
return transform.transform_proteomics(df=df)
if dataset_name == "biomarkers":
return transform.transform_biomarkers(datasets=datasets)
else:
return None

Expand Down Expand Up @@ -186,7 +190,7 @@ def process_dataset(

def create_data_manifest(
syn: synapseclient.Synapse, parent: synapseclient.Folder = None
) -> typing.Union[DataFrame, None]:
) -> Union[DataFrame, None]:
"""Creates data manifest (dataframe) that has the IDs and version numbers of child synapse folders

Args:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
model,type,measurement,units,age_death,tissue,sex,genotype
ModelA,TypeA,1,A,1,TissueA,male,genotype1
ModelA,TypeA,1,A,1,TissueA,male,genotype1
ModelA,TypeA,1,A,1,TissueA,male,genotype2
ModelA,TypeA,1,A,1,TissueA,male,genotype2
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
model,type,measurement,units,age_death,tissue,sex,genotype,extra
ModelA,TypeA,1,A,1,TissueA,male,genotype1,extra1
ModelA,TypeA,2,A,1,TissueA,male,genotype1,extra1
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
model,type,measurement,units,age_death,tissue,sex,genotype
ModelA,TypeA,1,A,1,TissueA,male,genotype1
ModelA,TypeA,2,A,1,TissueA,male,genotype1
ModelA,TypeA,3,A,2,TissueA,male,genotype2
ModelA,TypeB,4,A,2,TissueA,male,genotype1
ModelA,TypeB,5,A,3,TissueA,male,genotype1
ModelA,TypeB,6,A,3,TissueA,male,genotype2
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
model,type,measurement,units,age_death,tissue,sex
ModelA,TypeA,1,A,1,TissueA,male
ModelA,TypeA,2,A,1,TissueA,male
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
model,type,measurement,units,age_death,tissue,sex,genotype
ModelA,TypeA,1,A,1,TissueA,male,genotype1
ModelA,TypeA,,A,1,TissueA,male,genotype1
,TypeA,1,A,1,TissueA,male,genotype1
10 changes: 10 additions & 0 deletions tests/test_assets/biomarkers/input/biomarkers_none_input.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
model,type,measurement,units,age_death,tissue,sex,genotype
ModelA,TypeA,1,A,1,TissueA,male,genotype1
ModelA,TypeA,none,A,1,TissueA,male,genotype1
ModelA,TypeA,NA,A,1,TissueA,male,genotype1
ModelA,TypeA,nan,A,1,TissueA,male,genotype1
ModelA,TypeA,N/A,A,1,TissueA,male,genotype1
none,TypeA,1,A,1,TissueA,male,genotype1
none,TypeA,1,A,1,TissueA,male,NA
NA,TypeA,1,A,1,TissueA,male,genotype1
ModelA,NA,1,A,1,TissueA,male,genotype1
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
[
{
"model": "ModelA",
"type": "TypeA",
"age_death": 1,
"tissue": "TissueA",
"units": "A",
"points": [
{
"genotype": "genotype1",
"measurement": 1,
"sex": "male"
},
{
"genotype": "genotype1",
"measurement": 1,
"sex": "male"
},
{
"genotype": "genotype2",
"measurement": 1,
"sex": "male"
},
{
"genotype": "genotype2",
"measurement": 1,
"sex": "male"
}
]
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
[
{
"model": "ModelA",
"type": "TypeA",
"age_death": 1,
"tissue": "TissueA",
"units": "A",
"points": [
{
"genotype": "genotype1",
"measurement": 1,
"sex": "male"
},
{
"genotype": "genotype1",
"measurement": 2,
"sex": "male"
}
]
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
[
{
"model": "ModelA",
"type": "TypeA",
"age_death": 1,
"tissue": "TissueA",
"units": "A",
"points": [
{
"genotype": "genotype1",
"measurement": 1,
"sex": "male"
},
{
"genotype": "genotype1",
"measurement": 2,
"sex": "male"
}
]
},
{
"model": "ModelA",
"type": "TypeA",
"age_death": 2,
"tissue": "TissueA",
"units": "A",
"points": [
{
"genotype": "genotype2",
"measurement": 3,
"sex": "male"
}
]
},
{
"model": "ModelA",
"type": "TypeB",
"age_death": 2,
"tissue": "TissueA",
"units": "A",
"points": [
{
"genotype": "genotype1",
"measurement": 4,
"sex": "male"
}
]
},
{
"model": "ModelA",
"type": "TypeB",
"age_death": 3,
"tissue": "TissueA",
"units": "A",
"points": [
{
"genotype": "genotype1",
"measurement": 5,
"sex": "male"
},
{
"genotype": "genotype2",
"measurement": 6,
"sex": "male"
}
]
}
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
[
{
"model": "ModelA",
"type": "TypeA",
"age_death": 1,
"tissue": "TissueA",
"units": "A",
"points": [
{
"genotype": "genotype1",
"measurement": 1.0,
"sex": "male"
},
{
"genotype": "genotype1",
"measurement": "none",
"sex": "male"
}
]
},
{
"model": "none",
"type": "TypeA",
"age_death": 1,
"tissue": "TissueA",
"units": "A",
"points": [
{
"genotype": "genotype1",
"measurement": 1.0,
"sex": "male"
}
]
}
]
74 changes: 74 additions & 0 deletions tests/test_assets/biomarkers/output/biomarkers_none_output.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
[
{
"model": "ModelA",
"type": "TypeA",
"age_death": 1,
"tissue": "TissueA",
"units": "A",
"points": [
{
"genotype": "genotype1",
"measurement": "1",
"sex": "male"
},
{
"genotype": "genotype1",
"measurement": "none",
"sex": "male"
},
{
"genotype": "genotype1",
"measurement": "none",
"sex": "male"
},
{
"genotype": "genotype1",
"measurement": "none",
"sex": "male"
},
{
"genotype": "genotype1",
"measurement": "none",
"sex": "male"
}
]
},
{
"model": "ModelA",
"type": "none",
"age_death": 1,
"tissue": "TissueA",
"units": "A",
"points": [
{
"genotype": "genotype1",
"measurement": "1",
"sex": "male"
}
]
},
{
"model": "none",
"type": "TypeA",
"age_death": 1,
"tissue": "TissueA",
"units": "A",
"points": [
{
"genotype": "genotype1",
"measurement": "1",
"sex": "male"
},
{
"genotype": "none",
"measurement": "1",
"sex": "male"
},
{
"genotype": "genotype1",
"measurement": "1",
"sex": "male"
}
]
}
]
Loading