Omic pipeline

R scripts for proteomic and metabolic data analysis

Data cleaning: Scripts to normalize and standardize data. Scripts to subset data based on column name. Scripts to remove samples with too many missing values. Convert NA/InF into zeros.

Graph_ML: Scripts that handle non-euclidean data such as graphs

LLM: NLP pipeline to programmatically identify relevant information from large bodies electronic health data, utilizing the chat-gpt API

Machine Learning: A series of machine learning algorithms used for regression and classification.

New Tree Based Pipelines: These pipelines use either Random forest, Extra Treees, or XGboost to both select features and build models. The pipelines build a model (Random forest, Extra Trees, XGboost), calculates accuracy, feature importance, plots seperation via PCA, and recursively selects features.

MaxQuant Bash Scripts: Scripts for running MaxQuant on a HPC in a Linux enviroment.

QC: This contains a QC pipeline that can be used to idenify outliers and bad runs/samples in a dataset. Calculates and graphs number of zeros, median, and mean instenisty across samples. Histrograms of each feature, PCA plot and correlation matrix.

WGCNA: Weighted correlation network analysis (WGCNA) is used for finding clusters (modules) of highly correlated genes/protiens. The central idea is that proteins that are correlated have some type of biological relatedness. Part of the pipeline, line 112, produces a table that is then used for GO-elite.

GO-elite: The objective of GO-elite is to identify a set of biological Ontology terms or pathways to describe a particular set of genes/proteins. It answeres the basic question who are these proteins and what do they do. The code here is an R based wrapper for GO-elite that is used to visulize the results.

Random Forest: The objective of this pipeline is to build a model that can predict diagnosis based of relative protein abundance. The pipeline also calculates feature importance. This information can be used to find biomarker candidates.

Visualization: Methods for visualization of data and results.

Together these pipelines can move from raw data to final models, network analysis, and visualization. Informing the researcher what groups of genes/proteins are related to disease. What pathways these proteins are in and what genes/proteins are best used as biomarkers.

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
Data_Cleaning		Data_Cleaning
GO Elite		GO Elite
Graph_ML		Graph_ML
LLM		LLM
Machine_Learning		Machine_Learning
MaxQuant_Bash_Scripts		MaxQuant_Bash_Scripts
Mixed_effects_models		Mixed_effects_models
QC		QC
Random Forest Pipeline		Random Forest Pipeline
Single_Cell_RNA-seq		Single_Cell_RNA-seq
Visualization		Visualization
WGCNA		WGCNA
multiomic_pipelines		multiomic_pipelines
ukbiobank		ukbiobank
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Omic pipeline

About

Releases

Packages

Languages

ibishof/Omics_pipeline

Folders and files

Latest commit

History

Repository files navigation

Omic pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages