Merge pull request #4 from candiceT233/dayu_project_page

Dayu project page
grc-iit · Nov 14, 2023 · 1c49662 · 1c49662
2 parents e71c178 + d4253fe
commit 1c49662
Show file tree

Hide file tree

Showing 11 changed files with 234 additions and 0 deletions.
diff --git a/src/data/projects.ts b/src/data/projects.ts
@@ -96,6 +96,21 @@ const projects: Project[] = [
     status: "active",
     type: "funded",
   }, 
+  {
+    id: "dayu",
+    name: "DaYu",
+    title:
+      "DaYu: Optimizing Workflow Performance by Elucidating Semantic Data Flow",
+    shortDescription:
+      "Nowadays, distributed scientific workflows encounter challenges in data movement through storage systems. DaYu, by capturing the mapping of data objects to I/O operations, can uncover new insights for optimizing workflow data movement.",
+    link: "/research/projects/dayu",
+    isFeatured: false,
+    isOpenSource: true,
+    isOurs: true,
+    researchStatus: "testing",
+    status: "active",
+    type: "funded",
+  },
 ];
 
 export default projects;

diff --git a/src/pages/research/projects/dayu.mdx b/src/pages/research/projects/dayu.mdx
@@ -0,0 +1,219 @@
+---
+title: "DaYu: Optimizing Workflow Performance by Elucidating Semantic Data Flow"
+---
+
+import ProjectBadges from "@site/src/components/projects/ProjectBadges";
+
+<p>
+  <img src={require("@site/static/img/projects/dayu/dayu-logo2.png").default} width="200" />
+</p>
+
+
+# DaYu: Optimizing Workflow Performance by Elucidating Semantic Data Flow
+
+<ProjectBadges projectId="dayu" />
+
+Distributed scientific workflows nowadays face challenges in data movement through storage systems. 
+DaYu employs careful runtime measurement, maps domain semantics to low-level I/O operations, and utilizes 
+effective visualization and analysis of semantic data flows to understand how semantic datasets move through storage.
+
+## Introduction
+
+High-Performance Computing (HPC) workflows are evolving with increasing data intensity. 
+These workflows are growing in complexity, featuring multiple stages encompassing simulation, analysis, 
+and AI applications with diverse data demands. 
+Data transfer between HPC tasks currently relies on shared storage layers like Parallel File System (PFS) 
+and Burst Buffer, which can suffer from slow access and I/O contention.
+
+Enhancing data movement within workflows poses a significant challenge. 
+Strategic task scheduling, data caching, and staging have emerged as 
+effective means to boost I/O performance by reducing computation wait times. 
+With more I/O expertise, one can also utilize and fine-tune various I/O
+libraries, middleware, and file system configurations. The tuning often 
+requires iterative testing and are limited to specific workload, whille other 
+workloads still experiencing high latency with shared storage.
+
+Understanding I/O behavior is imperative when deciding on the correct strategies to enhance data access. 
+Details about data access within workflow tasks can effectively guide improvements in I/O and system configuration. 
+Existing application I/O profiling tools lack the ability to provide analyze of data access behavior across 
+tasks in a workflow and capture semantic
+information related to its low-level I/O requests. Such tools would
+be valuable for providing more straightforward insights into
+data access patterns across multiple tasks. By filling this gap, we
+can develop better methods for managing data movement and
+optimizing the overall workflow.
+
+**In this work, we unveil a fresh workflow optimization perspective with**
+- careful runtime measurement of data access metrics,
+- recovering the mapping of domain semantics to low-level I/O operations, and
+- effective visualization and analysis of semantic data flows for the complete workflow.
+
+## Background
+<center>
+<p>
+  <img src={require("@site/static/img/projects/dayu/hdf5_structure.png").default} width="500" />
+</p>
+
+**HDF5 File Structure Overview**
+</center>
+
+
+
+HDF5 is a widely used storage format and I/O libraries in scientific applications
+
+- Hierarchical structure of groups, datasets, and attributes
+- Allows enriched metadata that describes different data characteristics
+- Extensive API enabling tracking of its high-level data object and the low-level I/O
+
+
+
+## Approach
+The project's major challenges include mapping data semantics to I/O access, 
+tracking data flow across tasks, and visualizing coordination and time. 
+Our approach consists of three steps:
+<center>
+<p>
+  <img src={require("@site/static/img/projects/dayu/Graph-Overview.png").default} width="800" />
+</p>
+</center>
+
+
+
+## Case Study I: Storm Tracking Workflow
+**PyFLEXTRKR** uses a flexible atmospheric feature tracking software package 
+for weather research and forecast datasets.
+
+<center>
+<p>
+  <img src={require("@site/static/img/projects/dayu/wrf_pyflextrkr_workflow.png").default} width="800" />
+</p>
+
+**Six-Stages Pipeline PyFLEXTRKR Workflow.**
+</center>
+
+
+
+**Observations**
+- **Inter-task Data Reuse**: task 2, 4, and 6 uses files produced by the first task.
+- **Time-dependents inputs**: some input files are required at different time point.
+- **Data None-Used**: file produced by task 4 is not used by any later task.
+
+**Opportunities**
+- Tasks that use common datasets can be scheduled on the same resource.
+- Input can be stage-in at different time points of the workflow.
+- not used by later tasks can be immediately offloaded to free up memory.
+
+
+## Case Study II: DeepDriveMD Workflow
+**DeepDriveMD** (DDMD) is a deep learning-driven molecular dynamics 
+simulations workflow for protein folding.
+
+### Workflow Task-File DAG
+
+
+<center>
+<p>
+    <img src={require("@site/static/img/projects/dayu/DDMD_workflow-boxed.png").default} width="800" />
+</p>
+
+**Four-Stages Pipeline Workflow (simulation, aggregate, train, and inference).**
+</center>
+
+**Observation**: No data dependencies between `Train` and `Inference` tasks, 
+as we can see that both of them reads input aggregated.h5, and output different 
+sets of files that are not used by each other.
+
+**Opportunity**: `Inference` and `Train` tasks can be parallelized 
+without violating data dependency correctness. 
+
+### Semantic DAG 1
+
+<center>
+<p>
+  <img src={require("@site/static/img/projects/dayu/DDMD_aggregate_detail.png").default} width="800" />
+</p>
+
+**Aggregate Stage Close-Up Semantic DAG with Two Datasets.**
+</center>
+
+
+**Observation**: The `Aggregate` task alters the data layout of large datasets without changing the content. 
+Over 95% of the data volume is from the `contact_map` dataset, while only small amount is from the
+`point_cloud` dataset.
+
+**Opportunity**: Removing the `Aggregate` task does not compromise the correctness of the program. 
+We have `Train` and `Inference` task reading input directly from simulation, this can reduce unnecessary
+data manipulation and movement and improve data access parallelism.
+
+### Semantic DAG 2
+
+<center>
+<p>
+  <img src={require("@site/static/img/projects/dayu/DDMD_train_dset.png").default} width="800" />
+</p>
+
+**DDMD Train Stage Read File I/O Performance Detail.**
+</center>
+
+
+**Observation**: The `Train` task is not accessing all the datasets present in the `aggregated.h5` file. 
+In fact, it is not using the largest datasets which takes up 95% of the file space (from previous observation).
+
+**Opportunities**
+- Removing the `Aggregate` task can minimize unnecessary data transfers.
+- Caching a subset of the `aggregated.h5` file does not violate task-data dependencies.
+
+## Conclusion
+
+Nowadays in HPC applications, there is a lack of tools to understand 
+data flow between tasks in a workflow. This study introduced Semantic DAGs, 
+an enriched version of traditional DAGs. Precise measurements allowed us 
+to reconstruct mappings between tasks and meaningful data objects down to 
+I/O with file addresses. With visualization of task-to-data mapping and 
+extracted performance statistics, we can gain new insight into workflow 
+optimization opportunities.
+
+Our future work will focus on enhancing the analysis method and creating 
+a generalized approach for customized data placement in workflows. 
+We aim to achieve this through I/O buffering middleware and apply our 
+analysis for effective I/O system tuning to meet workload requirements.
+
+## Members
+
+- Meng Tang
+    - PhD Student @ GRC
+    - Illinois Institute of Technology
+    - [Contact](mailto:[email protected])
+- [Dr. Nathan R. Tallent](https://www.pnnl.gov/people/nathan-tallent)
+    - Co-Principal Investigator
+    - Pacific Northwest National Laboratory (PNNL)
+    - [Contact](mailto:[email protected])
+- [Dr. Anthony Kougkas](https://www.iit.edu/directory/people/antonios-kougkas)
+    - Co-Principal Investigator
+    - Illinois Institute of Technology
+    - [Contact](mailto:[email protected])
+- [Dr. Xian-He Sun](https://www.iit.edu/directory/people/xian-he-sun)
+    - Principal Investigator
+    - Illinois Institute of Technology
+    - [Contact](mailto:[email protected])
+
+## Sponsor
+
+<p>
+  <img src={require("@site/static/img/affiliations/doe.png").default} width="100" />
+</p>
+
+This research is supported by the U.S. Department of Energy (DOE) through 
+the Office of Advanced Scientific Computing Research's 
+"Orchestration for Distributed & Data-Intensive Scientific Exploration."
+
+
+
+<p>
+  <img src={require("@site/static/img/affiliations/nsf.png").default} width="100" />
+</p>
+
+This research is also based upon work supported by the 
+National Science Foundation under Grant no. 
+NSF CSSI-2104013.
+
diff --git a/static/img/projects/dayu/DDMD_aggregate_detail.png b/static/img/projects/dayu/DDMD_aggregate_detail.png
diff --git a/static/img/projects/dayu/DDMD_train_dset.png b/static/img/projects/dayu/DDMD_train_dset.png
diff --git a/static/img/projects/dayu/DDMD_workflow-boxed.png b/static/img/projects/dayu/DDMD_workflow-boxed.png
diff --git a/static/img/projects/dayu/Graph-Overview.png b/static/img/projects/dayu/Graph-Overview.png
diff --git a/static/img/projects/dayu/dayu-logo2.png b/static/img/projects/dayu/dayu-logo2.png
diff --git a/static/img/projects/dayu/hdf5_structure.png b/static/img/projects/dayu/hdf5_structure.png
diff --git a/static/img/projects/dayu/rich_train_dset.png b/static/img/projects/dayu/rich_train_dset.png
diff --git a/static/img/projects/dayu/tallent.png b/static/img/projects/dayu/tallent.png
diff --git a/static/img/projects/dayu/wrf_pyflextrkr_workflow.png b/static/img/projects/dayu/wrf_pyflextrkr_workflow.png