Skip to content

Commit

Permalink
Add lifecycle discussion
Browse files Browse the repository at this point in the history
  • Loading branch information
maherou committed Nov 22, 2020
1 parent a04779b commit 1001e67
Show file tree
Hide file tree
Showing 15 changed files with 35 additions and 24 deletions.
Binary file added E4S-Hierarchy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added E4S-Lifecycle.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added ECP-ST-CAR.docx
Binary file not shown.
8 changes: 4 additions & 4 deletions Introduction.tex
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,13 @@ \section{Introduction}\label{sect:intro}
\subsection{Background}
Historically, the software used on supercomputers has come from three sources: computer system vendors, DOE laboratories, and academia. Traditionally, vendors have supplied system software: operating system, compilers, runtime, and system-management software. The basic system software is typically augmented by software developed by the DOE HPC facilities to fill gaps or to improve management of the systems. An observation is that it is common for system software to break or not perform well when there is a jump in the scale of the system.

Mathematical libraries and tools for supercomputers have traditionally been developed at DOE laboratories and universities and ported to the new computer architectures when they are deployed. These math libraries and tools have been remarkably robust and have supplied some of the most impactful improvements in application performance and productivity. The challenges have been the constant adapting and tuning to rapidly changing architectures.
Mathematical libraries and tools for supercomputers have traditionally been developed at DOE laboratories and universities and ported to the new computer architectures when they are deployed. Vendors also play a role in this space by optimizing the implementations of commonly-used libraries and tools for their architectures, while retaining the interfaces defined by the broader community. This approach enables compile and link time replacement to improve performance on a specific platform by using the vendor versions. Math libraries and tools have been remarkably robust and have supplied some of the most impactful improvements in application performance and productivity. The challenges have been the constant adapting and tuning to rapidly changing architectures.

Programming paradigms and the associated programming environments that include compilers, debuggers, message passing, and associated runtimes have traditionally been developed by vendors, DOE laboratories, and universities. The same can be said for file system and storage software. An observation is that the vendor is ultimately responsible for providing a programming environment and file system with the supercomputer, but there is often a struggle to get the vendors to support software developed by others or to invest in new ideas that have few or no users yet. Another observation is that file-system software plays a key role in overall system resilience, and the difficulty of making the file-system software resilient has grown non-linearly with the scale and complexity of the supercomputers.

In addition to the lessons learned from traditional approaches, Exascale computers pose unique software challenges including the following.
\begin{itemize}
\item \textbf{Extreme parallelism:} Experience has shown that software breaks at each shift in scale. Exascale systems are predicted to have a billion-way concurrency almost exclusively from discrete accelerator devices, similar to today's GPUs. Because clock speeds have essentially stalled, the 1000-fold increase in potential performance going from Petascale to Exascale is entirely from concurrency improvements.
\item \textbf{Extreme parallelism:} Experience has shown that software breaks at each shift in scale. Exascale systems are predicted to have a billion-way concurrency almost exclusively from discrete accelerator devices, similar to today's GPUs. An alternate approach using many cores with vector units is also competitive, but still requires the same approximate amount of parallelism. Because clock speeds have essentially stalled, the 1000-fold increase in potential performance going from Petascale to Exascale is entirely from concurrency improvements.
\item \textbf{Data movement in a deep memory hierarchy: }Data movement has been identified as a key impediment to performance and power consumption. Exascale system designs are increasing the types and layers of memory, which further challenges the software to increase data locality and reuse, while reducing data movement.
\item \textbf{Discrete memory and execution spaces:} The node architectures of Exascale systems include host CPUs and discrete device accelerators. Programming for these systems requires coordinated transfer of data and work between the host and device. While some of this transfer can be managed implicitly, for the most performance-sensitive phases, the programmer typically must manage host-device coordination explicitly. Much of the software transformation effort will be focused on this issue.
\end{itemize}
Expand All @@ -59,7 +59,7 @@ \subsection{ECP ST Project WBS changes}\label{subsect:ProjectRestructuring}
\begin{figure}
\centering
\includegraphics[width=0.9\linewidth]{STFY20WBS}
\caption{\label{fig:wbs-FY20} The FY20 ECP ST WBS structure as of November 18, 2020, includes two new L4 subprojects: 2.3.5.10 ExaWorks, a workflow components project, and 2.3.3.15 PEEKS, a new solver effort that provides funding for Trilinos porting to Frontier and Aurora platforms and merges sparse solver efforts for the products Kokkos Kernels and Ginkgo from CLOVER for improved compatibility.}
\caption{\label{fig:wbs-FY20} The FY20 ECP ST WBS structure as of November 18, 2020, includes two new L4 subprojects: 2.3.5.10 ExaWorks, a workflow components project, and 2.3.3.15 Sake, a new solver effort that provides funding for Trilinos porting to Frontier and Aurora platforms.}
\end{figure}

\begin{figure}
Expand Down Expand Up @@ -92,7 +92,7 @@ \subsection{ECP ST Project WBS changes}\label{subsect:ProjectRestructuring}
\item Phase 3b: 35 total L4 subprojects. Add two new L4 subprojects.
\begin{itemize}
\item New L4 subproject called ExaWorks. Focuses on providing an underlying component architecture for workflow management systems, led by a team of workflow experts who would leverage the new substrate in their own workflow products.
\item Revised L4 subproject called PEEKS. PEEKS was originally an independent L4 subproject in Phase 2. In Phase 3 PEEKS was combined with SLATE, HeFFTe and Kokkos Kernels to create the CLOVER L4 subproject. Due to an increased level of funding for sparse solvers in Trilinos and the close connection between PEEKS and Kokkos Kernels with efforts in Trilinos, it made sense to again have PEEKS be an independent project that combines the previous PEEKS and Kokkos Kernels scope along with the new scope for Trilinos. The new Trilinos scope focuses on specific porting to Aurora and Frontier, which is outside the scope of what NNSA funds would support.
\item New L4 subproject called Sake. This project was created in response to a need for Trilinos funding to port to Aurora and Frontier. At the same time, Trilinos-related activities in the CLOVER project, specifically Kokkos Kernels, were merged with the new Trilinos funding to create a more holistic project, independent of CLOVER.
\item Figure~\ref{fig:wbs-FY20} show the overall structure.
\end{itemize}
\end{itemize}
Expand Down
Loading

0 comments on commit 1001e67

Please sign in to comment.