Publications

Trilinos: Enabling scientific computing across diverse hardware architectures at scale
ShyLU-node: On-node scalable solvers and preconditioners: Recent progress and current performance
Breaking the mold: Overcoming the time constraints of molecular dynamics on general-purpose hardware
Beyond Exascale: Dataflow Domain Translation on a Cerebras Cluster
Distributed Sparse Tensor Computations in MLIR
A Performance Portable Matrix Free Dense MTTKRP in GenTen
Do AI Models Perform Human-like Abstract Reasoning Across Modalities?
Materials Learning Algorithms (MALA): Scalable machine learning for electronic structure calculations in large-scale atomistic simulations
Performance Portable Gradient Computations Using Source Transformation
Cello: Co-Designing Schedule and Hybrid Implicit/Explicit Buffer for Complex Tensor Reuse
Imperfect Recognition: A Study of OCR Limitations in the Context of Scientific Documents
Jet: Multilevel graph partitioning on graphics processing units
Breaking the Molecular Dynamics Timescale Barrier Using a Wafer-Scale System
TenSQL: An SQL Database Built on GraphBLAS
Predicting electronic structures at any length scale with machine learning
An Experimental Study of Two-level Schwarz Domain-Decomposition Preconditioners on GPUs
Performance Portable Batched Sparse Linear Solvers
High-Performance GMRES Multi-Precision Benchmark: Design, Performance, and Challenges
Training-free hyperparameter optimization of neural networks for electronic structures in matter
Understanding the design-space of sparse/dense multiphase GNN dataflows on spatial accelerators
Concentric Spherical Neural Network for 3D Representation Learning
Parallel graph coloring algorithms for distributed GPU environments
FROSch Preconditioners for Land Ice Simulations of Greenland and Antarctica
Enabling Flexibility for Sparse Tensor Acceleration via Heterogeneity
Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication
Experimental evaluation of multiprecision strategies for GMRES on GPUs
Extending Sparse Tensor Accelerators to Support Multiple Compression Formats
Union: A unified HW-SW Co-Design ecosystem in MLIR for evaluating tensor operations on spatial accelerators
The Kokkos EcoSystem: Comprehensive Performance Portability For High Performance Computing
Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems
Performance-portable graph coarsening for efficient multilevel graph analysis
Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels
Kokkos 3: Programming model extensions for the exascale era
FROSch Preconditioners for Land Ice Simulations of Greenland and Antarctica
Extending Sparse Tensor Accelerators to Support Multiple Compression Formats
Experimental Evaluation of Multiprecision Strategies for GMRES on GPUs
EXAGRAPH: Graph and combinatorial methods for enabling exascale applications
Concentric Spherical GNN for 3D Representation Learning
Co-design center for exascale machine learning technologies (ExaLearn)
Accelerating finite-temperature Kohn-Sham density functional theory with deep neural networks
A survey of numerical methods utilizing mixed precision arithmetic
A Study of Mixed Precision Strategies for GMRES on GPUs
A Block-Based Triangle Counting Algorithm on Heterogeneous Environments
SPHYNX: Spectral Partitioning for HYbrid aNd aXelerator-enabled systems
Scalable, multi-constraint, complex-objective graph partitioning
Scalable asynchronous domain decomposition solvers
Preparing sparse solvers for exascale computing
Performance portable supernode-based sparse triangular solver for manycore architectures
Distributed Memory Graph Coloring Algorithms for Multiple GPUs
An algebraic sparsified nested dissection algorithm using low-rank approximations
A survey of numerical methods utilizing mixed precision arithmetic
A Performance-Portable Nonhydrostatic Atmospheric Dycore for the Energy Exascale Earth System Model Running at Cloud-Resolving Resolutions.
Scalable triangle counting on distributed-memory systems
Scalable generation of graphs for benchmarking HPC community-detection algorithms
Linear algebra-based triangle counting via fine-grained tasking on heterogeneous environments:(Update on static graph challenge)
Geometric Mapping of Tasks to Processors on Parallel Computers with Mesh or Torus Networks
A robust hierarchical solver for ill-conditioned systems with applications to ice sheet modeling
A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures
A Parallel Graph Algorithm for Detecting Mesh Singularities in Distributed Memory Ice Sheet Simulations
Tacho: memory-scalable task parallel sparse Cholesky factorization
Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures
Geometric partitioning and ordering strategies for task mapping on parallel computers
FROSch: a fast and robust overlapping Schwarz domain decomposition preconditioner based on Xpetra in Trilinos
Fast triangle counting using cilk
Ensemble grouping strategies for embedded stochastic collocation methods applied to anisotropic diffusion problems
Asynchronous one-level and two-level domain decomposition solvers
A distributed-memory hierarchical solver for general sparse linear systems
Performance-portable sparse matrix-matrix multiplication for many-core architectures
Partitioning trillion-edge graphs in minutes
Fast linear algebra-based triangle counting with kokkoskernels
Distributed graph layout for scalable small-world network analysis
Designing vector-friendly compact BLAS and LAPACK kernels
Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts
Parallel graph coloring for manycore architectures
Complex network partitioning using label propagation
Basker: a threaded sparse lu factorization utilizing hierarchical parallelism and data layouts
A survey of direct methods for sparse linear systems
Multi-jagged: A scalable parallel spatial partitioning algorithm
High-performance graph analytics on manycore processors
Building blocks for graph based network analysis
Towards extreme-scale simulations with next-generation Trilinos: a low Mach fluid application case study
Towards extreme-scale simulations for low mach fluids with second-generation Trilinos
PuLP: Scalable multi-objective multi-constraint partitioning for small-world networks
Exploiting geometric partitioning in task mapping for parallel computers
Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU/GPU cluster
Scalable matrix computations on large scale-free graphs using 2D graph partitioning
Electrical modeling and simulation for stockpile stewardship
ShyLU: A Hybrid-Hybrid Solver for Multicore Platforms
Multithreaded Algorithms for Maximum Matching in Bipartite Graphs
Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems
Enabling next-generation parallel circuit simulation with Trilinos
A study of combinatorial issues in a sparse hybrid solver
Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and update/downdate
System and method for dynamically disabling partially streamed content
System and method for cluster-sensitive sticky load balancing