1

An Experimental Study of Two-level Schwarz Domain-Decomposition Preconditioners on GPUs
High-Performance GMRES Multi-Precision Benchmark: Design, Performance, and Challenges
Understanding the design-space of sparse/dense multiphase GNN dataflows on spatial accelerators
Concentric Spherical Neural Network for 3D Representation Learning
Experimental evaluation of multiprecision strategies for GMRES on GPUs
Extending Sparse Tensor Accelerators to Support Multiple Compression Formats
Performance-portable graph coarsening for efficient multilevel graph analysis
Union: A unified HW-SW Co-Design ecosystem in MLIR for evaluating tensor operations on spatial accelerators
A Performance-Portable Nonhydrostatic Atmospheric Dycore for the Energy Exascale Earth System Model Running at Cloud-Resolving Resolutions.
Distributed Memory Graph Coloring Algorithms for Multiple GPUs
Performance portable supernode-based sparse triangular solver for manycore architectures
SPHYNX: Spectral Partitioning for HYbrid aNd aXelerator-enabled systems
A Parallel Graph Algorithm for Detecting Mesh Singularities in Distributed Memory Ice Sheet Simulations
A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures
Linear algebra-based triangle counting via fine-grained tasking on heterogeneous environments:(Update on static graph challenge)
Scalable generation of graphs for benchmarking HPC community-detection algorithms
Scalable triangle counting on distributed-memory systems
Asynchronous one-level and two-level domain decomposition solvers
Fast triangle counting using cilk
FROSch: a fast and robust overlapping Schwarz domain decomposition preconditioner based on Xpetra in Trilinos
Tacho: memory-scalable task parallel sparse Cholesky factorization
Designing vector-friendly compact BLAS and LAPACK kernels
Fast linear algebra-based triangle counting with kokkoskernels
Partitioning trillion-edge graphs in minutes
Performance-portable sparse matrix-matrix multiplication for many-core architectures
Basker: a threaded sparse lu factorization utilizing hierarchical parallelism and data layouts
Parallel graph coloring for manycore architectures
High-performance graph analytics on manycore processors
Building blocks for graph based network analysis
Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU/GPU cluster
Exploiting geometric partitioning in task mapping for parallel computers
PuLP: Scalable multi-objective multi-constraint partitioning for small-world networks
Towards extreme-scale simulations with next-generation Trilinos: a low Mach fluid application case study
Scalable matrix computations on large scale-free graphs using 2D graph partitioning
Multithreaded Algorithms for Maximum Matching in Bipartite Graphs
ShyLU: A Hybrid-Hybrid Solver for Multicore Platforms
A study of combinatorial issues in a sparse hybrid solver
Enabling next-generation parallel circuit simulation with Trilinos