Somesh Singh

Research

My research centers around high-performance parallel computing with an emphasis on sparse irregular computations. The focus of my Ph.D. research was on accelerating parallel graph processing using approximate computing. Details about my Ph.D. dissertation can be found here. During postdoc, my focus was on designing and developing methods and techniques for high-performance sparse tensor computations.

I left academia in September 2024. The publications from 2025 are based on work completed before then.

Publications

* : The authors are listed in alphabetical order by lastname.

Efficient Parallel Sparse Tensor Contraction*

Somesh Singh, Bora Uçar,

in IEEE Transactions on Parallel and Distributed Systems (TPDS), 2025.

(abstract) (pdf) (preprint) (doi)

We investigate the performance of algorithms for sparse tensor-sparse tensor multiplication (SpGETT). This operation, also called sparse tensor contraction, is a higher order analogue of the sparse matrix-sparse matrix multiplication (SpGEMM) operation. Therefore, SpGETT can be performed by first converting the input tensors into matrices, then invoking high performance variants of SpGEMM, and finally reconverting the resultant matrix into a tensor. Alternatively, one can carry out the scalar operations underlying SpGETT in the realm of tensors without matrix formulation. We discuss the building blocks in both approaches and formulate a hashing-based method to avoid costly search or redirection operations. We present performance results with the current state-of-the-art SpGEMM-based approaches, existing SpGETT approaches, and a carefully implemented SpGETT approach with a new fine-tuned hashing method, proposed in this paper. We evaluate the methods on real world tensors by contracting a tensor with itself along varying dimensions. Our proposed hashing-based method for SpGETT consistently outperforms the state-of-the-art method, achieving a 25% reduction in sequential execution time on average and a 21% reduction in parallel execution time on average across a variety of input instances.

BANG: Billion-Scale Approximate Nearest Neighbour Search using a Single GPU

Karthik V, Saim Khan, Somesh Singh, Harsha Vardhan Simhadri, Jyothi Vedurada,

in IEEE Transactions on Big Data (TBD), 2025.

(abstract) (preprint) (doi) (code)

Approximate Nearest Neighbour Search (ANNS) is a subroutine in algorithms routinely employed in information retrieval, pattern recognition, data mining, image processing, and beyond. Recent works have established that graph-based ANNS algorithms are practically more efficient than the other methods proposed in the literature. The growing volume and dimensionality of data necessitates designing scalable techniques for ANNS. To this end, the prior art has explored parallelizing graph-based ANNS on GPU, leveraging its massive parallelism. The current state-of-the-art GPU-based ANNS algorithms either (i) require both the dataset and the generated graph index to reside entirely in the GPU memory, or (ii) partition the dataset into small independent shards, each of which can fit in GPU memory, and perform the search on these shards on the GPU. While the first approach fails to handle large datasets due to the limited memory available on the GPU, the latter delivers poor performance on large datasets due to high data traffic over the low-bandwidth PCIe interconnect. We introduce BANG, a first-of-its-kind technique for graph-based ANNS on GPU for billion-scale datasets, that cannot entirely fit in the GPU memory. BANG stands out by harnessing a compressed form of the dataset on a single GPU to perform distance computations while efficiently accessing the graph index kept on the host memory, enabling efficient ANNS on large graphs within the limited GPU memory. BANG incorporates highly optimised GPU kernels and proceeds in phases that run concurrently on the GPU and CPU, taking advantage of their architectural specificities. Furthermore, it enables overlapping communication with computation that results in efficient data transfer between the CPU and GPU. Using a single NVIDIA Ampere A100 GPU, BANG achieves throughputs 50×-400× higher than the competing methods for a recall of 0.9 on three popular billion-scale datasets.

Engineering Fast Algorithms for the Bottleneck Matching Problem*

Ioannis Panagiotas, Grégoire Pichon, Somesh Singh, Bora Uçar,

in European Symposium on Algorithms (ESA), 2023.

(abstract) (pdf) (preprint) (doi) (code)

We investigate the maximum bottleneck matching problem in bipartite graphs. Given a bipartite graph with nonnegative edge weights, the problem is to determine a maximum cardinality matching in which the minimum weight of an edge is the maximum. To the best of our knowledge, there are two widely used solvers for this problem based on two different approaches. There exists a third known approach in the literature, which seems inferior to those two which is presumably why there is no implementation of it. We take this third approach, make theoretical observations to improve its behavior, and implement the improved method. Experiments with the existing two solvers show that their run time can be too high to be useful in many interesting cases. Furthermore, their performance is not predictable, and slight perturbations of the input graph lead to considerable changes in the run time. On the other hand, the proposed solver’s performance is much more stable; it is almost always faster than or comparable to the two existing solvers, and its run time always remains low.

Effective Parallelization of the Vehicle Routing Problem

Rajesh Pandian M, Somesh Singh, Rupesh Nasre, N.S. Narayanaswamy,

in Genetic and Evolutionary Computation Conference (GECCO), 2023.

(abstract) (pdf) (doi) (code)

Capacitated Vehicle Routing Problem (CVRP) is an important combinatorial optimization problem, which is also NP-hard. A wide array of heuristics have been proposed in the literature to obtain an approximate solution to CVRP. To improve the execution time, parallel methods have been developed for accelerating metaheuristics-based algorithms, genetic algorithms, and evolutionary algorithms for CVRP. Despite these advances, our experiments with the state-of-the-art parallel solutions indicate that their run times are too high to be practically useful. The combinatorial explosion is so high that the execution time is prohibitively large even on mid-sized CVRP instances having a few hundred customers. In this work, we propose a novel technique which combines local search and randomization for solving CVRP faster with reasonable accuracy, even on large problem instances. Our usage of randomization enables searching a large space of candidate solutions. We experimentally compare our proposed method with the state-of-the-art GPU implementations on diverse input instances and demonstrate the efficacy of our approach. Our sequential and shared-memory parallel implementations are on an average 36×-1189× faster than the state-of-the-art GPU-parallel genetic algorithms while also achieving a superior solution quality. Furthermore, our reported solutions are close to the current best-known solutions from CVRPLIB.

Algorithms and Data Structures for Hyperedge Queries*

Jules Bertrand, Fanny Dufossé, Somesh Singh, Bora Uçar,

in ACM Journal of Experimental Algorithmics (JEA), 2022.

(abstract) (pdf) (preprint) (doi) (code) [ ACM Results Reproduced Badge ]

We consider the problem of querying the existence of hyperedges in hypergraphs. More formally, given a hypergraph, we need to answer queries of the form: "Does the following set of vertices form a hyperedge in the given hypergraph?" Our aim is to set up data structures based on hashing to answer these queries as fast as possible. We propose an adaptation of a well-known perfect hashing approach for the problem at hand. We analyze the space and runtime complexity of the proposed approach and experimentally compare it with the state-of-the-art hashing-based solutions. Experiments demonstrate the efficiency of the proposed approach with respect to the state-of-the-art.

An Efficient Parallel Implementation of a Perfect Hashing Method for Hypergraphs*

Somesh Singh, Bora Uçar,

in GrAPL: Workshop on Graphs, Architectures, Programming, and Learning, IPDPSW 2022.

(abstract) (pdf) (preprint) (doi)

Querying the existence of an edge in a given graph or hypergraph is a building block in several algorithms. Hashing-based methods can be used for this purpose, where the given edges are stored in a hash table in a preprocessing step, and then the queries are answered using the lookup operations. While the general hashing methods have fast lookup times in the average case, the worst case run time is much higher. Perfect hashing methods take advantage of the fact that the items to be stored are all available and construct a collision free hash function for the given input, resulting in an optimal lookup time even in the worst case. We investigate an efficient shared-memory parallel implementation of a recently proposed perfect hashing method for hypergraphs. We experimentally compare the resulting parallel algorithms with the state-of-the-art and demonstrate better run time and scalability on a set of hypergraphs corresponding to real-life sparse tensors.

ParTBC: Faster Estimation of Top-k Betweenness Centrality Vertices on GPU

Somesh Singh, Tejas Shah, Rupesh Nasre,

in ACM Transactions on Design Automation of Electronic Systems (TODAES), 2021.

(abstract) (pdf) (doi)

Betweenness centrality (BC) is a popular centrality measure, based on shortest paths, used to quantify the importance of vertices in networks. It is used in a wide array of applications including social network analysis, community detection, clustering, biological network analysis, and several others. The state-of-the-art Brandes' algorithm for computing BC has time complexities of O( |V|·|E| ) and O( |V|·|E| + |V|²·log|V| ) for unweighted and weighted graphs, respectively. Brandes' algorithm has been successfully parallelized on multicore and manycore platforms. However, the computation of vertex BC continues to be time-consuming for large real-world graphs. Often, in practical applications, it suffices to identify the most important vertices in a network; that is, those having the highest BC values. Such applications demand only the top vertices in the network as per their BC values but do not demand their actual BC values. In such scenarios, not only is computing the BC of all the vertices unnecessary but also exact BC values need not be computed. In this work, we attempt to marry controlled approximations with parallelization to estimate the k-highest BC vertices faster, without having to compute the exact BC scores of the vertices. We present a host of techniques to determine the top-k vertices faster, with a small inaccuracy, by computing approximate BC scores of the vertices. Aiding our techniques is a novel vertex-renumbering scheme to make the graph layout more structured, which results in faster execution of parallel Brandes' algorithm on GPU. Our experimental results, on a suite of real-world and synthetic graphs, show that our best performing technique computes the top-k vertices with an average speedup of 2.5× compared to the exact parallel Brandes' algorithm on GPU, with an error of less than 6%. Our techniques also exhibit high precision and recall, both in excess of 94%.

Graffix: Efficient Graph Processing with a Tinge of GPU-Specific Approximations

Somesh Singh, Rupesh Nasre,

in International Conference on Parallel Processing (ICPP), 2020.

(abstract) (pdf) (doi) (talk)

Parallelizing graph algorithms on GPUs is challenging due to the irregular memory accesses involved in graph traversals. In particular, three important GPU-specific aspects affect performance: memory coalescing, memory latency, and thread divergence. In this work, we attempt to tame these challenges using approximate computing. We target graph applications on GPUs that can tolerate some degradation in the quality of the output, for obtaining the result in short order. We propose three techniques for boosting the performance of graph processing on the GPU by injecting approximations in a controlled manner. The first one creates a graph isomorph that brings relevant nodes nearby in memory and adds controlled replica of nodes to improve coalescing. The second reduces memory latency by facilitating the processing of subgraphs inside shared memory by adding edges among specific nodes and processing well-connected subgraphs iteratively inside shared memory. The third technique normalizes degrees across nodes assigned to a warp to reduce thread divergence. Each of the techniques offers notable performance benefits, and provides a knob to control the amount of inaccuracy added to an execution. We demonstrate the effectiveness of the proposed techniques using a suite of five large graphs with varied characteristics and five popular graph algorithms.

SixTrack V and runtime environment

R. De Maria, J. Andersson, V.K. Berglyd Olsen, L. Field, M. Giovannozzi, P.D. Hermes, N. Høimyr, S. Kostoglou, G. Iadarola, E. Mcintosh, A. Mereghetti, J. Molson, D. Pellegrini, T. Persson, M. Schwinzerl, E.H. Maclean, K.N. Sjobak, I. Zacharov, S. Singh,

in International Journal of Modern Physics A (IJMPA), 2019. [ Invited Paper ]

(abstract) (pdf) (doi) (code)

SixTrack is a single-particle tracking code for high-energy circular accelerators routinely used at CERN for the Large Hadron Collider (LHC), its luminosity upgrade (HL-LHC), the Future Circular Collider (FCC) and the Super Proton Synchrotron (SPS) simulations. The code is based on a 6D symplectic tracking engine, which is optimized for long-term tracking simulations and delivers fully reproducible results on several platforms. It also includes multiple scattering engines for beam–matter interaction studies, as well as facilities to run the integrated simulations with external particle matter interaction codes. These features differentiate SixTrack from general-purpose, optics-design software. The code recently underwent a major restructuring to merge the advanced features into a single branch, such as multiple ion species, interface with external codes and high-performance input/output. This restructuring also removed a large number of compilation flags, instead enabling/disabling the functionality with runtime options. In the process, the code was moved from Fortran 77 to Fortran 2018 standard, also allowing and achieving a better modularization. Physics models (beam–beam effects, Radio-Frequency (RF) multipoles, current carrying wires, solenoid and electron lenses) and methods (symplecticity check) have also been reviewed and refined to offer more accurate results. The SixDesk runtime environment allows the user to manage the large batches of simulations required for accurate predictions of the dynamic aperture. SixDesk supports several cluster environments available at CERN as well as submitting jobs to the LHC@Home volunteering computing project, which enables volunteers contributing with their hardware to CERN simulation. SixTrackLib is a new library aimed at providing a portable and flexible tracking engine for single- and multi-particle problems using the models and formalism of SixTrack. The library is able to run in CPUs as well as graphical processing units (GPUs). This contribution presents the status of the code, summarizes the main existing features and provides details about the main development lines SixTrack, SixDesk and SixTrackLib.

Optimizing Graph Processing on GPUs using Approximate Computing: Poster

Somesh Singh, Rupesh Nasre,

in ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2019.

(abstract) (pdf) (doi)

Parallelizing graph algorithms on GPUs is challenging due to the irregular memory accesses and control-flow involved in graph traversals. In this work, we tame these challenges by injecting approximations. In particular, we improve memory coalescing by renumbering and replicating nodes, memory latency by adding edges among specific nodes brought into shared memory, and thread-divergence by normalizing degrees across nodes assigned to a warp. Using a suite of graphs with varied characteristics and five popular algorithms, we demonstrate the effectiveness of our proposed techniques. Our approximations for coalescing, memory latency and thread-divergence lead to mean speedups of 1.3×, 1.41× and 1.06× achieving accuracies of 83%, 78% and 84%, respectively.

SixTrack Project: Status, Runtime Environment, and New Developments

R. De Maria, J. Andersson, V.K. Berglyd Olsen, L. Field, M. Giovannozzi, P.D. Hermes, N. Høimyr, S. Kostoglou, G. Iadarola, E. Mcintosh, A. Mereghetti, J. Molson, D. Pellegrini, T. Persson, M. Schwinzerl, E.H. Maclean, K.N. Sjobak, I. Zacharov, S. Singh,

in International Computational Accelerator Physics Conference (ICAP), 2018.

(abstract) (pdf) (doi)

SixTrack is a single-particle tracking code for high-energy circular accelerators routinely used at CERN for the Large Hadron Collider (LHC), its luminosity upgrade (HL-LHC), the Future Circular Collider (FCC), and the Super Proton Synchrotron (SPS) simulations. The code is based on a 6D symplectic tracking engine, which is optimised for long-term tracking simulations and delivers fully reproducible results on several platforms. It also includes multiple scattering engines for beam-matter interaction studies, as well as facilities to run integrated simulations with FLUKA and GEANT4. These features differentiate SixTrack from general-purpose, optics-design software like MAD-X. The code recently underwent a major restructuring to merge advanced features into a single branch, such as multiple ion species, interface with external codes, and high-performance input/output (XRootD, HDF5). This restructuring also removed a large number of build flags, instead enabling/disabling the functionality at run-time. In the process, the code was moved from Fortran 77 to Fortran 2018 standard, also allowing and achieving a better modularization. Physics models (beam-beam effects, RF-multipoles, current carrying wires, solenoid, and electron lenses) and methods (symplecticity check) have also been reviewed and refined to offer more accurate results. The SixDesk runtime environment allows the user to manage the large batches of simulations required for accurate predictions of the dynamic aperture. SixDesk supports CERN LSF and HTCondor batch systems, as well as the BOINC infrastructure in the framework of the LHC@Home volunteering computing project. SixTrackLib is a new library aimed at providing a portable and flexible tracking engine for single- and multi-particle problems using the models and formalism of SixTrack. The tracking routines are implemented in a parametrized C code that is specialised to run vectorized in CPUs and GPUs, by using SIMD intrinsics, OpenCL 1.2, and CUDA technologies. This contribution presents the status of the code and an outlook on future developments of SixTrack, SixDesk, and SixTrackLib.

Scalable and Performant Graph Processing on GPUs using Approximate Computing

Somesh Singh, Rupesh Nasre,

in IEEE Transactions on Multi-Scale Computing Systems (TMSCS), 2018.

(abstract) (pdf) (doi)

Graph algorithms are widely used in several application domains. It has been established that parallelizing graph algorithms is challenging. The parallelization issues get exacerbated when graphics processing units (GPUs) are used to execute graph algorithms. While the prior art has shown effective parallelization of several graph algorithms on GPUs, a few algorithms are still expensive. In this work, we address the scalability issues in graph parallelization. In particular, we aim to improve the execution time by tolerating a little approximation in the computation. We study the effects of four heuristic approximations on six graph algorithms with five graphs and show that if an application allows for small inaccuracy, this can be leveraged to achieve considerable performance benefits. We also study the effects of the approximations on GPU-based processing and provide interesting takeaways.

Patents

Method and System for Performing Approximate Nearest Neighbour Search on Billion-Scale Vectors

Karthik V, Somesh Singh, Harsha Vardhan Simhadri, Jyothi Vedurada,

Indian Patent, filed 2024. (Patent Pending)

Other Projects

Awarded $1000 in Google Cloud Platform credits for research on parallel subgraph isomorphism 2020
Google Summer of Code 2018 : Optimize and Integrate Standalone Tracking Library (SixTrackLib) SixTrack, CERN-HSF
Developed a standalone parallel implementation of (a part of) SixTrackLib. The contributed code is available here.
Google Summer of Code 2017 : Smart Data Structures in CUDA CMS, CERN-HSF
Developed SALLOC -- An arena based memory allocator for SIMT architectures.