A Coprocessor Sharing-Aware Scheduler for Xeon Phi-Based Compute Clusters

In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International

xeon_phi_knapsackWe propose a cluster scheduling technique for compute clusters with Xeon Phi coprocessors. Even though the Xeon Phi runs Linux which allows multiprocessing, cluster schedulers generally do not allow jobs to share coprocessors because sharing can cause oversubscription of coprocessor memory and thread resources. It has been shown that memory or thread oversubscription on a many core like the Phi results in job crashes or drastic performance loss. We first show that such an exclusive device allocation policy causes severe coprocessor underutilization: for typical workloads, on average only 38% of the Xeon Phi cores are busy across the cluster. Then, to improve coprocessor utilization, we propose a scheduling technique that enables safe coprocessor sharing without resource oversubscription. Jobs specify their maximum memory and thread requirements, and our scheduler packs as many jobs as possible on each coprocessor in the cluster, subject to resource limits. We solve this problem using a greedy approach at the cluster level combined with a knapsack-based algorithm for each node. Every coprocessor is modeled as a knapsack and jobs are packed into each knapsack with the goal of maximizing job concurrency, i.e., as many jobs as possible executing on each coprocessor. Given a set of jobs, we show that this strategy of packing for high concurrency is a good proxy for (i) reducing make span, without the need for users to specify job execution times and (ii) reducing coprocessor footprint, or the number of coprocessors required to finish the jobs without increasing make span. We implement the entire system as a seamless add on to Condor, a popular distributed job scheduler, and show make span and footprint reductions of more than 50% across a wide range of workloads.

Snapify: capturing snapshots of offload applications on Xeon Phi manycore processors

In Proceedings of the 23nd international symposium on High-performance parallel and distributed computing (HPDC ’14). ACM, New York, NY, USA, 1-12.

snapifyIntel Xeon Phi coprocessors provide excellent performance acceleration for highly parallel applications and have been deployed in several top-ranking supercomputers. One popular approach of programming the Xeon Phi is the offload model, where parallel code is executed on the Xeon Phi, while the host system executes the sequential code. However, Xeon Phi’s Many Integrated Core Platform Software Stack (MPSS) lacks fault-tolerance support for offload applications. This paper introduces Snapify, a set of extensions to MPSS that provides three novel features for Xeon Phi offload applications: checkpoint and restart, process swapping, and process migration. The core technique of Snapify is to take consistent process snapshots of the communicating offload processes and their host processes. To reduce the PCI latency of storing and retrieving process snapshots, Snapify uses a novel data transfer mechanism based on remote direct memory access (RDMA). Snapify can be used transparently by single-node and MPI applications, or be triggered directly by job schedulers through Snapify’s API. Experimental results on OpenMP and MPI offload applications show that Snapify adds a runtime overhead of at most 5%, and this overhead is low enough for most use cases in practice.

Continue reading the complete paper …

COSMIC: middleware for high performance and reliable multiprocessing on Xeon Phi coprocessors

In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing (HPDC ’13). ACM, New York, NY, USA, 215-226.

Intel_Xeon_Phi_PCIe_CardIt is remarkably easy to offload processing to Intel’s newest manycore coprocessor, the Xeon Phi: it supports a popular ISA (x86-based), a popular OS (Linux) and a popular programming model (OpenMP). Easy portability is attracting programmer efforts to achieve high performance for many applications. But Linux makes it easy for different users to share the Xeon Phi coprocessor, and multiprocessing inefficiencies can easily offset gains made by individual programmers. Our experiments on a production, high-performance Xeon server with multiple Xeon Phi coprocessors show that coprocessor multiprocessing not only slows down the processes but also introduces unreliability (some processes crash unexpectedly).

We propose a new, user-level middleware called COSMIC that improves performance and reliability of multiprocessing on coprocessors like the Xeon Phi. COSMIC seamlessly fits in the existing Xeon Phi software stack and is transparent to programmers. It manages Xeon Phi processes that execute parallel regions offloaded to the coprocessors. Offloads typically have programmer-driven performance directives like thread and affinity requirements. COSMIC does fair scheduling of both processes and offloads, and takes into account conflicting requirements of offloads belonging to different processes. By doing so, it has two benefits. First, it improves multiprocessing performance by preventing thread and memory oversubscription, by avoiding inter-offload interference and by reducing load imbalance on coprocessors and cores. Second, it increases multiprocessing reliability by exploiting programmer-specified per-process coprocessor memory requirements to completely avoid memory oversubscription and crashes. Our experiments on several representative Xeon Phi workloads show that, in a multiprocessing environment, COSMIC improves average core utilization by up to 3 times, reduces make-span by up to 52%, reduces average process latency (turn-around-time) by 70%, and completely eliminates process crashes.

Continue reading the complete paper …

Cigani! Juris! Boom, boom, boom, boom, boom. Kutz, kutz ehy ja.

I don’t use to (re)post random videos, but this time it is for bringing up the new quote on my about box, it is by Mikhail Kalashnikov, the inventor of the terrible AK-47 (known also simply as “the kalashnikov”) and of other small weapons.

“I would prefer to have invented a machine that people could use and that would help farmers with their work, for example a lawnmower.”

I found this quote some days ago, when I was looking for other things of course. I thought that the quote was good also for my “About” box: often I get a little frustrated because I can’t explain to my friends, to my relatives, to random people that I meet in the bar, what I do, what is the object of my work, what I invent (or what I try to invent).
Read more of this post

VMShm, a mechanism for accessing POSIX shared memory from QEMU/kvm guests

VMShm is a mechanism that enables qemu virtual machines to access to POSIX shared memory objects created on the host OS.

VMShm make it possible to an user space application running in a virtual machine to map up to 1M of a POSIX shared memory object from the host OS.

It can be used as a base to build up an high performance communication channel between host and guest(s) OSes.

Read more of this post