Article: -------- C. Riesinger, T. Neckel & F. Rupp (2016): Non-Standard Pseudo Random Number Generators Revisited for GPUs, Future Generation Computer Systems, 82, pp. 482-492, first published online: 24 DEC 2016 https://doi.org/10.1016/j.future.2016.12.018 Highlights ---------- * Three methods to generate normally distributed pseudo random numbers which have properties making them interesting for an implementation on the GPU. * Established GPU random number libraries are outperformed by a factor of up to 4.53, CPU libraries by a factor of up to 2.61. * One of the three methods has never been considered for GPUs before but delivers best performance on many benchmarked GPU architectures. Abstract: --------- Pseudo random number generators are intensively used in many computational applications, e.g. the treatment of uncertainty quantification problems. For this reason, the right selection of such generators and their optimization for various hardware architectures is of big interest. In this paper, we analyze three different pseudo random number generators for normally distributed random numbers: The Ziggurat method, rational polynomials to approximate the inverse cumulative distribution function of the normal distribution, and the Wallace method. These uncommon generators are typically not the first choice when it comes to generation of normally distributed random numbers. We investigate the properties of these three generators and show how their properties can be used for an efficient high-performance implementation on GPUs making these generators a good alternative on this type hardware architecture. Various benchmark results show that our implementations outperform well established normal pseudo random number generators on GPUs by factors up to 4.5, depending on the utilized GPU architecture. We achieve generation rates of up to 4.4 billion normally distributed random numbers per second per GPU. In addition, we show that our GPU implementations are competitive against state-of-the-art normal pseudo random number generators on CPUs by being up to 2.6 times faster than an OpenMP parallelized and vectorized code.