Abstract
This paper discusses an OpenCL version of a volumetric JPEG 2000 codec that runs on GPUs, multi-core processors or a combination of both. Since the performance critical part consists of a fine-grained (discrete wavelet transform) and coarse-grained algorithm (Tier-1), the best performance is obtained with a hybrid execution in which the discrete wavelet transform is executed on a GPU and Tier-1 on a multi-core. Using an Intel i7 multi-core in combination with a modest NVIDIA Quadro K620 GPU yields speedups greater than 10 compared with the original sequential code. The performance bottlenecks that arise on GPUs when parallelizing algorithms that are coarse-grained by nature are discussed and also the optimizations that are possible. A performance analysis reveals the inefficiencies and explains the deviations from the GPU peak performance.
References
|
Ahmadvand, M, Ezhdehakosh, A (2012) GPU-based implementation of JPEG 2000 encoder. In: The international conference on parallel and distributed processing techniques and applications (PDPTA), Las Vegas, NV, USA, 16–19 July 2012, pp.682–688. Athens: CSREA Press. Google Scholar | |
|
AMD Radeon Graphics Technology (2012) AMD Graphics cores next (GCN) architecture white paper. Available at: www.amd.com/Documents/GCN_Architecture_whitepaper.pdf (accessed 26 April 2016). Google Scholar | |
|
Balevic, A, Fuerst, N, Heide, M. (2009) CUJ2K: JPEG 2000 encoder in CUDA. Technical Report, Institute for Parallel and Distributed Systems, University of Stuttgart. Google Scholar | |
|
Bruylants, T, Munteanu, A, Alecu, A. (2007) Volumetric image compression with JPEG 2000. SPIE Newsroom Biomedical Optics and Medical Imaging. DOI: 10.1117/2.1200706.0779. Google Scholar | Crossref | |
|
Bruylants, T, Munteanu, A, Schelkens, P (2015) Wavelet-based volumetric medical image compression. Signal Processing: Image Communication 31: 112–133. Google Scholar | Crossref | ISI | |
|
Ciznicki, M, Kurowski, K, Plaza, A (2011) GPU implementation of JPEG 2000 for hyperspectral image compression. In: SPIE remote sensing, Prague, Czech Republic, 19–22 September 2011, pp.81830H–81830H. Cardiff: SPIE. Google Scholar | |
|
Ciznicki, M, Kierzynka, M, Kopta, P. (2014) Benchmarking JPEG 2000 implementations on modern CPU and GPU architectures. Journal of Computational Science 5(2): 90–98. Google Scholar | Crossref | ISI | |
|
Franco, J, Bernabé, G, Fernández, J. (2010) Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs. Procedia Computer Science 1(1): 1101–1110. Google Scholar | Crossref | |
|
Galiano, V, López, O, Malumbres, MP. (2012) GPU-based 3D Wavelet Transform. Proceedings of the 12th international conference on computational and mathematical methods in science and engineering (CMMSE), La Manga, Spain, 2–7 July 2012, pp.580–590. Available at: http://cmmse.usal.es/cmmse2015/images/stories/congreso/2-cmmse-2012.pdf (accessed 26 April 2016). Google Scholar | |
|
Grama, A, Gupta, A, Karypis, G. (2003) Introduction to Parallel Computing. 2nd ed. Harlow: Pearson Education. Google Scholar | |
|
Khronos (2012) OpenCL 1.2. Reference pages. Available at: www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/ (accessed 26 April 2016). Google Scholar | |
|
Le, R, Bahar, IR, Mundy, JL (2011) A novel parallel Tier-1 coder for JPEG 2000 using GPUs. IEEE 9th symposium on application specific processors (SASP), San Diego, CA, USA, 5–6 June 2011, pp.129–136. Piscataway: IEEE. Google Scholar | |
|
Lee, JH, Nigania, N, Kim, H. (2015) OpenCL Performance evaluation on modern multicore CPUs. Scientific Programming 2015: 859491. Google Scholar | Crossref | ISI | |
|
Matela, J (2009) GPU-Based DWT Acceleration for JPEG 2000. Annual doctoral workshop on mathematical and engineering methods in computer science (MEMCS), Znojmo, Czech Republic, 13–15 November 2009, pp.136–143. Brno, Czech Republic: Novpress S.r.o. Google Scholar | |
|
Matela, J, Rusnak, V, Holub, P (2011a) Efficient JPEG 2000 EBCOT context modeling for massively parallel architectures. Data compression conference (DCC), Snowbird, UT, USA, 29–31 March 2011, pp.423–432. Piscataway: IEEE. Google Scholar | |
|
Matela, J, Šrom, V, Holub, P (2011b) Low GPU occupancy approach to fast arithmetic coding in JPEG 2000. Mathematical and engineering methods in computer science (MEMCS), Lednice, Czech Republic, 14–16 October 2011, pp.136–145. Berlin: Springer. Google Scholar | |
|
Nickolls, J, Buck, I, Garland, M. (2008) Scalable parallel programming with CUDA. Queue 6(2): 40–53. Google Scholar | Crossref | |
|
NVIDIA corporation (2009) NVIDIA. NVIDIA’s next-generation CUDA compute architecture. Available at: http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf (accessed 26 April 2016). Google Scholar | |
|
Patterson, D, Hennessy, J (2012) Computer Organization and Design: The Hardware/Software Interface. 4th ed. Boston: Morgan Kaufmann. Google Scholar | |
|
Schelkens, P, Skodras, A, Ebrahimi, T (eds) (2009) The JPEG 2000 Suite. Chichester: John Wiley & Sons. Google Scholar | Crossref | |
|
Seo, S, Jo, G, Lee, J (2011) Performance characterization of the NAS Parallel Benchmarks in OpenCL. IEEE international symposium on workload characterization (IISWC), Austin, TX, USA, 6–8 November 2011, pp.137–148. Piscataway: IEEE. Google Scholar | Crossref | |
|
Stone, JE, Gohara, D, Shi, G (2010) OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12(1–3): 66–73. Google Scholar | Crossref | Medline | ISI | |
|
Sweldens, W (1996) The lifting scheme: A custom-design construction of biorthogonal wavelets. Applied and Computational Harmonic Analysis 3(2): 186–200. Google Scholar | Crossref | ISI | |
|
Taubman, D, Marcellin, M (2002) JPEG 2000 Image Compression Fundamentals, Standards and Practice. Boston, MA: Kluwer Academic Publishers. Google Scholar | Crossref | |
|
Wei, F, Cui, Q, Li, Y (2012) Fine-granular parallel EBCOT and optimization with CUDA for digital cinema image compression. IEEE international conference on Multimedia and expo (ICME), Melbourne, Australia, 9–13 July 2012, pp.1051–1054. Piscataway: IEEE. Google Scholar | Crossref |
