Show simple item record

dc.contributor.advisorBarlas, Gerassimos
dc.contributor.authorElhiny, Lamees
dc.date.accessioned2017-01-17T08:10:45Z
dc.date.available2017-01-17T08:10:45Z
dc.date.issued2016-11
dc.identifier.other35.232-2016.43
dc.identifier.urihttp://hdl.handle.net/11073/8695
dc.descriptionA Master of Science thesis in Computer Engineering by Lamees Elhiny entitled, "Load Partitioning for Matrix-Matrix Multiplication on a Cluster of CPUGPU Nodes Using the Divisible Load Paradigm," submitted in November 2016. Thesis advisor is Dr. Gerassimos Barlas. Soft and hard copy available.en_US
dc.description.abstractMatrix-matrix multiplication is a component of many numerical algorithms; however, it is a time consuming operation. Sometimes, when the matrix size is huge, the processing of the matrix-matrix multiplication on a single processor in not sufficiently fast. Finding an approach for efficient matrix-matrix multiplication can scale the performance of several applications that depend on it. The aim of this study is to improve the efficiency of matrix-matrix multiplication on a distributed network composed of heterogeneous nodes. Since load balancing between heterogeneous nodes forms the biggest challenge, the performance model is derived using the Divisible Load Theory (DLT). The proposed solution improves performance by: (a) the reduction of communication overhead, as DLT-derived load partitioning does not require synchronization between nodes during processing time, and (b) high utilization of resources, as both Control Processing Unit (CPU) and Graphical Processing Unit (GPU) are used in the computation. The experiments are conducted on a single node as well as a cluster of nodes. The results prove that the use of DLT equations balances the load between CPUs and GPUs. On a single node, the suggested hybrid approach has superior performance when compared to C Basic Linear Algebra Subroutines (cBLAS) and OpenMP Basic Linear Algebra Subroutines (openBLAS) approaches. On the other hand, the performance difference between the hybrid and GPU only (CUDA Basic Linear Algebra Subroutines) approaches is mild as the majority of the load in the hybrid approach is allocated to the GPU. On a cluster of nodes, the computation time is reduced to almost half of the GPU only processing time; however, the overall improvement is impeded by communication overhead. It is expected that faster communication media could reduce the overall time and further improve speedup.en_US
dc.description.sponsorshipCollege of Engineeringen_US
dc.description.sponsorshipDepartment of Computer Science and Engineeringen_US
dc.language.isoen_USen_US
dc.relation.ispartofseriesMaster of Science in Computer Engineering (MSCoE)en_US
dc.subjecthybrid processingen_US
dc.subjectparallel processingen_US
dc.subjectload partitioningen_US
dc.subjectmatrixmatrix multiplicationen_US
dc.subjectdivisible load theoryen_US
dc.subject.lcshMatricesen_US
dc.subject.lcshData processingen_US
dc.subject.lcshMultiplicationen_US
dc.subject.lcshComputer engineeringen_US
dc.titleLoad Partitioning for Matrix-Matrix Multiplication on a Cluster of CPUGPU Nodes Using the Divisible Load Paradigmen_US
dc.typeThesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record