Loading…

Exploiting DMA to enable non-blocking execution in Decoupled Threaded Architecture

DTA (decoupled threaded architecture) is designed to exploit fine/medium grained Thread Level Parallelism (TLP) by using a distributed hardware scheduling unit and relying on existing simple cores (in-order pipelines, no branch predictors, no ROBs). In DTA, the local variables and synchronization da...

Full description

Saved in:
Bibliographic Details
Main Authors: Giorgi, R., Popovic, Z., Puzovic, N.
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:DTA (decoupled threaded architecture) is designed to exploit fine/medium grained Thread Level Parallelism (TLP) by using a distributed hardware scheduling unit and relying on existing simple cores (in-order pipelines, no branch predictors, no ROBs). In DTA, the local variables and synchronization data are communicated via a fast frame memory. If the compiler cannot remove global data accesses, the threads are excessively fragmented. Therefore, in this paper, we present an implementation of a pre-fetching mechanism (for global data) that complements the original DTA pre-load mechanism (for consumer-producer data patterns) with the aim of improving non-blocking execution of the threads. Our implementation is based on an enhanced DMA mechanism to prefetch global data. We estimated the benefit and identified the required support of this proposed approach, in an initial implementation. In case of longer latency to access memory, our idea can reduce execution time greatly (i.e., 11times for the zoom benchmark on 8 processors) compared to the case of no-prefetching.
ISSN:1530-2075
DOI:10.1109/IPDPS.2009.5161111