<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>SMARTech Collection: CERCS Technical Reports</title>
    <link>http://smartech.gatech.edu/handle/1853/79</link>
    <description>Reports published by researchers associated with the Center</description>
    <textInput>
      <title>The Collection's search engine</title>
      <description>Search the Channel</description>
      <name>search</name>
      <link>http://smartech.gatech.edu/simple-search</link>
    </textInput>
    <item>
      <title>A Petri Net Approach to Analysis and Composition of Web Services</title>
      <link>http://smartech.gatech.edu/handle/1853/27247</link>
      <description>Title: A Petri Net Approach to Analysis and Composition of Web Services
&lt;br/&gt;
&lt;br/&gt;Authors: Xiong, PengCheng; Fan, YuShun; Zhou, MengChu
&lt;br/&gt;
&lt;br/&gt;Abstract: Business Process Execution Language for Web&#xD;
Services (BPEL) is becoming the industrial standard for modeling&#xD;
web service-based business processes. Behavioral compatibility for&#xD;
web service composition is one of the most important topics. The&#xD;
commonly used reachability exploration method focuses on&#xD;
verifying deadlock-freeness. When this property is violated, the&#xD;
states and traces in the reachability graph only give clues to&#xD;
re-design the composition. The process must then repeat itself until&#xD;
no deadlock is found. In this paper, multiple web services&#xD;
interaction is modeled with a Petri net called Composition net&#xD;
(C-net for short). The problem of behavioral compatibility among&#xD;
web services is hence transformed into the deadlock structure&#xD;
problem of a C-net. If services are incompatible, a policy based on&#xD;
appending additional information channels is proposed. It is&#xD;
proved that it can offer a good solution that can be mapped back&#xD;
into the BPEL models automatically.</description>
      <pubDate>Wed, 29 Oct 2008 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Translating GPU Binaries to Tiered SIMD Architectures with Ocelot</title>
      <link>http://smartech.gatech.edu/handle/1853/27246</link>
      <description>Title: Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
&lt;br/&gt;
&lt;br/&gt;Authors: Diamos, Gregory; Kerr, Andrew; Kesavan, Mukil
&lt;br/&gt;
&lt;br/&gt;Abstract: Parallel Thread Execution ISA (PTX) is a virtual&#xD;
instruction set used by NVIDIA GPUs that explicitly expresses&#xD;
hierarchical MIMD and SIMD style parallelism in an application.&#xD;
In such a programming model, the programmer and compiler&#xD;
are left with the not trivial, but not impossible, task of composing&#xD;
applications from parallel algorithms and data structures. Once&#xD;
this has been accomplished, even simple architectures with low&#xD;
hardware complexity can easily exploit the parallelism in an&#xD;
application.&#xD;
With these applications in mind, this paper presents Ocelot,&#xD;
a binary translation framework designed to allow architectures&#xD;
other than NVIDIA GPUs to leverage the parallelism in PTX&#xD;
programs. Specifically, we show how (i) the PTX thread hierarchy&#xD;
can be mapped to many-core architectures, (ii) translation&#xD;
techniques can be used to hide memory latency, and (iii) GPU&#xD;
data structures can be efficiently emulated or mapped to native&#xD;
equivalents. We describe the low level implementation of our&#xD;
translator, ending with a case study detailing the complete&#xD;
translation process from PTX to SPU assembly used by the IBM&#xD;
Cell Processor.</description>
      <pubDate>Wed, 29 Oct 2008 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>Chameleon: Virtualizing Idle Acceleration Cores of A Heterogeneous Multi-Core Processor for Caching and Prefetching</title>
      <link>http://smartech.gatech.edu/handle/1853/27232</link>
      <description>Title: Chameleon: Virtualizing Idle Acceleration Cores of A Heterogeneous Multi-Core Processor for Caching and Prefetching
&lt;br/&gt;
&lt;br/&gt;Authors: Woo, Dong Hyuk; Fryman, Joshua B.; Knies, Allan D.; Lee, Hsien-Hsin Sean
&lt;br/&gt;
&lt;br/&gt;Abstract: Heterogeneous multi-core processors have emerged as an energy- and area-efficient architectural solution to&#xD;
improving performance for domain-specific applications such as those with a plethora of data-level parallelism.&#xD;
These processors typically contain a large number of small, compute-centric cores for acceleration while keeping one&#xD;
or two high-performance ILP cores on the die to guarantee single-thread performance. Although a major portion of&#xD;
the transistors are occupied by the acceleration cores, these resources will sit idle when running unparallelized legacy&#xD;
codes or the sequential parts of an application. To address this under-utilization issue, in this paper, we introduce&#xD;
Chameleon, a flexible heterogeneous multi-core architecture to virtualize these resources for enhancing memory&#xD;
performance when running sequential programs. The Chameleon architecture can dynamically virtualize the idle&#xD;
acceleration cores into a last-level cache, a data prefetcher, or a hybrid between these two techniques. In addition,&#xD;
Chameleon can operate in an adaptive mode which dynamically configures the acceleration cores between the hybrid&#xD;
mode and the prefetch-only mode by monitoring the effectiveness of Chameleon caching scheme. In our evaluation&#xD;
using SPEC2006 benchmark suite, different levels of performance improvements were achieved in different modes&#xD;
for different applications. In the case of the adaptive mode, Chameleon improves the performance of SPECint06&#xD;
and SPECfp06 by 33% and 22% on average. When considering only memory-intensive applications, Chameleon&#xD;
improves the system performance by 53% and 33%.</description>
      <pubDate>Mon, 29 Oct 2007 22:58:59 GMT</pubDate>
    </item>
    <item>
      <title>A HyperTransport-Enabled Global Memory Model For Improved Memory Efficiency</title>
      <link>http://smartech.gatech.edu/handle/1853/27231</link>
      <description>Title: A HyperTransport-Enabled Global Memory Model For Improved Memory Efficiency
&lt;br/&gt;
&lt;br/&gt;Authors: Young, Jeffrey; Yalamanchili, Sudhakar; Silla, Federico; Duato, José
&lt;br/&gt;
&lt;br/&gt;Abstract: Modern and emerging data centers are presenting&#xD;
unprecedented demands in terms of cost and energy consumption,&#xD;
far outpacing architectural advances related to economies&#xD;
of scale. Consequently, blade designs exhibit significant cost and&#xD;
power inefficiencies, particularly in the memory system. For example,&#xD;
we observe that modern blades are often overprovisioned&#xD;
to accommodate peak memory demand which rarely occurs&#xD;
concurrently across blades. With memory often accounting for&#xD;
20% to 40% of the total system power [1], this approach is&#xD;
not sustainable. Concurrently, HyperTransport in concert with&#xD;
new high-bandwidth commodity interconnects can provide low-latency&#xD;
sharing of memory across blades. This paper provides a&#xD;
HyperTransport-enabled solution for seamless, efficient sharing&#xD;
of memory across blades in a data center, leading to significant&#xD;
power and cost savings.&#xD;
Specifically, we propose a new global address space model&#xD;
called the Dynamic Partitioned Global Address Space (DPGAS)&#xD;
model that extends previous concepts for Non-Uniform&#xD;
Memory Access (NUMA) and partitioned global address spaces&#xD;
(PGAS). The DPGAS model relies on HyperTransport’s low-latency&#xD;
characteristics to enable new techniques for efficient&#xD;
sharing of memory across data center blades. This paper presents&#xD;
the DPGAS model, describes HyperTransport-based hardware&#xD;
support for the model, and assesses this model’s power and cost&#xD;
impact on memory intensive applications. Overall, we find that&#xD;
cost savings can range from 4% to 26% with power reductions&#xD;
ranging from 2% to 25% across a variety of fixed application&#xD;
configurations using server consolidation and memory throttling.&#xD;
The HyperTransport implementation enables these savings with&#xD;
an additional node latency cost of 1,690 ns latency per remote&#xD;
64 byte cache line access across the blade-to-blade interconnect.</description>
      <pubDate>Mon, 29 Oct 2007 22:58:59 GMT</pubDate>
    </item>
  </channel>
</rss>

