Scaling address translation in multi-core architectures using low-latency interconnects
Bharadwaj, Vedula Venkata
MetadataShow full item record
Modern systems employ structures known as Translation Lookaside Buffers(TLB) to accelerate the address translation mechanism. As workloads use ever-increasing memory footprints, TLBs are becoming critical to overall system performance. Modern designs use private multi-level TLB hierarchies to balance latency and effective capacity. Unfortunately, private TLB hierarchies have drawbacks, major one being the replication of translations across multiple cores yielding lower hit rates than shared alternatives. But designing scalable shared TLBs remains a challenge since the benefit of higher capacity is often outweighed by latency overheads for accessing a large monolithic structure. To counter the access latencies of large TLBs, physically distributed TLBs akin to NUCA caches can be explored. While a physical distributed last level TLB reduces bank access latency, the on-chip access latency to access remote banks and back continues to hamper performance and energy. Such problems hinder the practical adoption of large shared TLBs on modern many-core systems, where higher core counts exacerbate latency and energy problems. By utilizing a light-weight single-cycle interconnect based on a recently-demonstrated technique called SMART, this thesis demostrates NUTRA, a Non-Uniform TRanslation Access architecture to tackle the scaling challenges of shared distributed last-level TLBs. NUTRA achieves latencies close to those of private L2 TLBs, with hit rates of shared last-level TLBs proposed in previous work. The combination of tight latencies and high hit rates means that NUTRA outperform not only monolithic SLL implementations, but also distributed implementations. Further, this thesis shows that a distributed organization coupled with low-latency interconnects delivers a scalable solution for last level TLBs in multi-core architectures.