|dc.description.abstract||The memory capacity of computers and edge devices continue to grow: the DRAM capacity for low end computers are at tens or hundreds of GBs and the modern high performance computing (HPC) platforms can support terabytes of RAM for Big data driven HPC and Machine Learning (ML) workloads. Although system virtualization improves resource consolidation, it does not tackle the increasing cost of address translation and the growing size of page tables OS kernel maintains. All virtual machines and processors use pages tables for address translation. On the other hand, Big data and latency-demanding applications are typically deployed in virtualized Clouds using the application deployment models, comprised of virtual machines (VMs), containers, and/or executors/JVMs. These applications enjoy high throughput and low latency if they are served entirely from memory. However, actual estimation and memory allocation are difficult. When these applications cannot fit their working sets in real memory of their VMs/containers/executors, they suffer large performance loss due to excess page faults and thrashing. Even when unused host memory or unused remote memory are present in other VMs or containers and executors, these applications are unable to share those unused host/remote memory. Existing proposals focus on estimating working set size for accurate allocation, and increasing effective capacity of executors, but lack of desired transparency and efficiency.
This dissertation research takes a holistic approach to tackle the above problems from three dimensions. First, we present the design of FastSwap, a highly efficient shared memory paging facility. FastSwap dynamic shared memory management scheme can effectively utilize the shared memory across VMs through host coordination, with three original contributions. (1) FastSwap provides efficient support for multi-granularity compression of swap pages in both shared memory and disk swap devices. (2) FastSwap provides an adaptive scheme to flush the least recently swap-out pages to disk swap partition when shared memory swap partition reaches a pre-specified threshold and close to full. (3) FastSwap provides batch swap-in optimizations. Our extensive experiments using big data analytics applications and benchmarks demonstrate that FastSwap offers up to two orders of magnitude performance improvements over existing memory swapping methods. Second, we develop XMemPod for non-intrusive host/remote memory sharing and for improving performance of memory-intensive applications. It leverages the memory capacity of host machines and remote machines on the same cluster to provide on-demand, transparent and non-intrusive sharing of unused memory, effectively removing the performance degradation of big data and ML workloads due to transient or imbalanced memory pressure experienced on a host or in a cluster. We demonstrate the benefit of XMemPod design and the benefits of memory sharing via three optimizations: First, we provide elasticity, multi-granular compressibility and failure isolation on shared memory pages. Second, we implement hybrid swap-out for better utilization of host and remote shared memory. Third but not the last we support proactive swap-in from remote to host, from disk to host, and from host to guest, which improves paging-in operations significantly and opportunistically and shortens the performance recovery time of those applications under memory pressure. XMemPod is deployed on a virtualized RDMA cluster without any modifications to user applications and the OSes. Evaluated with multiple workloads on unmodified Spark, Apache Hadoop, Memcached, Redis and VoltDB, using XMemPod, throughputs of these applications improve by 11x to 612x over conventional OS disk swap facility, and by 1.7x to 14x over the existing representative remote memory paging system. Third, we propose an efficient and elastic huge page management facility for memory intensive Bigdata and machine learning applications. Both computer hardware and operating systems provide support for huge pages. Modern computer hardware supports huge pages to handle hardware address translation overheads by providing thousands of entries in TLB for huge pages. Operating systems and hypervisors provide certain level of support for huge pages with best effort algorithms to address the access and management cost of memory page table for increasing DRAM capacity and growing memory footprint of Bigdata workloads. However, existing solutions in kernel support for huge pages are limited to spot fixes and some inherent fairness problems. We propose to take a methodical and principled approach to providing efficient, highly elastic, and yet transparent support of huge pages, aiming at improving utilization and access efficiency of memory pages.||