## Characterizing the redundancy of universal source coding for finite-length sequences

##### Abstract

In this thesis, we first study what is the average redundancy resulting from the universal compression of a single finite-length sequence from an unknown source. In the universal compression of a source with d unknown parameters, Rissanen demonstrated that the expected redundancy for regular codes is asymptotically d/2 log n + o(log n) for almost all sources, where n is the sequence length. Clarke and Barron also derived the asymptotic average minimax redundancy for memoryless sources. The average minimax redundancy is concerned with the redundancy of the worst parameter vector for the best code. Thus, it does not provide much information about the effect of the different source parameter values. Our treatment in this thesis is probabilistic. In particular, we derive a lower bound on the probability measure of the event that a sequence of length n from an FSMX source chosen using Jeffreys' prior is compressed with a redundancy larger than a certain fraction of d/2 log n. Further, our results show that the average minimax redundancy provides good estimate for the average redundancy of most sources for large enough n and d. On the other hand, when the source parameter d is small the average minimax redundancy overestimates the average redundancy for small to moderate length sequences. Additionally, we precisely characterize the average minimax redundancy of universal coding when the coding scheme is restricted to be from the family of two--stage codes, where we show that the two--stage assumption incurs a negligible redundancy for small and moderate length n unless the number of source parameters is small.
%We show that redundancy is significant in the compression of small sequences. Our results, collectively, help to characterize the non-negligible redundancy resulting from the compression of small and moderate length sequences. Next, we apply these results to the compression of a small to moderate length sequence provided that the context present in a sequence of length M from the same source is memorized. We quantify the achievable performance improvement in the universal compression of the small to moderate length sequence using context memorization.