Memory Swapping in Linux
Abstract 
Memory swapping is a common method to relieve the memory pressure. Hypervisors such as Kernel-based Virtual Machine (KVM) can explicitly notify the host system whether memory swapping is allowed to balance the running efficiency and memory usage. This article describes the memory swapping mechanism in KVM by summarizing memory swapping principles, thereby enhancing the understanding of memory management.
1. Memory Swapping 
When the system memory usage is high, the kernel swaps out some memory pages of processes that cannot run temporarily (for example, blocked) to the external memory. The system re-allocates the released memory to new processes or swaps in some process pages from the external memory. When a program needs to access the swapped-out memory, the kernel reloads the swapped-out memory pages to ensure that no access error occurs. The swap-in and swap-out of memory pages take a long time. Actually, memory swapping is a concept of trading time for space.
Swapped out memory pages are classified into file-backed pages and anonymous pages. During the swap, a file-backed page is directly read and written through the file. An anonymous page, however, contains heaps, stacks, and data segments and does not exist as a file. Therefore, anonymous pages cannot be directly swapped with drive files. In this case, a swap partition needs to be created on the drive for reading and writing anonymous pages.
2. Notification Mechanism in KVM 
To implement memory swapping in VMs, the host system (Linux) needs to provide a notifier to notify KVM of the memory block that is swapped in or out.
In hardware-assisted virtualization, the hypervisor directly maps the physical address of the VM to the physical address of the host using Extended Page Tables (EPTs), without using the virtual address of the host. Take QEMU as an example. QEMU has its host virtual address space, and a VM that QEMU runs has its physical address space. When memory swapping is performed in Linux, the page in the host physical memory is swapped. Therefore, after the QEMU page is swapped out, if the memory page is accessed, Linux knows that the page is swapped out to the external memory through the page table entries (PTEs) and performs the swap-in operation. When the page at the host physical address corresponding to the VM physical address is swapped out, if the page is accessed, KVM directly queries the host physical address through the EPT without involving Linux. Even if the memory block is swapped out, KVM directly accesses the original address space stored in the EPT.
In 2008, Linux introduced the MMU notifier as an extension. The MMU obtains notifications from the Linux system using callback functions. When certain events occur in Linux, these callback functions are invoked. Take the KVM module code in Linux 4.19 as an example. These callback functions are defined in the following structure. release is the last callback to KVM when the related mm_struct disappears.
static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
	.flags			= MMU_INVALIDATE_DOES_NOT_BLOCK,
	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
	.clear_young		= kvm_mmu_notifier_clear_young,
	.test_young		= kvm_mmu_notifier_test_young,
	.change_pte		= kvm_mmu_notifier_change_pte,
	.release		= kvm_mmu_notifier_release,
};KVM uses the following code to register the notifier with the Linux system:
static int kvm_init_mmu_notifier(struct kvm *kvm)
{
	kvm->mmu_notifier.ops = &kvm_mmu_notifier_ops;
	return mmu_notifier_register(&kvm->mmu_notifier, current->mm);
}The registration function is as follows. The mm parameter is the mm_struct structure associated with the given address space. Linux notifies KVM only when the address space changes. Address spaces irrelevant to KVM are not ruled, thereby improving the efficiency of the notifier.
int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
{
	return do_mmu_notifier_register(mn, mm, 1);
}3. Implementation of Memory Swapping 
3.1 Time for Memory Swap-out 
Memory can be swapped out in either of the following occasions:
- The kernel wakes up the kswapd kernel thread to perform slow reclamation and watermark control.
There are three watermarks in the kernel according to the remaining memory size.
(1) high: The remaining memory is redundant and the memory pressure is low.
(2) low: The current memory pressure is high. The kernel wakes up the kswapd kernel thread to reclaim memory until the remaining memory reaches the high watermark.
(3) min: minimum watermark, indicating that the memory is insufficient and faces great pressure. The kernel blocks the current process and reclaims the memory. Memory marked as min is used only when necessary.
- drop_cache is used for reclamation. Slow reclamation starts only when the memory is insufficient, whereas drop_cache can initiate reclamation.
3.2 Swap Cache 
A swap cache exists between the memory and a drive, which is different from the cache between the CPU and memory. The swap cache associates the swap process with the file system so that the swap can be completed through abstract interfaces of the file system. Besides, it is a shared resource for swap-ins and swap-outs. During the swap-out, processes access the page frame in the swap cache. The lock mechanism can be used for synchronization, preventing the case where a page is present for one PTE and swapped out for another.
3.3 Memory Swap-out 
The swap-out of a file-backed page is similar to that of an anonymous page. The swap-out of an anonymous page is used as an example.
(1) Check whether the anonymous page is in the swap cache. If not, add the page to the swap cache.
(2) The kernel finds all PTEs corresponding to the anonymous page through reverse mapping, removes the mapping, and re-establishes a mapping with the new page in the swap partition.
(3) After the mapping is complete, if the dirty page has been flushed back, the kernel deletes the anonymous page from the swap cache. The number of references to the page is reduced to 0 and the page can be reclaimed by the kernel. In this way, the page swap from the memory to the drive is complete.
/*
 * shrink_page_list() returns the number of reclaimed pages
 */
static unsigned long shrink_page_list(struct list_head *page_list,
				      struct pglist_data *pgdat,
				      struct scan_control *sc,
				      enum ttu_flags ttu_flags,
				      struct reclaim_stat *stat,
				      bool force_reclaim)
{
	...
	while (!list_empty(page_list)) {
		...
		/*
		 * If it is an anonymous page but is not in the swap cache, add the page to the swap cache.
		 */
		if (PageAnon(page) && PageSwapBacked(page)) {
			if (!PageSwapCache(page)) {
				...
				// add_to_swap() allocates swap space to the anonymous page and adds the page to the swap cache.
				if (!add_to_swap(page)) {
					...
				}
				may_enter_fs = 1;
				/* Adding to swap updated mapping */
				mapping = page_mapping(page);
			}
		} else if (unlikely(PageTransHuge(page))) {
			...
		}
		/*
		 * Try to remove all mappings. try_to_unmap searches for all PTEs of the page through reverse mapping and removes the mappings.
		 */
		if (page_mapped(page)) {
			enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;
			if (unlikely(PageTransHuge(page)))
				flags |= TTU_SPLIT_HUGE_PMD;
			if (!try_to_unmap(page, flags)) {
				nr_unmap_fail++;
				goto activate_locked;
			}
		}
		if (PageDirty(page)) {
			...
			// The pageout function writes the page back to the swap partition.
			switch (pageout(page, mapping, sc)) {
			...
			case PAGE_SUCCESS:
				...
				mapping = page_mapping(page);
			case PAGE_CLEAN:
				; /* try to free the page below */
			}
		}
		...
		if (PageAnon(page) && !PageSwapBacked(page)) {
			...
		} else if (!mapping || !__remove_mapping(mapping, page, true))  // Invokes __remove_mapping to set page->_count to 0 and release the page by adding it to free_page.
			goto keep_locked;
		...
	return nr_reclaimed;
}3.4 Memory Swap-in 
When the swapped-out page is accessed, page fault handling is triggered. The kernel swaps in the swapped-out memory page based on the page table content written during the swap-out. In Linux, the do_swap_page function is invoked to complete the page swap-in operation.
The swap-in process is as follows:
(1) Check whether the queried page exists in the swap cache. If yes, remap and update the page table according to the memory page referenced by the swap cache. If no, a new memory page is allocated and added to the reference of the swap cache. After the memory page content is updated, the page table is updated.
(2) After the swap-in operation is complete, the page reference of the corresponding swap area decreases by 1. When the value decreases to 0, no process references the page and the page can be reclaimed.
linux-4.19/mm/memory.c
/*
 * We enter with non-exclusive mmap_sem (to exclude vma changes,
 * but allow concurrent faults), and pte mapped but not yet locked.
 * We return with pte unmapped and unlocked.
 *
 * We return with the mmap_sem locked or unlocked in the same cases
 * as does filemap_fault().
 */
vm_fault_t do_swap_page(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	struct page *page = NULL, *swapcache;
	struct mem_cgroup *memcg;
	swp_entry_t entry;
	pte_t pte;
	int locked;
	int exclusive = 0;
	vm_fault_t ret = 0;
	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
		goto out;
	entry = pte_to_swp_entry(vmf->orig_pte);  // Returns the entry of the swap space based on the PTE.
	...
	page = lookup_swap_cache(entry, vma, vmf->address);  // Searches for the page corresponding to the entry in the swap cache.
	swapcache = page;
	if (!page) {
		struct swap_info_struct *si = swp_swap_info(entry);
		if (si->flags & SWP_SYNCHRONOUS_IO &&
			...
			}
		} else {
			/*
			 * If no page is found in the cache, the system searches for the page in the swap area, allocates a new memory page, and reads the page from the swap area.
			 */
			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
						vmf);
			swapcache = page;
		}
		...
		ret = VM_FAULT_MAJOR;
		count_vm_event(PGMAJFAULT);
		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
	} else if (PageHWPoison(page)) {
		...
	}
	locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);  // Locks the page.
	...
	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
			&vmf->ptl);  // Obtains a PTE and re-creates the mapping.
	...
	pte = mk_pte(page, vma->vm_page_prot);
	...
	/* ksm created a completely new copy */
	if (unlikely(page != swapcache && swapcache)) {
		page_add_new_anon_rmap(page, vma, vmf->address, false);
		mem_cgroup_commit_charge(page, memcg, false, false);
		lru_cache_add_active_or_unevictable(page, vma);
	} else {
		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
		mem_cgroup_commit_charge(page, memcg, true, false);
		activate_page(page);
	}
	...
	swap_free(entry);	// Reduces the reference count on the entry of the swap area.
	...
	return ret;
}