亚洲综合图片区自拍_思思91精品国产综合在线观看_一区二区三区欧美_欧美黑人又粗又大_亚洲人成精品久久久久桥本

首頁(yè) > 數(shù)碼 > 內(nèi)容頁(yè)

當(dāng)前觀點(diǎn)：GPU Render Engine詳細(xì)介紹

2023-05-30 10:22:19 來(lái)源：字節(jié)跳動(dòng)SYS Tech

前言

GPU（Graphics Processing Unit）是一種專(zhuān)門(mén)在個(gè)人電腦、工作站、游戲機(jī)和一些移動(dòng)設(shè)備（如平板電腦、智能手機(jī)等）上做圖形相關(guān)運(yùn)算工作的微處理器。隨著AI興起，適合并行運(yùn)算的 GPU 也被廣泛應(yīng)用于訓(xùn)練和推理，大量的服務(wù)器開(kāi)始搭載 GPU 做計(jì)算任務(wù)。當(dāng)前 GPU 包含多個(gè)引擎，包含渲染，計(jì)算，編解碼，顯示， DMA（Designated Market Area）等多個(gè)硬件模塊。每個(gè)硬件對(duì)應(yīng)一個(gè)或者多個(gè)引擎。本文主要介紹 render 引擎，從 GPU 渲染的硬件單元，到用戶(hù)態(tài)頂點(diǎn)，命令等數(shù)據(jù)下發(fā)給 GPU 硬件執(zhí)行過(guò)程等方面進(jìn)行詳細(xì)介紹，幫助大家更好地理解 render 引擎工作流程。（特別聲明：本文主要以 IntelGPU 為參考介紹）

名詞解析

(資料圖片)

3D Pipeline: 3D 管道是一組以流水線(xiàn)方式排列的固定功能單元，它通過(guò)固定函數(shù)單元和 EU 線(xiàn)程來(lái)處理與 3D 相關(guān)的命令。FF(fixed fuction): 具有固定功能的硬件。FF units: 在 GPU 3D Pipeline 中一個(gè)固定的功能單元。

FFID: Unique identifier for a fixed function unit.

CS(Command Streamer): 固定功能單元，解析驅(qū)動(dòng)寫(xiě)入到 ring buffer 里的命令，發(fā)送到 3D Pipeline 下一級(jí)。

VF(Vertex Fetcher): 3D Pipeline 中第一個(gè) FF 固定功能單元，讀取內(nèi)存中的頂點(diǎn)數(shù)據(jù)，處理后傳遞給 3D Pipeline 下一個(gè)階段 VS。EU(Excution Unit): 多線(xiàn)程的執(zhí)行單元，每個(gè) EU 都是一個(gè)處理器。

TD(ThreadDispatcher): 功能單元，用來(lái)仲裁來(lái)自固定函數(shù)單元的線(xiàn)程啟動(dòng)請(qǐng)求并在 EU 上實(shí)例化線(xiàn)程的功能單元。

Render 引擎介紹

Intel Render 引擎有兩種工作模式：一種是 3D 渲染，另一種是 media（編解碼相關(guān)）模式（計(jì)算 GPGPU 模式和media 是同一個(gè)）。驅(qū)動(dòng)通過(guò) PIPELINE_SELECT 命令選擇 3D/Media 模式:

// 在 mesa 代碼中 compute 執(zhí)行  、emit_pipeline_select(batch, GPGPU);  //在 render 中執(zhí)行   emit_pipeline_select(batch, _3D);

無(wú)論哪種工作方式都是用戶(hù)態(tài)驅(qū)動(dòng)和內(nèi)核態(tài)驅(qū)動(dòng)將命令寫(xiě)入到 ring buffer，然后提交給硬件，硬件通過(guò)PIPE_SELECT 命令選擇使用指定的 pipeline，然后將用戶(hù)命令和 buffer 發(fā)送硬件。硬件執(zhí)行命令流程如下圖：

3D pipeline 流程圖

上圖中黃色的部分，是通過(guò) EU（可以編程的計(jì)算單元）實(shí)現(xiàn)，在 NVIDIA 中，該硬件類(lèi)似的功能叫做 CUDA（Compute Unified Device Architecture）。藍(lán)色的部分是通過(guò)固定的硬件來(lái)實(shí)現(xiàn)。所以藍(lán)色部分叫做 FF 固定函數(shù)單元。

GPU render 引擎結(jié)構(gòu)

一圖勝千言，從上圖中可以看到渲染和 media 功能是由 CS、FF 固定函數(shù)單元配合 Slice 中 EU、采樣器、SliceCommon 硬件等共同完成。上圖只展示了一個(gè) slice，真實(shí)的 GPU 包含不同數(shù)量的 slice。

硬件介紹和分析

Command Streamer

GPU 執(zhí)行相關(guān)的 pipeline 操作需要 CPU下發(fā)命令，CPU 寫(xiě)入命令到一個(gè) buffer 里，該 buffer 一般稱(chēng)為 batch buffer。在驅(qū)動(dòng)里把該 buffer 轉(zhuǎn)換成 ring buffer 或者 batch buffer。

GPU render 相關(guān)的命令大體分為以下類(lèi)型:

memory interface: 內(nèi)存接口命令，對(duì) memory 進(jìn)行操作的命令。

3D state: 設(shè)置 3D pipeline 狀態(tài)機(jī)的命令，例如頂點(diǎn) surface state 狀態(tài)下發(fā)到 GPU。

pipe Control: 用來(lái)設(shè)置同步或者并行執(zhí)行的操作。

3D Primitive 圖元裝配有關(guān)的命令。

命令下發(fā)后就需要硬件去解析命令。從上圖 3D 引擎的結(jié)構(gòu)中能看到，首先是 Command Streamer 硬件讀取 ring buffer 數(shù)據(jù)中的命令。Command Streamer 是各種引擎的主要接口。每個(gè)引擎都有自己的 Command Streamer。

固定函數(shù)單元

圖 Generic 3D FF Unit Block DiagramFF 函數(shù)單元的一個(gè)主要作用是管理對(duì)頂點(diǎn)/像素?cái)?shù)據(jù)執(zhí)行大部分處理的 EU 線(xiàn)程。在一般意義上，所包括的關(guān)鍵功能是：

Bypass Mode

URB Entry Management

Thread Initiation Management

Thread Request Data Generation ? Thread Control Information Generation ? Thread Payload Header Generation ? Thread Payload Data Generation

Thread Output Handling

URB Entry Readback

Statistics Gathering

FF 單元有個(gè)很重要的功能是完成 Thread Dispatching （thread 調(diào)度器）任務(wù)，根據(jù) VS/GS 等等 stage 需求發(fā)起不同的 EU thread。FF 單元常用的命令大部分都和線(xiàn)程啟動(dòng)初始化相關(guān)。

上面的命令都是 thread init 需要，后面是常量包含到 URB thread 運(yùn)行時(shí)候需要讀取的數(shù)據(jù)。

EU（Excution Unit）

EU 的官方解釋為：

An EU is a multi-threaded processor within the multi-processor system. Each EU is a fully capable processor containing instruction fetch and decode, register files, source operand swizzle and SIMD ALU, etc. An EU is also referred to as a core

從官方解釋也能看出 EU 是個(gè)多線(xiàn)程的執(zhí)行單元，在最近的 gen11/gen12中，每個(gè) EU 都是具有七個(gè)線(xiàn)程的多線(xiàn)程 (SMT)。主計(jì)算單元由支持 SIMD 的 ALU/FPU 組成。每個(gè)硬件線(xiàn)程都有 128 個(gè) 32 位寬的通用寄存器(GRF)。OpenGL 中的 shader 和 OpenCL 中的 kernel 做計(jì)算都是由 EU 單元完成。

常用到基本概念

Thread:An instance of a kernel program executed on an EU. The life cycle for a thread starts from the executing the first instruction after being dispatched from Thread Dispatcher to an EU to the execution of the last instruction – a send instruction with EOT that signals the thread termination. Threads in GEN system may be independent from each other or communicate with each other through Message Gateway share function

Thread Dispatcher : Functional unit that arbitrates thread initiation requests from Fixed Functions units and instantiates the threads on EUs.

Thread Payload：Prior to a thread starting execution, some amount of data will be pre-loaded in to the thread’s GRF (starting at r0). This data is typically a combination of control information provided by the spawning entity (FF Unit) and data read from the URB

Thread Identifier : The field within a thread state register (SR0) that identifies which thread slots on an EU a thread occupies. A thread canbe uniquely identified by the EUID and TID.

Thread Spawner: The second and the last fixed function stage of the media pipeline that initiates new threads on behalf of generic/media processing.

Unified Return Buffer（URB）：The on-chip memory managed/shared by GEN Fixed Functions in order for a thread to return data that will be consumed either by a Fixed Function or other threads.

URB Entry: A logical entity stored in the URB (such as a vertex), referenced via a URB Handle.

URB Entry Allocation Size: Number of URB entries allocated to a Fixed Function unit.

URB Fence：Virtual, movable boundaries between the URB regions owned by each FF unit.

EU 的結(jié)構(gòu)

GRF（General Register File）通用寄存器組。ARF（Architecture Register File）體系結(jié)構(gòu)寄存器組。

常用的 ARF 寄存器：

類(lèi)似 arm體系結(jié)構(gòu)除了 x0 到 x30 通用寄存器，還有一些 sp，ret ，pc，PSTATE 等特殊寄存器。

FPU 就是常說(shuō)的 ALU 算術(shù)邏輯運(yùn)算單元，使用 SIMD 指令實(shí)現(xiàn)。包含以下指令：

EU 的寄存器數(shù)據(jù)存儲(chǔ)方式

EU 寄存器不同的數(shù)據(jù)存儲(chǔ)方式 AOS 和 SOA：

根據(jù) GPU 文檔介紹：

頂點(diǎn)著色器和幾何著色器使用 AOS 方式存儲(chǔ)，同時(shí)使用 SIMD4x2 and SIMD4 modes 模式操作；

像素著色器使用 SOA 方式使用 SIMD8/SIMD16 方式操作數(shù)據(jù)；

在 media 中主要使用 SOA 方式排列，偶爾也有 AOS

SIMDmxn 代表操作向量的大小，n 代表同時(shí)有幾個(gè)在同時(shí)進(jìn)行操作。SIMD4 代表上圖中 x、y、z、w 同時(shí)操作。

一個(gè) EU 例子：

add dst.xyz src0.yxzw src1.zwxy

用戶(hù)態(tài)對(duì) EU 的使用

OpenGL 程序可以編寫(xiě) shader（glsl）然后通過(guò) GPU 編譯器解析成 GPU 指令，并由 GPU 執(zhí)行。OpenCL 中可以編寫(xiě) kernel（OpenCL定義的Kernel語(yǔ)言，基于C99擴(kuò)展）程序，經(jīng)過(guò)編譯到 GPU 上執(zhí)行。無(wú)論 shader 還是 kernel 都在執(zhí)行時(shí)才進(jìn)行編譯。根據(jù)不同的 GPU 編譯成特定的二進(jìn)制。在 mesa 中大體過(guò)程如下：

在 mesa 中 OpenCL 和 OpenGL GPU 的編譯器都在 mesa/src/目錄下，編譯完成后根據(jù) vulkan 或者 openGL 分別調(diào)用 upload_blorp_shader/iris_blorp_upload_shader 拷貝到 res bo 中（batchbuffer），當(dāng) context 被調(diào)度執(zhí)行會(huì)讀取 batchbuffer 數(shù)據(jù)，GPU 加載到 EU 執(zhí)行。

EU 中的 Messages

EU 與共享函數(shù)之間以及固定功能管道（不被認(rèn)為是“子系統(tǒng)”的一部分）以及 EU 與 EU 之間的通信是通過(guò)信息包完成的。

通過(guò)發(fā)送指令請(qǐng)求消息傳輸，Messages 分為兩大類(lèi)型：

Message Payload：寫(xiě)入 MRF (Message Register File )寄存器的內(nèi)容，發(fā)送有效的信息。

Associated ("sideband") information：指令中包含的其他信息具體如下：

Message Descriptor. Specified with the send instruction. Included in the message descriptor is control and routing information such as the target function ID, message payload length, response length, etc. Additional information provided by the send instruction, e.g., the starting destination register number, the execution mask (EMASK), etc. A small subset of Thread State, such as the Thread ID, EUID, etc.

備注：MRF(Message Register File )：只寫(xiě)寄存器，用于在發(fā)送前組裝消息，并作為發(fā)送指令的操作數(shù)。每個(gè)線(xiàn)程都有專(zhuān)有的 MRF 寄存器組由 16 個(gè)寄存器組成，每個(gè)寄存器 256 位寬。每個(gè)寄存器都有一個(gè)專(zhuān)門(mén)的狀態(tài)位用來(lái)標(biāo)記當(dāng)前寄存器是否是正在發(fā)送消息的一部分，即標(biāo)記是否在使用。對(duì)于大多數(shù) Message 第一個(gè)寄存器是報(bào)文頭，其他的寄存器是 data。

一個(gè) Message 的生命周期分為四個(gè)階段：（1）創(chuàng)建（2）發(fā)送（3）處理（4）回寫(xiě)。

什么是共享函數(shù)？

共享函數(shù)是為 EU 提供專(zhuān)門(mén)補(bǔ)充功能的硬件單元。一個(gè)共享函數(shù)是在每個(gè) EU 原有基礎(chǔ)功能上的增強(qiáng)，獨(dú)立的硬件加速。共享函數(shù)是作為 EU 以外的獨(dú)立實(shí)體運(yùn)行，并在各 EU 之間共享。

共享函數(shù)的調(diào)用是通過(guò)稱(chēng)為消息的通信機(jī)制實(shí)現(xiàn)的。消息由一系列 MRF（Message Register File）定義，這些寄存器保存消息操作數(shù)、目標(biāo)共享函數(shù) ID、需要操作特定函數(shù)的編碼、目標(biāo)通用任何回寫(xiě)響應(yīng)指向的 GRF (General Register File）。消息通過(guò)發(fā)送指令在軟件控制下發(fā)送到共享函數(shù)。最常見(jiàn)的共享函數(shù)： ? Extended Math function ? Sampling Engine function ? DataPort function ? Message Gateway function ? Unified Return Buffer (URB) ? Thread Spawner (TS)

? Null function

上圖為 Send 主要發(fā)送 Messages 消息到共享函數(shù)，Send 用來(lái)發(fā)送通用指令，處于 messsage 的發(fā)送階段，message 指令類(lèi)型： ? Mem stores/reads are messages ? Mem scatter/gathers are messages ? Texture Sampling is a message with u,v,w coordinates per SIMD lane

? Messages used for synchronization, atomic operations, fences etc.

Unified Return Buffer （URB）

The on-chip memory managed/shared by GEN Fixed Functions in order for a thread to return data that will be consumed either by a Fixed Function or other threads.

統(tǒng)一返回緩沖區(qū)（URB）是一種通用的緩沖區(qū)，用于在不同的線(xiàn)程之間發(fā)送數(shù)據(jù)，在某些情況下，還可以在線(xiàn)程和固定函數(shù)單元之間發(fā)送數(shù)據(jù)（反之亦然）。一個(gè)線(xiàn)程 a 通過(guò)發(fā)送消息來(lái)獲取 URB。URB 通常以行為顆粒度進(jìn)行讀寫(xiě)。

URB 的空間布局：

URB entry

URB entry 是 URB 中的一個(gè)邏輯實(shí)體，由 URB handle 引用，并由若干個(gè)連續(xù)的行組成。

Constant URB Entries (CURBEs)

EU 中的 thread 在計(jì)算時(shí)可能會(huì)遇到程序里的一些常量，EU 可以通過(guò) Dataport 讀取內(nèi)存中的常量數(shù)值，但是這會(huì)影響性能。為了加速讀取常量的速度，GPU 可以把內(nèi)存數(shù)據(jù)寫(xiě)入到 URB 中。這樣常量可以放入到 URB 中，提高 thread 讀取常量的速度。將常量寫(xiě)入線(xiàn)程有效負(fù)載的機(jī)制叫做 CURBEs。CURBE 是一個(gè)由 command stream 控制的 URB entry，CURBE 是一個(gè)特殊的 URB entry。CS 解析到 CONSTANT_BUFFER 命令后會(huì)從內(nèi)存中讀取常量數(shù)據(jù)并寫(xiě)入到 CURBE。

URB 分配

Engine command streamer 通過(guò)解析命令來(lái)管理 GPU 中的 URB

3D Pipeline 上每個(gè)階段都需要使用 URB，各個(gè)階段 VF ，VS 等都需要分配自己的 URB。FF 單元將頂點(diǎn)、vertex patch 、constant data 從內(nèi)存寫(xiě)入到 URB entries 中，可以通過(guò) URB_FENCE 命令用來(lái)完成這個(gè)任務(wù)。每個(gè) stage 都有一個(gè) fence 指針，該指針指向自己使用的 URB 結(jié)束位置。上一個(gè) stage fence 指針和自己 fence 指針中間的位置就是使用的 URB 區(qū)間。

圖 URB Allocation - 3D Pipeline

URB 的讀寫(xiě)

URB 寫(xiě)入：

CS 會(huì)將常用數(shù)據(jù)寫(xiě)入到 URB 中 Constant URB Entries

Media pipeline Video Front End (VFE) 固定單元將 thread 預(yù)加載數(shù)據(jù)寫(xiě)入到 URB entry 中

3D pipeline Vertex Fetch (VF) 固定函數(shù)單元寫(xiě)頂點(diǎn)數(shù)據(jù)到 URB entry 中

Thread 寫(xiě)數(shù)據(jù)到到 URB entry

URB 讀取：

線(xiàn)程調(diào)度器（Thread Dispatcher）是 URB 讀取的主要來(lái)源。作為生成線(xiàn)程的一部分，管道固定函數(shù)為線(xiàn)程調(diào)度器提供了許多 URB 句柄、讀取偏移量和長(zhǎng)度。線(xiàn)程調(diào)度器從 URB 中獲取指定的數(shù)據(jù)，并提供預(yù)加載到 GRF 寄存器中。

3D 管道的幾何著色器（GS）和 Clipper（CLIP）固定函數(shù)單元可以讀取 URB entry 的選定部分，以提取管道所需的頂點(diǎn)數(shù)據(jù)。

windows（WM）FF 單元從條帶/風(fēng)扇單元編寫(xiě)的 URB 條目中讀取回深度系數(shù)。

URB 讀寫(xiě)通過(guò) URB 共享函數(shù)實(shí)現(xiàn)，URB 共享函數(shù)通過(guò) URB_WRITE 和 URB_READ message 類(lèi)型。

URB Entry

URB Entry 分配從 URBStartAddress 指定的地址開(kāi)始。

分配的大小由 NumberOfURBEntries 和 URBEntryAllocationSize 決定。

存儲(chǔ)頂點(diǎn)的 entry 被稱(chēng)為 Vertex URB Entry，一般來(lái)說(shuō)，頂點(diǎn)數(shù)據(jù)存儲(chǔ)在 URB 中 vertex URB entry（VUEs）中，由 CLIP 線(xiàn)程處理，僅通過(guò) VUE 句柄間接引用。因此，對(duì)于大部分頂點(diǎn)數(shù)據(jù)的內(nèi)容/格式不暴露于 3D 管道硬件——FF 單元通常只知道數(shù)據(jù)的句柄和大小。

發(fā)起 EU thread

1. Thread Dispatching

當(dāng) 3D 和 media pipeline 向 subsystem 發(fā)送線(xiàn)程啟動(dòng)請(qǐng)求時(shí)，線(xiàn)程調(diào)度器（Thread Dispatching ）接收這些請(qǐng)求。調(diào)度程序執(zhí)行諸如在并發(fā)程序之間進(jìn)行仲裁等任務(wù)，將請(qǐng)求的線(xiàn)程分配給 EU 上的硬件線(xiàn)程，在多個(gè)線(xiàn)程中分配每個(gè) EU 中的寄存器空間，并使用來(lái)自 FF 單元的數(shù)據(jù)初始化一個(gè)線(xiàn)程中的寄存器。

FF 單元會(huì)給 thread Dispatching 發(fā)送請(qǐng)求，F(xiàn)F 單元收集 thread 啟動(dòng)需要的信息，寫(xiě)入到 URB 中。Thread 啟動(dòng)的時(shí)候從 URB 加載到 GRF(General Register File)寄存器。

2. Thread Spawner

Media pipeline 有兩個(gè)固定函數(shù)單元，Video Front End (VFE) unit and Thread Spawner (TS) unit。

TS 是 media 功能時(shí)候發(fā)起一個(gè)通用的線(xiàn)程。VFE 解析命令流，然后將線(xiàn)程啟動(dòng)預(yù)加載的數(shù)據(jù)填寫(xiě)到 URB，TS 發(fā)起一個(gè)新的線(xiàn)程。TS 是 media pipeline 中唯一一個(gè)與線(xiàn)程調(diào)度器（Thread Dispatching ）聯(lián)系的單元。

3. Thread 啟動(dòng)需要信息

FF 單元確定可以請(qǐng)求一個(gè)線(xiàn)程，它就必須收集向線(xiàn)程調(diào)度程序提交線(xiàn)程發(fā)起所需的所有信息。這些信息可分為幾個(gè)類(lèi)別： ? Thread Control Information: FF 單元會(huì)給 thread Dispatcher 發(fā)送 thread 運(yùn)行的控制信息，包括 Kernel Start Pointer，Binding Table Entry Count 等等。這些信息不是直接發(fā)送到 thread payload 中

? Thread Payload Header: GRF 是 EU 的通用寄存器，thread 運(yùn)行之前需呀講一些數(shù)據(jù)預(yù)加載到 GRF 中（GRF r0 寄存器寄存器開(kāi)始）這些數(shù)據(jù)來(lái)自 FF 單元。

此數(shù)據(jù)分為兩部分：固定頭部數(shù)據(jù)和 R+ 頭部數(shù)據(jù)： 固定頭部數(shù)據(jù)：包含了 FF 單元從 FF pipe 獲取到的信息，是所有 thread 都用的信息，這里就包含后面。sampler 會(huì)用到的 surface state， sampler state 指針等等。 R+頭部數(shù)據(jù): 包含了 R0 R1 .. 包含了 FF 單元給 thread 傳遞的參數(shù)。R1 等寄存器會(huì)包含 URB handle， thread 需要讀取/寫(xiě)入頂點(diǎn)/patch 等數(shù)據(jù)可以指定 handle 從 URB 中加載 /寫(xiě)入。

? Thread Payload Input URB Data: thread 從 URB 加載數(shù)據(jù)和常量。

對(duì)于每個(gè) URB entry，F(xiàn)F 單元將提供一系列 handle、讀取偏移量和讀取長(zhǎng)度。線(xiàn)程調(diào)度子系統(tǒng)將讀取 URB 的適當(dāng) 256 bit 位置，并將結(jié)果寫(xiě)入順序 GRF 寄存器（寫(xiě)入 GRF）

thread 預(yù)加載（payload）數(shù)據(jù)的布局

Slice /Sub Slice

如圖一個(gè) slice 里包含多個(gè) subslice，一個(gè) L3 cache，一塊共享的顯存，一個(gè)硬件的內(nèi)存屏障還有固定函數(shù)單元（fixed function units）

Pipeline Stage

A abstracted element of the 3D pipeline, providing functions performed by a combination of the corresponding hardware FF unit and the threads spawned by that FF unit.

一個(gè) 3D pipeline 有多個(gè)階段：

每個(gè)階段通過(guò)固定函數(shù)單元完成相應(yīng)的功能,藍(lán)色的框固定函數(shù)單元是硬件固定不可更改的功能；

黃色代表可非固定功能，功能由軟件編程后經(jīng)過(guò)編譯器編譯在可編程單元(EU)中執(zhí)行；

fixed function 現(xiàn)在更像是一個(gè)概念不一定需要獨(dú)立的硬件去實(shí)現(xiàn)他的功能。

一個(gè) subslice 內(nèi)包含有多個(gè) EU，一個(gè) Dispatcher，一個(gè) sampler，一個(gè) Data Port。

sampler

3D 采樣引擎提供了高級(jí)采樣和過(guò)濾表面的能力。采樣引擎功能負(fù)責(zé)向 EU 提供過(guò)濾后的紋理值，以響應(yīng)采樣引擎收到的 message 消息。采樣引擎使用 SAMPLER_STATE 來(lái)控制濾波模式、地址控制模式等采樣引擎的其他功能。每個(gè)消息都傳遞一個(gè)指向 sampler state 的指針。此外，采樣引擎使用 SURFACE_STATE 來(lái)定義被采樣的 suface 屬性。這包括 surface 的位置、大小和格式以及其他屬性。

采樣引擎的子功能：

1. sampler 主要流程：

surface：A rendering operand or destination, including textures, buffers, and render targets

Surface state：State associated with a render surface including

進(jìn)行采樣處理需要 surface state objects (RENDER_SURFACE_STATE)， surface data，以及 sampler state。采樣器通過(guò) surface state objects 來(lái)獲取 surface 在 system 中地址以及 surface 的格式。采樣器獲取到了支持的格式的 surface 就會(huì)自動(dòng)解壓。采樣器支持的濾波模式有(point, bilinear, trilinear, anisotropic, cube etc.) ，具體處理時(shí)根據(jù) SAMPLER_STATE state object 內(nèi)容來(lái)確定使用哪個(gè)模式參數(shù)。

2. 主要指令

GPU 維護(hù)了一個(gè) bind table 表，每種類(lèi)型的 shader 都有一個(gè)綁定表。Send 發(fā)送命令到 sampler 時(shí) data 里包含了 bind table 的基地址和表里 entry。通過(guò)綁定表找到要使用的 surface state。

根據(jù) 手冊(cè)每種 shader 都有自己的 BTP，也就是每個(gè) shader 都有一個(gè) bind 表。BTP 是通過(guò) 3DSTATE_BINDING_TABLE_POINTERS_XXX 命令指定。

BTPB 通過(guò) 3DSTATE_BINDING_TABLE_POOL_ALLOC 命令設(shè)置。

Surface State Base address 通過(guò) STATE_BASE_ADDRESS 命令指定。STATE_BASE_ADDRESS 作為一個(gè)通用的基地址設(shè)置寄存器可以設(shè)置多個(gè)基地址：

General State Base Address；

Surface State Base Address；

Dynamic State Base Address；

Indirect Object Base Address ；

Instruction Base Address ；

Bindless Surface State Base Address；

Bindless Sampler State Base Address；

sampler state 則是通過(guò) message 消息中包含的 Sampler State Pointer 獲取到偏移地址，和 DynamicStateBaseAddress 基地址組合找到位于 system 種的 sampler state。Sampler State Pointer 是通過(guò)3DSTATE_SAMPLER_STATE_POINTERS_xx 設(shè)置的。

3. Mesa 中的實(shí)現(xiàn)

mesa 中設(shè)置 surface base addr 是調(diào)用的 update_surface_base_address 函數(shù)。

在 blorp_emit_surface_states 函數(shù)中對(duì)調(diào)用 blorp_alloc_binding_table 分配新的 bind table 會(huì)執(zhí)行 update_surface_base_address 基地址。

mesa 中的 blorp_emit_surface_states 函數(shù)會(huì)執(zhí)行 3DSTATE_BINDING_TABLE_POINTERS_xxx 將每種 shader 設(shè)置 BTP，該 BTP 維護(hù)在 struct iris_binding_table 中

DataPort

采樣器只能讀取， DataPort 具有讀寫(xiě)功能，提供了所有的內(nèi)存的訪問(wèn)通道。

context

Execlists

Execution-List 是硬件提供的軟件接口。SW 通過(guò)將要執(zhí)行的命令填寫(xiě)到 context 對(duì)應(yīng)的 ring buffer 中，然后講 context 提交給引擎的 Execlist Submit Port (ELSP)， dg1 上有兩個(gè) ELSP。Execution-List 使用需要引擎的 GFX_MODE 寄存器設(shè)置相應(yīng)的 enable。

Context state

每個(gè) context 根據(jù)其工作負(fù)載要求對(duì)引擎狀態(tài)進(jìn)行編程。硬件執(zhí)行需要修改的狀態(tài)稱(chēng)為 context state。每個(gè) congtext 都有自己的 context state。Context state 在執(zhí)行當(dāng)前 context ring buffer 的命令時(shí)被修改。指定在引擎上運(yùn)行的所有 context 都具有相同的上下文格式。context 通過(guò) Logical Context Address 指向了存儲(chǔ) context state 的地址。

Logical Context Address

logical context address 是保存硬件信息（context state）的一個(gè)全局虛擬地址（GPU 的虛擬地址），GPU context 切換通過(guò) logical context address 來(lái)完成 context state 的保存和加載。當(dāng)調(diào)度器通過(guò) Execution-List 提交一個(gè)工作負(fù)載列表時(shí)，命令流媒體硬件一次執(zhí)行一次 context 切換。引擎通過(guò) logical context address（LRCA）來(lái)加載 logical context。

Context state 包含內(nèi)容

the following sections: ? Per-Process HW Status Page (4K) ? Ring Context (Ring Buffer Control Registers, Page Directory Pointers, etc.)

? Engine Context ( PipelineState, Non-pipelineState, Statistics, MMIO)

context 的生命周期

Mesa 3D 代碼實(shí)現(xiàn)分析

圖渲染pipeline 各階段

黃色的階段代表沒(méi)有固定的硬件來(lái)完成這個(gè)功能，而是由 CS 中的固定函數(shù)單元發(fā)起一個(gè) EU thread 來(lái)完成功能。也就是黃色的階段都是可編程的。藍(lán)色的是由固定硬件完成，不可編程。

Vertex Fetch (VF) stage

頂點(diǎn)獲取階段。當(dāng) 3D Primitive 命令下發(fā)后，VF 負(fù)責(zé)從內(nèi)存中讀取頂點(diǎn)數(shù)據(jù)，重新格式化它，并將結(jié)果寫(xiě)入頂點(diǎn) URB 條目中。VF 單元會(huì)生成 instanceID，VertexID 和 PrimitiveID 這些數(shù)據(jù)的組合將會(huì)寫(xiě)入到 VUE 中。

VF 的數(shù)據(jù)處理

3D_Primitive 之前要先下發(fā) 3D pipeline state 相關(guān)的命令，這樣 3D_Primitive 解析到后 CS 會(huì)讓 VF 硬件開(kāi)始工作去獲取數(shù)據(jù)然后處理數(shù)據(jù)，從用戶(hù)的 batchbuffer 3D_Primitive 命令位置反方向解析之前的 3D pipeline 相關(guān)命令。STATE_BASE_ADDRESS 命令用于設(shè)置基地址，后面下發(fā)命令使用 bind table/surface state 地址都與這個(gè)基地址有關(guān)。在 mesa 中 src/gallium/drivers/iris/iris_bufmgr.h 中有介紹相關(guān)的內(nèi)存分配情況：

/** * Memory zones.  When allocating a buffer, you can request that it is * placed into a specific region of the virtual address space (PPGTT). * * Most buffers can go anywhere (IRIS_MEMZONE_OTHER).  Some buffers are * accessed via an offset from a base address.  STATE_BASE_ADDRESS has * a maximum 4GB size for each region, so we need to restrict those * buffers to be within 4GB of the base.  Each memory zone corresponds * to a particular base address. * * We lay out the virtual address space as follows: * * - [0,   4K): Nothing            (empty page for null address) * - [4K,  4G): Shaders            (Instruction Base Address) * - [4G,  8G): Surfaces & Binders (Surface State Base Address, Bindless ...) * - [8G, 12G): Dynamic            (Dynamic State Base Address) * - [12G, *):  Other              (everything else in the full 48-bit VMA) */相應(yīng)的mesa 中將ppgtt使用的虛擬地址也分成了幾個(gè)zoneenum iris_memory_zone {   IRIS_MEMZONE_SHADER,   IRIS_MEMZONE_BINDER,   IRIS_MEMZONE_SCRATCH,   IRIS_MEMZONE_SURFACE,   IRIS_MEMZONE_DYNAMIC,   IRIS_MEMZONE_OTHER,   IRIS_MEMZONE_BORDER_COLOR_POOL,};

3D pipeline 狀態(tài)設(shè)置命令

VF 階段接收來(lái)自 CS 單元的 3DPRIMITIVE 命令信息，執(zhí)行該命令，將生成的頂點(diǎn)數(shù)據(jù)存儲(chǔ)在 URB 中，并將相應(yīng)的頂點(diǎn)信息包向下傳遞給其他的管道 stage。在 3D primitives 命令發(fā)出之前，要如下命令設(shè)置 Pipeline 各種狀態(tài): ? 3DSTATE_PIPELINED_POINTERS （gen45 后這個(gè)命令沒(méi)了） ? 3DSTATE_BINDING_TABLE_POINTERS ? 3DSTATE_VERTEX_BUFFERS ? 3DSTATE_VERTEX_ELEMENTS ? 3DSTATE_INDEX_BUFFERS ? 3DSTATE_VF_STATISTICS ? 3DSTATE_DRAWING_RECTANGLE ? 3DSTATE_CONSTANT_COLOR ? 3DSTATE_DEPTH_BUFFER ? 3DSTATE_POLY_STIPPLE_OFFSET ? 3DSTATE_POLY_STIPPLE_PATTERN ? 3DSTATE_LINE_STIPPLE

? 3DSTATE_GLOBAL_DEPTH_OFFSET

和 VF 頂點(diǎn)輸入有關(guān)的命令有: ? 3DSTATE_VERTEX_BUFFERS(頂點(diǎn) buffer VBO) ? 3DSTATE_VERTEX_ELEMENTS(頂點(diǎn)屬性配置） ? 3DSTATE_INDEX_BUFFERS(索引 buffer EBO)? 3DPRIMITIVE（圖元裝配）

? 3DSTATE_VF_STATISTICS(頂點(diǎn)信息統(tǒng)計(jì))

vertex buffer

3DSTATE_VERTEX_BUFFERS 和 3DSTATE_VERTEX_STATE 命令用來(lái)定義存放頂點(diǎn)的 buffer。3D Primitive 中使用的頂點(diǎn) buffer 大部分來(lái)自頂點(diǎn) buffer。在 OpenGL 中對(duì)應(yīng)的 VBO（Vertex Buffer Object）。

float vertices[] = {        0.5f, 0.5f, 0.0f,   // 右上角        0.5f, -0.5f, 0.0f,  // 右下角        -0.5f, -0.5f, 0.0f, // 左下角        -0.5f, 0.5f, 0.0f   // 左上角};// create VBOsglGenBuffers(1, &vboId); // for vertex bufferglBindBuffer(GL_ARRAY_BUFFER, vboId);// copy vertex attribs data to VBOglBufferSubData(GL_ARRAY_BUFFER, 0, vSize, vertices);//將頂點(diǎn)數(shù)據(jù)寫(xiě)入到VBO中

VB（vertex buffer）是一個(gè)一維結(jié)構(gòu)的數(shù)組，其中結(jié)構(gòu)的大小由 VB 的緩沖間距定義。3DPrimitive 對(duì) vertex buffer 的訪問(wèn)有順序和隨機(jī)兩種方式。當(dāng) OpenGL 沒(méi)有使用頂點(diǎn) index buffer object（element buffer object）的時(shí)候會(huì)順序讀.使用 index buffer object 會(huì)根據(jù) buffer 指定 index 在 buffer 中隨機(jī)讀取。

3DSTATE_VERTEX_STATE 是用戶(hù)表示數(shù)據(jù)緩沖區(qū)和實(shí)例化數(shù)據(jù)緩沖區(qū)的一個(gè)結(jié)構(gòu)，是一個(gè) 4 個(gè) Dword 大小的 buffer。一個(gè) 3DSTATE_VERTEX_BUFFERS 可以指定多個(gè) 3DSTATE_VERTEX_STATE（至少包含一個(gè)），最多綁定 33 個(gè) 3DSTATE_VERTEX_STATE。

mesa 中 src/gallium/drivers/iris/iris_state.c 文件用來(lái)給 batchbuffer 填充 3D pipeline state 相關(guān)的命令。iris_upload_dirty_render_state 函數(shù)將 3DSTATE_VERTEX_BUFFERS 寫(xiě)入到 batch buffer 中。

const unsigned vb_dwords = GENX(VERTEX_BUFFER_STATE_length);         uint32_t *map =            iris_get_command_space(batch, 4 * (1 + vb_dwords * count));         _iris_pack_command(batch, GENX(3DSTATE_VERTEX_BUFFERS), map, vb) {//寫(xiě)入命令            vb.DWordLength = (vb_dwords * count + 1) - 2;         }         map += 1;         bound = dynamic_bound;         while (bound) {            const int i = u_bit_scan64(&bound);            memcpy(map, genx->vertex_buffers[i].state, //3DSTATE_VERTEX_STATE 數(shù)據(jù)拷貝到batchbuffer中                   sizeof(uint32_t) * vb_dwords);            map += vb_dwords;         }      }

在 opengl 調(diào)用 genbuffer 創(chuàng)建 vbo 再寫(xiě)入數(shù)據(jù)后 mesa 后端 pipe 會(huì)調(diào)用 pipe->set_vertex_buffers, 調(diào)用 iris_set_vertex_buffers 函數(shù)將 buffer 的地址大小等數(shù)據(jù) xieru 到 vertex buffer state 結(jié)構(gòu)中。

3DSTATE_VERTEX_BUFFERS 命令寫(xiě)入時(shí)候?qū)⒋鎯?chǔ)的 vertex buffer state 數(shù)據(jù)一起寫(xiě)入到 batch buffer 中。

static voidiris_set_vertex_buffers(struct pipe_context *ctx,                        unsigned start_slot, unsigned count,                        unsigned unbind_num_trailing_slots,                        bool take_ownership,                        const struct pipe_vertex_buffer *buffers) {       .............省略      iris_pack_state(GENX(VERTEX_BUFFER_STATE), state->state, vb) {         vb.VertexBufferIndex = start_slot + i;         vb.AddressModifyEnable = true;         vb.BufferPitch = buffer->stride;         if (res) {            vb.BufferSize = res->base.b.width0 - (int) buffer->buffer_offset;            vb.BufferStartingAddress =               ro_bo(NULL, res->bo->address + (int) buffer->buffer_offset);            vb.MOCS = iris_mocs(res->bo, &screen->isl_dev,                                ISL_SURF_USAGE_VERTEX_BUFFER_BIT);#if GFX_VER >= 12            vb.L3BypassDisable       = true;#endif         } else {            vb.NullVertexBuffer = true;            vb.MOCS = iris_mocs(NULL, &screen->isl_dev,                                ISL_SURF_USAGE_VERTEX_BUFFER_BIT);         }      }   }   .........省略}

iris_pack_state 宏定義是根據(jù) mesa 中的 xml 生成了，如果不編譯源碼是沒(méi)有這個(gè)文件的，生成目錄 build/src/intel/genxml/gen12_pack.h。

index buffer

3DSTATE_INDEX_BUFFERS 命令用來(lái)配置 index buffer 相關(guān) state（address/size 等等）index buffer 在 OpenGL 中也叫做 Element buffer 也就是 EBO（Element buffer object）。

float vertices[] = {        0.5f, 0.5f, 0.0f,   // 右上角        0.5f, -0.5f, 0.0f,  // 右下角        -0.5f, -0.5f, 0.0f, // 左下角        -0.5f, 0.5f, 0.0f   // 左上角};unsigned int indices[] = {        0, 1, 3, // 第一個(gè)三角形        1, 2, 3  // 第二個(gè)三角形};glGenBuffers(1, &EBO);glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, EBO);glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(indices), indices, GL_STATIC_DRAW);

EBO 是為了解決一個(gè)頂點(diǎn)多次重復(fù)調(diào)用的問(wèn)題，可以減少內(nèi)存的空間浪費(fèi)，提高執(zhí)行效率，當(dāng)重復(fù)使用重復(fù)頂點(diǎn)時(shí)，通過(guò)頂點(diǎn)的位置索引來(lái)調(diào)用頂點(diǎn)。而不是重復(fù)調(diào)用頂點(diǎn)的數(shù)據(jù)。

EBO 中存儲(chǔ)的內(nèi)容就是頂點(diǎn)位置的索引 indices，EBO 類(lèi)似于 VBO，也是在顯存中分配的一個(gè) bufer，只不過(guò) EBO 存放的是頂點(diǎn)的索引。GPU 中 index buffer 結(jié)構(gòu)布局按照手冊(cè)中的 3DSTATE_INDEX_BUFFER_BODY 格式 mesa 中的 iris_upload_render_state 函數(shù)將 3DSTATE_INDEX_BUFFERS xieru 到 batch buffer 中。

uint32_t ib_packet[GENX(3DSTATE_INDEX_BUFFER_length)];      iris_pack_command(GENX(3DSTATE_INDEX_BUFFER), ib_packet, ib) {         ib.IndexFormat = draw->index_size >> 1;         ib.MOCS = iris_mocs(bo, &batch->screen->isl_dev,                             ISL_SURF_USAGE_INDEX_BUFFER_BIT);         ib.BufferSize = bo->size - offset;         ib.BufferStartingAddress = ro_bo(NULL, bo->address + offset);#if GFX_VER >= 12         ib.L3BypassDisable       = true;#endif      }

index format 對(duì)應(yīng)的是 opengl 中的索引值的類(lèi)型:GL_UNSIGNED_BYTE, GL_UNSIGNED_SHORT, or GL_UNSIGNED_INT。

3DSTATE_INDEX_BUFFERS 和 3DSTATE_VERTEX_BUFFERS 都有 bo->address，這個(gè) bo address 并不是 CPU 側(cè)映射的虛擬地址，而是通過(guò)上面說(shuō)的給 ppgtt 使用的虛擬地址。從 mesa 驅(qū)動(dòng)劃分的不同 zone 分配出來(lái)的。

enum iris_memory_zone {   IRIS_MEMZONE_SHADER,   IRIS_MEMZONE_BINDER,   IRIS_MEMZONE_SCRATCH,   IRIS_MEMZONE_SURFACE,   IRIS_MEMZONE_DYNAMIC,   IRIS_MEMZONE_OTHER,   IRIS_MEMZONE_BORDER_COLOR_POOL,};

mesa 用戶(hù)態(tài)驅(qū)動(dòng) bo 的分配必須指定自己使用的 mem zone。

iris_bo_alloc(struct iris_bufmgr *bufmgr,              const char *name,              uint64_t size,              uint32_t alignment,              enum iris_memory_zone memzone,              unsigned flags)

從指定的 zone 分配出來(lái)虛擬地址后，這個(gè)虛擬地址寫(xiě)入到 batch 中，根據(jù)之前設(shè)置的 base 基地址。GPU 驅(qū)動(dòng)會(huì)強(qiáng)制使用這個(gè)地址做 PPGTT（Per-Process Graphics Translation Table）的虛擬地址映射。

頂點(diǎn)屬性

有了 VBO/EBO 后頂點(diǎn)數(shù)據(jù)和頂點(diǎn)使用順序已經(jīng)有了，VBO 存儲(chǔ)頂點(diǎn)數(shù)據(jù)有不同的格式，GPU 硬件按照什么格式解析 VBO 數(shù)據(jù)需要告訴 GPU。在 opengl 中 glVertexAttribPointer 和 glvertexattribformat 用來(lái)設(shè)置 buffer 中頂點(diǎn)數(shù)據(jù)的解析方式。

// 0. 復(fù)制頂點(diǎn)數(shù)組到緩沖中供OpenGL使用glBindBuffer(GL_ARRAY_BUFFER, VBO);glBufferData(GL_ARRAY_BUFFER, sizeof(vertices), vertices, GL_STATIC_DRAW); // 1. 設(shè)置頂點(diǎn)屬性指針 glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 3 * sizeof(float), (void*)0); glEnableVertexAttribArray(0);來(lái)自https://learnopengl-cn.github.io/01%20Getting%20started/04%20Hello%20Triangle/glVertexAttribPointer`函數(shù)的參數(shù)非常多，所以我會(huì)逐一介紹它們：-   第一個(gè)參數(shù)指定我們要配置的頂點(diǎn)屬性。在頂點(diǎn)著色器中使用`layout(location = 0)`定義了`position`頂點(diǎn)屬性的位置值(Location)嗎？它可以把頂點(diǎn)屬性的位置值設(shè)置為`0`。因?yàn)槲覀兿Ｍ褦?shù)據(jù)傳遞到這一個(gè)頂點(diǎn)屬性中，所以這里我們傳入`0`。-   第二個(gè)參數(shù)指定頂點(diǎn)屬性的大小。頂點(diǎn)屬性是一個(gè)`vec3`，它由3個(gè)值組成，所以大小是3。-   第三個(gè)參數(shù)指定數(shù)據(jù)的類(lèi)型，這里是`GL_FLOAT`(GLSL中`vec*`都是由浮點(diǎn)數(shù)值組成的)。-   下個(gè)參數(shù)定義我們是否希望數(shù)據(jù)被標(biāo)準(zhǔn)化(Normalize)。如果我們?cè)O(shè)置為`GL_TRUE`，所有數(shù)據(jù)都會(huì)被映射到0（對(duì)于有符號(hào)型signed數(shù)據(jù)是-1）到1之間。我們把它設(shè)置為`GL_FALSE`。-   第五個(gè)參數(shù)叫做步長(zhǎng)(Stride)，它告訴我們?cè)谶B續(xù)的頂點(diǎn)屬性組之間的間隔。由于下個(gè)組位置數(shù)據(jù)在3個(gè)`float`之后，我們把步長(zhǎng)設(shè)置為`3 * sizeof(float)`。要注意的是由于我們知道這個(gè)數(shù)組是緊密排列的（在兩個(gè)頂點(diǎn)屬性之間沒(méi)有空隙）我們也可以設(shè)置為0來(lái)讓OpenGL決定具體步長(zhǎng)是多少（只有當(dāng)數(shù)值是緊密排列時(shí)才可用）。一旦我們有更多的頂點(diǎn)屬性，我們就必須更小心地定義每個(gè)頂點(diǎn)屬性之間的間隔，我們?cè)诤竺鏁?huì)看到更多的例子（譯注: 這個(gè)參數(shù)的意思簡(jiǎn)單說(shuō)就是從這個(gè)屬性第二次出現(xiàn)的地方到整個(gè)數(shù)組0位置之間有多少字節(jié)）。-   最后一個(gè)參數(shù)的類(lèi)型是`void*`，所以需要我們進(jìn)行這個(gè)奇怪的強(qiáng)制類(lèi)型轉(zhuǎn)換。它表示位置數(shù)據(jù)在緩沖中起始位置的偏移量(Offset)。由于位置數(shù)據(jù)在數(shù)組的開(kāi)頭，所以這里是0

stride offset 和 stride 配置可以參考:https://stackoverflow.com/questions/16380005/opengl-3-4-glvertexattribpointer-stride-and-offset-miscalculation

在 GPU 中使用 VERTEX_ELEMENT_STATE 來(lái)設(shè)置頂點(diǎn)的屬性。3DSTATE_VERTEX_ELEMENTS 設(shè)置有多少個(gè) VERTEX_ELEMENT_STATE. mesa 代碼中通過(guò) struct pipe_vertex_element 結(jié)構(gòu)維護(hù)頂點(diǎn)屬性

struct pipe_vertex_element{   /** Offset of this attribute, in bytes, from the start of the vertex */   uint16_t src_offset;   /** Which vertex_buffer (as given to pipe->set_vertex_buffer()) does    * this attribute live in?    */   uint8_t vertex_buffer_index:7;   /**    * Whether this element refers to a dual-slot vertex shader input.    * The purpose of this field is to do dual-slot lowering when the CSO is    * created instead of during every state change.    *    * It"s lowered by util_lower_uint64_vertex_elements.    */   bool dual_slot:1;   /**    * This has only 8 bits because all vertex formats should be <= 255.    */   uint8_t src_format; /* low 8 bits of enum pipe_format. */   /** Instance data rate divisor. 0 means this is per-vertex data,    *  n means per-instance data used for n consecutive instances (n > 0).    */   unsigned instance_divisor;};

pipe->create_vertex_elements_state 來(lái)設(shè)置頂點(diǎn)屬性，intel iris_create_vertex_elements 寫(xiě) VERTEX_ELEMENT_STATE

iris_pack_state(GENX(VERTEX_ELEMENT_STATE), ve_pack_dest, ve) {         ve.EdgeFlagEnable = false;         ve.VertexBufferIndex = state[i].vertex_buffer_index;         ve.Valid = true;         ve.SourceElementOffset = state[i].src_offset;         ve.SourceElementFormat = fmt.fmt;         ve.Component0Control = comp[0];         ve.Component1Control = comp[1];         ve.Component2Control = comp[2];         ve.Component3Control = comp[3];      }

頂點(diǎn)數(shù)據(jù)處理

3DPrimitive 命令下發(fā)后，根據(jù) batch buffer 配置的頂點(diǎn) buffer 地址，格式，index 等信息 GPU 處理頂點(diǎn)數(shù)據(jù)，將頂點(diǎn)數(shù)據(jù)處理生成 vertexid ，vertexindex 等后寫(xiě)入到 URB 中，生成獨(dú)一的 URB handle， VS 通過(guò) URB handle 配合 shader 處理數(shù)據(jù)。3DPRIMITIVE 命令指定數(shù)據(jù)格式:

iris_emit_cmd(batch, GENX(3DPRIMITIVE), prim) {      prim.VertexAccessType = draw->index_size > 0 ? RANDOM : SEQUENTIAL;/是否存在index buffer 存在就是隨機(jī)讀取      prim.PredicateEnable = use_predicate;      if (indirect) {         prim.IndirectParameterEnable = true;      } else {         prim.StartInstanceLocation = draw->start_instance;         prim.InstanceCount = draw->instance_count;         prim.VertexCountPerInstance = sc->count;         prim.StartVertexLocation = sc->start;//對(duì)應(yīng)opengl 中g(shù)lDrawArrays 的start         if (draw->index_size) {            prim.BaseVertexLocation += sc->index_bias;//頂點(diǎn)的固定偏移，類(lèi)似于glDraw[Range]Elements{,BaseVertex}" api 指定BaseVertex         }      }   }

頂點(diǎn)數(shù)據(jù)的讀取寫(xiě)入 URE 大體流程

3D Primitives 處理頂點(diǎn)數(shù)據(jù)偽代碼（來(lái)自 gen4 手冊(cè)）

vertexloop 會(huì)計(jì)算當(dāng)前的 VERTEXID，順序讀寫(xiě)生成 VERTEXID 偽代碼：

VertexIndex = StartVertexLocation + VertexNumber  VertexID = VertexNumber

隨機(jī)讀寫(xiě)偽代碼：

IBIndex=StartVertexLocation+VertexNumberVertexID = IB[IBIndex]  if (VertexID == ‘a(chǎn)ll ones’)   CutFlag = 1  else   VertexIndex = VertexID + BaseVertexLocation   CutFlag = 0  endif

相比順序讀寫(xiě)，隨機(jī)讀寫(xiě)是從 index buffer 查找頂點(diǎn)的位置，所以多了個(gè) index buffer 的過(guò)程。

PrimitiveID 是每個(gè) PrimitiveI type 每個(gè)的 id，下圖比較形象

Vertexid， PrimitiveID，頂點(diǎn)數(shù)據(jù)的格式轉(zhuǎn)換，最后生成 vue handle 給 VS 階段使用。

Vertex Shader (VS) Stage

VF 處理完成頂點(diǎn)數(shù)據(jù)后傳輸?shù)?pipeline 下個(gè) stage VS。VS 通過(guò) URB handle 讀取 shader 需要的數(shù)據(jù)，發(fā)起 EU 上的線(xiàn)程執(zhí)行 shader。發(fā)起 thread 是通過(guò) Thread Dispatcher 完成. shader 編譯器會(huì)根據(jù) shader 的內(nèi)容生成使用的常量 buffer，使用的 surface，采樣器等信息。這些 state 通過(guò) 3DSTATE_XX 命令下發(fā)下去。VS 使用 3DSTATE_VS。

這些信息通過(guò) 3DSTATE_VS 命令下發(fā)給 GPU，在 mesa /src/gallium/drivers/iris/iris_state.c 下發(fā)所有和 state 相關(guān)的信息。

#defineINIT_THREAD_DISPATCH_FIELDS(pkt,prefix,stage)   pkt.KernelStartPointer = KSP(shader);                                     pkt.BindingTableEntryCount = shader->bt.size_bytes / 4;                   pkt.FloatingPointMode = prog_data->use_alt_mode;                                                                                                    pkt.DispatchGRFStartRegisterForURBData =                                     prog_data->dispatch_grf_start_reg;                                     pkt.prefix##URBEntryReadLength = vue_prog_data->urb_read_length;          pkt.prefix##URBEntryReadOffset = 0;                                                                                                                 pkt.StatisticsEnable = true;                                              pkt.Enable           = true;                                                                                                                        if (prog_data->total_scratch) {                                              struct iris_bo *bo =                                                         iris_get_scratch_space(ice, prog_data->total_scratch, stage);          uint32_t scratch_addr = bo->gtt_offset;                                   pkt.PerThreadScratchSpace = ffs(prog_data->total_scratch) - 11;           pkt.ScratchSpaceBasePointer = rw_bo(NULL, scratch_addr,                                                       IRIS_DOMAIN_NONE);                 }/** * Encode most of 3DSTATE_VS based on the compiled shader. */static voidiris_store_vs_state(struct iris_context *ice,                    const struct gen_device_info *devinfo,                    struct iris_compiled_shader *shader){   struct brw_stage_prog_data *prog_data = shader->prog_data;   struct brw_vue_prog_data *vue_prog_data = (void *) prog_data;   iris_pack_command(GENX(3DSTATE_VS), shader->derived_data, vs) {      INIT_THREAD_DISPATCH_FIELDS(vs, Vertex, MESA_SHADER_VERTEX);      vs.MaximumNumberofThreads = devinfo->max_vs_threads - 1;      vs.SIMD8DispatchEnable = true;      vs.UserClipDistanceCullTestEnableBitmask =         vue_prog_data->cull_distance_mask;   }}

采樣器 state/table 的創(chuàng)建

在 GPU 中每個(gè) shader stage 都有自己的 sampler state 表，然后通過(guò) 3DSTATE_SAMPLER_STATE_POINTERS_XX 設(shè)置到硬件。

Mesa gallium 分為前端和后端。OpenGL 后端實(shí)現(xiàn)都是基于 pipe 結(jié)構(gòu)調(diào)用硬件驅(qū)動(dòng)。ctx->pipe->create_sampler_states//創(chuàng)建一個(gè)基于硬件支持的 sample state 結(jié)構(gòu) ctx->pipe->bind_sampler_states //將創(chuàng)建的 sample state 結(jié)構(gòu)和 VS/FS 等階段綁定

OpenGL 中紋理采樣器的使用

void glGenSamplers (GLsizei count, GLuint *samplers);void glSamplerParameteri (GLuint sampler, GLenum pname, GLint param);void glSamplerParameterf (GLuint sampler, GLenum pname, GLfloat param);void glBindSampler(GLuint unit, GLuint sampler);

mesa 中維護(hù)的 sampler 結(jié)構(gòu)

gl_sampler_object 轉(zhuǎn)換成pipe_sampler_state，然后update_shader_samplers—>cso_set_samplers-> cso_single_sampler->cso_single_sampler->pipe->create_sampler_states ->iris_create_sampler_state最后調(diào)用intel 驅(qū)動(dòng)iris_create_sampler_state 將數(shù)據(jù)寫(xiě)入到硬件支持的buffer里。mesa中通過(guò)cso_single_sampler_done函數(shù)將sampler state和3d pipeline stage 綁定voidcso_single_sampler_done(struct cso_context *ctx,                        enum pipe_shader_type shader_stage){   struct sampler_info *info = &ctx->samplers[shader_stage];   if (ctx->max_sampler_seen == -1)      return;   ctx->pipe->bind_sampler_states(ctx->pipe, shader_stage, 0,                                  ctx->max_sampler_seen + 1,                                  info->samplers);   ctx->max_sampler_seen = -1;}

最后 iris_upload_sampler_states 函數(shù)中會(huì)分配一個(gè) sampler_table 的 bo，然后將 sample state 數(shù)據(jù)拷貝到 bo 中，然后通過(guò) 3DSTATE_SAMPLER_STATE_POINTERS_VS 命令下發(fā)到 GPU。

Sampler State TABLE

sampler state 則是通過(guò) message 消息中包含的 Sampler State Pointer 獲取到偏移地址，和 DynamicStateBaseAddress 基地址組合找到位于 system 種的 sampler state。Sampler State Pointer 是通過(guò) 3DSTATE_SAMPLER_STATE_POINTERS_xx 設(shè)置的。

for (int stage = 0; stage <= MESA_SHADER_FRAGMENT; stage++) {      if (!(stage_dirty & (IRIS_STAGE_DIRTY_SAMPLER_STATES_VS << stage)) ||          !ice->shaders.prog[stage])         continue;      iris_upload_sampler_states(ice, stage); //更新sampler state 到sampler_table中      struct iris_shader_state *shs = &ice->state.shaders[stage];      struct pipe_resource *res = shs->sampler_table.res;      if (res)         iris_use_pinned_bo(batch, iris_resource_bo(res), false,                            IRIS_DOMAIN_NONE);      iris_emit_cmd(batch, GENX(3DSTATE_SAMPLER_STATE_POINTERS_VS), ptr) {         ptr._3DCommandSubOpcode = 43 + stage;         ptr.PointertoVSSamplerState = shs->sampler_table.offset;      }   }

Mesa gallium 分為前端和后端。Opengl 后端實(shí)現(xiàn)都是基于 pipe 結(jié)構(gòu)調(diào)用硬件用戶(hù)態(tài)驅(qū)動(dòng)。

ctx->pipe->create_sampler_states //創(chuàng)建一個(gè)基于硬件支持的 sample state 結(jié)構(gòu)

ctx->pipe->bind_sampler_states // 將 sample state 結(jié)構(gòu)和 VS FS 等 state 階段綁定

glGenSamplers 會(huì)創(chuàng)建一個(gè)采樣器 glBindSampler 綁定一個(gè)采樣器 glSamplerParameter 設(shè)置采樣器參數(shù)等等 api 會(huì)在 mesa 創(chuàng)建由khronos.org OpenGL 定義的ARB_sampler_objects 結(jié)構(gòu)體并且設(shè)置相關(guān)的參數(shù)。在 draw 時(shí)候調(diào)用 update_shader_samplers，在 update 中 st_convert_sampler(函數(shù)將

gl_sampler_object 轉(zhuǎn)換成 pipe_sampler_state，然后 update_shader_samplers—>cso_set_samplers-> cso_single_sampler->cso_single_sampler->pipe->create_sampler_states ->iris_create_sampler_state，最后調(diào)用驅(qū)動(dòng) iris_create_sampler_state 將數(shù)據(jù)寫(xiě)入到硬件支持的 SAMPLER_STATE 格式 buffer 里。

iris_create_sampler_state(struct pipe_context *ctx,                          const struct pipe_sampler_state *state){   struct iris_sampler_state *cso = CALLOC_STRUCT(iris_sampler_state);   ......   float min_lod = state->min_lod;   unsigned mag_img_filter = state->mag_img_filter;   // XXX: explain this code ported from ilo...I don"t get it at all...   if (state->min_mip_filter == PIPE_TEX_MIPFILTER_NONE &&       state->min_lod > 0.0f) {      min_lod = 0.0f;      mag_img_filter = state->min_img_filter;   }   iris_pack_state(GENX(SAMPLER_STATE), cso->sampler_state, samp) {      samp.TCXAddressControlMode = wrap_s;      samp.TCYAddressControlMode = wrap_t;      samp.TCZAddressControlMode = wrap_r;      samp.CubeSurfaceControlMode = state->seamless_cube_map;      samp.NonnormalizedCoordinateEnable = !state->normalized_coords;      samp.MinModeFilter = state->min_img_filter;      samp.MagModeFilter = mag_img_filter;      samp.MipModeFilter = translate_mip_filter(state->min_mip_filter);      samp.MaximumAnisotropy = RATIO21;      if (state->max_anisotropy >= 2) {         if (state->min_img_filter == PIPE_TEX_FILTER_LINEAR) {            samp.MinModeFilter = MAPFILTER_ANISOTROPIC;            samp.AnisotropicAlgorithm = EWAApproximation;         }         if (state->mag_img_filter == PIPE_TEX_FILTER_LINEAR)            samp.MagModeFilter = MAPFILTER_ANISOTROPIC;         samp.MaximumAnisotropy =            MIN2((state->max_anisotropy - 2) / 2, RATIO161);      }     }

在 cso_set_samplers 函數(shù)中分配的 sampler 結(jié)構(gòu)指針賦值給了 ctx 的 samplers 中

cso_single_sampler(struct cso_context *ctx, enum pipe_shader_type shader_stage,                   unsigned idx, const struct pipe_sampler_state *templ){   if (templ) {      unsigned key_size = sizeof(struct pipe_sampler_state);      unsigned hash_key = cso_construct_key((void*)templ, key_size);      struct cso_sampler *cso;      struct cso_hash_iter iter =         cso_find_state_template(ctx->cache,                                 hash_key, CSO_SAMPLER,                                 (void *) templ, key_size);      if (cso_hash_iter_is_null(iter)) {         cso = MALLOC(sizeof(struct cso_sampler));         if (!cso)            return;         memcpy(&cso->state, templ, sizeof(*templ));         cso->data = ctx->pipe->create_sampler_state(ctx->pipe, &cso->state);         cso->delete_state =            (cso_state_callback) ctx->pipe->delete_sampler_state;         cso->context = ctx->pipe;         cso->hash_key = hash_key;         iter = cso_insert_state(ctx->cache, hash_key, CSO_SAMPLER, cso);         if (cso_hash_iter_is_null(iter)) {            FREE(cso);            return;         }      }      else {         cso = cso_hash_iter_data(iter);      }      ctx->samplers[shader_stage].cso_samplers[idx] = cso;      ctx->samplers[shader_stage].samplers[idx] = cso->data;      ctx->max_sampler_seen = MAX2(ctx->max_sampler_seen, (int)idx);   }}

將 create sample 創(chuàng)建的 state 發(fā)動(dòng)給驅(qū)動(dòng)，和 stage 綁定

/** * Send staged sampler state to the driver. */voidcso_single_sampler_done(struct cso_context *ctx,                        enum pipe_shader_type shader_stage){   struct sampler_info *info = &ctx->samplers[shader_stage];   if (ctx->max_sampler_seen == -1)      return;   ctx->pipe->bind_sampler_states(ctx->pipe, shader_stage, 0,                                  ctx->max_sampler_seen + 1,                                  info->samplers);   ctx->max_sampler_seen = -1;}

/** * The pipe->bind_sampler_states() driver hook. */static voidiris_bind_sampler_states(struct pipe_context *ctx,                         enum pipe_shader_type p_stage,                         unsigned start, unsigned count,                         void **states){   struct iris_context *ice = (struct iris_context *) ctx;   gl_shader_stage stage = stage_from_pipe(p_stage);   struct iris_shader_state *shs = &ice->state.shaders[stage];   assert(start + count <= IRIS_MAX_TEXTURE_SAMPLERS);   bool dirty = false;   for (int i = 0; i < count; i++) {      if (shs->samplers[start + i] != states[i]) {         shs->samplers[start + i] = states[i];         dirty = true;      }   }   if (dirty)      ice->state.stage_dirty |= IRIS_STAGE_DIRTY_SAMPLER_STATES_VS << stage;}

STATE BIND TABLE

opengl 使用的紋理，image 等在驅(qū)動(dòng)中都是以 surface 來(lái)表示。suface state 保存了地址屬性等信息。state binder table 里 entry 存放了 surface state。每個(gè) shader stage 都有自己的 state binder table，使用 3DSTATE_BINDING_TABLE_POINTERS_XX 命令配置到 GPU。

for (int stage = 0; stage <= MESA_SHADER_FRAGMENT; stage++) {      /* Gen9 requires 3DSTATE_BINDING_TABLE_POINTERS_XS to be re-emitted       * in order to commit constants.  TODO: Investigate "Disable Gather       * at Set Shader" to go back to legacy mode...       */      if (stage_dirty & ((IRIS_STAGE_DIRTY_BINDINGS_VS |                          (GEN_GEN == 9 ? IRIS_STAGE_DIRTY_CONSTANTS_VS : 0))                            << stage)) {         iris_emit_cmd(batch, GENX(3DSTATE_BINDING_TABLE_POINTERS_VS), ptr) {            ptr._3DCommandSubOpcode = 38 + stage;            ptr.PointertoVSBindingTable = binder->bt_offset[stage];         }      }   }

binder->bt_offset[stage] 是各個(gè) state 的狀態(tài)表。bt_offset 是保存的地址。在 mesa 中將 ppgtt 使用的虛擬地址都叫做 offset。mesa 通過(guò) heap 分配出這個(gè)虛擬地址，然后通過(guò) exec2 下發(fā)給驅(qū)動(dòng)，驅(qū)動(dòng)不修改該虛擬地址并強(qiáng)制建立映射關(guān)系。

State Bind Table 初始化

OpenGL 創(chuàng)建 context 會(huì)調(diào)用 iris_create_context，初始化 context 時(shí)調(diào)用 iris_init_binder 分配一塊 bo，用來(lái)做 state bind table 的 buffer。

binder_realloc(struct iris_context *ice){   struct iris_screen *screen = (void *) ice->ctx.screen;   struct iris_bufmgr *bufmgr = screen->bufmgr;   struct iris_binder *binder = &ice->state.binder;   uint64_t next_address = IRIS_MEMZONE_BINDER_START;   if (binder->bo) {      /* Place the new binder just after the old binder, unless we"ve hit the       * end of the memory zone...then wrap around to the start again.       */      next_address = binder->bo->gtt_offset + IRIS_BINDER_SIZE;      if (next_address >= IRIS_MEMZONE_SURFACE_START)         next_address = IRIS_MEMZONE_BINDER_START;      iris_bo_unreference(binder->bo);   }   binder->bo =      iris_bo_alloc(bufmgr, "binder",                    IRIS_BINDER_SIZE, IRIS_MEMZONE_BINDER, 0);   binder->bo->gtt_offset = next_address;   binder->map = iris_bo_map(NULL, binder->bo, MAP_WRITE);   binder->insert_point = INIT_INSERT_POINT;   }

填充 state bind table

state bind table 中放的是 OpenGL 使用的各種 surface 的 state，mesa 中 iris_setup_binding_table 是 GPU 通過(guò)編譯 shader 判斷需要有多少 surface 在 shader 中使用，然后計(jì)算出每種 surface state 的大小。在 mesa iris_populate_binding_table 函數(shù)中往前面分配的 bind buffer 中填寫(xiě)數(shù)據(jù) 。

總體 kernel Pointer/ sampler state/bind table 使用如下圖：

審核編輯：湯梓紅

標(biāo)簽：