diff options
author | Oscar Mateo <oscar.mateo@intel.com> | 2014-07-24 12:04:48 -0400 |
---|---|---|
committer | Daniel Vetter <daniel.vetter@ffwll.ch> | 2014-08-20 11:17:51 -0400 |
commit | 73e4d07f8ae9cff8c869d73df4e299a3a6f5ad98 (patch) | |
tree | 405ff3959ffd8161fd362376db9c427bd33506e8 /drivers/gpu/drm | |
parent | c0ab1ae9028f14bcb7bfb655bd2120c60681c479 (diff) |
drm/i915/bdw: Document Logical Rings, LR contexts and Execlists
Add theory of operation notes to intel_lrc.c and comments to externally
visible functions.
v2: Add notes on logical ring context creation.
v3: Use kerneldoc.
v4: Integrate it in the DocBook template.
Signed-off-by: Thomas Daniel <thomas.daniel@intel.com> (v1)
Signed-off-by: Oscar Mateo <oscar.mateo@intel.com> (v2, v3)
Reviewed-by: Damien Lespiau <damien.lespiau@intel.com>
[danvet: Drop hunk about render ring init function since that's not
yet merged.]
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Diffstat (limited to 'drivers/gpu/drm')
-rw-r--r-- | drivers/gpu/drm/i915/intel_lrc.c | 203 | ||||
-rw-r--r-- | drivers/gpu/drm/i915/intel_lrc.h | 30 |
2 files changed, 232 insertions, 1 deletions
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index cc923a96fa4c..c096b9b7f22a 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c | |||
@@ -28,13 +28,108 @@ | |||
28 | * | 28 | * |
29 | */ | 29 | */ |
30 | 30 | ||
31 | /* | 31 | /** |
32 | * DOC: Logical Rings, Logical Ring Contexts and Execlists | ||
33 | * | ||
34 | * Motivation: | ||
32 | * GEN8 brings an expansion of the HW contexts: "Logical Ring Contexts". | 35 | * GEN8 brings an expansion of the HW contexts: "Logical Ring Contexts". |
33 | * These expanded contexts enable a number of new abilities, especially | 36 | * These expanded contexts enable a number of new abilities, especially |
34 | * "Execlists" (also implemented in this file). | 37 | * "Execlists" (also implemented in this file). |
35 | * | 38 | * |
39 | * One of the main differences with the legacy HW contexts is that logical | ||
40 | * ring contexts incorporate many more things to the context's state, like | ||
41 | * PDPs or ringbuffer control registers: | ||
42 | * | ||
43 | * The reason why PDPs are included in the context is straightforward: as | ||
44 | * PPGTTs (per-process GTTs) are actually per-context, having the PDPs | ||
45 | * contained there mean you don't need to do a ppgtt->switch_mm yourself, | ||
46 | * instead, the GPU will do it for you on the context switch. | ||
47 | * | ||
48 | * But, what about the ringbuffer control registers (head, tail, etc..)? | ||
49 | * shouldn't we just need a set of those per engine command streamer? This is | ||
50 | * where the name "Logical Rings" starts to make sense: by virtualizing the | ||
51 | * rings, the engine cs shifts to a new "ring buffer" with every context | ||
52 | * switch. When you want to submit a workload to the GPU you: A) choose your | ||
53 | * context, B) find its appropriate virtualized ring, C) write commands to it | ||
54 | * and then, finally, D) tell the GPU to switch to that context. | ||
55 | * | ||
56 | * Instead of the legacy MI_SET_CONTEXT, the way you tell the GPU to switch | ||
57 | * to a contexts is via a context execution list, ergo "Execlists". | ||
58 | * | ||
59 | * LRC implementation: | ||
60 | * Regarding the creation of contexts, we have: | ||
61 | * | ||
62 | * - One global default context. | ||
63 | * - One local default context for each opened fd. | ||
64 | * - One local extra context for each context create ioctl call. | ||
65 | * | ||
66 | * Now that ringbuffers belong per-context (and not per-engine, like before) | ||
67 | * and that contexts are uniquely tied to a given engine (and not reusable, | ||
68 | * like before) we need: | ||
69 | * | ||
70 | * - One ringbuffer per-engine inside each context. | ||
71 | * - One backing object per-engine inside each context. | ||
72 | * | ||
73 | * The global default context starts its life with these new objects fully | ||
74 | * allocated and populated. The local default context for each opened fd is | ||
75 | * more complex, because we don't know at creation time which engine is going | ||
76 | * to use them. To handle this, we have implemented a deferred creation of LR | ||
77 | * contexts: | ||
78 | * | ||
79 | * The local context starts its life as a hollow or blank holder, that only | ||
80 | * gets populated for a given engine once we receive an execbuffer. If later | ||
81 | * on we receive another execbuffer ioctl for the same context but a different | ||
82 | * engine, we allocate/populate a new ringbuffer and context backing object and | ||
83 | * so on. | ||
84 | * | ||
85 | * Finally, regarding local contexts created using the ioctl call: as they are | ||
86 | * only allowed with the render ring, we can allocate & populate them right | ||
87 | * away (no need to defer anything, at least for now). | ||
88 | * | ||
89 | * Execlists implementation: | ||
36 | * Execlists are the new method by which, on gen8+ hardware, workloads are | 90 | * Execlists are the new method by which, on gen8+ hardware, workloads are |
37 | * submitted for execution (as opposed to the legacy, ringbuffer-based, method). | 91 | * submitted for execution (as opposed to the legacy, ringbuffer-based, method). |
92 | * This method works as follows: | ||
93 | * | ||
94 | * When a request is committed, its commands (the BB start and any leading or | ||
95 | * trailing commands, like the seqno breadcrumbs) are placed in the ringbuffer | ||
96 | * for the appropriate context. The tail pointer in the hardware context is not | ||
97 | * updated at this time, but instead, kept by the driver in the ringbuffer | ||
98 | * structure. A structure representing this request is added to a request queue | ||
99 | * for the appropriate engine: this structure contains a copy of the context's | ||
100 | * tail after the request was written to the ring buffer and a pointer to the | ||
101 | * context itself. | ||
102 | * | ||
103 | * If the engine's request queue was empty before the request was added, the | ||
104 | * queue is processed immediately. Otherwise the queue will be processed during | ||
105 | * a context switch interrupt. In any case, elements on the queue will get sent | ||
106 | * (in pairs) to the GPU's ExecLists Submit Port (ELSP, for short) with a | ||
107 | * globally unique 20-bits submission ID. | ||
108 | * | ||
109 | * When execution of a request completes, the GPU updates the context status | ||
110 | * buffer with a context complete event and generates a context switch interrupt. | ||
111 | * During the interrupt handling, the driver examines the events in the buffer: | ||
112 | * for each context complete event, if the announced ID matches that on the head | ||
113 | * of the request queue, then that request is retired and removed from the queue. | ||
114 | * | ||
115 | * After processing, if any requests were retired and the queue is not empty | ||
116 | * then a new execution list can be submitted. The two requests at the front of | ||
117 | * the queue are next to be submitted but since a context may not occur twice in | ||
118 | * an execution list, if subsequent requests have the same ID as the first then | ||
119 | * the two requests must be combined. This is done simply by discarding requests | ||
120 | * at the head of the queue until either only one requests is left (in which case | ||
121 | * we use a NULL second context) or the first two requests have unique IDs. | ||
122 | * | ||
123 | * By always executing the first two requests in the queue the driver ensures | ||
124 | * that the GPU is kept as busy as possible. In the case where a single context | ||
125 | * completes but a second context is still executing, the request for this second | ||
126 | * context will be at the head of the queue when we remove the first one. This | ||
127 | * request will then be resubmitted along with a new request for a different context, | ||
128 | * which will cause the hardware to continue executing the second request and queue | ||
129 | * the new request (the GPU detects the condition of a context getting preempted | ||
130 | * with the same context and optimizes the context switch flow by not doing | ||
131 | * preemption, but just sampling the new tail pointer). | ||
132 | * | ||
38 | */ | 133 | */ |
39 | 134 | ||
40 | #include <drm/drmP.h> | 135 | #include <drm/drmP.h> |
@@ -109,6 +204,17 @@ enum { | |||
109 | }; | 204 | }; |
110 | #define GEN8_CTX_ID_SHIFT 32 | 205 | #define GEN8_CTX_ID_SHIFT 32 |
111 | 206 | ||
207 | /** | ||
208 | * intel_sanitize_enable_execlists() - sanitize i915.enable_execlists | ||
209 | * @dev: DRM device. | ||
210 | * @enable_execlists: value of i915.enable_execlists module parameter. | ||
211 | * | ||
212 | * Only certain platforms support Execlists (the prerequisites being | ||
213 | * support for Logical Ring Contexts and Aliasing PPGTT or better), | ||
214 | * and only when enabled via module parameter. | ||
215 | * | ||
216 | * Return: 1 if Execlists is supported and has to be enabled. | ||
217 | */ | ||
112 | int intel_sanitize_enable_execlists(struct drm_device *dev, int enable_execlists) | 218 | int intel_sanitize_enable_execlists(struct drm_device *dev, int enable_execlists) |
113 | { | 219 | { |
114 | WARN_ON(i915.enable_ppgtt == -1); | 220 | WARN_ON(i915.enable_ppgtt == -1); |
@@ -123,6 +229,18 @@ int intel_sanitize_enable_execlists(struct drm_device *dev, int enable_execlists | |||
123 | return 0; | 229 | return 0; |
124 | } | 230 | } |
125 | 231 | ||
232 | /** | ||
233 | * intel_execlists_ctx_id() - get the Execlists Context ID | ||
234 | * @ctx_obj: Logical Ring Context backing object. | ||
235 | * | ||
236 | * Do not confuse with ctx->id! Unfortunately we have a name overload | ||
237 | * here: the old context ID we pass to userspace as a handler so that | ||
238 | * they can refer to a context, and the new context ID we pass to the | ||
239 | * ELSP so that the GPU can inform us of the context status via | ||
240 | * interrupts. | ||
241 | * | ||
242 | * Return: 20-bits globally unique context ID. | ||
243 | */ | ||
126 | u32 intel_execlists_ctx_id(struct drm_i915_gem_object *ctx_obj) | 244 | u32 intel_execlists_ctx_id(struct drm_i915_gem_object *ctx_obj) |
127 | { | 245 | { |
128 | u32 lrca = i915_gem_obj_ggtt_offset(ctx_obj); | 246 | u32 lrca = i915_gem_obj_ggtt_offset(ctx_obj); |
@@ -313,6 +431,13 @@ static bool execlists_check_remove_request(struct intel_engine_cs *ring, | |||
313 | return false; | 431 | return false; |
314 | } | 432 | } |
315 | 433 | ||
434 | /** | ||
435 | * intel_execlists_handle_ctx_events() - handle Context Switch interrupts | ||
436 | * @ring: Engine Command Streamer to handle. | ||
437 | * | ||
438 | * Check the unread Context Status Buffers and manage the submission of new | ||
439 | * contexts to the ELSP accordingly. | ||
440 | */ | ||
316 | void intel_execlists_handle_ctx_events(struct intel_engine_cs *ring) | 441 | void intel_execlists_handle_ctx_events(struct intel_engine_cs *ring) |
317 | { | 442 | { |
318 | struct drm_i915_private *dev_priv = ring->dev->dev_private; | 443 | struct drm_i915_private *dev_priv = ring->dev->dev_private; |
@@ -481,6 +606,23 @@ static int execlists_move_to_gpu(struct intel_ringbuffer *ringbuf, | |||
481 | return logical_ring_invalidate_all_caches(ringbuf); | 606 | return logical_ring_invalidate_all_caches(ringbuf); |
482 | } | 607 | } |
483 | 608 | ||
609 | /** | ||
610 | * execlists_submission() - submit a batchbuffer for execution, Execlists style | ||
611 | * @dev: DRM device. | ||
612 | * @file: DRM file. | ||
613 | * @ring: Engine Command Streamer to submit to. | ||
614 | * @ctx: Context to employ for this submission. | ||
615 | * @args: execbuffer call arguments. | ||
616 | * @vmas: list of vmas. | ||
617 | * @batch_obj: the batchbuffer to submit. | ||
618 | * @exec_start: batchbuffer start virtual address pointer. | ||
619 | * @flags: translated execbuffer call flags. | ||
620 | * | ||
621 | * This is the evil twin version of i915_gem_ringbuffer_submission. It abstracts | ||
622 | * away the submission details of the execbuffer ioctl call. | ||
623 | * | ||
624 | * Return: non-zero if the submission fails. | ||
625 | */ | ||
484 | int intel_execlists_submission(struct drm_device *dev, struct drm_file *file, | 626 | int intel_execlists_submission(struct drm_device *dev, struct drm_file *file, |
485 | struct intel_engine_cs *ring, | 627 | struct intel_engine_cs *ring, |
486 | struct intel_context *ctx, | 628 | struct intel_context *ctx, |
@@ -608,6 +750,15 @@ int logical_ring_flush_all_caches(struct intel_ringbuffer *ringbuf) | |||
608 | return 0; | 750 | return 0; |
609 | } | 751 | } |
610 | 752 | ||
753 | /** | ||
754 | * intel_logical_ring_advance_and_submit() - advance the tail and submit the workload | ||
755 | * @ringbuf: Logical Ringbuffer to advance. | ||
756 | * | ||
757 | * The tail is updated in our logical ringbuffer struct, not in the actual context. What | ||
758 | * really happens during submission is that the context and current tail will be placed | ||
759 | * on a queue waiting for the ELSP to be ready to accept a new context submission. At that | ||
760 | * point, the tail *inside* the context is updated and the ELSP written to. | ||
761 | */ | ||
611 | void intel_logical_ring_advance_and_submit(struct intel_ringbuffer *ringbuf) | 762 | void intel_logical_ring_advance_and_submit(struct intel_ringbuffer *ringbuf) |
612 | { | 763 | { |
613 | struct intel_engine_cs *ring = ringbuf->ring; | 764 | struct intel_engine_cs *ring = ringbuf->ring; |
@@ -781,6 +932,19 @@ static int logical_ring_prepare(struct intel_ringbuffer *ringbuf, int bytes) | |||
781 | return 0; | 932 | return 0; |
782 | } | 933 | } |
783 | 934 | ||
935 | /** | ||
936 | * intel_logical_ring_begin() - prepare the logical ringbuffer to accept some commands | ||
937 | * | ||
938 | * @ringbuf: Logical ringbuffer. | ||
939 | * @num_dwords: number of DWORDs that we plan to write to the ringbuffer. | ||
940 | * | ||
941 | * The ringbuffer might not be ready to accept the commands right away (maybe it needs to | ||
942 | * be wrapped, or wait a bit for the tail to be updated). This function takes care of that | ||
943 | * and also preallocates a request (every workload submission is still mediated through | ||
944 | * requests, same as it did with legacy ringbuffer submission). | ||
945 | * | ||
946 | * Return: non-zero if the ringbuffer is not ready to be written to. | ||
947 | */ | ||
784 | int intel_logical_ring_begin(struct intel_ringbuffer *ringbuf, int num_dwords) | 948 | int intel_logical_ring_begin(struct intel_ringbuffer *ringbuf, int num_dwords) |
785 | { | 949 | { |
786 | struct intel_engine_cs *ring = ringbuf->ring; | 950 | struct intel_engine_cs *ring = ringbuf->ring; |
@@ -1021,6 +1185,12 @@ static int gen8_emit_request(struct intel_ringbuffer *ringbuf) | |||
1021 | return 0; | 1185 | return 0; |
1022 | } | 1186 | } |
1023 | 1187 | ||
1188 | /** | ||
1189 | * intel_logical_ring_cleanup() - deallocate the Engine Command Streamer | ||
1190 | * | ||
1191 | * @ring: Engine Command Streamer. | ||
1192 | * | ||
1193 | */ | ||
1024 | void intel_logical_ring_cleanup(struct intel_engine_cs *ring) | 1194 | void intel_logical_ring_cleanup(struct intel_engine_cs *ring) |
1025 | { | 1195 | { |
1026 | struct drm_i915_private *dev_priv = ring->dev->dev_private; | 1196 | struct drm_i915_private *dev_priv = ring->dev->dev_private; |
@@ -1215,6 +1385,16 @@ static int logical_vebox_ring_init(struct drm_device *dev) | |||
1215 | return logical_ring_init(dev, ring); | 1385 | return logical_ring_init(dev, ring); |
1216 | } | 1386 | } |
1217 | 1387 | ||
1388 | /** | ||
1389 | * intel_logical_rings_init() - allocate, populate and init the Engine Command Streamers | ||
1390 | * @dev: DRM device. | ||
1391 | * | ||
1392 | * This function inits the engines for an Execlists submission style (the equivalent in the | ||
1393 | * legacy ringbuffer submission world would be i915_gem_init_rings). It does it only for | ||
1394 | * those engines that are present in the hardware. | ||
1395 | * | ||
1396 | * Return: non-zero if the initialization failed. | ||
1397 | */ | ||
1218 | int intel_logical_rings_init(struct drm_device *dev) | 1398 | int intel_logical_rings_init(struct drm_device *dev) |
1219 | { | 1399 | { |
1220 | struct drm_i915_private *dev_priv = dev->dev_private; | 1400 | struct drm_i915_private *dev_priv = dev->dev_private; |
@@ -1377,6 +1557,14 @@ populate_lr_context(struct intel_context *ctx, struct drm_i915_gem_object *ctx_o | |||
1377 | return 0; | 1557 | return 0; |
1378 | } | 1558 | } |
1379 | 1559 | ||
1560 | /** | ||
1561 | * intel_lr_context_free() - free the LRC specific bits of a context | ||
1562 | * @ctx: the LR context to free. | ||
1563 | * | ||
1564 | * The real context freeing is done in i915_gem_context_free: this only | ||
1565 | * takes care of the bits that are LRC related: the per-engine backing | ||
1566 | * objects and the logical ringbuffer. | ||
1567 | */ | ||
1380 | void intel_lr_context_free(struct intel_context *ctx) | 1568 | void intel_lr_context_free(struct intel_context *ctx) |
1381 | { | 1569 | { |
1382 | int i; | 1570 | int i; |
@@ -1415,6 +1603,19 @@ static uint32_t get_lr_context_size(struct intel_engine_cs *ring) | |||
1415 | return ret; | 1603 | return ret; |
1416 | } | 1604 | } |
1417 | 1605 | ||
1606 | /** | ||
1607 | * intel_lr_context_deferred_create() - create the LRC specific bits of a context | ||
1608 | * @ctx: LR context to create. | ||
1609 | * @ring: engine to be used with the context. | ||
1610 | * | ||
1611 | * This function can be called more than once, with different engines, if we plan | ||
1612 | * to use the context with them. The context backing objects and the ringbuffers | ||
1613 | * (specially the ringbuffer backing objects) suck a lot of memory up, and that's why | ||
1614 | * the creation is a deferred call: it's better to make sure first that we need to use | ||
1615 | * a given ring with the context. | ||
1616 | * | ||
1617 | * Return: non-zero on eror. | ||
1618 | */ | ||
1418 | int intel_lr_context_deferred_create(struct intel_context *ctx, | 1619 | int intel_lr_context_deferred_create(struct intel_context *ctx, |
1419 | struct intel_engine_cs *ring) | 1620 | struct intel_engine_cs *ring) |
1420 | { | 1621 | { |
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h index 117d1a4eb3b9..991d4499fb03 100644 --- a/drivers/gpu/drm/i915/intel_lrc.h +++ b/drivers/gpu/drm/i915/intel_lrc.h | |||
@@ -38,10 +38,21 @@ int intel_logical_rings_init(struct drm_device *dev); | |||
38 | 38 | ||
39 | int logical_ring_flush_all_caches(struct intel_ringbuffer *ringbuf); | 39 | int logical_ring_flush_all_caches(struct intel_ringbuffer *ringbuf); |
40 | void intel_logical_ring_advance_and_submit(struct intel_ringbuffer *ringbuf); | 40 | void intel_logical_ring_advance_and_submit(struct intel_ringbuffer *ringbuf); |
41 | /** | ||
42 | * intel_logical_ring_advance() - advance the ringbuffer tail | ||
43 | * @ringbuf: Ringbuffer to advance. | ||
44 | * | ||
45 | * The tail is only updated in our logical ringbuffer struct. | ||
46 | */ | ||
41 | static inline void intel_logical_ring_advance(struct intel_ringbuffer *ringbuf) | 47 | static inline void intel_logical_ring_advance(struct intel_ringbuffer *ringbuf) |
42 | { | 48 | { |
43 | ringbuf->tail &= ringbuf->size - 1; | 49 | ringbuf->tail &= ringbuf->size - 1; |
44 | } | 50 | } |
51 | /** | ||
52 | * intel_logical_ring_emit() - write a DWORD to the ringbuffer. | ||
53 | * @ringbuf: Ringbuffer to write to. | ||
54 | * @data: DWORD to write. | ||
55 | */ | ||
45 | static inline void intel_logical_ring_emit(struct intel_ringbuffer *ringbuf, | 56 | static inline void intel_logical_ring_emit(struct intel_ringbuffer *ringbuf, |
46 | u32 data) | 57 | u32 data) |
47 | { | 58 | { |
@@ -66,6 +77,25 @@ int intel_execlists_submission(struct drm_device *dev, struct drm_file *file, | |||
66 | u64 exec_start, u32 flags); | 77 | u64 exec_start, u32 flags); |
67 | u32 intel_execlists_ctx_id(struct drm_i915_gem_object *ctx_obj); | 78 | u32 intel_execlists_ctx_id(struct drm_i915_gem_object *ctx_obj); |
68 | 79 | ||
80 | /** | ||
81 | * struct intel_ctx_submit_request - queued context submission request | ||
82 | * @ctx: Context to submit to the ELSP. | ||
83 | * @ring: Engine to submit it to. | ||
84 | * @tail: how far in the context's ringbuffer this request goes to. | ||
85 | * @execlist_link: link in the submission queue. | ||
86 | * @work: workqueue for processing this request in a bottom half. | ||
87 | * @elsp_submitted: no. of times this request has been sent to the ELSP. | ||
88 | * | ||
89 | * The ELSP only accepts two elements at a time, so we queue context/tail | ||
90 | * pairs on a given queue (ring->execlist_queue) until the hardware is | ||
91 | * available. The queue serves a double purpose: we also use it to keep track | ||
92 | * of the up to 2 contexts currently in the hardware (usually one in execution | ||
93 | * and the other queued up by the GPU): We only remove elements from the head | ||
94 | * of the queue when the hardware informs us that an element has been | ||
95 | * completed. | ||
96 | * | ||
97 | * All accesses to the queue are mediated by a spinlock (ring->execlist_lock). | ||
98 | */ | ||
69 | struct intel_ctx_submit_request { | 99 | struct intel_ctx_submit_request { |
70 | struct intel_context *ctx; | 100 | struct intel_context *ctx; |
71 | struct intel_engine_cs *ring; | 101 | struct intel_engine_cs *ring; |