aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/i915/i915_gem.c
diff options
context:
space:
mode:
authorChris Wilson <chris@chris-wilson.co.uk>2016-07-01 12:23:15 -0400
committerChris Wilson <chris@chris-wilson.co.uk>2016-07-01 15:58:43 -0400
commit688e6c7258164de86d626e8e983ca8d28015c263 (patch)
treeea2040fd06199f2335f7431ca2506cd4e7757fe5 /drivers/gpu/drm/i915/i915_gem.c
parent1f15b76f1ec973d1eb5d21b6d98b21aebb9025f1 (diff)
drm/i915: Slaughter the thundering i915_wait_request herd
One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
Diffstat (limited to 'drivers/gpu/drm/i915/i915_gem.c')
-rw-r--r--drivers/gpu/drm/i915/i915_gem.c143
1 files changed, 53 insertions, 90 deletions
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index b5278d117ea0..c9814572e346 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1343,17 +1343,6 @@ i915_gem_check_wedge(unsigned reset_counter, bool interruptible)
1343 return 0; 1343 return 0;
1344} 1344}
1345 1345
1346static void fake_irq(unsigned long data)
1347{
1348 wake_up_process((struct task_struct *)data);
1349}
1350
1351static bool missed_irq(struct drm_i915_private *dev_priv,
1352 struct intel_engine_cs *engine)
1353{
1354 return test_bit(engine->id, &dev_priv->gpu_error.missed_irq_rings);
1355}
1356
1357static unsigned long local_clock_us(unsigned *cpu) 1346static unsigned long local_clock_us(unsigned *cpu)
1358{ 1347{
1359 unsigned long t; 1348 unsigned long t;
@@ -1386,7 +1375,7 @@ static bool busywait_stop(unsigned long timeout, unsigned cpu)
1386 return this_cpu != cpu; 1375 return this_cpu != cpu;
1387} 1376}
1388 1377
1389static int __i915_spin_request(struct drm_i915_gem_request *req, int state) 1378static bool __i915_spin_request(struct drm_i915_gem_request *req, int state)
1390{ 1379{
1391 unsigned long timeout; 1380 unsigned long timeout;
1392 unsigned cpu; 1381 unsigned cpu;
@@ -1401,17 +1390,14 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
1401 * takes to sleep on a request, on the order of a microsecond. 1390 * takes to sleep on a request, on the order of a microsecond.
1402 */ 1391 */
1403 1392
1404 if (req->engine->irq_refcount)
1405 return -EBUSY;
1406
1407 /* Only spin if we know the GPU is processing this request */ 1393 /* Only spin if we know the GPU is processing this request */
1408 if (!i915_gem_request_started(req, true)) 1394 if (!i915_gem_request_started(req, true))
1409 return -EAGAIN; 1395 return false;
1410 1396
1411 timeout = local_clock_us(&cpu) + 5; 1397 timeout = local_clock_us(&cpu) + 5;
1412 while (!need_resched()) { 1398 do {
1413 if (i915_gem_request_completed(req, true)) 1399 if (i915_gem_request_completed(req, true))
1414 return 0; 1400 return true;
1415 1401
1416 if (signal_pending_state(state, current)) 1402 if (signal_pending_state(state, current))
1417 break; 1403 break;
@@ -1420,12 +1406,9 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
1420 break; 1406 break;
1421 1407
1422 cpu_relax_lowlatency(); 1408 cpu_relax_lowlatency();
1423 } 1409 } while (!need_resched());
1424
1425 if (i915_gem_request_completed(req, false))
1426 return 0;
1427 1410
1428 return -EAGAIN; 1411 return false;
1429} 1412}
1430 1413
1431/** 1414/**
@@ -1450,18 +1433,14 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
1450 s64 *timeout, 1433 s64 *timeout,
1451 struct intel_rps_client *rps) 1434 struct intel_rps_client *rps)
1452{ 1435{
1453 struct intel_engine_cs *engine = i915_gem_request_get_engine(req);
1454 struct drm_i915_private *dev_priv = req->i915;
1455 const bool irq_test_in_progress =
1456 ACCESS_ONCE(dev_priv->gpu_error.test_irq_rings) & intel_engine_flag(engine);
1457 int state = interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE; 1436 int state = interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE;
1458 DEFINE_WAIT(reset); 1437 DEFINE_WAIT(reset);
1459 DEFINE_WAIT(wait); 1438 struct intel_wait wait;
1460 unsigned long timeout_expire; 1439 unsigned long timeout_remain;
1461 s64 before = 0; /* Only to silence a compiler warning. */ 1440 s64 before = 0; /* Only to silence a compiler warning. */
1462 int ret; 1441 int ret = 0;
1463 1442
1464 WARN(!intel_irqs_enabled(dev_priv), "IRQs disabled"); 1443 might_sleep();
1465 1444
1466 if (list_empty(&req->list)) 1445 if (list_empty(&req->list))
1467 return 0; 1446 return 0;
@@ -1469,7 +1448,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
1469 if (i915_gem_request_completed(req, true)) 1448 if (i915_gem_request_completed(req, true))
1470 return 0; 1449 return 0;
1471 1450
1472 timeout_expire = 0; 1451 timeout_remain = MAX_SCHEDULE_TIMEOUT;
1473 if (timeout) { 1452 if (timeout) {
1474 if (WARN_ON(*timeout < 0)) 1453 if (WARN_ON(*timeout < 0))
1475 return -EINVAL; 1454 return -EINVAL;
@@ -1477,7 +1456,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
1477 if (*timeout == 0) 1456 if (*timeout == 0)
1478 return -ETIME; 1457 return -ETIME;
1479 1458
1480 timeout_expire = jiffies + nsecs_to_jiffies_timeout(*timeout); 1459 timeout_remain = nsecs_to_jiffies_timeout(*timeout);
1481 1460
1482 /* 1461 /*
1483 * Record current time in case interrupted by signal, or wedged. 1462 * Record current time in case interrupted by signal, or wedged.
@@ -1485,55 +1464,32 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
1485 before = ktime_get_raw_ns(); 1464 before = ktime_get_raw_ns();
1486 } 1465 }
1487 1466
1488 if (INTEL_INFO(dev_priv)->gen >= 6)
1489 gen6_rps_boost(dev_priv, rps, req->emitted_jiffies);
1490
1491 trace_i915_gem_request_wait_begin(req); 1467 trace_i915_gem_request_wait_begin(req);
1492 1468
1493 /* Optimistic spin for the next jiffie before touching IRQs */ 1469 if (INTEL_INFO(req->i915)->gen >= 6)
1494 ret = __i915_spin_request(req, state); 1470 gen6_rps_boost(req->i915, rps, req->emitted_jiffies);
1495 if (ret == 0)
1496 goto out;
1497 1471
1498 if (!irq_test_in_progress && WARN_ON(!engine->irq_get(engine))) { 1472 /* Optimistic spin for the next ~jiffie before touching IRQs */
1499 ret = -ENODEV; 1473 if (__i915_spin_request(req, state))
1500 goto out; 1474 goto complete;
1501 }
1502 1475
1503 add_wait_queue(&dev_priv->gpu_error.wait_queue, &reset); 1476 set_current_state(state);
1504 for (;;) { 1477 add_wait_queue(&req->i915->gpu_error.wait_queue, &reset);
1505 struct timer_list timer;
1506 1478
1507 prepare_to_wait(&engine->irq_queue, &wait, state); 1479 intel_wait_init(&wait, req->seqno);
1508 1480 if (intel_engine_add_wait(req->engine, &wait))
1509 /* We need to check whether any gpu reset happened in between 1481 /* In order to check that we haven't missed the interrupt
1510 * the request being submitted and now. If a reset has occurred, 1482 * as we enabled it, we need to kick ourselves to do a
1511 * the seqno will have been advance past ours and our request 1483 * coherent check on the seqno before we sleep.
1512 * is complete. If we are in the process of handling a reset,
1513 * the request is effectively complete as the rendering will
1514 * be discarded, but we need to return in order to drop the
1515 * struct_mutex.
1516 */ 1484 */
1517 if (i915_reset_in_progress(&dev_priv->gpu_error)) { 1485 goto wakeup;
1518 ret = 0;
1519 break;
1520 }
1521
1522 if (i915_gem_request_completed(req, false)) {
1523 ret = 0;
1524 break;
1525 }
1526 1486
1487 for (;;) {
1527 if (signal_pending_state(state, current)) { 1488 if (signal_pending_state(state, current)) {
1528 ret = -ERESTARTSYS; 1489 ret = -ERESTARTSYS;
1529 break; 1490 break;
1530 } 1491 }
1531 1492
1532 if (timeout && time_after_eq(jiffies, timeout_expire)) {
1533 ret = -ETIME;
1534 break;
1535 }
1536
1537 /* Ensure that even if the GPU hangs, we get woken up. 1493 /* Ensure that even if the GPU hangs, we get woken up.
1538 * 1494 *
1539 * However, note that if no one is waiting, we never notice 1495 * However, note that if no one is waiting, we never notice
@@ -1541,32 +1497,33 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
1541 * held by the GPU and so trigger a hangcheck. In the most 1497 * held by the GPU and so trigger a hangcheck. In the most
1542 * pathological case, this will be upon memory starvation! 1498 * pathological case, this will be upon memory starvation!
1543 */ 1499 */
1544 i915_queue_hangcheck(dev_priv); 1500 i915_queue_hangcheck(req->i915);
1545
1546 timer.function = NULL;
1547 if (timeout || missed_irq(dev_priv, engine)) {
1548 unsigned long expire;
1549 1501
1550 setup_timer_on_stack(&timer, fake_irq, (unsigned long)current); 1502 timeout_remain = io_schedule_timeout(timeout_remain);
1551 expire = missed_irq(dev_priv, engine) ? jiffies + 1 : timeout_expire; 1503 if (timeout_remain == 0) {
1552 mod_timer(&timer, expire); 1504 ret = -ETIME;
1505 break;
1553 } 1506 }
1554 1507
1555 io_schedule(); 1508 if (intel_wait_complete(&wait))
1556 1509 break;
1557 if (timer.function) {
1558 del_singleshot_timer_sync(&timer);
1559 destroy_timer_on_stack(&timer);
1560 }
1561 }
1562 remove_wait_queue(&dev_priv->gpu_error.wait_queue, &reset);
1563 1510
1564 if (!irq_test_in_progress) 1511 set_current_state(state);
1565 engine->irq_put(engine);
1566 1512
1567 finish_wait(&engine->irq_queue, &wait); 1513wakeup:
1514 /* Carefully check if the request is complete, giving time
1515 * for the seqno to be visible following the interrupt.
1516 * We also have to check in case we are kicked by the GPU
1517 * reset in order to drop the struct_mutex.
1518 */
1519 if (__i915_request_irq_complete(req))
1520 break;
1521 }
1522 remove_wait_queue(&req->i915->gpu_error.wait_queue, &reset);
1568 1523
1569out: 1524 intel_engine_remove_wait(req->engine, &wait);
1525 __set_current_state(TASK_RUNNING);
1526complete:
1570 trace_i915_gem_request_wait_end(req); 1527 trace_i915_gem_request_wait_end(req);
1571 1528
1572 if (timeout) { 1529 if (timeout) {
@@ -2796,6 +2753,12 @@ i915_gem_init_seqno(struct drm_i915_private *dev_priv, u32 seqno)
2796 } 2753 }
2797 i915_gem_retire_requests(dev_priv); 2754 i915_gem_retire_requests(dev_priv);
2798 2755
2756 /* If the seqno wraps around, we need to clear the breadcrumb rbtree */
2757 if (!i915_seqno_passed(seqno, dev_priv->next_seqno)) {
2758 while (intel_kick_waiters(dev_priv))
2759 yield();
2760 }
2761
2799 /* Finally reset hw state */ 2762 /* Finally reset hw state */
2800 for_each_engine(engine, dev_priv) 2763 for_each_engine(engine, dev_priv)
2801 intel_ring_init_seqno(engine, seqno); 2764 intel_ring_init_seqno(engine, seqno);