From 889271dc04a1912d25f1f1ff35c3e4cb67be415e Mon Sep 17 00:00:00 2001
From: Seema Khowala <seemaj@nvidia.com>
Date: Fri, 1 Mar 2019 11:48:32 -0800
Subject: gpu: nvgpu: change err to info print if failing eng id is -1

For handle_sched_error, change err to info print for failing eng
id returned as -1 i.e. FIFO_INVAL_ENGINE_ID as no engine is found
busy doing ctxsw. May be ctxsw already finished for the context
for which ctxsw timeout intr was triggered.

Possible Causes:
a)
On hitting engine reset, h/w drops the ctxsw_status to INVALID in
fifo_engine_status register. Also while the engine is held in reset
h/w passes busy/idle straight through. fifo_engine_status registers
are correct in that there is no context switch outstanding
as the CTXSW is aborted when reset is asserted.
This is just a side effect of how gv100 and earlier versions of
ctxsw_timeout behave.
With gv10b and later, h/w snaps the context at the point of error
so that s/w can see the tsg_id which caused the HW timeout.
b)
If engines are not busy and ctxsw state is valid then intr occurred
in the past and if the ctxsw state has moved on to VALID from LOAD
or SAVE, it means that whatever timed out eventually finished
anyways. The problem with this is that s/w cannot conclude which
context caused the problem as maybe more switches occurred before
intr is handled.

Bug 2092051
Bug 2429295
Bug 2484211
Bug 1890287

Change-Id: Ia79bee6e860fb179ee39024c963671d4f8245227
Signed-off-by: Seema Khowala <seemaj@nvidia.com>
Reviewed-on: https://git-master.nvidia.com/r/2030866
Signed-off-by: Debarshi Dutta <ddutta@nvidia.com>
(cherry-picked from d27f875d2c7839d3b1ec7db80d83594509ff2ea8
in dev-kernel)
Reviewed-on: https://git-master.nvidia.com/r/2076126
Reviewed-by: mobile promotions <svcmobile_promotions@nvidia.com>
Tested-by: mobile promotions <svcmobile_promotions@nvidia.com>
---
 drivers/gpu/nvgpu/gk20a/fifo_gk20a.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

(limited to 'drivers')

diff --git a/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c b/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c
index dbed9880..78f777ae 100644
--- a/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c
+++ b/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c
@@ -2410,6 +2410,34 @@ bool gk20a_fifo_handle_sched_error(struct gk20a *g)
 	sched_error = gk20a_readl(g, fifo_intr_sched_error_r());
 
 	engine_id = gk20a_fifo_get_failing_engine_data(g, &id, &is_tsg);
+	/*
+	 * Could not find the engine
+	 * Possible Causes:
+	 * a)
+	 * On hitting engine reset, h/w drops the ctxsw_status to INVALID in
+	 * fifo_engine_status register. Also while the engine is held in reset
+	 * h/w passes busy/idle straight through. fifo_engine_status registers
+	 * are correct in that there is no context switch outstanding
+	 * as the CTXSW is aborted when reset is asserted.
+	 * This is just a side effect of how gv100 and earlier versions of
+	 * ctxsw_timeout behave.
+	 * With gv11b and later, h/w snaps the context at the point of error
+	 * so that s/w can see the tsg_id which caused the HW timeout.
+	 * b)
+	 * If engines are not busy and ctxsw state is valid then intr occurred
+	 * in the past and if the ctxsw state has moved on to VALID from LOAD
+	 * or SAVE, it means that whatever timed out eventually finished
+	 * anyways. The problem with this is that s/w cannot conclude which
+	 * context caused the problem as maybe more switches occurred before
+	 * intr is handled.
+	 */
+	if (engine_id == FIFO_INVAL_ENGINE_ID) {
+		nvgpu_info(g, "fifo sched error: 0x%08x, failed to find engine "
+				"that is busy doing ctxsw. "
+				"May be ctxsw already happened", sched_error);
+		ret = false;
+		goto err;
+	}
 
 	/* could not find the engine - should never happen */
 	if (!gk20a_fifo_is_valid_engine_id(g, engine_id)) {
-- 
cgit v1.2.2