From 889271dc04a1912d25f1f1ff35c3e4cb67be415e Mon Sep 17 00:00:00 2001 From: Seema Khowala Date: Fri, 1 Mar 2019 11:48:32 -0800 Subject: gpu: nvgpu: change err to info print if failing eng id is -1 For handle_sched_error, change err to info print for failing eng id returned as -1 i.e. FIFO_INVAL_ENGINE_ID as no engine is found busy doing ctxsw. May be ctxsw already finished for the context for which ctxsw timeout intr was triggered. Possible Causes: a) On hitting engine reset, h/w drops the ctxsw_status to INVALID in fifo_engine_status register. Also while the engine is held in reset h/w passes busy/idle straight through. fifo_engine_status registers are correct in that there is no context switch outstanding as the CTXSW is aborted when reset is asserted. This is just a side effect of how gv100 and earlier versions of ctxsw_timeout behave. With gv10b and later, h/w snaps the context at the point of error so that s/w can see the tsg_id which caused the HW timeout. b) If engines are not busy and ctxsw state is valid then intr occurred in the past and if the ctxsw state has moved on to VALID from LOAD or SAVE, it means that whatever timed out eventually finished anyways. The problem with this is that s/w cannot conclude which context caused the problem as maybe more switches occurred before intr is handled. Bug 2092051 Bug 2429295 Bug 2484211 Bug 1890287 Change-Id: Ia79bee6e860fb179ee39024c963671d4f8245227 Signed-off-by: Seema Khowala Reviewed-on: https://git-master.nvidia.com/r/2030866 Signed-off-by: Debarshi Dutta (cherry-picked from d27f875d2c7839d3b1ec7db80d83594509ff2ea8 in dev-kernel) Reviewed-on: https://git-master.nvidia.com/r/2076126 Reviewed-by: mobile promotions Tested-by: mobile promotions --- drivers/gpu/nvgpu/gk20a/fifo_gk20a.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) (limited to 'drivers') diff --git a/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c b/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c index dbed9880..78f777ae 100644 --- a/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c +++ b/drivers/gpu/nvgpu/gk20a/fifo_gk20a.c @@ -2410,6 +2410,34 @@ bool gk20a_fifo_handle_sched_error(struct gk20a *g) sched_error = gk20a_readl(g, fifo_intr_sched_error_r()); engine_id = gk20a_fifo_get_failing_engine_data(g, &id, &is_tsg); + /* + * Could not find the engine + * Possible Causes: + * a) + * On hitting engine reset, h/w drops the ctxsw_status to INVALID in + * fifo_engine_status register. Also while the engine is held in reset + * h/w passes busy/idle straight through. fifo_engine_status registers + * are correct in that there is no context switch outstanding + * as the CTXSW is aborted when reset is asserted. + * This is just a side effect of how gv100 and earlier versions of + * ctxsw_timeout behave. + * With gv11b and later, h/w snaps the context at the point of error + * so that s/w can see the tsg_id which caused the HW timeout. + * b) + * If engines are not busy and ctxsw state is valid then intr occurred + * in the past and if the ctxsw state has moved on to VALID from LOAD + * or SAVE, it means that whatever timed out eventually finished + * anyways. The problem with this is that s/w cannot conclude which + * context caused the problem as maybe more switches occurred before + * intr is handled. + */ + if (engine_id == FIFO_INVAL_ENGINE_ID) { + nvgpu_info(g, "fifo sched error: 0x%08x, failed to find engine " + "that is busy doing ctxsw. " + "May be ctxsw already happened", sched_error); + ret = false; + goto err; + } /* could not find the engine - should never happen */ if (!gk20a_fifo_is_valid_engine_id(g, engine_id)) { -- cgit v1.2.2