litmus-rt.git - The LITMUS^RT kernel.

	Commit message (Collapse)	Author	Age
*	mm: numa: Limit NUMA scanning to migrate-on-fault VMAs	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a 90% regression observed with a large Oracle performance test on a 4 node system. Profiles indicated that the overhead was due to contention on sp_lock when looking up shared memory policies. These policies do not have the appropriate flags to allow them to be automatically balanced so trapping faults on them is pointless. This patch skips VMAs that do not have MPOL_F_MOF set. [riel@redhat.com: Initial patch] Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-and-tested-by: Joe Mario <jmario@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-32-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Do not migrate memory immediately after switching node	Rik van Riel	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The load balancer can move tasks between nodes and does not take NUMA locality into account. With automatic NUMA balancing this may result in the tasks working set being migrated to the new node. However, as the fault buffer will still store faults from the old node the schduler may decide to reset the preferred node and migrate the task back resulting in more migrations. The ideal would be that the scheduler did not migrate tasks with a heavy memory footprint but this may result nodes being overloaded. We could also discard the fault information on task migration but this would still cause all the tasks working set to be migrated. This patch simply avoids migrating the memory for a short time after a task is migrated. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-31-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Set preferred NUMA node based on number of private faults	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ideally it would be possible to distinguish between NUMA hinting faults that are private to a task and those that are shared. If treated identically there is a risk that shared pages bounce between nodes depending on the order they are referenced by tasks. Ultimately what is desirable is that task private pages remain local to the task while shared pages are interleaved between sharing tasks running on different nodes to give good average performance. This is further complicated by THP as even applications that partition their data may not be partitioning on a huge page boundary. To start with, this patch assumes that multi-threaded or multi-process applications partition their data and that in general the private accesses are more important for cpu->memory locality in the general case. Also, no new infrastructure is required to treat private pages properly but interleaving for shared pages requires additional infrastructure. To detect private accesses the pid of the last accessing task is required but the storage requirements are a high. This patch borrows heavily from Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking" to encode some bits from the last accessing task in the page flags as well as the node information. Collisions will occur but it is better than just depending on the node information. Node information is then used to determine if a page needs to migrate. The PID information is used to detect private/shared accesses. The preferred NUMA node is selected based on where the maximum number of approximately private faults were measured. Shared faults are not taken into consideration for a few reasons. First, if there are many tasks sharing the page then they'll all move towards the same node. The node will be compute overloaded and then scheduled away later only to bounce back again. Alternatively the shared tasks would just bounce around nodes because the fault information is effectively noise. Either way accounting for shared faults the same as private faults can result in lower performance overall. The second reason is based on a hypothetical workload that has a small number of very important, heavily accessed private pages but a large shared array. The shared array would dominate the number of faults and be selected as a preferred node even though it's the wrong decision. The third reason is that multiple threads in a process will race each other to fault the shared page making the fault information unreliable. Signed-off-by: Mel Gorman <mgorman@suse.de> [ Fix complication error when !NUMA_BALANCING. ] Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Remove check that skips small VMAs	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	task_numa_work skips small VMAs. At the time the logic was to reduce the scanning overhead which was considerable. It is a dubious hack at best. It would make much more sense to cache where faults have been observed and only rescan those regions during subsequent PTE scans. Remove this hack as motivation to do it properly in the future. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-29-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	mm: numa: Scan pages with elevated page_mapcount	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently automatic NUMA balancing is unable to distinguish between false shared versus private pages except by ignoring pages with an elevated page_mapcount entirely. This avoids shared pages bouncing between the nodes whose task is using them but that is ignored quite a lot of data. This patch kicks away the training wheels in preparation for adding support for identifying shared/private pages is now in place. The ordering is so that the impact of the shared/private detection can be easily measured. Note that the patch does not migrate shared, file-backed within vmas marked VM_EXEC as these are generally shared library pages. Migrating such pages is not beneficial as there is an expectation they are read-shared between caches and iTLB and iCache pressure is generally low. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-28-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Check current->mm before allocating NUMA faults	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	task_numa_placement checks current->mm but after buffers for faults have already been uselessly allocated. Move the check earlier. [peterz@infradead.org: Identified the problem] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-27-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Add infrastructure for split shared/private accounting of NUMA ↵	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	hinting faults Ideally it would be possible to distinguish between NUMA hinting faults that are private to a task and those that are shared. This patch prepares infrastructure for separately accounting shared and private faults by allocating the necessary buffers and passing in relevant information. For now, all faults are treated as private and detection will be introduced later. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-26-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Reschedule task on preferred NUMA node once selected	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A preferred node is selected based on the node the most NUMA hinting faults was incurred on. There is no guarantee that the task is running on that node at the time so this patch rescheules the task to run on the most idle CPU of the selected node when selected. This avoids waiting for the balancer to make a decision. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-25-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Resist moving tasks towards nodes with fewer hinting faults	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Just as "sched: Favour moving tasks towards the preferred node" favours moving tasks towards nodes with a higher number of recorded NUMA hinting faults, this patch resists moving tasks towards nodes with lower faults. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-24-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Favour moving tasks towards the preferred node	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch favours moving tasks towards NUMA node that recorded a higher number of NUMA faults during active load balancing. Ideally this is self-reinforcing as the longer the task runs on that node, the more faults it should incur causing task_numa_placement to keep the task running on that node. In reality a big weakness is that the nodes CPUs can be overloaded and it would be more efficient to queue tasks on an idle node and migrate to the new node. This would require additional smarts in the balancer so for now the balancer will simply prefer to place the task on the preferred node for a PTE scans which is controlled by the numa_balancing_settle_count sysctl. Once the settle_count number of scans has complete the schedule is free to place the task on an alternative node if the load is imbalanced. [srikar@linux.vnet.ibm.com: Fixed statistics] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [ Tunable and use higher faults instead of preferred. ] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Update NUMA hinting faults once per scan	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	NUMA hinting fault counts and placement decisions are both recorded in the same array which distorts the samples in an unpredictable fashion. The values linearly accumulate during the scan and then decay creating a sawtooth-like pattern in the per-node counts. It also means that placement decisions are time sensitive. At best it means that it is very difficult to state that the buffer holds a decaying average of past faulting behaviour. At worst, it can confuse the load balancer if it sees one node with an artifically high count due to very recent faulting activity and may create a bouncing effect. This patch adds a second array. numa_faults stores the historical data which is used for placement decisions. numa_faults_buffer holds the fault activity during the current scan window. When the scan completes, numa_faults decays and the values from numa_faults_buffer are copied across. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-22-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Select a preferred node with the most numa hinting faults	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch selects a preferred node for a task to run on based on the NUMA hinting faults. This information is later used to migrate tasks towards the node during balancing. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-21-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Track NUMA hinting faults on per-node basis	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch tracks what nodes numa hinting faults were incurred on. This information is later used to schedule a task on the node storing the pages most frequently faulted by the task. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Slow scan rate if no NUMA hinting faults are being recorded	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page was migrated. For long-lived but idle processes there may be no faults but the scan rate will be high and just waste CPU. This patch will slow the scan rate for processes that are not trapping faults. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-19-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Set the scan rate proportional to the memory usage of the task ↵	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	being scanned The NUMA PTE scan rate is controlled with a combination of the numa_balancing_scan_period_min, numa_balancing_scan_period_max and numa_balancing_scan_size. This scan rate is independent of the size of the task and as an aside it is further complicated by the fact that numa_balancing_scan_size controls how many pages are marked pte_numa and not how much virtual memory is scanned. In combination, it is almost impossible to meaningfully tune the min and max scan periods and reasoning about performance is complex when the time to complete a full scan is is partially a function of the tasks memory size. This patch alters the semantic of the min and max tunables to be about tuning the length time it takes to complete a scan of a tasks occupied virtual address space. Conceptually this is a lot easier to understand. There is a "sanity" check to ensure the scan rate is never extremely fast based on the amount of virtual memory that should be scanned in a second. The default of 2.5G seems arbitrary but it is to have the maximum scan rate after the patch roughly match the maximum scan rate before the patch was applied. On a similar note, numa_scan_period is in milliseconds and not jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies to numa_scan_period means that the rate scanning slows depends on HZ which is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Initialise numa_next_scan properly	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Scan delay logic and resets are currently initialised to start scanning immediately instead of delaying properly. Initialise them properly at fork time and catch when a new mm has been allocated. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-17-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a ↵	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	new node" PTE scanning and NUMA hinting fault handling is expensive so commit 5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node") deferred the PTE scan until a task had been scheduled on another node. The problem is that in the purely shared memory case that this may never happen and no NUMA hinting fault information will be captured. We are not ruling out the possibility that something better can be done here but for now, this patch needs to be reverted and depend entirely on the scan_delay to avoid punishing short-lived processes. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Continue PTE scanning even if migrate rate limited	Peter Zijlstra	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate limited sees like a bad idea. Even if this node can't migrate anymore other nodes might and we want up-to-date information to do balance decisions. We already rate limit the actual migrations, this should leave enough bandwidth to allow the non-migrating scanning. I think its important we keep up-to-date information if we're going to do placement based on it. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-15-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Mitigate chance that same task always updates PTEs	Peter Zijlstra	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With a trace_printk("working\n"); right after the cmpxchg in task_numa_work() we can see that of a 4 thread process, its always the same task winning the race and doing the protection change. This is a problem since the task doing the protection change has a penalty for taking faults -- it is busy when marking the PTEs. If its always the same task the ->numa_faults[] get severely skewed. Avoid this by delaying the task doing the protection change such that it is unlikely to win the privilege again. Before: root@interlagos:~# grep "thread 0/.working" /debug/tracing/trace \| tail -15 thread 0/0-3232 [022] .... 212.787402: task_numa_work: working thread 0/0-3232 [022] .... 212.888473: task_numa_work: working thread 0/0-3232 [022] .... 212.989538: task_numa_work: working thread 0/0-3232 [022] .... 213.090602: task_numa_work: working thread 0/0-3232 [022] .... 213.191667: task_numa_work: working thread 0/0-3232 [022] .... 213.292734: task_numa_work: working thread 0/0-3232 [022] .... 213.393804: task_numa_work: working thread 0/0-3232 [022] .... 213.494869: task_numa_work: working thread 0/0-3232 [022] .... 213.596937: task_numa_work: working thread 0/0-3232 [022] .... 213.699000: task_numa_work: working thread 0/0-3232 [022] .... 213.801067: task_numa_work: working thread 0/0-3232 [022] .... 213.903155: task_numa_work: working thread 0/0-3232 [022] .... 214.005201: task_numa_work: working thread 0/0-3232 [022] .... 214.107266: task_numa_work: working thread 0/0-3232 [022] .... 214.209342: task_numa_work: working After: root@interlagos:~# grep "thread 0/.working" /debug/tracing/trace \| tail -15 thread 0/0-3253 [005] .... 136.865051: task_numa_work: working thread 0/2-3255 [026] .... 136.965134: task_numa_work: working thread 0/3-3256 [024] .... 137.065217: task_numa_work: working thread 0/3-3256 [024] .... 137.165302: task_numa_work: working thread 0/3-3256 [024] .... 137.265382: task_numa_work: working thread 0/0-3253 [004] .... 137.366465: task_numa_work: working thread 0/2-3255 [026] .... 137.466549: task_numa_work: working thread 0/0-3253 [004] .... 137.566629: task_numa_work: working thread 0/0-3253 [004] .... 137.666711: task_numa_work: working thread 0/1-3254 [028] .... 137.766799: task_numa_work: working thread 0/0-3253 [004] .... 137.866876: task_numa_work: working thread 0/2-3255 [026] .... 137.966960: task_numa_work: working thread 0/1-3254 [028] .... 138.067041: task_numa_work: working thread 0/2-3255 [026] .... 138.167123: task_numa_work: working thread 0/3-3256 [024] .... 138.267207: task_numa_work: working Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-14-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	mm: numa: Do not migrate or account for hinting faults on the zero page	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The zero page is not replicated between nodes and is often shared between processes. The data is read-only and likely to be cached in local CPUs if heavily accessed meaning that the remote memory access cost is less of a concern. This patch prevents trapping faults on the zero pages. For tasks using the zero page this will reduce the number of PTE updates, TLB flushes and hinting faults. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [ Correct use of is_huge_zero_page] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-13-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	NUMA PTE scanning is expensive both in terms of the scanning itself and the TLB flush if there are any updates. The TLB flush is avoided if no PTEs are updated but there is a bug where transhuge PMDs are considered to be updated even if they were already pmd_numa. This patch addresses the problem and TLB flushes should be reduced. Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-12-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	mm: Do not flush TLB during protection change if !pte_present && ↵	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	!migration_entry NUMA PTE scanning is expensive both in terms of the scanning itself and the TLB flush if there are any updates. Currently non-present PTEs are accounted for as an update and incurring a TLB flush where it is only necessary for anonymous migration entries. This patch addresses the problem and should reduce TLB flushes. Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-11-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	mm: Account for a THP NUMA hinting update as one PTE update	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A THP PMD update is accounted for as 512 pages updated in vmstat. This is large difference when estimating the cost of automatic NUMA balancing and can be misleading when comparing results that had collapsed versus split THP. This patch addresses the accounting issue. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	mm: Close races between THP migration and PMD numa clearing	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	THP migration uses the page lock to guard against parallel allocations but there are cases like this still open Task A Task B --------------------- --------------------- do_huge_pmd_numa_page do_huge_pmd_numa_page lock_page mpol_misplaced == -1 unlock_page goto clear_pmdnuma lock_page mpol_misplaced == 2 migrate_misplaced_transhuge pmd = pmd_mknonnuma set_pmd_at During hours of testing, one crashed with weird errors and while I have no direct evidence, I suspect something like the race above happened. This patch extends the page lock to being held until the pmd_numa is cleared to prevent migration starting in parallel while the pmd_numa is being cleared. It also flushes the old pmd entry and orders pagetable insertion before rmap insertion. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	mm: numa: Sanitize task_numa_fault() callsites	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are three callers of task_numa_fault(): - do_huge_pmd_numa_page(): Accounts against the current node, not the node where the page resides, unless we migrated, in which case it accounts against the node we migrated to. - do_numa_page(): Accounts against the current node, not the node where the page resides, unless we migrated, in which case it accounts against the node we migrated to. - do_pmd_numa_page(): Accounts not at all when the page isn't migrated, otherwise accounts against the node we migrated towards. This seems wrong to me; all three sites should have the same sementaics, furthermore we should accounts against where the page really is, we already know where the task is. So modify all three sites to always account; we did after all receive the fault; and always account to where the page is after migration, regardless of success. They all still differ on when they clear the PTE/PMD; ideally that would get sorted too. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	mm: Prevent parallel splits during THP migration	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	THP migrations are serialised by the page lock but on its own that does not prevent THP splits. If the page is split during THP migration then the pmd_same checks will prevent page table corruption but the unlock page and other fix-ups potentially will cause corruption. This patch takes the anon_vma lock to prevent parallel splits during migration. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-7-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	mm: Wait for THP migrations to complete during NUMA hinting faults	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The locking for migrating THP is unusual. While normal page migration prevents parallel accesses using a migration PTE, THP migration relies on a combination of the page_table_lock, the page lock and the existance of the NUMA hinting PTE to guarantee safety but there is a bug in the scheme. If a THP page is currently being migrated and another thread traps a fault on the same page it checks if the page is misplaced. If it is not, then pmd_numa is cleared. The problem is that it checks if the page is misplaced without holding the page lock meaning that the racing thread can be migrating the THP when the second thread clears the NUMA bit and faults a stale page. This patch checks if the page is potentially being migrated and stalls using the lock_page if it is potentially being migrated before checking if the page is misplaced or not. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-6-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	mm: numa: Do not account for a hinting fault if we raced	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If another task handled a hinting fault in parallel then do not double account for it. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-5-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	sched/numa: Fix comments	Peter Zijlstra	2013-10-09
\| \| \| \| \| \| \| \| \| \| \| \| \|	Fix a 80 column violation and a PTE vs PMD reference. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-4-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	mm: numa: Document automatic NUMA balancing sysctls	Mel Gorman	2013-10-09
\| \| \| \| \| \| \| \| \| \| \|	Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-3-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
*	Merge tag 'v3.12-rc4' into sched/core	Ingo Molnar	2013-10-09
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Merge Linux v3.12-rc4 to fix a conflict and also to refresh the tree before applying more scheduler patches. Conflicts: arch/avr32/include/asm/Kbuild Signed-off-by: Ingo Molnar <mingo@kernel.org>
\| *	Linux 3.12-rc4	Linus Torvalds	2013-10-06
\| \|
\| *	net: Update the sysctl permissions handler to test effective uid/gid	Eric W. Biederman	2013-10-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Modify the code to use current_euid(), and in_egroup_p, as in done in fs/proc/proc_sysctl.c:test_perm() Cc: stable@vger.kernel.org Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
\| *	Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending	Linus Torvalds	2013-10-06
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull SCSI target fixes from Nicholas Bellinger: "Here are the outstanding target fixes queued up for v3.12-rc4 code. The highlights include: - Make vhost/scsi tag percpu_ida_alloc() use GFP_ATOMIC - Allow sess_cmd_map allocation failure fallback to use vzalloc - Fix COMPARE_AND_WRITE se_cmd->data_length bug with FILEIO backends - Fixes for COMPARE_AND_WRITE callback recursive failure OOPs + non zero scsi_status bug - Make iscsi-target do acknowledgement tag release from RX context - Setup iscsi-target with extra (cmdsn_depth / 2) percpu_ida tags Also included is a iscsi-target patch CC'ed for v3.10+ that avoids legacy wait_for_task=true release during fast-past StatSN acknowledgement, and two other SRP target related patches that address long-standing issues that are CC'ed for v3.3+. Extra thanks to Thomas Glanzmann for his testing feedback with COMPARE_AND_WRITE + EXTENDED_COPY VAAI logic" * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: iscsi-target; Allow an extra tag_num / 2 number of percpu_ida tags iscsi-target: Perform release of acknowledged tags from RX context iscsi-target: Only perform wait_for_tasks when performing shutdown target: Fail on non zero scsi_status in compare_and_write_callback target: Fix recursive COMPARE_AND_WRITE callback failure target: Reset data_length for COMPARE_AND_WRITE to NoLB * block_size ib_srpt: always set response for task management target: Fall back to vzalloc upon ->sess_cmd_map kzalloc failure vhost/scsi: Use GFP_ATOMIC with percpu_ida_alloc for obtaining tag ib_srpt: Destroy cm_id before destroying QP. target: Fix xop->dbl assignment in target_xcopy_parse_segdesc_02
\| \| *	iscsi-target; Allow an extra tag_num / 2 number of percpu_ida tags	Nicholas Bellinger	2013-10-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch bumps the default number of tags allocated per session by iscsi-target via transport_alloc_session_tags() -> percpu_ida_init() by another (tag_num / 2). This is done to take into account the tags waiting to be acknowledged and released in iscsit_ack_from_expstatsn(), but who's number are not directly limited by the CmdSN Window queue_depth being enforced by the target. Using a larger value here is also useful to prevent percpu_ida_alloc() from having to steal tags from other CPUs when no tags are available on the local CPU, while waiting for unacknowledged tags to be released. Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
\| \| *	iscsi-target: Perform release of acknowledged tags from RX context	Nicholas Bellinger	2013-10-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch converts iscsit_ack_from_expstatsn() to populate a local ack_list of commands, and call iscsit_free_cmd() directly from RX thread context, instead of using iscsit_add_cmd_to_immediate_queue() to queue the acknowledged commands to be released from TX thread context. It is helpful to release the acknowledge commands as quickly as possible, along with the associated percpu_ida tags, in order to prevent percpu_ida_alloc() from having to steal tags from other CPUs while waiting for iscsit_free_cmd() to happen from TX thread context. Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
\| \| *	iscsi-target: Only perform wait_for_tasks when performing shutdown	Nicholas Bellinger	2013-10-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch changes transport_generic_free_cmd() to only wait_for_tasks when shutdown=true is passed to iscsit_free_cmd(). With the advent of >= v3.10 iscsi-target code using se_cmd->cmd_kref, the extra wait_for_tasks with shutdown=false is unnecessary, and may end up causing an extra context switch when releasing WRITEs. Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
\| \| *	target: Fail on non zero scsi_status in compare_and_write_callback	Nicholas Bellinger	2013-10-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch addresses a bug for backends such as IBLOCK that perform asynchronous completion via transport_complete_cmd(), that will call target_complete_failure_work() -> transport_generic_request_failure(), upon exception status and invoke cmd->transport_complete_callback() -> compare_and_write_callback() incorrectly during the failure case. It adds a check for a non zero se_cmd->scsi_status within the first invocation of compare_and_write_callback(), and will jump to out plus up se_device->caw_sem before exiting the callback. Reported-by: Thomas Glanzmann <thomas@glanzmann.de> Tested-by: Thomas Glanzmann <thomas@glanzmann.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
\| \| *	target: Fix recursive COMPARE_AND_WRITE callback failure	Nicholas Bellinger	2013-10-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch addresses a bug when compare_and_write_callback() invoked from target_complete_ok_work() hits an failure from __target_execute_cmd() -> cmd->execute_cmd(), that ends up calling transport_generic_request_failure() -> compare_and_write_post(), thus causing SCF_COMPARE_AND_WRITE_POST to incorrectly be set. The result of this bug is that target_complete_ok_work() no longer hits the if (!rc && !(cmd->se_cmd_flags & SCF_COMPARE_AND_WRITE_POST) check that forces an immediate return, and instead double completes the se_cmd in question, triggering an OOPs in the process. This patch changes compare_and_write_post() to only set this bit when a failure has not already occured to ensure the immediate return from within target_complete_ok_work(), and thus allow transport_generic_request_failure() to handle the sending of the CHECK_CONDITION exception status. Reported-by: Thomas Glanzmann <thomas@glanzmann.de> Tested-by: Thomas Glanzmann <thomas@glanzmann.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
\| \| *	target: Reset data_length for COMPARE_AND_WRITE to NoLB * block_size	Nicholas Bellinger	2013-10-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch resets se_cmd->data_length for COMPARE_AND_WRITE emulation within sbc_compare_and_write() to NoLB * block_size in order to address a bug with FILEIO backends where a I/O failure will occur when data_length does not match the I/O size being actually dispatched for the individual per block READs + WRITEs. This is done late enough in sbc_compare_and_write() after the memory allocations have occured in transport_generic_new_cmd() to not cause any unwanted side-effects. Reported-by: Thomas Glanzmann <thomas@glanzmann.de> Tested-by: Thomas Glanzmann <thomas@glanzmann.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
\| \| *	ib_srpt: always set response for task management	Jack Wang	2013-10-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The SRP specification requires: "Response data shall be provided in any SRP_RSP response that is sent in response to an SRP_TSK_MGMT request (see 6.7). The information in the RSP_CODE field (see table 24) shall indicate the completion status of the task management function." So fix this to avoid the SRP initiator interprets task management functions that succeeded as failed. Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Cc: stable@vger.kernel.org # 3.3+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
\| \| *	target: Fall back to vzalloc upon ->sess_cmd_map kzalloc failure	Nicholas Bellinger	2013-10-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch changes transport_alloc_session_tags() to fall back to use vzalloc when kzalloc fails for big tag_num that end up generating larger order allocations. Also use is_vmalloc_addr() in transport_alloc_session_tags() failure path, and normal transport_free_session() path to determine when vfree() needs to be called instead of kfree(). v2 changes: - Use __GFP_NOWARN \| __GFP_REPEAT for sess_cmd_map kzalloc (mst) Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Asias He <asias@redhat.com> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
\| \| *	vhost/scsi: Use GFP_ATOMIC with percpu_ida_alloc for obtaining tag	Nicholas Bellinger	2013-10-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix GFP_KERNEL -> GFP_ATOMIC usage of percpu_ida_alloc() within vhost_scsi_get_tag(), as this code is expected to be called directly from interrupt context. v2 changes: - Handle possible tag < 0 failure with GFP_ATOMIC Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Asias He <asias@redhat.com> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
\| \| *	ib_srpt: Destroy cm_id before destroying QP.	Nicholas Bellinger	2013-10-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes a bug where ib_destroy_cm_id() was incorrectly being called after srpt_destroy_ch_ib() had destroyed the active QP. This would result in the following failed SRP_LOGIN_REQ messages: Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff1762bd, t_port_id 0x2c903009f8f40:0x2c903009f8f40 and it_iu_len 260 on port 1 (guid=0xfe80000000000000:0x2c903009f8f41) Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff1758f9, t_port_id 0x2c903009f8f40:0x2c903009f8f40 and it_iu_len 260 on port 2 (guid=0xfe80000000000000:0x2c903009f8f42) Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff175941, t_port_id 0x2c903009f8f40:0x2c903009f8f40 and it_iu_len 260 on port 2 (guid=0xfe80000000000000:0x2c90300a3cfb2) Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff176299, t_port_id 0x2c903009f8f40:0x2c903009f8f40 and it_iu_len 260 on port 1 (guid=0xfe80000000000000:0x2c90300a3cfb1) mlx4_core 0000:84:00.0: command 0x19 failed: fw status = 0x9 rejected SRP_LOGIN_REQ because creating a new RDMA channel failed. Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff176299, t_port_id 0x2c903009f8f40:0x2c903009f8f40 and it_iu_len 260 on port 1 (guid=0xfe80000000000000:0x2c90300a3cfb1) mlx4_core 0000:84:00.0: command 0x19 failed: fw status = 0x9 rejected SRP_LOGIN_REQ because creating a new RDMA channel failed. Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff176299, t_port_id 0x2c903009f8f40:0x2c903009f8f40 and it_iu_len 260 on port 1 (guid=0xfe80000000000000:0x2c90300a3cfb1) Reported-by: Navin Ahuja <navin.ahuja@saratoga-speed.com> Cc: stable@vger.kernel.org # 3.3+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
\| \| *	target: Fix xop->dbl assignment in target_xcopy_parse_segdesc_02	Nicholas Bellinger	2013-10-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes up an incorrect assignment for xop->dbl within target_xcopy_parse_segdesc_02() code, as reported by Coverity here: http://marc.info/?l=linux-kernel&m=137936416618490&w=2 Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
\| * \|	Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma	Linus Torvalds	2013-10-06
\| \|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull slave-dmaengine fixes from Vinod Koul: "Here is the slave dmanegine fixes. We have the fix for deadlock issue on imx-dma by Michael and Josh's edma config fix along with author change" * 'fixes' of git://git.infradead.org/users/vkoul/slave-dma: dmaengine: imx-dma: fix callback path in tasklet dmaengine: imx-dma: fix lockdep issue between irqhandler and tasklet dmaengine: imx-dma: fix slow path issue in prep_dma_cyclic dma/Kconfig: Make TI_EDMA select TI_PRIV_EDMA edma: Update author email address
\| \| * \|	dmaengine: imx-dma: fix callback path in tasklet	Michael Grzeschik	2013-10-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to free the ld_active list head before jumping into the callback routine. Otherwise the callback could run into issue_pending and change our ld_active list head we just going to free. This will run the channel list into an currupted and undefined state. Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
\| \| * \|	dmaengine: imx-dma: fix lockdep issue between irqhandler and tasklet	Michael Grzeschik	2013-10-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The tasklet and irqhandler are using spin_lock while other routines are using spin_lock_irqsave/restore. This leads to lockdep issues as described bellow. This patch is changing the code to use spinlock_irq_save/restore in both code pathes. As imxdma_xfer_desc always gets called with spin_lock_irqsave lock held, this patch also removes the spare call inside the routine to avoid double locking. [ 403.358162] ================================= [ 403.362549] [ INFO: inconsistent lock state ] [ 403.366945] 3.10.0-20130823+ #904 Not tainted [ 403.371331] --------------------------------- [ 403.375721] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 403.381769] swapper/0 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 403.386762] (&(&imxdma->lock)->rlock){?.-...}, at: [<c019d77c>] imxdma_tasklet+0x20/0x134 [ 403.395201] {IN-HARDIRQ-W} state was registered at: [ 403.400108] [<c004b264>] mark_lock+0x2a0/0x6b4 [ 403.404798] [<c004d7c8>] __lock_acquire+0x650/0x1a64 [ 403.410004] [<c004f15c>] lock_acquire+0x94/0xa8 [ 403.414773] [<c02f74e4>] _raw_spin_lock+0x54/0x8c [ 403.419720] [<c019d094>] dma_irq_handler+0x78/0x254 [ 403.424845] [<c0061124>] handle_irq_event_percpu+0x38/0x1b4 [ 403.430670] [<c00612e4>] handle_irq_event+0x44/0x64 [ 403.435789] [<c0063a70>] handle_level_irq+0xd8/0xf0 [ 403.440903] [<c0060a20>] generic_handle_irq+0x28/0x38 [ 403.446194] [<c0009cc4>] handle_IRQ+0x68/0x8c [ 403.450789] [<c0008714>] avic_handle_irq+0x3c/0x48 [ 403.455811] [<c0008f84>] __irq_svc+0x44/0x74 [ 403.460314] [<c0040b04>] cpu_startup_entry+0x88/0xf4 [ 403.465525] [<c02f00d0>] rest_init+0xb8/0xe0 [ 403.470045] [<c03e07dc>] start_kernel+0x28c/0x2d4 [ 403.474986] [<a0008040>] 0xa0008040 [ 403.478709] irq event stamp: 50854 [ 403.482140] hardirqs last enabled at (50854): [<c001c6b8>] tasklet_action+0x38/0xdc [ 403.489954] hardirqs last disabled at (50853): [<c001c6a0>] tasklet_action+0x20/0xdc [ 403.497761] softirqs last enabled at (50850): [<c001bc64>] _local_bh_enable+0x14/0x18 [ 403.505741] softirqs last disabled at (50851): [<c001c268>] irq_exit+0x88/0xdc [ 403.513026] [ 403.513026] other info that might help us debug this: [ 403.519593] Possible unsafe locking scenario: [ 403.519593] [ 403.525548] CPU0 [ 403.528020] ---- [ 403.530491] lock(&(&imxdma->lock)->rlock); [ 403.534828] <Interrupt> [ 403.537474] lock(&(&imxdma->lock)->rlock); [ 403.541983] [ 403.541983] * DEADLOCK * [ 403.541983] [ 403.547951] no locks held by swapper/0. [ 403.551813] [ 403.551813] stack backtrace: [ 403.556222] CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0-20130823+ #904 [ 403.563039] Backtrace: [ 403.565581] [<c000b98c>] (dump_backtrace+0x0/0x10c) from [<c000bb28>] (show_stack+0x18/0x1c) [ 403.574054] r6:00000000 r5:c05c51d8 r4:c040bd58 r3:00200000 [ 403.579872] [<c000bb10>] (show_stack+0x0/0x1c) from [<c02f398c>] (dump_stack+0x20/0x28) [ 403.587955] [<c02f396c>] (dump_stack+0x0/0x28) from [<c02f29c8>] (print_usage_bug.part.28+0x224/0x28c) [ 403.597340] [<c02f27a4>] (print_usage_bug.part.28+0x0/0x28c) from [<c004b404>] (mark_lock+0x440/0x6b4) [ 403.606682] r8:c004a41c r7:00000000 r6:c040bd58 r5:c040c040 r4:00000002 [ 403.613566] [<c004afc4>] (mark_lock+0x0/0x6b4) from [<c004d844>] (__lock_acquire+0x6cc/0x1a64) [ 403.622244] [<c004d178>] (__lock_acquire+0x0/0x1a64) from [<c004f15c>] (lock_acquire+0x94/0xa8) [ 403.631010] [<c004f0c8>] (lock_acquire+0x0/0xa8) from [<c02f74e4>] (_raw_spin_lock+0x54/0x8c) [ 403.639614] [<c02f7490>] (_raw_spin_lock+0x0/0x8c) from [<c019d77c>] (imxdma_tasklet+0x20/0x134) [ 403.648434] r6:c3847010 r5:c040e890 r4:c38470d4 [ 403.653194] [<c019d75c>] (imxdma_tasklet+0x0/0x134) from [<c001c70c>] (tasklet_action+0x8c/0xdc) [ 403.662013] r8:c0599160 r7:00000000 r6:00000000 r5:c040e890 r4:c3847114 r3:c019d75c [ 403.670042] [<c001c680>] (tasklet_action+0x0/0xdc) from [<c001bd4c>] (__do_softirq+0xe4/0x1f0) [ 403.678687] r7:00000101 r6:c0402000 r5:c059919c r4:00000001 [ 403.684498] [<c001bc68>] (__do_softirq+0x0/0x1f0) from [<c001c268>] (irq_exit+0x88/0xdc) [ 403.692652] [<c001c1e0>] (irq_exit+0x0/0xdc) from [<c0009cc8>] (handle_IRQ+0x6c/0x8c) [ 403.700514] r4:00000030 r3:00000110 [ 403.704192] [<c0009c5c>] (handle_IRQ+0x0/0x8c) from [<c0008714>] (avic_handle_irq+0x3c/0x48) [ 403.712664] r5:c0403f28 r4:c0593ebc [ 403.716343] [<c00086d8>] (avic_handle_irq+0x0/0x48) from [<c0008f84>] (__irq_svc+0x44/0x74) [ 403.724733] Exception stack(0xc0403f28 to 0xc0403f70) [ 403.729841] 3f20: 00000001 00000004 00000000 20000013 c0402000 c04104a8 [ 403.738078] 3f40: 00000002 c0b69620 a0004000 41069264 a03fb5f4 c0403f7c c0403f40 c0403f70 [ 403.746301] 3f60: c004b92c c0009e74 20000013 ffffffff [ 403.751383] r6:ffffffff r5:20000013 r4:c0009e74 r3:c004b92c [ 403.757210] [<c0009e30>] (arch_cpu_idle+0x0/0x4c) from [<c0040b04>] (cpu_startup_entry+0x88/0xf4) [ 403.766161] [<c0040a7c>] (cpu_startup_entry+0x0/0xf4) from [<c02f00d0>] (rest_init+0xb8/0xe0) [ 403.774753] [<c02f0018>] (rest_init+0x0/0xe0) from [<c03e07dc>] (start_kernel+0x28c/0x2d4) [ 403.783051] r6:c03fc484 r5:ffffffff r4:c040a0e0 [ 403.787797] [<c03e0550>] (start_kernel+0x0/0x2d4) from [<a0008040>] (0xa0008040) Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
\| \| * \|	dmaengine: imx-dma: fix slow path issue in prep_dma_cyclic	Michael Grzeschik	2013-10-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When perparing cyclic_dma buffers by the sound layer, it will dump the following lockdep trace. The leading snd_pcm_action_single get called with read_lock_irq called. To fix this, we change the kcalloc call from GFP_KERNEL to GFP_ATOMIC. WARNING: at kernel/lockdep.c:2740 lockdep_trace_alloc+0xcc/0x114() DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)) Modules linked in: CPU: 0 PID: 832 Comm: aplay Not tainted 3.11.0-20130823+ #903 Backtrace: [<c000b98c>] (dump_backtrace+0x0/0x10c) from [<c000bb28>] (show_stack+0x18/0x1c) r6:c004c090 r5:00000009 r4:c2e0bd18 r3:00404000 [<c000bb10>] (show_stack+0x0/0x1c) from [<c02f397c>] (dump_stack+0x20/0x28) [<c02f395c>] (dump_stack+0x0/0x28) from [<c001531c>] (warn_slowpath_common+0x54/0x70) [<c00152c8>] (warn_slowpath_common+0x0/0x70) from [<c00153dc>] (warn_slowpath_fmt+0x38/0x40) r8:00004000 r7:a3b90000 r6:000080d0 r5:60000093 r4:c2e0a000 r3:00000009 [<c00153a4>] (warn_slowpath_fmt+0x0/0x40) from [<c004c090>] (lockdep_trace_alloc+0xcc/0x114) r3:c03955d8 r2:c03907db [<c004bfc4>] (lockdep_trace_alloc+0x0/0x114) from [<c008f16c>] (__kmalloc+0x34/0x118) r6:000080d0 r5:c3800120 r4:000080d0 r3:c040a0f8 [<c008f138>] (__kmalloc+0x0/0x118) from [<c019c95c>] (imxdma_prep_dma_cyclic+0x64/0x168) r7:a3b90000 r6:00000004 r5:c39d8420 r4:c3847150 [<c019c8f8>] (imxdma_prep_dma_cyclic+0x0/0x168) from [<c024618c>] (snd_dmaengine_pcm_trigger+0xa8/0x160) [<c02460e4>] (snd_dmaengine_pcm_trigger+0x0/0x160) from [<c0241fa8>] (soc_pcm_trigger+0x90/0xb4) r8:c058c7b0 r7:c3b8140c r6:c39da560 r5:00000001 r4:c3b81000 [<c0241f18>] (soc_pcm_trigger+0x0/0xb4) from [<c022ece4>] (snd_pcm_do_start+0x2c/0x38) r7:00000000 r6:00000003 r5:c058c7b0 r4:c3b81000 [<c022ecb8>] (snd_pcm_do_start+0x0/0x38) from [<c022e958>] (snd_pcm_action_single+0x40/0x6c) [<c022e918>] (snd_pcm_action_single+0x0/0x6c) from [<c022ea64>] (snd_pcm_action_lock_irq+0x7c/0x9c) r7:00000003 r6:c3b810f0 r5:c3b810f0 r4:c3b81000 [<c022e9e8>] (snd_pcm_action_lock_irq+0x0/0x9c) from [<c023009c>] (snd_pcm_common_ioctl1+0x7f8/0xfd0) r8:c3b7f888 r7:005407b8 r6:c2c991c0 r5:c3b81000 r4:c3b81000 r3:00004142 [<c022f8a4>] (snd_pcm_common_ioctl1+0x0/0xfd0) from [<c023117c>] (snd_pcm_playback_ioctl1+0x464/0x488) [<c0230d18>] (snd_pcm_playback_ioctl1+0x0/0x488) from [<c02311d4>] (snd_pcm_playback_ioctl+0x34/0x40) r8:c3b7f888 r7:00004142 r6:00000004 r5:c2c991c0 r4:005407b8 [<c02311a0>] (snd_pcm_playback_ioctl+0x0/0x40) from [<c00a14a4>] (vfs_ioctl+0x30/0x44) [<c00a1474>] (vfs_ioctl+0x0/0x44) from [<c00a1fe8>] (do_vfs_ioctl+0x55c/0x5c0) [<c00a1a8c>] (do_vfs_ioctl+0x0/0x5c0) from [<c00a208c>] (SyS_ioctl+0x40/0x68) [<c00a204c>] (SyS_ioctl+0x0/0x68) from [<c0009380>] (ret_fast_syscall+0x0/0x44) r8:c0009544 r7:00000036 r6:bedeaa58 r5:00000000 r4:000000c0 Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
\| \| * \|	dma/Kconfig: Make TI_EDMA select TI_PRIV_EDMA	Josh Boyer	2013-09-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The TI_EDMA driver unconditionally calls functions provided by the TI_PRIV_EDMA code, but it doesn't force that to be built-in. If that isn't otherwise enabled somewhere, you can get build errors like: linux-3.12.0-0.rc0.git1.1.fc21.armv7hl/drivers/dma/edma.c:593: undefined reference to `edma_free_slot' drivers/built-in.o: In function `edma_terminate_all': linux-3.12.0-0.rc0.git1.1.fc21.armv7hl/drivers/dma/edma.c:169: undefined reference to `edma_stop' drivers/built-in.o: In function `edma_execute': linux-3.12.0-0.rc0.git1.1.fc21.armv7hl/drivers/dma/edma.c:122: undefined reference to `edma_write_slot' linux-3.12.0-0.rc0.git1.1.fc21.armv7hl/drivers/dma/edma.c:149: undefined reference to `edma_link' linux-3.12.0-0.rc0.git1.1.fc21.armv7hl/drivers/dma/edma.c:152: undefined reference to `edma_start' Make TI_EDMA select TI_PRIV_EDMA to avoid this. Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> Acked-by: Matt Porter <matt.porter@linaro.org> Signed-off-by: Vinod Koul <vinod.koul@intel.com>