diff options
author | Eric Paris <eparis@redhat.com> | 2014-03-07 11:41:32 -0500 |
---|---|---|
committer | Eric Paris <eparis@redhat.com> | 2014-03-07 11:41:32 -0500 |
commit | b7d3622a39fde7658170b7f3cf6c6889bb8db30d (patch) | |
tree | 64f4e781ecb2a85d675e234072b988560bcd25f1 /Documentation/sysctl | |
parent | f3411cb2b2e396a41ed3a439863f028db7140a34 (diff) | |
parent | d8ec26d7f8287f5788a494f56e8814210f0e64be (diff) |
Merge tag 'v3.13' into for-3.15
Linux 3.13
Conflicts:
include/net/xfrm.h
Simple merge where v3.13 removed 'extern' from definitions and the audit
tree did s/u32/unsigned int/ to the same definitions.
Diffstat (limited to 'Documentation/sysctl')
-rw-r--r-- | Documentation/sysctl/kernel.txt | 101 | ||||
-rw-r--r-- | Documentation/sysctl/vm.txt | 15 |
2 files changed, 104 insertions, 12 deletions
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 9d4c1d18ad44..26b7ee491df8 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt | |||
@@ -290,13 +290,24 @@ Default value is "/sbin/hotplug". | |||
290 | kptr_restrict: | 290 | kptr_restrict: |
291 | 291 | ||
292 | This toggle indicates whether restrictions are placed on | 292 | This toggle indicates whether restrictions are placed on |
293 | exposing kernel addresses via /proc and other interfaces. When | 293 | exposing kernel addresses via /proc and other interfaces. |
294 | kptr_restrict is set to (0), there are no restrictions. When | 294 | |
295 | kptr_restrict is set to (1), the default, kernel pointers | 295 | When kptr_restrict is set to (0), the default, there are no restrictions. |
296 | printed using the %pK format specifier will be replaced with 0's | 296 | |
297 | unless the user has CAP_SYSLOG. When kptr_restrict is set to | 297 | When kptr_restrict is set to (1), kernel pointers printed using the %pK |
298 | (2), kernel pointers printed using %pK will be replaced with 0's | 298 | format specifier will be replaced with 0's unless the user has CAP_SYSLOG |
299 | regardless of privileges. | 299 | and effective user and group ids are equal to the real ids. This is |
300 | because %pK checks are done at read() time rather than open() time, so | ||
301 | if permissions are elevated between the open() and the read() (e.g via | ||
302 | a setuid binary) then %pK will not leak kernel pointers to unprivileged | ||
303 | users. Note, this is a temporary solution only. The correct long-term | ||
304 | solution is to do the permission checks at open() time. Consider removing | ||
305 | world read permissions from files that use %pK, and using dmesg_restrict | ||
306 | to protect against uses of %pK in dmesg(8) if leaking kernel pointer | ||
307 | values to unprivileged users is a concern. | ||
308 | |||
309 | When kptr_restrict is set to (2), kernel pointers printed using | ||
310 | %pK will be replaced with 0's regardless of privileges. | ||
300 | 311 | ||
301 | ============================================================== | 312 | ============================================================== |
302 | 313 | ||
@@ -355,6 +366,82 @@ utilize. | |||
355 | 366 | ||
356 | ============================================================== | 367 | ============================================================== |
357 | 368 | ||
369 | numa_balancing | ||
370 | |||
371 | Enables/disables automatic page fault based NUMA memory | ||
372 | balancing. Memory is moved automatically to nodes | ||
373 | that access it often. | ||
374 | |||
375 | Enables/disables automatic NUMA memory balancing. On NUMA machines, there | ||
376 | is a performance penalty if remote memory is accessed by a CPU. When this | ||
377 | feature is enabled the kernel samples what task thread is accessing memory | ||
378 | by periodically unmapping pages and later trapping a page fault. At the | ||
379 | time of the page fault, it is determined if the data being accessed should | ||
380 | be migrated to a local memory node. | ||
381 | |||
382 | The unmapping of pages and trapping faults incur additional overhead that | ||
383 | ideally is offset by improved memory locality but there is no universal | ||
384 | guarantee. If the target workload is already bound to NUMA nodes then this | ||
385 | feature should be disabled. Otherwise, if the system overhead from the | ||
386 | feature is too high then the rate the kernel samples for NUMA hinting | ||
387 | faults may be controlled by the numa_balancing_scan_period_min_ms, | ||
388 | numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, | ||
389 | numa_balancing_scan_size_mb, numa_balancing_settle_count sysctls and | ||
390 | numa_balancing_migrate_deferred. | ||
391 | |||
392 | ============================================================== | ||
393 | |||
394 | numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, | ||
395 | numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb | ||
396 | |||
397 | Automatic NUMA balancing scans tasks address space and unmaps pages to | ||
398 | detect if pages are properly placed or if the data should be migrated to a | ||
399 | memory node local to where the task is running. Every "scan delay" the task | ||
400 | scans the next "scan size" number of pages in its address space. When the | ||
401 | end of the address space is reached the scanner restarts from the beginning. | ||
402 | |||
403 | In combination, the "scan delay" and "scan size" determine the scan rate. | ||
404 | When "scan delay" decreases, the scan rate increases. The scan delay and | ||
405 | hence the scan rate of every task is adaptive and depends on historical | ||
406 | behaviour. If pages are properly placed then the scan delay increases, | ||
407 | otherwise the scan delay decreases. The "scan size" is not adaptive but | ||
408 | the higher the "scan size", the higher the scan rate. | ||
409 | |||
410 | Higher scan rates incur higher system overhead as page faults must be | ||
411 | trapped and potentially data must be migrated. However, the higher the scan | ||
412 | rate, the more quickly a tasks memory is migrated to a local node if the | ||
413 | workload pattern changes and minimises performance impact due to remote | ||
414 | memory accesses. These sysctls control the thresholds for scan delays and | ||
415 | the number of pages scanned. | ||
416 | |||
417 | numa_balancing_scan_period_min_ms is the minimum time in milliseconds to | ||
418 | scan a tasks virtual memory. It effectively controls the maximum scanning | ||
419 | rate for each task. | ||
420 | |||
421 | numa_balancing_scan_delay_ms is the starting "scan delay" used for a task | ||
422 | when it initially forks. | ||
423 | |||
424 | numa_balancing_scan_period_max_ms is the maximum time in milliseconds to | ||
425 | scan a tasks virtual memory. It effectively controls the minimum scanning | ||
426 | rate for each task. | ||
427 | |||
428 | numa_balancing_scan_size_mb is how many megabytes worth of pages are | ||
429 | scanned for a given scan. | ||
430 | |||
431 | numa_balancing_settle_count is how many scan periods must complete before | ||
432 | the schedule balancer stops pushing the task towards a preferred node. This | ||
433 | gives the scheduler a chance to place the task on an alternative node if the | ||
434 | preferred node is overloaded. | ||
435 | |||
436 | numa_balancing_migrate_deferred is how many page migrations get skipped | ||
437 | unconditionally, after a page migration is skipped because a page is shared | ||
438 | with other tasks. This reduces page migration overhead, and determines | ||
439 | how much stronger the "move task near its memory" policy scheduler becomes, | ||
440 | versus the "move memory near its task" memory management policy, for workloads | ||
441 | with shared memory. | ||
442 | |||
443 | ============================================================== | ||
444 | |||
358 | osrelease, ostype & version: | 445 | osrelease, ostype & version: |
359 | 446 | ||
360 | # cat osrelease | 447 | # cat osrelease |
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 79a797eb3e87..1fbd4eb7b64a 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt | |||
@@ -119,8 +119,11 @@ other appears as 0 when read. | |||
119 | 119 | ||
120 | dirty_background_ratio | 120 | dirty_background_ratio |
121 | 121 | ||
122 | Contains, as a percentage of total system memory, the number of pages at which | 122 | Contains, as a percentage of total available memory that contains free pages |
123 | the background kernel flusher threads will start writing out dirty data. | 123 | and reclaimable pages, the number of pages at which the background kernel |
124 | flusher threads will start writing out dirty data. | ||
125 | |||
126 | The total avaiable memory is not equal to total system memory. | ||
124 | 127 | ||
125 | ============================================================== | 128 | ============================================================== |
126 | 129 | ||
@@ -151,9 +154,11 @@ interval will be written out next time a flusher thread wakes up. | |||
151 | 154 | ||
152 | dirty_ratio | 155 | dirty_ratio |
153 | 156 | ||
154 | Contains, as a percentage of total system memory, the number of pages at which | 157 | Contains, as a percentage of total available memory that contains free pages |
155 | a process which is generating disk writes will itself start writing out dirty | 158 | and reclaimable pages, the number of pages at which a process which is |
156 | data. | 159 | generating disk writes will itself start writing out dirty data. |
160 | |||
161 | The total avaiable memory is not equal to total system memory. | ||
157 | 162 | ||
158 | ============================================================== | 163 | ============================================================== |
159 | 164 | ||