diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/cgroups/cgroups.txt (renamed from Documentation/cgroups.txt) | 0 | ||||
-rw-r--r-- | Documentation/cgroups/freezer-subsystem.txt | 99 | ||||
-rw-r--r-- | Documentation/controllers/memory.txt | 24 | ||||
-rw-r--r-- | Documentation/cpusets.txt | 2 | ||||
-rw-r--r-- | Documentation/filesystems/ext3.txt | 5 | ||||
-rw-r--r-- | Documentation/filesystems/proc.txt | 28 | ||||
-rw-r--r-- | Documentation/filesystems/ubifs.txt | 9 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 14 | ||||
-rw-r--r-- | Documentation/mtd/nand_ecc.txt | 714 | ||||
-rw-r--r-- | Documentation/sysrq.txt | 5 | ||||
-rw-r--r-- | Documentation/vm/unevictable-lru.txt | 615 |
11 files changed, 1492 insertions, 23 deletions
diff --git a/Documentation/cgroups.txt b/Documentation/cgroups/cgroups.txt index d9014aa0eb68..d9014aa0eb68 100644 --- a/Documentation/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt | |||
diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt new file mode 100644 index 000000000000..c50ab58b72eb --- /dev/null +++ b/Documentation/cgroups/freezer-subsystem.txt | |||
@@ -0,0 +1,99 @@ | |||
1 | The cgroup freezer is useful to batch job management system which start | ||
2 | and stop sets of tasks in order to schedule the resources of a machine | ||
3 | according to the desires of a system administrator. This sort of program | ||
4 | is often used on HPC clusters to schedule access to the cluster as a | ||
5 | whole. The cgroup freezer uses cgroups to describe the set of tasks to | ||
6 | be started/stopped by the batch job management system. It also provides | ||
7 | a means to start and stop the tasks composing the job. | ||
8 | |||
9 | The cgroup freezer will also be useful for checkpointing running groups | ||
10 | of tasks. The freezer allows the checkpoint code to obtain a consistent | ||
11 | image of the tasks by attempting to force the tasks in a cgroup into a | ||
12 | quiescent state. Once the tasks are quiescent another task can | ||
13 | walk /proc or invoke a kernel interface to gather information about the | ||
14 | quiesced tasks. Checkpointed tasks can be restarted later should a | ||
15 | recoverable error occur. This also allows the checkpointed tasks to be | ||
16 | migrated between nodes in a cluster by copying the gathered information | ||
17 | to another node and restarting the tasks there. | ||
18 | |||
19 | Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping | ||
20 | and resuming tasks in userspace. Both of these signals are observable | ||
21 | from within the tasks we wish to freeze. While SIGSTOP cannot be caught, | ||
22 | blocked, or ignored it can be seen by waiting or ptracing parent tasks. | ||
23 | SIGCONT is especially unsuitable since it can be caught by the task. Any | ||
24 | programs designed to watch for SIGSTOP and SIGCONT could be broken by | ||
25 | attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can | ||
26 | demonstrate this problem using nested bash shells: | ||
27 | |||
28 | $ echo $$ | ||
29 | 16644 | ||
30 | $ bash | ||
31 | $ echo $$ | ||
32 | 16690 | ||
33 | |||
34 | From a second, unrelated bash shell: | ||
35 | $ kill -SIGSTOP 16690 | ||
36 | $ kill -SIGCONT 16990 | ||
37 | |||
38 | <at this point 16990 exits and causes 16644 to exit too> | ||
39 | |||
40 | This happens because bash can observe both signals and choose how it | ||
41 | responds to them. | ||
42 | |||
43 | Another example of a program which catches and responds to these | ||
44 | signals is gdb. In fact any program designed to use ptrace is likely to | ||
45 | have a problem with this method of stopping and resuming tasks. | ||
46 | |||
47 | In contrast, the cgroup freezer uses the kernel freezer code to | ||
48 | prevent the freeze/unfreeze cycle from becoming visible to the tasks | ||
49 | being frozen. This allows the bash example above and gdb to run as | ||
50 | expected. | ||
51 | |||
52 | The freezer subsystem in the container filesystem defines a file named | ||
53 | freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the | ||
54 | cgroup. Subsequently writing "THAWED" will unfreeze the tasks in the cgroup. | ||
55 | Reading will return the current state. | ||
56 | |||
57 | * Examples of usage : | ||
58 | |||
59 | # mkdir /containers/freezer | ||
60 | # mount -t cgroup -ofreezer freezer /containers | ||
61 | # mkdir /containers/0 | ||
62 | # echo $some_pid > /containers/0/tasks | ||
63 | |||
64 | to get status of the freezer subsystem : | ||
65 | |||
66 | # cat /containers/0/freezer.state | ||
67 | THAWED | ||
68 | |||
69 | to freeze all tasks in the container : | ||
70 | |||
71 | # echo FROZEN > /containers/0/freezer.state | ||
72 | # cat /containers/0/freezer.state | ||
73 | FREEZING | ||
74 | # cat /containers/0/freezer.state | ||
75 | FROZEN | ||
76 | |||
77 | to unfreeze all tasks in the container : | ||
78 | |||
79 | # echo THAWED > /containers/0/freezer.state | ||
80 | # cat /containers/0/freezer.state | ||
81 | THAWED | ||
82 | |||
83 | This is the basic mechanism which should do the right thing for user space task | ||
84 | in a simple scenario. | ||
85 | |||
86 | It's important to note that freezing can be incomplete. In that case we return | ||
87 | EBUSY. This means that some tasks in the cgroup are busy doing something that | ||
88 | prevents us from completely freezing the cgroup at this time. After EBUSY, | ||
89 | the cgroup will remain partially frozen -- reflected by freezer.state reporting | ||
90 | "FREEZING" when read. The state will remain "FREEZING" until one of these | ||
91 | things happens: | ||
92 | |||
93 | 1) Userspace cancels the freezing operation by writing "THAWED" to | ||
94 | the freezer.state file | ||
95 | 2) Userspace retries the freezing operation by writing "FROZEN" to | ||
96 | the freezer.state file (writing "FREEZING" is not legal | ||
97 | and returns EIO) | ||
98 | 3) The tasks that blocked the cgroup from entering the "FROZEN" | ||
99 | state disappear from the cgroup's set of tasks. | ||
diff --git a/Documentation/controllers/memory.txt b/Documentation/controllers/memory.txt index 9b53d5827361..1c07547d3f81 100644 --- a/Documentation/controllers/memory.txt +++ b/Documentation/controllers/memory.txt | |||
@@ -112,14 +112,22 @@ the per cgroup LRU. | |||
112 | 112 | ||
113 | 2.2.1 Accounting details | 113 | 2.2.1 Accounting details |
114 | 114 | ||
115 | All mapped pages (RSS) and unmapped user pages (Page Cache) are accounted. | 115 | All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. |
116 | RSS pages are accounted at the time of page_add_*_rmap() unless they've already | 116 | (some pages which never be reclaimable and will not be on global LRU |
117 | been accounted for earlier. A file page will be accounted for as Page Cache; | 117 | are not accounted. we just accounts pages under usual vm management.) |
118 | it's mapped into the page tables of a process, duplicate accounting is carefully | 118 | |
119 | avoided. Page Cache pages are accounted at the time of add_to_page_cache(). | 119 | RSS pages are accounted at page_fault unless they've already been accounted |
120 | The corresponding routines that remove a page from the page tables or removes | 120 | for earlier. A file page will be accounted for as Page Cache when it's |
121 | a page from Page Cache is used to decrement the accounting counters of the | 121 | inserted into inode (radix-tree). While it's mapped into the page tables of |
122 | cgroup. | 122 | processes, duplicate accounting is carefully avoided. |
123 | |||
124 | A RSS page is unaccounted when it's fully unmapped. A PageCache page is | ||
125 | unaccounted when it's removed from radix-tree. | ||
126 | |||
127 | At page migration, accounting information is kept. | ||
128 | |||
129 | Note: we just account pages-on-lru because our purpose is to control amount | ||
130 | of used pages. not-on-lru pages are tend to be out-of-control from vm view. | ||
123 | 131 | ||
124 | 2.3 Shared Page Accounting | 132 | 2.3 Shared Page Accounting |
125 | 133 | ||
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 47e568a9370a..5c86c258c791 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt | |||
@@ -48,7 +48,7 @@ hooks, beyond what is already present, required to manage dynamic | |||
48 | job placement on large systems. | 48 | job placement on large systems. |
49 | 49 | ||
50 | Cpusets use the generic cgroup subsystem described in | 50 | Cpusets use the generic cgroup subsystem described in |
51 | Documentation/cgroup.txt. | 51 | Documentation/cgroups/cgroups.txt. |
52 | 52 | ||
53 | Requests by a task, using the sched_setaffinity(2) system call to | 53 | Requests by a task, using the sched_setaffinity(2) system call to |
54 | include CPUs in its CPU affinity mask, and using the mbind(2) and | 54 | include CPUs in its CPU affinity mask, and using the mbind(2) and |
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 295f26cd895a..9dd2a3bb2acc 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt | |||
@@ -96,6 +96,11 @@ errors=remount-ro(*) Remount the filesystem read-only on an error. | |||
96 | errors=continue Keep going on a filesystem error. | 96 | errors=continue Keep going on a filesystem error. |
97 | errors=panic Panic and halt the machine if an error occurs. | 97 | errors=panic Panic and halt the machine if an error occurs. |
98 | 98 | ||
99 | data_err=ignore(*) Just print an error message if an error occurs | ||
100 | in a file data buffer in ordered mode. | ||
101 | data_err=abort Abort the journal if an error occurs in a file | ||
102 | data buffer in ordered mode. | ||
103 | |||
99 | grpid Give objects the same group ID as their creator. | 104 | grpid Give objects the same group ID as their creator. |
100 | bsdgroups | 105 | bsdgroups |
101 | 106 | ||
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index c032bf39e8b9..bcceb99b81dd 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
@@ -1384,15 +1384,18 @@ causes the kernel to prefer to reclaim dentries and inodes. | |||
1384 | dirty_background_ratio | 1384 | dirty_background_ratio |
1385 | ---------------------- | 1385 | ---------------------- |
1386 | 1386 | ||
1387 | Contains, as a percentage of total system memory, the number of pages at which | 1387 | Contains, as a percentage of the dirtyable system memory (free pages + mapped |
1388 | the pdflush background writeback daemon will start writing out dirty data. | 1388 | pages + file cache, not including locked pages and HugePages), the number of |
1389 | pages at which the pdflush background writeback daemon will start writing out | ||
1390 | dirty data. | ||
1389 | 1391 | ||
1390 | dirty_ratio | 1392 | dirty_ratio |
1391 | ----------------- | 1393 | ----------------- |
1392 | 1394 | ||
1393 | Contains, as a percentage of total system memory, the number of pages at which | 1395 | Contains, as a percentage of the dirtyable system memory (free pages + mapped |
1394 | a process which is generating disk writes will itself start writing out dirty | 1396 | pages + file cache, not including locked pages and HugePages), the number of |
1395 | data. | 1397 | pages at which a process which is generating disk writes will itself start |
1398 | writing out dirty data. | ||
1396 | 1399 | ||
1397 | dirty_writeback_centisecs | 1400 | dirty_writeback_centisecs |
1398 | ------------------------- | 1401 | ------------------------- |
@@ -2412,24 +2415,29 @@ will be dumped when the <pid> process is dumped. coredump_filter is a bitmask | |||
2412 | of memory types. If a bit of the bitmask is set, memory segments of the | 2415 | of memory types. If a bit of the bitmask is set, memory segments of the |
2413 | corresponding memory type are dumped, otherwise they are not dumped. | 2416 | corresponding memory type are dumped, otherwise they are not dumped. |
2414 | 2417 | ||
2415 | The following 4 memory types are supported: | 2418 | The following 7 memory types are supported: |
2416 | - (bit 0) anonymous private memory | 2419 | - (bit 0) anonymous private memory |
2417 | - (bit 1) anonymous shared memory | 2420 | - (bit 1) anonymous shared memory |
2418 | - (bit 2) file-backed private memory | 2421 | - (bit 2) file-backed private memory |
2419 | - (bit 3) file-backed shared memory | 2422 | - (bit 3) file-backed shared memory |
2420 | - (bit 4) ELF header pages in file-backed private memory areas (it is | 2423 | - (bit 4) ELF header pages in file-backed private memory areas (it is |
2421 | effective only if the bit 2 is cleared) | 2424 | effective only if the bit 2 is cleared) |
2425 | - (bit 5) hugetlb private memory | ||
2426 | - (bit 6) hugetlb shared memory | ||
2422 | 2427 | ||
2423 | Note that MMIO pages such as frame buffer are never dumped and vDSO pages | 2428 | Note that MMIO pages such as frame buffer are never dumped and vDSO pages |
2424 | are always dumped regardless of the bitmask status. | 2429 | are always dumped regardless of the bitmask status. |
2425 | 2430 | ||
2426 | Default value of coredump_filter is 0x3; this means all anonymous memory | 2431 | Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only |
2427 | segments are dumped. | 2432 | effected by bit 5-6. |
2433 | |||
2434 | Default value of coredump_filter is 0x23; this means all anonymous memory | ||
2435 | segments and hugetlb private memory are dumped. | ||
2428 | 2436 | ||
2429 | If you don't want to dump all shared memory segments attached to pid 1234, | 2437 | If you don't want to dump all shared memory segments attached to pid 1234, |
2430 | write 1 to the process's proc file. | 2438 | write 0x21 to the process's proc file. |
2431 | 2439 | ||
2432 | $ echo 0x1 > /proc/1234/coredump_filter | 2440 | $ echo 0x21 > /proc/1234/coredump_filter |
2433 | 2441 | ||
2434 | When a new process is created, the process inherits the bitmask status from its | 2442 | When a new process is created, the process inherits the bitmask status from its |
2435 | parent. It is useful to set up coredump_filter before the program runs. | 2443 | parent. It is useful to set up coredump_filter before the program runs. |
diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt index 6a0d70a22f05..dd84ea3c10da 100644 --- a/Documentation/filesystems/ubifs.txt +++ b/Documentation/filesystems/ubifs.txt | |||
@@ -86,6 +86,15 @@ norm_unmount (*) commit on unmount; the journal is committed | |||
86 | fast_unmount do not commit on unmount; this option makes | 86 | fast_unmount do not commit on unmount; this option makes |
87 | unmount faster, but the next mount slower | 87 | unmount faster, but the next mount slower |
88 | because of the need to replay the journal. | 88 | because of the need to replay the journal. |
89 | bulk_read read more in one go to take advantage of flash | ||
90 | media that read faster sequentially | ||
91 | no_bulk_read (*) do not bulk-read | ||
92 | no_chk_data_crc skip checking of CRCs on data nodes in order to | ||
93 | improve read performance. Use this option only | ||
94 | if the flash media is highly reliable. The effect | ||
95 | of this option is that corruption of the contents | ||
96 | of a file can go unnoticed. | ||
97 | chk_data_crc (*) do not skip checking CRCs on data nodes | ||
89 | 98 | ||
90 | 99 | ||
91 | Quick usage instructions | 100 | Quick usage instructions |
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index d4f4875fc7c6..0f1544f67400 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
@@ -690,7 +690,7 @@ and is between 256 and 4096 characters. It is defined in the file | |||
690 | See Documentation/block/as-iosched.txt and | 690 | See Documentation/block/as-iosched.txt and |
691 | Documentation/block/deadline-iosched.txt for details. | 691 | Documentation/block/deadline-iosched.txt for details. |
692 | 692 | ||
693 | elfcorehdr= [X86-32, X86_64] | 693 | elfcorehdr= [IA64,PPC,SH,X86-32,X86_64] |
694 | Specifies physical address of start of kernel core | 694 | Specifies physical address of start of kernel core |
695 | image elf header. Generally kexec loader will | 695 | image elf header. Generally kexec loader will |
696 | pass this option to capture kernel. | 696 | pass this option to capture kernel. |
@@ -796,6 +796,8 @@ and is between 256 and 4096 characters. It is defined in the file | |||
796 | Defaults to the default architecture's huge page size | 796 | Defaults to the default architecture's huge page size |
797 | if not specified. | 797 | if not specified. |
798 | 798 | ||
799 | hlt [BUGS=ARM,SH] | ||
800 | |||
799 | i8042.debug [HW] Toggle i8042 debug mode | 801 | i8042.debug [HW] Toggle i8042 debug mode |
800 | i8042.direct [HW] Put keyboard port into non-translated mode | 802 | i8042.direct [HW] Put keyboard port into non-translated mode |
801 | i8042.dumbkbd [HW] Pretend that controller can only read data from | 803 | i8042.dumbkbd [HW] Pretend that controller can only read data from |
@@ -1211,6 +1213,10 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1211 | mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel | 1213 | mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel |
1212 | memory. | 1214 | memory. |
1213 | 1215 | ||
1216 | memchunk=nn[KMG] | ||
1217 | [KNL,SH] Allow user to override the default size for | ||
1218 | per-device physically contiguous DMA buffers. | ||
1219 | |||
1214 | memmap=exactmap [KNL,X86-32,X86_64] Enable setting of an exact | 1220 | memmap=exactmap [KNL,X86-32,X86_64] Enable setting of an exact |
1215 | E820 memory map, as specified by the user. | 1221 | E820 memory map, as specified by the user. |
1216 | Such memmap=exactmap lines can be constructed based on | 1222 | Such memmap=exactmap lines can be constructed based on |
@@ -1393,6 +1399,8 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1393 | 1399 | ||
1394 | nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects. | 1400 | nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects. |
1395 | 1401 | ||
1402 | nodsp [SH] Disable hardware DSP at boot time. | ||
1403 | |||
1396 | noefi [X86-32,X86-64] Disable EFI runtime services support. | 1404 | noefi [X86-32,X86-64] Disable EFI runtime services support. |
1397 | 1405 | ||
1398 | noexec [IA-64] | 1406 | noexec [IA-64] |
@@ -1409,13 +1417,15 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1409 | noexec32=off: disable non-executable mappings | 1417 | noexec32=off: disable non-executable mappings |
1410 | read implies executable mappings | 1418 | read implies executable mappings |
1411 | 1419 | ||
1420 | nofpu [SH] Disable hardware FPU at boot time. | ||
1421 | |||
1412 | nofxsr [BUGS=X86-32] Disables x86 floating point extended | 1422 | nofxsr [BUGS=X86-32] Disables x86 floating point extended |
1413 | register save and restore. The kernel will only save | 1423 | register save and restore. The kernel will only save |
1414 | legacy floating-point registers on task switch. | 1424 | legacy floating-point registers on task switch. |
1415 | 1425 | ||
1416 | noclflush [BUGS=X86] Don't use the CLFLUSH instruction | 1426 | noclflush [BUGS=X86] Don't use the CLFLUSH instruction |
1417 | 1427 | ||
1418 | nohlt [BUGS=ARM] | 1428 | nohlt [BUGS=ARM,SH] |
1419 | 1429 | ||
1420 | no-hlt [BUGS=X86-32] Tells the kernel that the hlt | 1430 | no-hlt [BUGS=X86-32] Tells the kernel that the hlt |
1421 | instruction doesn't work correctly and not to | 1431 | instruction doesn't work correctly and not to |
diff --git a/Documentation/mtd/nand_ecc.txt b/Documentation/mtd/nand_ecc.txt new file mode 100644 index 000000000000..bdf93b7f0f24 --- /dev/null +++ b/Documentation/mtd/nand_ecc.txt | |||
@@ -0,0 +1,714 @@ | |||
1 | Introduction | ||
2 | ============ | ||
3 | |||
4 | Having looked at the linux mtd/nand driver and more specific at nand_ecc.c | ||
5 | I felt there was room for optimisation. I bashed the code for a few hours | ||
6 | performing tricks like table lookup removing superfluous code etc. | ||
7 | After that the speed was increased by 35-40%. | ||
8 | Still I was not too happy as I felt there was additional room for improvement. | ||
9 | |||
10 | Bad! I was hooked. | ||
11 | I decided to annotate my steps in this file. Perhaps it is useful to someone | ||
12 | or someone learns something from it. | ||
13 | |||
14 | |||
15 | The problem | ||
16 | =========== | ||
17 | |||
18 | NAND flash (at least SLC one) typically has sectors of 256 bytes. | ||
19 | However NAND flash is not extremely reliable so some error detection | ||
20 | (and sometimes correction) is needed. | ||
21 | |||
22 | This is done by means of a Hamming code. I'll try to explain it in | ||
23 | laymans terms (and apologies to all the pro's in the field in case I do | ||
24 | not use the right terminology, my coding theory class was almost 30 | ||
25 | years ago, and I must admit it was not one of my favourites). | ||
26 | |||
27 | As I said before the ecc calculation is performed on sectors of 256 | ||
28 | bytes. This is done by calculating several parity bits over the rows and | ||
29 | columns. The parity used is even parity which means that the parity bit = 1 | ||
30 | if the data over which the parity is calculated is 1 and the parity bit = 0 | ||
31 | if the data over which the parity is calculated is 0. So the total | ||
32 | number of bits over the data over which the parity is calculated + the | ||
33 | parity bit is even. (see wikipedia if you can't follow this). | ||
34 | Parity is often calculated by means of an exclusive or operation, | ||
35 | sometimes also referred to as xor. In C the operator for xor is ^ | ||
36 | |||
37 | Back to ecc. | ||
38 | Let's give a small figure: | ||
39 | |||
40 | byte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14 | ||
41 | byte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14 | ||
42 | byte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14 | ||
43 | byte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14 | ||
44 | byte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14 | ||
45 | .... | ||
46 | byte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15 | ||
47 | byte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15 | ||
48 | cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0 | ||
49 | cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2 | ||
50 | cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4 | ||
51 | |||
52 | This figure represents a sector of 256 bytes. | ||
53 | cp is my abbreviaton for column parity, rp for row parity. | ||
54 | |||
55 | Let's start to explain column parity. | ||
56 | cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. | ||
57 | so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. | ||
58 | Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7. | ||
59 | cp2 is the parity over bit0, bit1, bit4 and bit5 | ||
60 | cp3 is the parity over bit2, bit3, bit6 and bit7. | ||
61 | cp4 is the parity over bit0, bit1, bit2 and bit3. | ||
62 | cp5 is the parity over bit4, bit5, bit6 and bit7. | ||
63 | Note that each of cp0 .. cp5 is exactly one bit. | ||
64 | |||
65 | Row parity actually works almost the same. | ||
66 | rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) | ||
67 | rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) | ||
68 | rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... | ||
69 | (so handle two bytes, then skip 2 bytes). | ||
70 | rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) | ||
71 | for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. | ||
72 | so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) | ||
73 | and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. | ||
74 | The story now becomes quite boring. I guess you get the idea. | ||
75 | rp6 covers 8 bytes then skips 8 etc | ||
76 | rp7 skips 8 bytes then covers 8 etc | ||
77 | rp8 covers 16 bytes then skips 16 etc | ||
78 | rp9 skips 16 bytes then covers 16 etc | ||
79 | rp10 covers 32 bytes then skips 32 etc | ||
80 | rp11 skips 32 bytes then covers 32 etc | ||
81 | rp12 covers 64 bytes then skips 64 etc | ||
82 | rp13 skips 64 bytes then covers 64 etc | ||
83 | rp14 covers 128 bytes then skips 128 | ||
84 | rp15 skips 128 bytes then covers 128 | ||
85 | |||
86 | In the end the parity bits are grouped together in three bytes as | ||
87 | follows: | ||
88 | ECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 | ||
89 | ECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00 | ||
90 | ECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08 | ||
91 | ECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1 | ||
92 | |||
93 | I detected after writing this that ST application note AN1823 | ||
94 | (http://www.st.com/stonline/books/pdf/docs/10123.pdf) gives a much | ||
95 | nicer picture.(but they use line parity as term where I use row parity) | ||
96 | Oh well, I'm graphically challenged, so suffer with me for a moment :-) | ||
97 | And I could not reuse the ST picture anyway for copyright reasons. | ||
98 | |||
99 | |||
100 | Attempt 0 | ||
101 | ========= | ||
102 | |||
103 | Implementing the parity calculation is pretty simple. | ||
104 | In C pseudocode: | ||
105 | for (i = 0; i < 256; i++) | ||
106 | { | ||
107 | if (i & 0x01) | ||
108 | rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; | ||
109 | else | ||
110 | rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; | ||
111 | if (i & 0x02) | ||
112 | rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; | ||
113 | else | ||
114 | rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; | ||
115 | if (i & 0x04) | ||
116 | rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; | ||
117 | else | ||
118 | rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; | ||
119 | if (i & 0x08) | ||
120 | rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; | ||
121 | else | ||
122 | rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; | ||
123 | if (i & 0x10) | ||
124 | rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; | ||
125 | else | ||
126 | rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; | ||
127 | if (i & 0x20) | ||
128 | rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; | ||
129 | else | ||
130 | rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; | ||
131 | if (i & 0x40) | ||
132 | rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; | ||
133 | else | ||
134 | rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; | ||
135 | if (i & 0x80) | ||
136 | rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; | ||
137 | else | ||
138 | rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; | ||
139 | cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; | ||
140 | cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; | ||
141 | cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; | ||
142 | cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 | ||
143 | cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 | ||
144 | cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 | ||
145 | } | ||
146 | |||
147 | |||
148 | Analysis 0 | ||
149 | ========== | ||
150 | |||
151 | C does have bitwise operators but not really operators to do the above | ||
152 | efficiently (and most hardware has no such instructions either). | ||
153 | Therefore without implementing this it was clear that the code above was | ||
154 | not going to bring me a Nobel prize :-) | ||
155 | |||
156 | Fortunately the exclusive or operation is commutative, so we can combine | ||
157 | the values in any order. So instead of calculating all the bits | ||
158 | individually, let us try to rearrange things. | ||
159 | For the column parity this is easy. We can just xor the bytes and in the | ||
160 | end filter out the relevant bits. This is pretty nice as it will bring | ||
161 | all cp calculation out of the if loop. | ||
162 | |||
163 | Similarly we can first xor the bytes for the various rows. | ||
164 | This leads to: | ||
165 | |||
166 | |||
167 | Attempt 1 | ||
168 | ========= | ||
169 | |||
170 | const char parity[256] = { | ||
171 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
172 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
173 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
174 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
175 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
176 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
177 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
178 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
179 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
180 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
181 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
182 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
183 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
184 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
185 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
186 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 | ||
187 | }; | ||
188 | |||
189 | void ecc1(const unsigned char *buf, unsigned char *code) | ||
190 | { | ||
191 | int i; | ||
192 | const unsigned char *bp = buf; | ||
193 | unsigned char cur; | ||
194 | unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; | ||
195 | unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; | ||
196 | unsigned char par; | ||
197 | |||
198 | par = 0; | ||
199 | rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; | ||
200 | rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; | ||
201 | rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; | ||
202 | rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; | ||
203 | |||
204 | for (i = 0; i < 256; i++) | ||
205 | { | ||
206 | cur = *bp++; | ||
207 | par ^= cur; | ||
208 | if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; | ||
209 | if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; | ||
210 | if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; | ||
211 | if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; | ||
212 | if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; | ||
213 | if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; | ||
214 | if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; | ||
215 | if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; | ||
216 | } | ||
217 | code[0] = | ||
218 | (parity[rp7] << 7) | | ||
219 | (parity[rp6] << 6) | | ||
220 | (parity[rp5] << 5) | | ||
221 | (parity[rp4] << 4) | | ||
222 | (parity[rp3] << 3) | | ||
223 | (parity[rp2] << 2) | | ||
224 | (parity[rp1] << 1) | | ||
225 | (parity[rp0]); | ||
226 | code[1] = | ||
227 | (parity[rp15] << 7) | | ||
228 | (parity[rp14] << 6) | | ||
229 | (parity[rp13] << 5) | | ||
230 | (parity[rp12] << 4) | | ||
231 | (parity[rp11] << 3) | | ||
232 | (parity[rp10] << 2) | | ||
233 | (parity[rp9] << 1) | | ||
234 | (parity[rp8]); | ||
235 | code[2] = | ||
236 | (parity[par & 0xf0] << 7) | | ||
237 | (parity[par & 0x0f] << 6) | | ||
238 | (parity[par & 0xcc] << 5) | | ||
239 | (parity[par & 0x33] << 4) | | ||
240 | (parity[par & 0xaa] << 3) | | ||
241 | (parity[par & 0x55] << 2); | ||
242 | code[0] = ~code[0]; | ||
243 | code[1] = ~code[1]; | ||
244 | code[2] = ~code[2]; | ||
245 | } | ||
246 | |||
247 | Still pretty straightforward. The last three invert statements are there to | ||
248 | give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash | ||
249 | all data is 0xff, so the checksum then matches. | ||
250 | |||
251 | I also introduced the parity lookup. I expected this to be the fastest | ||
252 | way to calculate the parity, but I will investigate alternatives later | ||
253 | on. | ||
254 | |||
255 | |||
256 | Analysis 1 | ||
257 | ========== | ||
258 | |||
259 | The code works, but is not terribly efficient. On my system it took | ||
260 | almost 4 times as much time as the linux driver code. But hey, if it was | ||
261 | *that* easy this would have been done long before. | ||
262 | No pain. no gain. | ||
263 | |||
264 | Fortunately there is plenty of room for improvement. | ||
265 | |||
266 | In step 1 we moved from bit-wise calculation to byte-wise calculation. | ||
267 | However in C we can also use the unsigned long data type and virtually | ||
268 | every modern microprocessor supports 32 bit operations, so why not try | ||
269 | to write our code in such a way that we process data in 32 bit chunks. | ||
270 | |||
271 | Of course this means some modification as the row parity is byte by | ||
272 | byte. A quick analysis: | ||
273 | for the column parity we use the par variable. When extending to 32 bits | ||
274 | we can in the end easily calculate p0 and p1 from it. | ||
275 | (because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 | ||
276 | respectively) | ||
277 | also rp2 and rp3 can be easily retrieved from par as rp3 covers the | ||
278 | first two bytes and rp2 the last two bytes. | ||
279 | |||
280 | Note that of course now the loop is executed only 64 times (256/4). | ||
281 | And note that care must taken wrt byte ordering. The way bytes are | ||
282 | ordered in a long is machine dependent, and might affect us. | ||
283 | Anyway, if there is an issue: this code is developed on x86 (to be | ||
284 | precise: a DELL PC with a D920 Intel CPU) | ||
285 | |||
286 | And of course the performance might depend on alignment, but I expect | ||
287 | that the I/O buffers in the nand driver are aligned properly (and | ||
288 | otherwise that should be fixed to get maximum performance). | ||
289 | |||
290 | Let's give it a try... | ||
291 | |||
292 | |||
293 | Attempt 2 | ||
294 | ========= | ||
295 | |||
296 | extern const char parity[256]; | ||
297 | |||
298 | void ecc2(const unsigned char *buf, unsigned char *code) | ||
299 | { | ||
300 | int i; | ||
301 | const unsigned long *bp = (unsigned long *)buf; | ||
302 | unsigned long cur; | ||
303 | unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; | ||
304 | unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; | ||
305 | unsigned long par; | ||
306 | |||
307 | par = 0; | ||
308 | rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; | ||
309 | rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; | ||
310 | rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; | ||
311 | rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; | ||
312 | |||
313 | for (i = 0; i < 64; i++) | ||
314 | { | ||
315 | cur = *bp++; | ||
316 | par ^= cur; | ||
317 | if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; | ||
318 | if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; | ||
319 | if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; | ||
320 | if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; | ||
321 | if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; | ||
322 | if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; | ||
323 | } | ||
324 | /* | ||
325 | we need to adapt the code generation for the fact that rp vars are now | ||
326 | long; also the column parity calculation needs to be changed. | ||
327 | we'll bring rp4 to 15 back to single byte entities by shifting and | ||
328 | xoring | ||
329 | */ | ||
330 | rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; | ||
331 | rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; | ||
332 | rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; | ||
333 | rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; | ||
334 | rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; | ||
335 | rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; | ||
336 | rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; | ||
337 | rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; | ||
338 | rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; | ||
339 | rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; | ||
340 | rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; | ||
341 | rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; | ||
342 | rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; | ||
343 | rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; | ||
344 | par ^= (par >> 16); | ||
345 | rp1 = (par >> 8); rp1 &= 0xff; | ||
346 | rp0 = (par & 0xff); | ||
347 | par ^= (par >> 8); par &= 0xff; | ||
348 | |||
349 | code[0] = | ||
350 | (parity[rp7] << 7) | | ||
351 | (parity[rp6] << 6) | | ||
352 | (parity[rp5] << 5) | | ||
353 | (parity[rp4] << 4) | | ||
354 | (parity[rp3] << 3) | | ||
355 | (parity[rp2] << 2) | | ||
356 | (parity[rp1] << 1) | | ||
357 | (parity[rp0]); | ||
358 | code[1] = | ||
359 | (parity[rp15] << 7) | | ||
360 | (parity[rp14] << 6) | | ||
361 | (parity[rp13] << 5) | | ||
362 | (parity[rp12] << 4) | | ||
363 | (parity[rp11] << 3) | | ||
364 | (parity[rp10] << 2) | | ||
365 | (parity[rp9] << 1) | | ||
366 | (parity[rp8]); | ||
367 | code[2] = | ||
368 | (parity[par & 0xf0] << 7) | | ||
369 | (parity[par & 0x0f] << 6) | | ||
370 | (parity[par & 0xcc] << 5) | | ||
371 | (parity[par & 0x33] << 4) | | ||
372 | (parity[par & 0xaa] << 3) | | ||
373 | (parity[par & 0x55] << 2); | ||
374 | code[0] = ~code[0]; | ||
375 | code[1] = ~code[1]; | ||
376 | code[2] = ~code[2]; | ||
377 | } | ||
378 | |||
379 | The parity array is not shown any more. Note also that for these | ||
380 | examples I kinda deviated from my regular programming style by allowing | ||
381 | multiple statements on a line, not using { } in then and else blocks | ||
382 | with only a single statement and by using operators like ^= | ||
383 | |||
384 | |||
385 | Analysis 2 | ||
386 | ========== | ||
387 | |||
388 | The code (of course) works, and hurray: we are a little bit faster than | ||
389 | the linux driver code (about 15%). But wait, don't cheer too quickly. | ||
390 | THere is more to be gained. | ||
391 | If we look at e.g. rp14 and rp15 we see that we either xor our data with | ||
392 | rp14 or with rp15. However we also have par which goes over all data. | ||
393 | This means there is no need to calculate rp14 as it can be calculated from | ||
394 | rp15 through rp14 = par ^ rp15; | ||
395 | (or if desired we can avoid calculating rp15 and calculate it from | ||
396 | rp14). That is why some places refer to inverse parity. | ||
397 | Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. | ||
398 | Effectively this means we can eliminate the else clause from the if | ||
399 | statements. Also we can optimise the calculation in the end a little bit | ||
400 | by going from long to byte first. Actually we can even avoid the table | ||
401 | lookups | ||
402 | |||
403 | Attempt 3 | ||
404 | ========= | ||
405 | |||
406 | Odd replaced: | ||
407 | if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; | ||
408 | if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; | ||
409 | if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; | ||
410 | if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; | ||
411 | if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; | ||
412 | if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; | ||
413 | with | ||
414 | if (i & 0x01) rp5 ^= cur; | ||
415 | if (i & 0x02) rp7 ^= cur; | ||
416 | if (i & 0x04) rp9 ^= cur; | ||
417 | if (i & 0x08) rp11 ^= cur; | ||
418 | if (i & 0x10) rp13 ^= cur; | ||
419 | if (i & 0x20) rp15 ^= cur; | ||
420 | |||
421 | and outside the loop added: | ||
422 | rp4 = par ^ rp5; | ||
423 | rp6 = par ^ rp7; | ||
424 | rp8 = par ^ rp9; | ||
425 | rp10 = par ^ rp11; | ||
426 | rp12 = par ^ rp13; | ||
427 | rp14 = par ^ rp15; | ||
428 | |||
429 | And after that the code takes about 30% more time, although the number of | ||
430 | statements is reduced. This is also reflected in the assembly code. | ||
431 | |||
432 | |||
433 | Analysis 3 | ||
434 | ========== | ||
435 | |||
436 | Very weird. Guess it has to do with caching or instruction parallellism | ||
437 | or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting | ||
438 | observation was that this one is only 30% slower (according to time) | ||
439 | executing the code as my 3Ghz D920 processor. | ||
440 | |||
441 | Well, it was expected not to be easy so maybe instead move to a | ||
442 | different track: let's move back to the code from attempt2 and do some | ||
443 | loop unrolling. This will eliminate a few if statements. I'll try | ||
444 | different amounts of unrolling to see what works best. | ||
445 | |||
446 | |||
447 | Attempt 4 | ||
448 | ========= | ||
449 | |||
450 | Unrolled the loop 1, 2, 3 and 4 times. | ||
451 | For 4 the code starts with: | ||
452 | |||
453 | for (i = 0; i < 4; i++) | ||
454 | { | ||
455 | cur = *bp++; | ||
456 | par ^= cur; | ||
457 | rp4 ^= cur; | ||
458 | rp6 ^= cur; | ||
459 | rp8 ^= cur; | ||
460 | rp10 ^= cur; | ||
461 | if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; | ||
462 | if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; | ||
463 | cur = *bp++; | ||
464 | par ^= cur; | ||
465 | rp5 ^= cur; | ||
466 | rp6 ^= cur; | ||
467 | ... | ||
468 | |||
469 | |||
470 | Analysis 4 | ||
471 | ========== | ||
472 | |||
473 | Unrolling once gains about 15% | ||
474 | Unrolling twice keeps the gain at about 15% | ||
475 | Unrolling three times gives a gain of 30% compared to attempt 2. | ||
476 | Unrolling four times gives a marginal improvement compared to unrolling | ||
477 | three times. | ||
478 | |||
479 | I decided to proceed with a four time unrolled loop anyway. It was my gut | ||
480 | feeling that in the next steps I would obtain additional gain from it. | ||
481 | |||
482 | The next step was triggered by the fact that par contains the xor of all | ||
483 | bytes and rp4 and rp5 each contain the xor of half of the bytes. | ||
484 | So in effect par = rp4 ^ rp5. But as xor is commutative we can also say | ||
485 | that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can | ||
486 | eliminate rp5 (or rp4, but I already foresaw another optimisation). | ||
487 | The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. | ||
488 | |||
489 | |||
490 | Attempt 5 | ||
491 | ========= | ||
492 | |||
493 | Effectively so all odd digit rp assignments in the loop were removed. | ||
494 | This included the else clause of the if statements. | ||
495 | Of course after the loop we need to correct things by adding code like: | ||
496 | rp5 = par ^ rp4; | ||
497 | Also the initial assignments (rp5 = 0; etc) could be removed. | ||
498 | Along the line I also removed the initialisation of rp0/1/2/3. | ||
499 | |||
500 | |||
501 | Analysis 5 | ||
502 | ========== | ||
503 | |||
504 | Measurements showed this was a good move. The run-time roughly halved | ||
505 | compared with attempt 4 with 4 times unrolled, and we only require 1/3rd | ||
506 | of the processor time compared to the current code in the linux kernel. | ||
507 | |||
508 | However, still I thought there was more. I didn't like all the if | ||
509 | statements. Why not keep a running parity and only keep the last if | ||
510 | statement. Time for yet another version! | ||
511 | |||
512 | |||
513 | Attempt 6 | ||
514 | ========= | ||
515 | |||
516 | THe code within the for loop was changed to: | ||
517 | |||
518 | for (i = 0; i < 4; i++) | ||
519 | { | ||
520 | cur = *bp++; tmppar = cur; rp4 ^= cur; | ||
521 | cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; | ||
522 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
523 | cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; | ||
524 | |||
525 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; | ||
526 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
527 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
528 | cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; | ||
529 | |||
530 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; | ||
531 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; | ||
532 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; | ||
533 | cur = *bp++; tmppar ^= cur; rp8 ^= cur; | ||
534 | |||
535 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; | ||
536 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
537 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
538 | cur = *bp++; tmppar ^= cur; | ||
539 | |||
540 | par ^= tmppar; | ||
541 | if ((i & 0x1) == 0) rp12 ^= tmppar; | ||
542 | if ((i & 0x2) == 0) rp14 ^= tmppar; | ||
543 | } | ||
544 | |||
545 | As you can see tmppar is used to accumulate the parity within a for | ||
546 | iteration. In the last 3 statements is is added to par and, if needed, | ||
547 | to rp12 and rp14. | ||
548 | |||
549 | While making the changes I also found that I could exploit that tmppar | ||
550 | contains the running parity for this iteration. So instead of having: | ||
551 | rp4 ^= cur; rp6 = cur; | ||
552 | I removed the rp6 = cur; statement and did rp6 ^= tmppar; on next | ||
553 | statement. A similar change was done for rp8 and rp10 | ||
554 | |||
555 | |||
556 | Analysis 6 | ||
557 | ========== | ||
558 | |||
559 | Measuring this code again showed big gain. When executing the original | ||
560 | linux code 1 million times, this took about 1 second on my system. | ||
561 | (using time to measure the performance). After this iteration I was back | ||
562 | to 0.075 sec. Actually I had to decide to start measuring over 10 | ||
563 | million interations in order not to loose too much accuracy. This one | ||
564 | definitely seemed to be the jackpot! | ||
565 | |||
566 | There is a little bit more room for improvement though. There are three | ||
567 | places with statements: | ||
568 | rp4 ^= cur; rp6 ^= cur; | ||
569 | It seems more efficient to also maintain a variable rp4_6 in the while | ||
570 | loop; This eliminates 3 statements per loop. Of course after the loop we | ||
571 | need to correct by adding: | ||
572 | rp4 ^= rp4_6; | ||
573 | rp6 ^= rp4_6 | ||
574 | Furthermore there are 4 sequential assingments to rp8. This can be | ||
575 | encoded slightly more efficient by saving tmppar before those 4 lines | ||
576 | and later do rp8 = rp8 ^ tmppar ^ notrp8; | ||
577 | (where notrp8 is the value of rp8 before those 4 lines). | ||
578 | Again a use of the commutative property of xor. | ||
579 | Time for a new test! | ||
580 | |||
581 | |||
582 | Attempt 7 | ||
583 | ========= | ||
584 | |||
585 | The new code now looks like: | ||
586 | |||
587 | for (i = 0; i < 4; i++) | ||
588 | { | ||
589 | cur = *bp++; tmppar = cur; rp4 ^= cur; | ||
590 | cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; | ||
591 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
592 | cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; | ||
593 | |||
594 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | ||
595 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
596 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
597 | cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; | ||
598 | |||
599 | notrp8 = tmppar; | ||
600 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | ||
601 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
602 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
603 | cur = *bp++; tmppar ^= cur; | ||
604 | rp8 = rp8 ^ tmppar ^ notrp8; | ||
605 | |||
606 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | ||
607 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
608 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
609 | cur = *bp++; tmppar ^= cur; | ||
610 | |||
611 | par ^= tmppar; | ||
612 | if ((i & 0x1) == 0) rp12 ^= tmppar; | ||
613 | if ((i & 0x2) == 0) rp14 ^= tmppar; | ||
614 | } | ||
615 | rp4 ^= rp4_6; | ||
616 | rp6 ^= rp4_6; | ||
617 | |||
618 | |||
619 | Not a big change, but every penny counts :-) | ||
620 | |||
621 | |||
622 | Analysis 7 | ||
623 | ========== | ||
624 | |||
625 | Acutally this made things worse. Not very much, but I don't want to move | ||
626 | into the wrong direction. Maybe something to investigate later. Could | ||
627 | have to do with caching again. | ||
628 | |||
629 | Guess that is what there is to win within the loop. Maybe unrolling one | ||
630 | more time will help. I'll keep the optimisations from 7 for now. | ||
631 | |||
632 | |||
633 | Attempt 8 | ||
634 | ========= | ||
635 | |||
636 | Unrolled the loop one more time. | ||
637 | |||
638 | |||
639 | Analysis 8 | ||
640 | ========== | ||
641 | |||
642 | This makes things worse. Let's stick with attempt 6 and continue from there. | ||
643 | Although it seems that the code within the loop cannot be optimised | ||
644 | further there is still room to optimize the generation of the ecc codes. | ||
645 | We can simply calcualate the total parity. If this is 0 then rp4 = rp5 | ||
646 | etc. If the parity is 1, then rp4 = !rp5; | ||
647 | But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits | ||
648 | in the result byte and then do something like | ||
649 | code[0] |= (code[0] << 1); | ||
650 | Lets test this. | ||
651 | |||
652 | |||
653 | Attempt 9 | ||
654 | ========= | ||
655 | |||
656 | Changed the code but again this slightly degrades performance. Tried all | ||
657 | kind of other things, like having dedicated parity arrays to avoid the | ||
658 | shift after parity[rp7] << 7; No gain. | ||
659 | Change the lookup using the parity array by using shift operators (e.g. | ||
660 | replace parity[rp7] << 7 with: | ||
661 | rp7 ^= (rp7 << 4); | ||
662 | rp7 ^= (rp7 << 2); | ||
663 | rp7 ^= (rp7 << 1); | ||
664 | rp7 &= 0x80; | ||
665 | No gain. | ||
666 | |||
667 | The only marginal change was inverting the parity bits, so we can remove | ||
668 | the last three invert statements. | ||
669 | |||
670 | Ah well, pity this does not deliver more. Then again 10 million | ||
671 | iterations using the linux driver code takes between 13 and 13.5 | ||
672 | seconds, whereas my code now takes about 0.73 seconds for those 10 | ||
673 | million iterations. So basically I've improved the performance by a | ||
674 | factor 18 on my system. Not that bad. Of course on different hardware | ||
675 | you will get different results. No warranties! | ||
676 | |||
677 | But of course there is no such thing as a free lunch. The codesize almost | ||
678 | tripled (from 562 bytes to 1434 bytes). Then again, it is not that much. | ||
679 | |||
680 | |||
681 | Correcting errors | ||
682 | ================= | ||
683 | |||
684 | For correcting errors I again used the ST application note as a starter, | ||
685 | but I also peeked at the existing code. | ||
686 | The algorithm itself is pretty straightforward. Just xor the given and | ||
687 | the calculated ecc. If all bytes are 0 there is no problem. If 11 bits | ||
688 | are 1 we have one correctable bit error. If there is 1 bit 1, we have an | ||
689 | error in the given ecc code. | ||
690 | It proved to be fastest to do some table lookups. Performance gain | ||
691 | introduced by this is about a factor 2 on my system when a repair had to | ||
692 | be done, and 1% or so if no repair had to be done. | ||
693 | Code size increased from 330 bytes to 686 bytes for this function. | ||
694 | (gcc 4.2, -O3) | ||
695 | |||
696 | |||
697 | Conclusion | ||
698 | ========== | ||
699 | |||
700 | The gain when calculating the ecc is tremendous. Om my development hardware | ||
701 | a speedup of a factor of 18 for ecc calculation was achieved. On a test on an | ||
702 | embedded system with a MIPS core a factor 7 was obtained. | ||
703 | On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor | ||
704 | 5 (big endian mode, gcc 4.1.2, -O3) | ||
705 | For correction not much gain could be obtained (as bitflips are rare). Then | ||
706 | again there are also much less cycles spent there. | ||
707 | |||
708 | It seems there is not much more gain possible in this, at least when | ||
709 | programmed in C. Of course it might be possible to squeeze something more | ||
710 | out of it with an assembler program, but due to pipeline behaviour etc | ||
711 | this is very tricky (at least for intel hw). | ||
712 | |||
713 | Author: Frans Meulenbroeks | ||
714 | Copyright (C) 2008 Koninklijke Philips Electronics NV. | ||
diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index 7b3b069c376e..10a0263ebb3f 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt | |||
@@ -95,8 +95,9 @@ On all - write a character to /proc/sysrq-trigger. e.g.: | |||
95 | 95 | ||
96 | 'p' - Will dump the current registers and flags to your console. | 96 | 'p' - Will dump the current registers and flags to your console. |
97 | 97 | ||
98 | 'q' - Will dump per CPU lists of all armed hrtimers (not timer_list timers) | 98 | 'q' - Will dump per CPU lists of all armed hrtimers (but NOT regular |
99 | and detailed information about all clockevent devices. | 99 | timer_list timers) and detailed information about all |
100 | clockevent devices. | ||
100 | 101 | ||
101 | 'r' - Turns off keyboard raw mode and sets it to XLATE. | 102 | 'r' - Turns off keyboard raw mode and sets it to XLATE. |
102 | 103 | ||
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt new file mode 100644 index 000000000000..125eed560e5a --- /dev/null +++ b/Documentation/vm/unevictable-lru.txt | |||
@@ -0,0 +1,615 @@ | |||
1 | |||
2 | This document describes the Linux memory management "Unevictable LRU" | ||
3 | infrastructure and the use of this infrastructure to manage several types | ||
4 | of "unevictable" pages. The document attempts to provide the overall | ||
5 | rationale behind this mechanism and the rationale for some of the design | ||
6 | decisions that drove the implementation. The latter design rationale is | ||
7 | discussed in the context of an implementation description. Admittedly, one | ||
8 | can obtain the implementation details--the "what does it do?"--by reading the | ||
9 | code. One hopes that the descriptions below add value by provide the answer | ||
10 | to "why does it do that?". | ||
11 | |||
12 | Unevictable LRU Infrastructure: | ||
13 | |||
14 | The Unevictable LRU adds an additional LRU list to track unevictable pages | ||
15 | and to hide these pages from vmscan. This mechanism is based on a patch by | ||
16 | Larry Woodman of Red Hat to address several scalability problems with page | ||
17 | reclaim in Linux. The problems have been observed at customer sites on large | ||
18 | memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB | ||
19 | of main memory will have over 32 million 4k pages in a single zone. When a | ||
20 | large fraction of these pages are not evictable for any reason [see below], | ||
21 | vmscan will spend a lot of time scanning the LRU lists looking for the small | ||
22 | fraction of pages that are evictable. This can result in a situation where | ||
23 | all cpus are spending 100% of their time in vmscan for hours or days on end, | ||
24 | with the system completely unresponsive. | ||
25 | |||
26 | The Unevictable LRU infrastructure addresses the following classes of | ||
27 | unevictable pages: | ||
28 | |||
29 | + page owned by ramfs | ||
30 | + page mapped into SHM_LOCKed shared memory regions | ||
31 | + page mapped into VM_LOCKED [mlock()ed] vmas | ||
32 | |||
33 | The infrastructure might be able to handle other conditions that make pages | ||
34 | unevictable, either by definition or by circumstance, in the future. | ||
35 | |||
36 | |||
37 | The Unevictable LRU List | ||
38 | |||
39 | The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list | ||
40 | called the "unevictable" list and an associated page flag, PG_unevictable, to | ||
41 | indicate that the page is being managed on the unevictable list. The | ||
42 | PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active | ||
43 | flag in that it indicates on which LRU list a page resides when PG_lru is set. | ||
44 | The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU | ||
45 | Kconfig option. | ||
46 | |||
47 | The Unevictable LRU infrastructure maintains unevictable pages on an additional | ||
48 | LRU list for a few reasons: | ||
49 | |||
50 | 1) We get to "treat unevictable pages just like we treat other pages in the | ||
51 | system, which means we get to use the same code to manipulate them, the | ||
52 | same code to isolate them (for migrate, etc.), the same code to keep track | ||
53 | of the statistics, etc..." [Rik van Riel] | ||
54 | |||
55 | 2) We want to be able to migrate unevictable pages between nodes--for memory | ||
56 | defragmentation, workload management and memory hotplug. The linux kernel | ||
57 | can only migrate pages that it can successfully isolate from the lru lists. | ||
58 | If we were to maintain pages elsewise than on an lru-like list, where they | ||
59 | can be found by isolate_lru_page(), we would prevent their migration, unless | ||
60 | we reworked migration code to find the unevictable pages. | ||
61 | |||
62 | |||
63 | The unevictable LRU list does not differentiate between file backed and swap | ||
64 | backed [anon] pages. This differentiation is only important while the pages | ||
65 | are, in fact, evictable. | ||
66 | |||
67 | The unevictable LRU list benefits from the "arrayification" of the per-zone | ||
68 | LRU lists and statistics originally proposed and posted by Christoph Lameter. | ||
69 | |||
70 | The unevictable list does not use the lru pagevec mechanism. Rather, | ||
71 | unevictable pages are placed directly on the page's zone's unevictable | ||
72 | list under the zone lru_lock. The reason for this is to prevent stranding | ||
73 | of pages on the unevictable list when one task has the page isolated from the | ||
74 | lru and other tasks are changing the "evictability" state of the page. | ||
75 | |||
76 | |||
77 | Unevictable LRU and Memory Controller Interaction | ||
78 | |||
79 | The memory controller data structure automatically gets a per zone unevictable | ||
80 | lru list as a result of the "arrayification" of the per-zone LRU lists. The | ||
81 | memory controller tracks the movement of pages to and from the unevictable list. | ||
82 | When a memory control group comes under memory pressure, the controller will | ||
83 | not attempt to reclaim pages on the unevictable list. This has a couple of | ||
84 | effects. Because the pages are "hidden" from reclaim on the unevictable list, | ||
85 | the reclaim process can be more efficient, dealing only with pages that have | ||
86 | a chance of being reclaimed. On the other hand, if too many of the pages | ||
87 | charged to the control group are unevictable, the evictable portion of the | ||
88 | working set of the tasks in the control group may not fit into the available | ||
89 | memory. This can cause the control group to thrash or to oom-kill tasks. | ||
90 | |||
91 | |||
92 | Unevictable LRU: Detecting Unevictable Pages | ||
93 | |||
94 | The function page_evictable(page, vma) in vmscan.c determines whether a | ||
95 | page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions, | ||
96 | page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's | ||
97 | address space using a wrapper function. Wrapper functions are used to set, | ||
98 | clear and test the flag to reduce the requirement for #ifdef's throughout the | ||
99 | source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created. | ||
100 | This flag remains for the life of the inode. | ||
101 | |||
102 | For shared memory regions, AS_UNEVICTABLE is set when an application | ||
103 | successfully SHM_LOCKs the region and is removed when the region is | ||
104 | SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page | ||
105 | tables for the region as does, for example, mlock(). So, we make no special | ||
106 | effort to push any pages in the SHM_LOCKed region to the unevictable list. | ||
107 | Vmscan will do this when/if it encounters the pages during reclaim. On | ||
108 | SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the | ||
109 | unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed | ||
110 | region is destroyed, the pages are also "rescued" from the unevictable list in | ||
111 | the process of freeing them. | ||
112 | |||
113 | page_evictable() detects mlock()ed pages by testing an additional page flag, | ||
114 | PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a | ||
115 | non-NULL vma is supplied, page_evictable() will check whether the vma is | ||
116 | VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and | ||
117 | update the appropriate statistics if the vma is VM_LOCKED. This method allows | ||
118 | efficient "culling" of pages in the fault path that are being faulted in to | ||
119 | VM_LOCKED vmas. | ||
120 | |||
121 | |||
122 | Unevictable Pages and Vmscan [shrink_*_list()] | ||
123 | |||
124 | If unevictable pages are culled in the fault path, or moved to the unevictable | ||
125 | list at mlock() or mmap() time, vmscan will never encounter the pages until | ||
126 | they have become evictable again, for example, via munlock() and have been | ||
127 | "rescued" from the unevictable list. However, there may be situations where we | ||
128 | decide, for the sake of expediency, to leave a unevictable page on one of the | ||
129 | regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for | ||
130 | such pages in all of the shrink_{active|inactive|page}_list() functions and | ||
131 | will "cull" such pages that it encounters--that is, it diverts those pages to | ||
132 | the unevictable list for the zone being scanned. | ||
133 | |||
134 | There may be situations where a page is mapped into a VM_LOCKED vma, but the | ||
135 | page is not marked as PageMlocked. Such pages will make it all the way to | ||
136 | shrink_page_list() where they will be detected when vmscan walks the reverse | ||
137 | map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list() | ||
138 | will cull the page at that point. | ||
139 | |||
140 | Note that for anonymous pages, shrink_page_list() attempts to add the page to | ||
141 | the swap cache before it tries to unmap the page. To avoid this unnecessary | ||
142 | consumption of swap space, shrink_page_list() calls try_to_munlock() to check | ||
143 | whether any VM_LOCKED vmas map the page without attempting to unmap the page. | ||
144 | If try_to_munlock() returns SWAP_MLOCK, shrink_page_list() will cull the page | ||
145 | without consuming swap space. try_to_munlock() will be described below. | ||
146 | |||
147 | To "cull" an unevictable page, vmscan simply puts the page back on the lru | ||
148 | list using putback_lru_page()--the inverse operation to isolate_lru_page()-- | ||
149 | after dropping the page lock. Because the condition which makes the page | ||
150 | unevictable may change once the page is unlocked, putback_lru_page() will | ||
151 | recheck the unevictable state of a page that it places on the unevictable lru | ||
152 | list. If the page has become unevictable, putback_lru_page() removes it from | ||
153 | the list and retries, including the page_unevictable() test. Because such a | ||
154 | race is a rare event and movement of pages onto the unevictable list should be | ||
155 | rare, these extra evictabilty checks should not occur in the majority of calls | ||
156 | to putback_lru_page(). | ||
157 | |||
158 | |||
159 | Mlocked Page: Prior Work | ||
160 | |||
161 | The "Unevictable Mlocked Pages" infrastructure is based on work originally | ||
162 | posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". | ||
163 | Nick posted his patch as an alternative to a patch posted by Christoph | ||
164 | Lameter to achieve the same objective--hiding mlocked pages from vmscan. | ||
165 | In Nick's patch, he used one of the struct page lru list link fields as a count | ||
166 | of VM_LOCKED vmas that map the page. This use of the link field for a count | ||
167 | prevented the management of the pages on an LRU list. Thus, mlocked pages were | ||
168 | not migratable as isolate_lru_page() could not find them and the lru list link | ||
169 | field was not available to the migration subsystem. Nick resolved this by | ||
170 | putting mlocked pages back on the lru list before attempting to isolate them, | ||
171 | thus abandoning the count of VM_LOCKED vmas. When Nick's patch was integrated | ||
172 | with the Unevictable LRU work, the count was replaced by walking the reverse | ||
173 | map to determine whether any VM_LOCKED vmas mapped the page. More on this | ||
174 | below. | ||
175 | |||
176 | |||
177 | Mlocked Pages: Basic Management | ||
178 | |||
179 | Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of | ||
180 | unevictable pages. When such a page has been "noticed" by the memory | ||
181 | management subsystem, the page is marked with the PG_mlocked [PageMlocked()] | ||
182 | flag. A PageMlocked() page will be placed on the unevictable LRU list when | ||
183 | it is added to the LRU. Pages can be "noticed" by memory management in | ||
184 | several places: | ||
185 | |||
186 | 1) in the mlock()/mlockall() system call handlers. | ||
187 | 2) in the mmap() system call handler when mmap()ing a region with the | ||
188 | MAP_LOCKED flag, or mmap()ing a region in a task that has called | ||
189 | mlockall() with the MCL_FUTURE flag. Both of these conditions result | ||
190 | in the VM_LOCKED flag being set for the vma. | ||
191 | 3) in the fault path, if mlocked pages are "culled" in the fault path, | ||
192 | and when a VM_LOCKED stack segment is expanded. | ||
193 | 4) as mentioned above, in vmscan:shrink_page_list() with attempting to | ||
194 | reclaim a page in a VM_LOCKED vma--via try_to_unmap() or try_to_munlock(). | ||
195 | |||
196 | Mlocked pages become unlocked and rescued from the unevictable list when: | ||
197 | |||
198 | 1) mapped in a range unlocked via the munlock()/munlockall() system calls. | ||
199 | 2) munmapped() out of the last VM_LOCKED vma that maps the page, including | ||
200 | unmapping at task exit. | ||
201 | 3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file. | ||
202 | 4) before a page is COWed in a VM_LOCKED vma. | ||
203 | |||
204 | |||
205 | Mlocked Pages: mlock()/mlockall() System Call Handling | ||
206 | |||
207 | Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() | ||
208 | for each vma in the range specified by the call. In the case of mlockall(), | ||
209 | this is the entire active address space of the task. Note that mlock_fixup() | ||
210 | is used for both mlock()ing and munlock()ing a range of memory. A call to | ||
211 | mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED | ||
212 | is treated as a no-op--mlock_fixup() simply returns. | ||
213 | |||
214 | If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas" | ||
215 | below, mlock_fixup() will attempt to merge the vma with its neighbors or split | ||
216 | off a subset of the vma if the range does not cover the entire vma. Once the | ||
217 | vma has been merged or split or neither, mlock_fixup() will call | ||
218 | __mlock_vma_pages_range() to fault in the pages via get_user_pages() and | ||
219 | to mark the pages as mlocked via mlock_vma_page(). | ||
220 | |||
221 | Note that the vma being mlocked might be mapped with PROT_NONE. In this case, | ||
222 | get_user_pages() will be unable to fault in the pages. That's OK. If pages | ||
223 | do end up getting faulted into this VM_LOCKED vma, we'll handle them in the | ||
224 | fault path or in vmscan. | ||
225 | |||
226 | Also note that a page returned by get_user_pages() could be truncated or | ||
227 | migrated out from under us, while we're trying to mlock it. To detect | ||
228 | this, __mlock_vma_pages_range() tests the page_mapping after acquiring | ||
229 | the page lock. If the page is still associated with its mapping, we'll | ||
230 | go ahead and call mlock_vma_page(). If the mapping is gone, we just | ||
231 | unlock the page and move on. Worse case, this results in page mapped | ||
232 | in a VM_LOCKED vma remaining on a normal LRU list without being | ||
233 | PageMlocked(). Again, vmscan will detect and cull such pages. | ||
234 | |||
235 | mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will | ||
236 | TestSetPageMlocked() for each page returned by get_user_pages(). We use | ||
237 | TestSetPageMlocked() because the page might already be mlocked by another | ||
238 | task/vma and we don't want to do extra work. We especially do not want to | ||
239 | count an mlocked page more than once in the statistics. If the page was | ||
240 | already mlocked, mlock_vma_page() is done. | ||
241 | |||
242 | If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the | ||
243 | page from the LRU, as it is likely on the appropriate active or inactive list | ||
244 | at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will | ||
245 | putback the page--putback_lru_page()--which will notice that the page is now | ||
246 | mlocked and divert the page to the zone's unevictable LRU list. If | ||
247 | mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle | ||
248 | it later if/when it attempts to reclaim the page. | ||
249 | |||
250 | |||
251 | Mlocked Pages: Filtering Special Vmas | ||
252 | |||
253 | mlock_fixup() filters several classes of "special" vmas: | ||
254 | |||
255 | 1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind | ||
256 | these mappings are inherently pinned, so we don't need to mark them as | ||
257 | mlocked. In any case, most of the pages have no struct page in which to | ||
258 | so mark the page. Because of this, get_user_pages() will fail for these | ||
259 | vmas, so there is no sense in attempting to visit them. | ||
260 | |||
261 | 2) vmas mapping hugetlbfs page are already effectively pinned into memory. | ||
262 | We don't need nor want to mlock() these pages. However, to preserve the | ||
263 | prior behavior of mlock()--before the unevictable/mlock changes--mlock_fixup() | ||
264 | will call make_pages_present() in the hugetlbfs vma range to allocate the | ||
265 | huge pages and populate the ptes. | ||
266 | |||
267 | 3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of | ||
268 | kernel pages, such as the vdso page, relay channel pages, etc. These pages | ||
269 | are inherently unevictable and are not managed on the LRU lists. | ||
270 | mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls | ||
271 | make_pages_present() to populate the ptes. | ||
272 | |||
273 | Note that for all of these special vmas, mlock_fixup() does not set the | ||
274 | VM_LOCKED flag. Therefore, we won't have to deal with them later during | ||
275 | munlock() or munmap()--for example, at task exit. Neither does mlock_fixup() | ||
276 | account these vmas against the task's "locked_vm". | ||
277 | |||
278 | Mlocked Pages: Downgrading the Mmap Semaphore. | ||
279 | |||
280 | mlock_fixup() must be called with the mmap semaphore held for write, because | ||
281 | it may have to merge or split vmas. However, mlocking a large region of | ||
282 | memory can take a long time--especially if vmscan must reclaim pages to | ||
283 | satisfy the regions requirements. Faulting in a large region with the mmap | ||
284 | semaphore held for write can hold off other faults on the address space, in | ||
285 | the case of a multi-threaded task. It can also hold off scans of the task's | ||
286 | address space via /proc. While testing under heavy load, it was observed that | ||
287 | the ps(1) command could be held off for many minutes while a large segment was | ||
288 | mlock()ed down. | ||
289 | |||
290 | To address this issue, and to make the system more responsive during mlock()ing | ||
291 | of large segments, mlock_fixup() downgrades the mmap semaphore to read mode | ||
292 | during the call to __mlock_vma_pages_range(). This works fine. However, the | ||
293 | callers of mlock_fixup() expect the semaphore to be returned in write mode. | ||
294 | So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not | ||
295 | support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore | ||
296 | and reacquire it in write mode. In a multi-threaded task, it is possible for | ||
297 | the task memory map to change while the semaphore is dropped. Therefore, | ||
298 | mlock_fixup() looks up the vma at the range start address after reacquiring | ||
299 | the semaphore in write mode and verifies that it still covers the original | ||
300 | range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of | ||
301 | mlock_fixup() have been changed to deal with this new error condition. | ||
302 | |||
303 | Note: when munlocking a region, all of the pages should already be resident-- | ||
304 | unless we have racing threads mlocking() and munlocking() regions. So, | ||
305 | unlocking should not have to wait for page allocations nor faults of any kind. | ||
306 | Therefore mlock_fixup() does not downgrade the semaphore for munlock(). | ||
307 | |||
308 | |||
309 | Mlocked Pages: munlock()/munlockall() System Call Handling | ||
310 | |||
311 | The munlock() and munlockall() system calls are handled by the same functions-- | ||
312 | do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock | ||
313 | vs lock operation indicated by an argument. So, these system calls are also | ||
314 | handled by mlock_fixup(). Again, if called for an already munlock()ed vma, | ||
315 | mlock_fixup() simply returns. Because of the vma filtering discussed above, | ||
316 | VM_LOCKED will not be set in any "special" vmas. So, these vmas will be | ||
317 | ignored for munlock. | ||
318 | |||
319 | If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off | ||
320 | the specified range. The range is then munlocked via the function | ||
321 | __mlock_vma_pages_range()--the same function used to mlock a vma range-- | ||
322 | passing a flag to indicate that munlock() is being performed. | ||
323 | |||
324 | Because the vma access protections could have been changed to PROT_NONE after | ||
325 | faulting in and mlocking some pages, get_user_pages() was unreliable for visiting | ||
326 | these pages for munlocking. Because we don't want to leave pages mlocked(), | ||
327 | get_user_pages() was enhanced to accept a flag to ignore the permissions when | ||
328 | fetching the pages--all of which should be resident as a result of previous | ||
329 | mlock()ing. | ||
330 | |||
331 | For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling | ||
332 | munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked | ||
333 | flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page() | ||
334 | use the Test*PageMlocked() function to handle the case where the page might | ||
335 | have already been unlocked by another task. If the page was mlocked, | ||
336 | munlock_vma_page() updates that zone statistics for the number of mlocked | ||
337 | pages. Note, however, that at this point we haven't checked whether the page | ||
338 | is mapped by other VM_LOCKED vmas. | ||
339 | |||
340 | We can't call try_to_munlock(), the function that walks the reverse map to check | ||
341 | for other VM_LOCKED vmas, without first isolating the page from the LRU. | ||
342 | try_to_munlock() is a variant of try_to_unmap() and thus requires that the page | ||
343 | not be on an lru list. [More on these below.] However, the call to | ||
344 | isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). | ||
345 | So, we go ahead and clear PG_mlocked up front, as this might be the only chance | ||
346 | we have. If we can successfully isolate the page, we go ahead and | ||
347 | try_to_munlock(), which will restore the PG_mlocked flag and update the zone | ||
348 | page statistics if it finds another vma holding the page mlocked. If we fail | ||
349 | to isolate the page, we'll have left a potentially mlocked page on the LRU. | ||
350 | This is fine, because we'll catch it later when/if vmscan tries to reclaim the | ||
351 | page. This should be relatively rare. | ||
352 | |||
353 | Mlocked Pages: Migrating Them... | ||
354 | |||
355 | A page that is being migrated has been isolated from the lru lists and is | ||
356 | held locked across unmapping of the page, updating the page's mapping | ||
357 | [address_space] entry and copying the contents and state, until the | ||
358 | page table entry has been replaced with an entry that refers to the new | ||
359 | page. Linux supports migration of mlocked pages and other unevictable | ||
360 | pages. This involves simply moving the PageMlocked and PageUnevictable states | ||
361 | from the old page to the new page. | ||
362 | |||
363 | Note that page migration can race with mlocking or munlocking of the same | ||
364 | page. This has been discussed from the mlock/munlock perspective in the | ||
365 | respective sections above. Both processes [migration, m[un]locking], hold | ||
366 | the page locked. This provides the first level of synchronization. Page | ||
367 | migration zeros out the page_mapping of the old page before unlocking it, | ||
368 | so m[un]lock can skip these pages by testing the page mapping under page | ||
369 | lock. | ||
370 | |||
371 | When completing page migration, we place the new and old pages back onto the | ||
372 | lru after dropping the page lock. The "unneeded" page--old page on success, | ||
373 | new page on failure--will be freed when the reference count held by the | ||
374 | migration process is released. To ensure that we don't strand pages on the | ||
375 | unevictable list because of a race between munlock and migration, page | ||
376 | migration uses the putback_lru_page() function to add migrated pages back to | ||
377 | the lru. | ||
378 | |||
379 | |||
380 | Mlocked Pages: mmap(MAP_LOCKED) System Call Handling | ||
381 | |||
382 | In addition the the mlock()/mlockall() system calls, an application can request | ||
383 | that a region of memory be mlocked using the MAP_LOCKED flag with the mmap() | ||
384 | call. Furthermore, any mmap() call or brk() call that expands the heap by a | ||
385 | task that has previously called mlockall() with the MCL_FUTURE flag will result | ||
386 | in the newly mapped memory being mlocked. Before the unevictable/mlock changes, | ||
387 | the kernel simply called make_pages_present() to allocate pages and populate | ||
388 | the page table. | ||
389 | |||
390 | To mlock a range of memory under the unevictable/mlock infrastructure, the | ||
391 | mmap() handler and task address space expansion functions call | ||
392 | mlock_vma_pages_range() specifying the vma and the address range to mlock. | ||
393 | mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in | ||
394 | "Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will | ||
395 | have already been set by the caller, in filtered vmas. Thus these vma's need | ||
396 | not be visited for munlock when the region is unmapped. | ||
397 | |||
398 | For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to | ||
399 | fault/allocate the pages and mlock them. Again, like mlock_fixup(), | ||
400 | mlock_vma_pages_range() downgrades the mmap semaphore to read mode before | ||
401 | attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore | ||
402 | back to write mode before returning. | ||
403 | |||
404 | The callers of mlock_vma_pages_range() will have already added the memory | ||
405 | range to be mlocked to the task's "locked_vm". To account for filtered vmas, | ||
406 | mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the | ||
407 | callers then subtract a non-negative return value from the task's locked_vm. | ||
408 | A negative return value represent an error--for example, from get_user_pages() | ||
409 | attempting to fault in a vma with PROT_NONE access. In this case, we leave | ||
410 | the memory range accounted as locked_vm, as the protections could be changed | ||
411 | later and pages allocated into that region. | ||
412 | |||
413 | |||
414 | Mlocked Pages: munmap()/exit()/exec() System Call Handling | ||
415 | |||
416 | When unmapping an mlocked region of memory, whether by an explicit call to | ||
417 | munmap() or via an internal unmap from exit() or exec() processing, we must | ||
418 | munlock the pages if we're removing the last VM_LOCKED vma that maps the pages. | ||
419 | Before the unevictable/mlock changes, mlocking did not mark the pages in any way, | ||
420 | so unmapping them required no processing. | ||
421 | |||
422 | To munlock a range of memory under the unevictable/mlock infrastructure, the | ||
423 | munmap() hander and task address space tear down function call | ||
424 | munlock_vma_pages_all(). The name reflects the observation that one always | ||
425 | specifies the entire vma range when munlock()ing during unmap of a region. | ||
426 | Because of the vma filtering when mlocking() regions, only "normal" vmas that | ||
427 | actually contain mlocked pages will be passed to munlock_vma_pages_all(). | ||
428 | |||
429 | munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup() | ||
430 | for the munlock case, calls __munlock_vma_pages_range() to walk the page table | ||
431 | for the vma's memory range and munlock_vma_page() each resident page mapped by | ||
432 | the vma. This effectively munlocks the page, only if this is the last | ||
433 | VM_LOCKED vma that maps the page. | ||
434 | |||
435 | |||
436 | Mlocked Page: try_to_unmap() | ||
437 | |||
438 | [Note: the code changes represented by this section are really quite small | ||
439 | compared to the text to describe what happening and why, and to discuss the | ||
440 | implications.] | ||
441 | |||
442 | Pages can, of course, be mapped into multiple vmas. Some of these vmas may | ||
443 | have VM_LOCKED flag set. It is possible for a page mapped into one or more | ||
444 | VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one | ||
445 | of the active or inactive LRU lists. This could happen if, for example, a | ||
446 | task in the process of munlock()ing the page could not isolate the page from | ||
447 | the LRU. As a result, vmscan/shrink_page_list() might encounter such a page | ||
448 | as described in "Unevictable Pages and Vmscan [shrink_*_list()]". To | ||
449 | handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED | ||
450 | vmas while it is walking a page's reverse map. | ||
451 | |||
452 | try_to_unmap() is always called, by either vmscan for reclaim or for page | ||
453 | migration, with the argument page locked and isolated from the LRU. BUG_ON() | ||
454 | assertions enforce this requirement. Separate functions handle anonymous and | ||
455 | mapped file pages, as these types of pages have different reverse map | ||
456 | mechanisms. | ||
457 | |||
458 | try_to_unmap_anon() | ||
459 | |||
460 | To unmap anonymous pages, each vma in the list anchored in the anon_vma must be | ||
461 | visited--at least until a VM_LOCKED vma is encountered. If the page is being | ||
462 | unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked | ||
463 | pages are migratable. However, for reclaim, if the page is mapped into a | ||
464 | VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap | ||
465 | semphore of the mm_struct to which the vma belongs in read mode. If this is | ||
466 | successful, try_to_unmap() will mlock the page via mlock_vma_page()--we | ||
467 | wouldn't have gotten to try_to_unmap() if the page were already mlocked--and | ||
468 | will return SWAP_MLOCK, indicating that the page is unevictable. If the | ||
469 | mmap semaphore cannot be acquired, we are not sure whether the page is really | ||
470 | unevictable or not. In this case, try_to_unmap() will return SWAP_AGAIN. | ||
471 | |||
472 | try_to_unmap_file() -- linear mappings | ||
473 | |||
474 | Unmapping of a mapped file page works the same, except that the scan visits | ||
475 | all vmas that maps the page's index/page offset in the page's mapping's | ||
476 | reverse map priority search tree. It must also visit each vma in the page's | ||
477 | mapping's non-linear list, if the list is non-empty. As for anonymous pages, | ||
478 | on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will | ||
479 | attempt to acquire the associated mm_struct's mmap semaphore to mlock the page, | ||
480 | returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not. | ||
481 | |||
482 | try_to_unmap_file() -- non-linear mappings | ||
483 | |||
484 | If a page's mapping contains a non-empty non-linear mapping vma list, then | ||
485 | try_to_un{map|lock}() must also visit each vma in that list to determine | ||
486 | whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit | ||
487 | all vmas in the non-linear list to ensure that the pages is not/should not be | ||
488 | mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate. | ||
489 | However, there is no easy way to determine whether the page is actually mapped | ||
490 | in a given vma--either for unmapping or testing whether the VM_LOCKED vma | ||
491 | actually pins the page. | ||
492 | |||
493 | So, try_to_unmap_file() handles non-linear mappings by scanning a certain | ||
494 | number of pages--a "cluster"--in each non-linear vma associated with the page's | ||
495 | mapping, for each file mapped page that vmscan tries to unmap. If this happens | ||
496 | to unmap the page we're trying to unmap, try_to_unmap() will notice this on | ||
497 | return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it | ||
498 | will return SWAP_AGAIN, causing vmscan to recirculate this page. We take | ||
499 | advantage of the cluster scan in try_to_unmap_cluster() as follows: | ||
500 | |||
501 | For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap | ||
502 | semaphore of the associated mm_struct for read without blocking. If this | ||
503 | attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will | ||
504 | retain the mmap semaphore for the scan; otherwise it drops it here. Then, | ||
505 | for each page in the cluster, if we're holding the mmap semaphore for a locked | ||
506 | vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This | ||
507 | call is a no-op if the page is already locked, but will mlock any pages in | ||
508 | the non-linear mapping that happen to be unlocked. If one of the pages so | ||
509 | mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will | ||
510 | return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan | ||
511 | to cull the page, rather than recirculating it on the inactive list. Again, | ||
512 | if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns | ||
513 | SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but | ||
514 | couldn't be mlocked. | ||
515 | |||
516 | |||
517 | Mlocked pages: try_to_munlock() Reverse Map Scan | ||
518 | |||
519 | TODO/FIXME: a better name might be page_mlocked()--analogous to the | ||
520 | page_referenced() reverse map walker--especially if we continue to call this | ||
521 | from shrink_page_list(). See related TODO/FIXME below. | ||
522 | |||
523 | When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() System | ||
524 | Call Handling" above--tries to munlock a page, or when shrink_page_list() | ||
525 | encounters an anonymous page that is not yet in the swap cache, they need to | ||
526 | determine whether or not the page is mapped by any VM_LOCKED vma, without | ||
527 | actually attempting to unmap all ptes from the page. For this purpose, the | ||
528 | unevictable/mlock infrastructure introduced a variant of try_to_unmap() called | ||
529 | try_to_munlock(). | ||
530 | |||
531 | try_to_munlock() calls the same functions as try_to_unmap() for anonymous and | ||
532 | mapped file pages with an additional argument specifing unlock versus unmap | ||
533 | processing. Again, these functions walk the respective reverse maps looking | ||
534 | for VM_LOCKED vmas. When such a vma is found for anonymous pages and file | ||
535 | pages mapped in linear VMAs, as in the try_to_unmap() case, the functions | ||
536 | attempt to acquire the associated mmap semphore, mlock the page via | ||
537 | mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the | ||
538 | pre-clearing of the page's PG_mlocked done by munlock_vma_page() and informs | ||
539 | shrink_page_list() that the anonymous page should be culled rather than added | ||
540 | to the swap cache in preparation for a try_to_unmap() that will almost | ||
541 | certainly fail. | ||
542 | |||
543 | If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap | ||
544 | semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() | ||
545 | to recycle the page on the inactive list and hope that it has better luck | ||
546 | with the page next time. | ||
547 | |||
548 | For file pages mapped into non-linear vmas, the try_to_munlock() logic works | ||
549 | slightly differently. On encountering a VM_LOCKED non-linear vma that might | ||
550 | map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking | ||
551 | the page. munlock_vma_page() will just leave the page unlocked and let | ||
552 | vmscan deal with it--the usual fallback position. | ||
553 | |||
554 | Note that try_to_munlock()'s reverse map walk must visit every vma in a pages' | ||
555 | reverse map to determine that a page is NOT mapped into any VM_LOCKED vma. | ||
556 | However, the scan can terminate when it encounters a VM_LOCKED vma and can | ||
557 | successfully acquire the vma's mmap semphore for read and mlock the page. | ||
558 | Although try_to_munlock() can be called many [very many!] times when | ||
559 | munlock()ing a large region or tearing down a large address space that has been | ||
560 | mlocked via mlockall(), overall this is a fairly rare event. In addition, | ||
561 | although shrink_page_list() calls try_to_munlock() for every anonymous page that | ||
562 | it handles that is not yet in the swap cache, on average anonymous pages will | ||
563 | have very short reverse map lists. | ||
564 | |||
565 | Mlocked Page: Page Reclaim in shrink_*_list() | ||
566 | |||
567 | shrink_active_list() culls any obviously unevictable pages--i.e., | ||
568 | !page_evictable(page, NULL)--diverting these to the unevictable lru | ||
569 | list. However, shrink_active_list() only sees unevictable pages that | ||
570 | made it onto the active/inactive lru lists. Note that these pages do not | ||
571 | have PageUnevictable set--otherwise, they would be on the unevictable list and | ||
572 | shrink_active_list would never see them. | ||
573 | |||
574 | Some examples of these unevictable pages on the LRU lists are: | ||
575 | |||
576 | 1) ramfs pages that have been placed on the lru lists when first allocated. | ||
577 | |||
578 | 2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to | ||
579 | allocate or fault in the pages in the shared memory region. This happens | ||
580 | when an application accesses the page the first time after SHM_LOCKing | ||
581 | the segment. | ||
582 | |||
583 | 3) Mlocked pages that could not be isolated from the lru and moved to the | ||
584 | unevictable list in mlock_vma_page(). | ||
585 | |||
586 | 3) Pages mapped into multiple VM_LOCKED vmas, but try_to_munlock() couldn't | ||
587 | acquire the vma's mmap semaphore to test the flags and set PageMlocked. | ||
588 | munlock_vma_page() was forced to let the page back on to the normal | ||
589 | LRU list for vmscan to handle. | ||
590 | |||
591 | shrink_inactive_list() also culls any unevictable pages that it finds | ||
592 | on the inactive lists, again diverting them to the appropriate zone's unevictable | ||
593 | lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became | ||
594 | SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or | ||
595 | pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from | ||
596 | the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice | ||
597 | the latter, but will pass on to shrink_page_list(). | ||
598 | |||
599 | shrink_page_list() again culls obviously unevictable pages that it could | ||
600 | encounter for similar reason to shrink_inactive_list(). As already discussed, | ||
601 | shrink_page_list() proactively looks for anonymous pages that should have | ||
602 | PG_mlocked set but don't--these would not be detected by page_evictable()--to | ||
603 | avoid adding them to the swap cache unnecessarily. File pages mapped into | ||
604 | VM_LOCKED vmas but without PG_mlocked set will make it all the way to | ||
605 | try_to_unmap(). shrink_page_list() will divert them to the unevictable list when | ||
606 | try_to_unmap() returns SWAP_MLOCK, as discussed above. | ||
607 | |||
608 | TODO/FIXME: If we can enhance the swap cache to reliably remove entries | ||
609 | with page_count(page) > 2, as long as all ptes are mapped to the page and | ||
610 | not the swap entry, we can probably remove the call to try_to_munlock() in | ||
611 | shrink_page_list() and just remove the page from the swap cache when | ||
612 | try_to_unmap() returns SWAP_MLOCK. Currently, remove_exclusive_swap_page() | ||
613 | doesn't seem to allow that. | ||
614 | |||
615 | |||