aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2015-04-15 19:39:15 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2015-04-15 19:39:15 -0400
commiteea3a00264cf243a28e4331566ce67b86059339d (patch)
tree487f16389e0dfa32e9caa7604d1274a7dcda8f04 /Documentation
parente7c82412433a8039616c7314533a0a1c025d99bf (diff)
parente693d73c20ffdb06840c9378f367bad849ac0d5d (diff)
Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton: - the rest of MM - various misc bits - add ability to run /sbin/reboot at reboot time - printk/vsprintf changes - fiddle with seq_printf() return value * akpm: (114 commits) parisc: remove use of seq_printf return value lru_cache: remove use of seq_printf return value tracing: remove use of seq_printf return value cgroup: remove use of seq_printf return value proc: remove use of seq_printf return value s390: remove use of seq_printf return value cris fasttimer: remove use of seq_printf return value cris: remove use of seq_printf return value openrisc: remove use of seq_printf return value ARM: plat-pxa: remove use of seq_printf return value nios2: cpuinfo: remove use of seq_printf return value microblaze: mb: remove use of seq_printf return value ipc: remove use of seq_printf return value rtc: remove use of seq_printf return value power: wakeup: remove use of seq_printf return value x86: mtrr: if: remove use of seq_printf return value linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK MAINTAINERS: CREDITS: remove Stefano Brivio from B43 .mailmap: add Ricardo Ribalda CREDITS: add Ricardo Ribalda Delgado ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/obsolete/sysfs-block-zram119
-rw-r--r--Documentation/ABI/testing/sysfs-block-zram25
-rw-r--r--Documentation/blockdev/zram.txt87
-rw-r--r--Documentation/filesystems/Locking8
-rw-r--r--Documentation/printk-formats.txt49
-rw-r--r--Documentation/sysctl/vm.txt11
-rw-r--r--Documentation/vm/hugetlbpage.txt55
-rw-r--r--Documentation/vm/unevictable-lru.txt12
-rw-r--r--Documentation/vm/zsmalloc.txt70
9 files changed, 393 insertions, 43 deletions
diff --git a/Documentation/ABI/obsolete/sysfs-block-zram b/Documentation/ABI/obsolete/sysfs-block-zram
new file mode 100644
index 000000000000..720ea92cfb2e
--- /dev/null
+++ b/Documentation/ABI/obsolete/sysfs-block-zram
@@ -0,0 +1,119 @@
1What: /sys/block/zram<id>/num_reads
2Date: August 2015
3Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
4Description:
5 The num_reads file is read-only and specifies the number of
6 reads (failed or successful) done on this device.
7 Now accessible via zram<id>/stat node.
8
9What: /sys/block/zram<id>/num_writes
10Date: August 2015
11Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
12Description:
13 The num_writes file is read-only and specifies the number of
14 writes (failed or successful) done on this device.
15 Now accessible via zram<id>/stat node.
16
17What: /sys/block/zram<id>/invalid_io
18Date: August 2015
19Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
20Description:
21 The invalid_io file is read-only and specifies the number of
22 non-page-size-aligned I/O requests issued to this device.
23 Now accessible via zram<id>/io_stat node.
24
25What: /sys/block/zram<id>/failed_reads
26Date: August 2015
27Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
28Description:
29 The failed_reads file is read-only and specifies the number of
30 failed reads happened on this device.
31 Now accessible via zram<id>/io_stat node.
32
33What: /sys/block/zram<id>/failed_writes
34Date: August 2015
35Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
36Description:
37 The failed_writes file is read-only and specifies the number of
38 failed writes happened on this device.
39 Now accessible via zram<id>/io_stat node.
40
41What: /sys/block/zram<id>/notify_free
42Date: August 2015
43Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
44Description:
45 The notify_free file is read-only. Depending on device usage
46 scenario it may account a) the number of pages freed because
47 of swap slot free notifications or b) the number of pages freed
48 because of REQ_DISCARD requests sent by bio. The former ones
49 are sent to a swap block device when a swap slot is freed, which
50 implies that this disk is being used as a swap disk. The latter
51 ones are sent by filesystem mounted with discard option,
52 whenever some data blocks are getting discarded.
53 Now accessible via zram<id>/io_stat node.
54
55What: /sys/block/zram<id>/zero_pages
56Date: August 2015
57Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
58Description:
59 The zero_pages file is read-only and specifies number of zero
60 filled pages written to this disk. No memory is allocated for
61 such pages.
62 Now accessible via zram<id>/mm_stat node.
63
64What: /sys/block/zram<id>/orig_data_size
65Date: August 2015
66Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
67Description:
68 The orig_data_size file is read-only and specifies uncompressed
69 size of data stored in this disk. This excludes zero-filled
70 pages (zero_pages) since no memory is allocated for them.
71 Unit: bytes
72 Now accessible via zram<id>/mm_stat node.
73
74What: /sys/block/zram<id>/compr_data_size
75Date: August 2015
76Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
77Description:
78 The compr_data_size file is read-only and specifies compressed
79 size of data stored in this disk. So, compression ratio can be
80 calculated using orig_data_size and this statistic.
81 Unit: bytes
82 Now accessible via zram<id>/mm_stat node.
83
84What: /sys/block/zram<id>/mem_used_total
85Date: August 2015
86Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
87Description:
88 The mem_used_total file is read-only and specifies the amount
89 of memory, including allocator fragmentation and metadata
90 overhead, allocated for this disk. So, allocator space
91 efficiency can be calculated using compr_data_size and this
92 statistic.
93 Unit: bytes
94 Now accessible via zram<id>/mm_stat node.
95
96What: /sys/block/zram<id>/mem_used_max
97Date: August 2015
98Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
99Description:
100 The mem_used_max file is read/write and specifies the amount
101 of maximum memory zram have consumed to store compressed data.
102 For resetting the value, you should write "0". Otherwise,
103 you could see -EINVAL.
104 Unit: bytes
105 Downgraded to write-only node: so it's possible to set new
106 value only; its current value is stored in zram<id>/mm_stat
107 node.
108
109What: /sys/block/zram<id>/mem_limit
110Date: August 2015
111Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
112Description:
113 The mem_limit file is read/write and specifies the maximum
114 amount of memory ZRAM can use to store the compressed data.
115 The limit could be changed in run time and "0" means disable
116 the limit. No limit is the initial state. Unit: bytes
117 Downgraded to write-only node: so it's possible to set new
118 value only; its current value is stored in zram<id>/mm_stat
119 node.
diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index a6148eaf91e5..2e69e83bf510 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -141,3 +141,28 @@ Description:
141 amount of memory ZRAM can use to store the compressed data. The 141 amount of memory ZRAM can use to store the compressed data. The
142 limit could be changed in run time and "0" means disable the 142 limit could be changed in run time and "0" means disable the
143 limit. No limit is the initial state. Unit: bytes 143 limit. No limit is the initial state. Unit: bytes
144
145What: /sys/block/zram<id>/compact
146Date: August 2015
147Contact: Minchan Kim <minchan@kernel.org>
148Description:
149 The compact file is write-only and trigger compaction for
150 allocator zrm uses. The allocator moves some objects so that
151 it could free fragment space.
152
153What: /sys/block/zram<id>/io_stat
154Date: August 2015
155Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
156Description:
157 The io_stat file is read-only and accumulates device's I/O
158 statistics not accounted by block layer. For example,
159 failed_reads, failed_writes, etc. File format is similar to
160 block layer statistics file format.
161
162What: /sys/block/zram<id>/mm_stat
163Date: August 2015
164Contact: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
165Description:
166 The mm_stat file is read-only and represents device's mm
167 statistics (orig_data_size, compr_data_size, etc.) in a format
168 similar to block layer statistics file format.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 7fcf9c6592ec..48a183e29988 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -98,20 +98,79 @@ size of the disk when not in use so a huge zram is wasteful.
98 mount /dev/zram1 /tmp 98 mount /dev/zram1 /tmp
99 99
1007) Stats: 1007) Stats:
101 Per-device statistics are exported as various nodes under 101Per-device statistics are exported as various nodes under /sys/block/zram<id>/
102 /sys/block/zram<id>/ 102
103 disksize 103A brief description of exported device attritbutes. For more details please
104 num_reads 104read Documentation/ABI/testing/sysfs-block-zram.
105 num_writes 105
106 failed_reads 106Name access description
107 failed_writes 107---- ------ -----------
108 invalid_io 108disksize RW show and set the device's disk size
109 notify_free 109initstate RO shows the initialization state of the device
110 zero_pages 110reset WO trigger device reset
111 orig_data_size 111num_reads RO the number of reads
112 compr_data_size 112failed_reads RO the number of failed reads
113 mem_used_total 113num_write RO the number of writes
114 mem_used_max 114failed_writes RO the number of failed writes
115invalid_io RO the number of non-page-size-aligned I/O requests
116max_comp_streams RW the number of possible concurrent compress operations
117comp_algorithm RW show and change the compression algorithm
118notify_free RO the number of notifications to free pages (either
119 slot free notifications or REQ_DISCARD requests)
120zero_pages RO the number of zero filled pages written to this disk
121orig_data_size RO uncompressed size of data stored in this disk
122compr_data_size RO compressed size of data stored in this disk
123mem_used_total RO the amount of memory allocated for this disk
124mem_used_max RW the maximum amount memory zram have consumed to
125 store compressed data
126mem_limit RW the maximum amount of memory ZRAM can use to store
127 the compressed data
128num_migrated RO the number of objects migrated migrated by compaction
129
130
131WARNING
132=======
133per-stat sysfs attributes are considered to be deprecated.
134The basic strategy is:
135-- the existing RW nodes will be downgraded to WO nodes (in linux 4.11)
136-- deprecated RO sysfs nodes will eventually be removed (in linux 4.11)
137
138The list of deprecated attributes can be found here:
139Documentation/ABI/obsolete/sysfs-block-zram
140
141Basically, every attribute that has its own read accessible sysfs node
142(e.g. num_reads) *AND* is accessible via one of the stat files (zram<id>/stat
143or zram<id>/io_stat or zram<id>/mm_stat) is considered to be deprecated.
144
145User space is advised to use the following files to read the device statistics.
146
147File /sys/block/zram<id>/stat
148
149Represents block layer statistics. Read Documentation/block/stat.txt for
150details.
151
152File /sys/block/zram<id>/io_stat
153
154The stat file represents device's I/O statistics not accounted by block
155layer and, thus, not available in zram<id>/stat file. It consists of a
156single line of text and contains the following stats separated by
157whitespace:
158 failed_reads
159 failed_writes
160 invalid_io
161 notify_free
162
163File /sys/block/zram<id>/mm_stat
164
165The stat file represents device's mm statistics. It consists of a single
166line of text and contains the following stats separated by whitespace:
167 orig_data_size
168 compr_data_size
169 mem_used_total
170 mem_limit
171 mem_used_max
172 zero_pages
173 num_migrated
115 174
1168) Deactivate: 1758) Deactivate:
117 swapoff /dev/zram0 176 swapoff /dev/zram0
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index c3cd6279e92e..7c3f187d48bf 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -523,6 +523,7 @@ prototypes:
523 void (*close)(struct vm_area_struct*); 523 void (*close)(struct vm_area_struct*);
524 int (*fault)(struct vm_area_struct*, struct vm_fault *); 524 int (*fault)(struct vm_area_struct*, struct vm_fault *);
525 int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); 525 int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
526 int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
526 int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); 527 int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
527 528
528locking rules: 529locking rules:
@@ -532,6 +533,7 @@ close: yes
532fault: yes can return with page locked 533fault: yes can return with page locked
533map_pages: yes 534map_pages: yes
534page_mkwrite: yes can return with page locked 535page_mkwrite: yes can return with page locked
536pfn_mkwrite: yes
535access: yes 537access: yes
536 538
537 ->fault() is called when a previously not present pte is about 539 ->fault() is called when a previously not present pte is about
@@ -558,6 +560,12 @@ the page has been truncated, the filesystem should not look up a new page
558like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which 560like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
559will cause the VM to retry the fault. 561will cause the VM to retry the fault.
560 562
563 ->pfn_mkwrite() is the same as page_mkwrite but when the pte is
564VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
565VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior
566after this call is to make the pte read-write, unless pfn_mkwrite returns
567an error.
568
561 ->access() is called when get_user_pages() fails in 569 ->access() is called when get_user_pages() fails in
562access_process_vm(), typically used to debug a process through 570access_process_vm(), typically used to debug a process through
563/proc/pid/mem or ptrace. This function is needed only for 571/proc/pid/mem or ptrace. This function is needed only for
diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt
index 5a615c14f75d..cb6a596072bb 100644
--- a/Documentation/printk-formats.txt
+++ b/Documentation/printk-formats.txt
@@ -8,6 +8,21 @@ If variable is of Type, use printk format specifier:
8 unsigned long long %llu or %llx 8 unsigned long long %llu or %llx
9 size_t %zu or %zx 9 size_t %zu or %zx
10 ssize_t %zd or %zx 10 ssize_t %zd or %zx
11 s32 %d or %x
12 u32 %u or %x
13 s64 %lld or %llx
14 u64 %llu or %llx
15
16If <type> is dependent on a config option for its size (e.g., sector_t,
17blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a
18format specifier of its largest possible type and explicitly cast to it.
19Example:
20
21 printk("test: sector number/total blocks: %llu/%llu\n",
22 (unsigned long long)sector, (unsigned long long)blockcount);
23
24Reminder: sizeof() result is of type size_t.
25
11 26
12Raw pointer value SHOULD be printed with %p. The kernel supports 27Raw pointer value SHOULD be printed with %p. The kernel supports
13the following extended format specifiers for pointer types: 28the following extended format specifiers for pointer types:
@@ -54,6 +69,7 @@ Struct Resources:
54 69
55 For printing struct resources. The 'R' and 'r' specifiers result in a 70 For printing struct resources. The 'R' and 'r' specifiers result in a
56 printed resource with ('R') or without ('r') a decoded flags member. 71 printed resource with ('R') or without ('r') a decoded flags member.
72 Passed by reference.
57 73
58Physical addresses types phys_addr_t: 74Physical addresses types phys_addr_t:
59 75
@@ -132,6 +148,8 @@ MAC/FDDI addresses:
132 specifier to use reversed byte order suitable for visual interpretation 148 specifier to use reversed byte order suitable for visual interpretation
133 of Bluetooth addresses which are in the little endian order. 149 of Bluetooth addresses which are in the little endian order.
134 150
151 Passed by reference.
152
135IPv4 addresses: 153IPv4 addresses:
136 154
137 %pI4 1.2.3.4 155 %pI4 1.2.3.4
@@ -146,6 +164,8 @@ IPv4 addresses:
146 host, network, big or little endian order addresses respectively. Where 164 host, network, big or little endian order addresses respectively. Where
147 no specifier is provided the default network/big endian order is used. 165 no specifier is provided the default network/big endian order is used.
148 166
167 Passed by reference.
168
149IPv6 addresses: 169IPv6 addresses:
150 170
151 %pI6 0001:0002:0003:0004:0005:0006:0007:0008 171 %pI6 0001:0002:0003:0004:0005:0006:0007:0008
@@ -160,6 +180,8 @@ IPv6 addresses:
160 print a compressed IPv6 address as described by 180 print a compressed IPv6 address as described by
161 http://tools.ietf.org/html/rfc5952 181 http://tools.ietf.org/html/rfc5952
162 182
183 Passed by reference.
184
163IPv4/IPv6 addresses (generic, with port, flowinfo, scope): 185IPv4/IPv6 addresses (generic, with port, flowinfo, scope):
164 186
165 %pIS 1.2.3.4 or 0001:0002:0003:0004:0005:0006:0007:0008 187 %pIS 1.2.3.4 or 0001:0002:0003:0004:0005:0006:0007:0008
@@ -186,6 +208,8 @@ IPv4/IPv6 addresses (generic, with port, flowinfo, scope):
186 specifiers can be used as well and are ignored in case of an IPv6 208 specifiers can be used as well and are ignored in case of an IPv6
187 address. 209 address.
188 210
211 Passed by reference.
212
189 Further examples: 213 Further examples:
190 214
191 %pISfc 1.2.3.4 or [1:2:3:4:5:6:7:8]/123456789 215 %pISfc 1.2.3.4 or [1:2:3:4:5:6:7:8]/123456789
@@ -207,6 +231,8 @@ UUID/GUID addresses:
207 Where no additional specifiers are used the default little endian 231 Where no additional specifiers are used the default little endian
208 order with lower case hex characters will be printed. 232 order with lower case hex characters will be printed.
209 233
234 Passed by reference.
235
210dentry names: 236dentry names:
211 %pd{,2,3,4} 237 %pd{,2,3,4}
212 %pD{,2,3,4} 238 %pD{,2,3,4}
@@ -216,6 +242,8 @@ dentry names:
216 equivalent of %s dentry->d_name.name we used to use, %pd<n> prints 242 equivalent of %s dentry->d_name.name we used to use, %pd<n> prints
217 n last components. %pD does the same thing for struct file. 243 n last components. %pD does the same thing for struct file.
218 244
245 Passed by reference.
246
219struct va_format: 247struct va_format:
220 248
221 %pV 249 %pV
@@ -231,23 +259,20 @@ struct va_format:
231 Do not use this feature without some mechanism to verify the 259 Do not use this feature without some mechanism to verify the
232 correctness of the format string and va_list arguments. 260 correctness of the format string and va_list arguments.
233 261
234u64 SHOULD be printed with %llu/%llx: 262 Passed by reference.
235
236 printk("%llu", u64_var);
237 263
238s64 SHOULD be printed with %lld/%llx: 264struct clk:
239 265
240 printk("%lld", s64_var); 266 %pC pll1
267 %pCn pll1
268 %pCr 1560000000
241 269
242If <type> is dependent on a config option for its size (e.g., sector_t, 270 For printing struct clk structures. '%pC' and '%pCn' print the name
243blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a 271 (Common Clock Framework) or address (legacy clock framework) of the
244format specifier of its largest possible type and explicitly cast to it. 272 structure; '%pCr' prints the current clock rate.
245Example:
246 273
247 printk("test: sector number/total blocks: %llu/%llu\n", 274 Passed by reference.
248 (unsigned long long)sector, (unsigned long long)blockcount);
249 275
250Reminder: sizeof() result is of type size_t.
251 276
252Thank you for your cooperation and attention. 277Thank you for your cooperation and attention.
253 278
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 902b4574acfb..9832ec52f859 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -21,6 +21,7 @@ Currently, these files are in /proc/sys/vm:
21- admin_reserve_kbytes 21- admin_reserve_kbytes
22- block_dump 22- block_dump
23- compact_memory 23- compact_memory
24- compact_unevictable_allowed
24- dirty_background_bytes 25- dirty_background_bytes
25- dirty_background_ratio 26- dirty_background_ratio
26- dirty_bytes 27- dirty_bytes
@@ -106,6 +107,16 @@ huge pages although processes will also directly compact memory as required.
106 107
107============================================================== 108==============================================================
108 109
110compact_unevictable_allowed
111
112Available only when CONFIG_COMPACTION is set. When set to 1, compaction is
113allowed to examine the unevictable lru (mlocked pages) for pages to compact.
114This should be used on systems where stalls for minor page faults are an
115acceptable trade for large contiguous free memory. Set to 0 to prevent
116compaction from moving pages that are unevictable. Default value is 1.
117
118==============================================================
119
109dirty_background_bytes 120dirty_background_bytes
110 121
111Contains the amount of dirty memory at which the background kernel 122Contains the amount of dirty memory at which the background kernel
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt
index f2d3a100fe38..030977fb8d2d 100644
--- a/Documentation/vm/hugetlbpage.txt
+++ b/Documentation/vm/hugetlbpage.txt
@@ -267,21 +267,34 @@ call, then it is required that system administrator mount a file system of
267type hugetlbfs: 267type hugetlbfs:
268 268
269 mount -t hugetlbfs \ 269 mount -t hugetlbfs \
270 -o uid=<value>,gid=<value>,mode=<value>,size=<value>,nr_inodes=<value> \ 270 -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
271 none /mnt/huge 271 min_size=<value>,nr_inodes=<value> none /mnt/huge
272 272
273This command mounts a (pseudo) filesystem of type hugetlbfs on the directory 273This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
274/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid 274/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid
275options sets the owner and group of the root of the file system. By default 275options sets the owner and group of the root of the file system. By default
276the uid and gid of the current process are taken. The mode option sets the 276the uid and gid of the current process are taken. The mode option sets the
277mode of root of file system to value & 01777. This value is given in octal. 277mode of root of file system to value & 01777. This value is given in octal.
278By default the value 0755 is picked. The size option sets the maximum value of 278By default the value 0755 is picked. If the paltform supports multiple huge
279memory (huge pages) allowed for that filesystem (/mnt/huge). The size is 279page sizes, the pagesize option can be used to specify the huge page size and
280rounded down to HPAGE_SIZE. The option nr_inodes sets the maximum number of 280associated pool. pagesize is specified in bytes. If pagesize is not specified
281inodes that /mnt/huge can use. If the size or nr_inodes option is not 281the paltform's default huge page size and associated pool will be used. The
282provided on command line then no limits are set. For size and nr_inodes 282size option sets the maximum value of memory (huge pages) allowed for that
283options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For 283filesystem (/mnt/huge). The size option can be specified in bytes, or as a
284example, size=2K has the same meaning as size=2048. 284percentage of the specified huge page pool (nr_hugepages). The size is
285rounded down to HPAGE_SIZE boundary. The min_size option sets the minimum
286value of memory (huge pages) allowed for the filesystem. min_size can be
287specified in the same way as size, either bytes or a percentage of the
288huge page pool. At mount time, the number of huge pages specified by
289min_size are reserved for use by the filesystem. If there are not enough
290free huge pages available, the mount will fail. As huge pages are allocated
291to the filesystem and freed, the reserve count is adjusted so that the sum
292of allocated and reserved huge pages is always at least min_size. The option
293nr_inodes sets the maximum number of inodes that /mnt/huge can use. If the
294size, min_size or nr_inodes option is not provided on command line then
295no limits are set. For pagesize, size, min_size and nr_inodes options, you
296can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For example, size=2K
297has the same meaning as size=2048.
285 298
286While read system calls are supported on files that reside on hugetlb 299While read system calls are supported on files that reside on hugetlb
287file systems, write system calls are not. 300file systems, write system calls are not.
@@ -289,15 +302,23 @@ file systems, write system calls are not.
289Regular chown, chgrp, and chmod commands (with right permissions) could be 302Regular chown, chgrp, and chmod commands (with right permissions) could be
290used to change the file attributes on hugetlbfs. 303used to change the file attributes on hugetlbfs.
291 304
292Also, it is important to note that no such mount command is required if the 305Also, it is important to note that no such mount command is required if
293applications are going to use only shmat/shmget system calls or mmap with 306applications are going to use only shmat/shmget system calls or mmap with
294MAP_HUGETLB. Users who wish to use hugetlb page via shared memory segment 307MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see map_hugetlb
295should be a member of a supplementary group and system admin needs to 308below.
296configure that gid into /proc/sys/vm/hugetlb_shm_group. It is possible for 309
297same or different applications to use any combination of mmaps and shm* 310Users who wish to use hugetlb memory via shared memory segment should be a
298calls, though the mount of filesystem will be required for using mmap calls 311member of a supplementary group and system admin needs to configure that gid
299without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see 312into /proc/sys/vm/hugetlb_shm_group. It is possible for same or different
300map_hugetlb.c. 313applications to use any combination of mmaps and shm* calls, though the mount of
314filesystem will be required for using mmap calls without MAP_HUGETLB.
315
316Syscalls that operate on memory backed by hugetlb pages only have their lengths
317aligned to the native page size of the processor; they will normally fail with
318errno set to EINVAL or exclude hugetlb pages that extend beyond the length if
319not hugepage aligned. For example, munmap(2) will fail if memory is backed by
320a hugetlb page and the length is smaller than the hugepage size.
321
301 322
302Examples 323Examples
303======== 324========
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
index 86cb4624fc5a..3be0bfc4738d 100644
--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -22,6 +22,7 @@ CONTENTS
22 - Filtering special vmas. 22 - Filtering special vmas.
23 - munlock()/munlockall() system call handling. 23 - munlock()/munlockall() system call handling.
24 - Migrating mlocked pages. 24 - Migrating mlocked pages.
25 - Compacting mlocked pages.
25 - mmap(MAP_LOCKED) system call handling. 26 - mmap(MAP_LOCKED) system call handling.
26 - munmap()/exit()/exec() system call handling. 27 - munmap()/exit()/exec() system call handling.
27 - try_to_unmap(). 28 - try_to_unmap().
@@ -450,6 +451,17 @@ list because of a race between munlock and migration, page migration uses the
450putback_lru_page() function to add migrated pages back to the LRU. 451putback_lru_page() function to add migrated pages back to the LRU.
451 452
452 453
454COMPACTING MLOCKED PAGES
455------------------------
456
457The unevictable LRU can be scanned for compactable regions and the default
458behavior is to do so. /proc/sys/vm/compact_unevictable_allowed controls
459this behavior (see Documentation/sysctl/vm.txt). Once scanning of the
460unevictable LRU is enabled, the work of compaction is mostly handled by
461the page migration code and the same work flow as described in MIGRATING
462MLOCKED PAGES will apply.
463
464
453mmap(MAP_LOCKED) SYSTEM CALL HANDLING 465mmap(MAP_LOCKED) SYSTEM CALL HANDLING
454------------------------------------- 466-------------------------------------
455 467
diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt
new file mode 100644
index 000000000000..64ed63c4f69d
--- /dev/null
+++ b/Documentation/vm/zsmalloc.txt
@@ -0,0 +1,70 @@
1zsmalloc
2--------
3
4This allocator is designed for use with zram. Thus, the allocator is
5supposed to work well under low memory conditions. In particular, it
6never attempts higher order page allocation which is very likely to
7fail under memory pressure. On the other hand, if we just use single
8(0-order) pages, it would suffer from very high fragmentation --
9any object of size PAGE_SIZE/2 or larger would occupy an entire page.
10This was one of the major issues with its predecessor (xvmalloc).
11
12To overcome these issues, zsmalloc allocates a bunch of 0-order pages
13and links them together using various 'struct page' fields. These linked
14pages act as a single higher-order page i.e. an object can span 0-order
15page boundaries. The code refers to these linked pages as a single entity
16called zspage.
17
18For simplicity, zsmalloc can only allocate objects of size up to PAGE_SIZE
19since this satisfies the requirements of all its current users (in the
20worst case, page is incompressible and is thus stored "as-is" i.e. in
21uncompressed form). For allocation requests larger than this size, failure
22is returned (see zs_malloc).
23
24Additionally, zs_malloc() does not return a dereferenceable pointer.
25Instead, it returns an opaque handle (unsigned long) which encodes actual
26location of the allocated object. The reason for this indirection is that
27zsmalloc does not keep zspages permanently mapped since that would cause
28issues on 32-bit systems where the VA region for kernel space mappings
29is very small. So, before using the allocating memory, the object has to
30be mapped using zs_map_object() to get a usable pointer and subsequently
31unmapped using zs_unmap_object().
32
33stat
34----
35
36With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via
37/sys/kernel/debug/zsmalloc/<user name>. Here is a sample of stat output:
38
39# cat /sys/kernel/debug/zsmalloc/zram0/classes
40
41 class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage
42 ..
43 ..
44 9 176 0 1 186 129 8 4
45 10 192 1 0 2880 2872 135 3
46 11 208 0 1 819 795 42 2
47 12 224 0 1 219 159 12 4
48 ..
49 ..
50
51
52class: index
53size: object size zspage stores
54almost_empty: the number of ZS_ALMOST_EMPTY zspages(see below)
55almost_full: the number of ZS_ALMOST_FULL zspages(see below)
56obj_allocated: the number of objects allocated
57obj_used: the number of objects allocated to the user
58pages_used: the number of pages allocated for the class
59pages_per_zspage: the number of 0-order pages to make a zspage
60
61We assign a zspage to ZS_ALMOST_EMPTY fullness group when:
62 n <= N / f, where
63n = number of allocated objects
64N = total number of objects zspage can store
65f = fullness_threshold_frac(ie, 4 at the moment)
66
67Similarly, we assign zspage to:
68 ZS_ALMOST_FULL when n > N / f
69 ZS_EMPTY when n == 0
70 ZS_FULL when n == N