diff options
Diffstat (limited to 'Documentation/vm/balance')
-rw-r--r-- | Documentation/vm/balance | 93 |
1 files changed, 93 insertions, 0 deletions
diff --git a/Documentation/vm/balance b/Documentation/vm/balance new file mode 100644 index 000000000000..bd3d31bc4915 --- /dev/null +++ b/Documentation/vm/balance | |||
@@ -0,0 +1,93 @@ | |||
1 | Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com> | ||
2 | |||
3 | Memory balancing is needed for non __GFP_WAIT as well as for non | ||
4 | __GFP_IO allocations. | ||
5 | |||
6 | There are two reasons to be requesting non __GFP_WAIT allocations: | ||
7 | the caller can not sleep (typically intr context), or does not want | ||
8 | to incur cost overheads of page stealing and possible swap io for | ||
9 | whatever reasons. | ||
10 | |||
11 | __GFP_IO allocation requests are made to prevent file system deadlocks. | ||
12 | |||
13 | In the absence of non sleepable allocation requests, it seems detrimental | ||
14 | to be doing balancing. Page reclamation can be kicked off lazily, that | ||
15 | is, only when needed (aka zone free memory is 0), instead of making it | ||
16 | a proactive process. | ||
17 | |||
18 | That being said, the kernel should try to fulfill requests for direct | ||
19 | mapped pages from the direct mapped pool, instead of falling back on | ||
20 | the dma pool, so as to keep the dma pool filled for dma requests (atomic | ||
21 | or not). A similar argument applies to highmem and direct mapped pages. | ||
22 | OTOH, if there is a lot of free dma pages, it is preferable to satisfy | ||
23 | regular memory requests by allocating one from the dma pool, instead | ||
24 | of incurring the overhead of regular zone balancing. | ||
25 | |||
26 | In 2.2, memory balancing/page reclamation would kick off only when the | ||
27 | _total_ number of free pages fell below 1/64 th of total memory. With the | ||
28 | right ratio of dma and regular memory, it is quite possible that balancing | ||
29 | would not be done even when the dma zone was completely empty. 2.2 has | ||
30 | been running production machines of varying memory sizes, and seems to be | ||
31 | doing fine even with the presence of this problem. In 2.3, due to | ||
32 | HIGHMEM, this problem is aggravated. | ||
33 | |||
34 | In 2.3, zone balancing can be done in one of two ways: depending on the | ||
35 | zone size (and possibly of the size of lower class zones), we can decide | ||
36 | at init time how many free pages we should aim for while balancing any | ||
37 | zone. The good part is, while balancing, we do not need to look at sizes | ||
38 | of lower class zones, the bad part is, we might do too frequent balancing | ||
39 | due to ignoring possibly lower usage in the lower class zones. Also, | ||
40 | with a slight change in the allocation routine, it is possible to reduce | ||
41 | the memclass() macro to be a simple equality. | ||
42 | |||
43 | Another possible solution is that we balance only when the free memory | ||
44 | of a zone _and_ all its lower class zones falls below 1/64th of the | ||
45 | total memory in the zone and its lower class zones. This fixes the 2.2 | ||
46 | balancing problem, and stays as close to 2.2 behavior as possible. Also, | ||
47 | the balancing algorithm works the same way on the various architectures, | ||
48 | which have different numbers and types of zones. If we wanted to get | ||
49 | fancy, we could assign different weights to free pages in different | ||
50 | zones in the future. | ||
51 | |||
52 | Note that if the size of the regular zone is huge compared to dma zone, | ||
53 | it becomes less significant to consider the free dma pages while | ||
54 | deciding whether to balance the regular zone. The first solution | ||
55 | becomes more attractive then. | ||
56 | |||
57 | The appended patch implements the second solution. It also "fixes" two | ||
58 | problems: first, kswapd is woken up as in 2.2 on low memory conditions | ||
59 | for non-sleepable allocations. Second, the HIGHMEM zone is also balanced, | ||
60 | so as to give a fighting chance for replace_with_highmem() to get a | ||
61 | HIGHMEM page, as well as to ensure that HIGHMEM allocations do not | ||
62 | fall back into regular zone. This also makes sure that HIGHMEM pages | ||
63 | are not leaked (for example, in situations where a HIGHMEM page is in | ||
64 | the swapcache but is not being used by anyone) | ||
65 | |||
66 | kswapd also needs to know about the zones it should balance. kswapd is | ||
67 | primarily needed in a situation where balancing can not be done, | ||
68 | probably because all allocation requests are coming from intr context | ||
69 | and all process contexts are sleeping. For 2.3, kswapd does not really | ||
70 | need to balance the highmem zone, since intr context does not request | ||
71 | highmem pages. kswapd looks at the zone_wake_kswapd field in the zone | ||
72 | structure to decide whether a zone needs balancing. | ||
73 | |||
74 | Page stealing from process memory and shm is done if stealing the page would | ||
75 | alleviate memory pressure on any zone in the page's node that has fallen below | ||
76 | its watermark. | ||
77 | |||
78 | pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are | ||
79 | per-zone fields, used to determine when a zone needs to be balanced. When | ||
80 | the number of pages falls below pages_min, the hysteric field low_on_memory | ||
81 | gets set. This stays set till the number of free pages becomes pages_high. | ||
82 | When low_on_memory is set, page allocation requests will try to free some | ||
83 | pages in the zone (providing GFP_WAIT is set in the request). Orthogonal | ||
84 | to this, is the decision to poke kswapd to free some zone pages. That | ||
85 | decision is not hysteresis based, and is done when the number of free | ||
86 | pages is below pages_low; in which case zone_wake_kswapd is also set. | ||
87 | |||
88 | |||
89 | (Good) Ideas that I have heard: | ||
90 | 1. Dynamic experience should influence balancing: number of failed requests | ||
91 | for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net) | ||
92 | 2. Implement a replace_with_highmem()-like replace_with_regular() to preserve | ||
93 | dma pages. (lkd@tantalophile.demon.co.uk) | ||