6 files changed, 854 insertions, 19 deletions
diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 000000000000..6ef2a8652e17
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
+                        Subsystem Trace Points: kmem
+The tracing system kmem captures events related to object and page allocation
+within the kernel. Broadly speaking there are four major subheadings.
+  o Slab allocation of small objects of unknown type (kmalloc)
+  o Slab allocation of small objects of known type
+  o Page allocation
+  o Per-CPU Allocator Activity
+  o External Fragmentation
+This document will describe what each of the tracepoints are and why they
+might be useful.
+1. Slab allocation of small objects of unknown type
+===================================================
+kmalloc         call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmalloc_node    call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kfree           call_site=%lx ptr=%p
+Heavy activity for these events may indicate that a specific cache is
+justified, particularly if kmalloc slab pages are getting significantly
+internal fragmented as a result of the allocation pattern. By correlating
+kmalloc with kfree, it may be possible to identify memory leaks and where
+the allocation sites were.
+2. Slab allocation of small objects of known type
+=================================================
+kmem_cache_alloc        call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
+kmem_cache_alloc_node   call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
+kmem_cache_free         call_site=%lx ptr=%p
+These events are similar in usage to the kmalloc-related events except that
+it is likely easier to pin the event down to a specific cache. At the time
+of writing, no information is available on what slab is being allocated from,
+but the call_site can usually be used to extrapolate that information
+3. Page allocation
+==================
+mm_page_alloc             page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
+mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_free_direct       page=%p pfn=%lu order=%d
+mm_pagevec_free           page=%p pfn=%lu order=%d cold=%d
+These four events deal with page allocation and freeing. mm_page_alloc is
+a simple indicator of page allocator activity. Pages may be allocated from
+the per-CPU allocator (high performance) or the buddy allocator.
+If pages are allocated directly from the buddy allocator, the
+mm_page_alloc_zone_locked event is triggered. This event is important as high
+amounts of activity imply high activity on the zone->lock. Taking this lock
+impairs performance by disabling interrupts, dirtying cache lines between
+CPUs and serialising many CPUs.
+When a page is freed directly by the caller, the mm_page_free_direct event
+is triggered. Significant amounts of activity here could indicate that the
+callers should be batching their activities.
+When pages are freed using a pagevec, the mm_pagevec_free is
+triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
+freed in batch with a pagevec. Significant amounts of activity here could
+indicate that the system is under memory pressure and can also indicate
+contention on the zone->lru_lock.
+4. Per-CPU Allocator Activity
+=============================
+mm_page_alloc_zone_locked       page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
+mm_page_pcpu_drain              page=%p pfn=%lu order=%d cpu=%d migratetype=%d
+In front of the page allocator is a per-cpu page allocator. It exists only
+for order-0 pages, reduces contention on the zone->lock and reduces the
+amount of writing on struct page.
+When a per-CPU list is empty or pages of the wrong type are allocated,
+the zone->lock will be taken once and the per-CPU list refilled. The event
+triggered is mm_page_alloc_zone_locked for each page allocated with the
+event indicating whether it is for a percpu_refill or not.
+When the per-CPU list is too full, a number of pages are freed, each one
+which triggers a mm_page_pcpu_drain event.
+The individual nature of the events are so that pages can be tracked
+between allocation and freeing. A number of drain or refill pages that occur
+consecutively imply the zone->lock being taken once. Large amounts of PCP
+refills and drains could imply an imbalance between CPUs where too much work
+is being concentrated in one place. It could also indicate that the per-CPU
+lists should be a larger size. Finally, large amounts of refills on one CPU
+and drains on another could be a factor in causing large amounts of cache
+line bounces due to writes between CPUs and worth investigating if pages
+can be allocated and freed on the same CPU through some algorithm change.
+5. External Fragmentation
+=========================
+mm_page_alloc_extfrag           page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
+External fragmentation affects whether a high-order allocation will be
+successful or not. For some types of hardware, this is important although
+it is avoided where possible. If the system is using huge pages and needs
+to be able to resize the pool over the lifetime of the system, this value
+is important.
+Large numbers of this event implies that memory is fragmenting and
+high-order allocations will start failing at some time in the future. One
+means of reducing the occurange of this event is to increase the size of
+min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
+pageblock_size is usually the size of the default hugepage size.
diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt
index 78c45a87be57..02ac6ed38b2d 100644
--- a/Documentation/trace/events.txt
+++ b/Documentation/trace/events.txt
@@ -72,7 +72,7 @@ To enable all events in sched subsystem:
        # echo 1 > /sys/kernel/debug/tracing/events/sched/enable
-To eanble all events:
+To enable all events:
        # echo 1 > /sys/kernel/debug/tracing/events/enable
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index 1b6292bbdd6d..957b22fde2df 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -133,7 +133,7 @@ of ftrace. Here is a list of some of the key files:
        than requested, the rest of the page will be used,
        making the actual allocation bigger than requested.
        ( Note, the size may not be a multiple of the page size
-          due to buffer managment overhead. )
+          due to buffer management overhead. )
        This can only be updated when the current_tracer
        is set to "nop".
diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100644
index 000000000000..7df50e8cf4d9
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,418 @@
+#!/usr/bin/perl
+# This is a POC (proof of concept or piece of crap, take your pick) for reading the
+# text representation of trace output related to page allocation. It makes an attempt
+# to extract some high-level information on what is going on. The accuracy of the parser
+# may vary considerably
+#
+# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --prepend-parent    Report on the parent proc and PID
+#   --read-procstat     If the trace lacks process info, get it from /proc
+#   --ignore-pid        Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+# Tracepoint events
+use constant MM_PAGE_ALLOC              => 1;
+use constant MM_PAGE_FREE_DIRECT        => 2;
+use constant MM_PAGEVEC_FREE            => 3;
+use constant MM_PAGE_PCPU_DRAIN         => 4;
+use constant MM_PAGE_ALLOC_ZONE_LOCKED  => 5;
+use constant MM_PAGE_ALLOC_EXTFRAG      => 6;
+use constant EVENT_UNKNOWN              => 7;
+# Constants used to track state
+use constant STATE_PCPU_PAGES_DRAINED   => 8;
+use constant STATE_PCPU_PAGES_REFILLED  => 9;
+# High-level events extrapolated from tracepoints
+use constant HIGH_PCPU_DRAINS           => 10;
+use constant HIGH_PCPU_REFILLS          => 11;
+use constant HIGH_EXT_FRAGMENT          => 12;
+use constant HIGH_EXT_FRAGMENT_SEVERE   => 13;
+use constant HIGH_EXT_FRAGMENT_MODERATE => 14;
+use constant HIGH_EXT_FRAGMENT_CHANGED  => 15;
+my %perprocesspid;
+my %perprocess;
+my $opt_ignorepid;
+my $opt_read_procstat;
+my $opt_prepend_parent;
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+        my $current_time = time;
+        if ($current_time - 2 > $sigint_received) {
+                print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+                $sigint_report = 1;
+        } else {
+                if (!$sigint_exit) {
+                        print "Second SIGINT received quickly, exiting\n";
+                }
+                $sigint_exit++;
+        }
+        if ($sigint_exit > 3) {
+                print "Many SIGINTs received, exiting now without report\n";
+                exit;
+        }
+        $sigint_received = $current_time;
+        $sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+# Parse command line options
+GetOptions(
+        'ignore-pid'     =>     \$opt_ignorepid,
+        'read-procstat'  =>     \$opt_read_procstat,
+        'prepend-parent' =>     \$opt_prepend_parent,
+);
+# Defaults for dynamically discovered regex's
+my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';
+# Dyanically discovered regex
+my $regex_fragdetails;
+# Static regex used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+sub generate_traceevent_regex {
+        my $event = shift;
+        my $default = shift;
+        my $regex;
+        # Read the event format or use the default
+        if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+                $regex = $default;
+        } else {
+                my $line;
+                while (!eof(FORMAT)) {
+                        $line = <FORMAT>;
+                        if ($line =~ /^print fmt:\s"(.*)",.*/) {
+                                $regex = $1;
+                                $regex =~ s/%p/\([0-9a-f]*\)/g;
+                                $regex =~ s/%d/\([-0-9]*\)/g;
+                                $regex =~ s/%lu/\([0-9]*\)/g;
+                        }
+                }
+        }
+        # Verify fields are in the right order
+        my $tuple;
+        foreach $tuple (split /\s/, $regex) {
+                my ($key, $value) = split(/=/, $tuple);
+                my $expected = shift;
+                if ($key ne $expected) {
+                        print("WARNING: Format not as expected '$key' != '$expected'");
+                        $regex =~ s/$key=\((.*)\)/$key=$1/;
+                }
+        }
+        if (defined shift) {
+                die("Fewer fields than expected in format");
+        }
+        return $regex;
+}
+$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
+                        $regex_fragdetails_default,
+                        "page", "pfn",
+                        "alloc_order", "fallback_order", "pageblock_order",
+                        "alloc_migratetype", "fallback_migratetype",
+                        "fragmenting", "change_ownership");
+sub read_statline($) {
+        my $pid = $_[0];
+        my $statline;
+        if (open(STAT, "/proc/$pid/stat")) {
+                $statline = <STAT>;
+                close(STAT);
+        }
+        if ($statline eq '') {
+                $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+        }
+        return $statline;
+}
+sub guess_process_pid($$) {
+        my $pid = $_[0];
+        my $statline = $_[1];
+        if ($pid == 0) {
+                return "swapper-0";
+        }
+        if ($statline !~ /$regex_statname/o) {
+                die("Failed to math stat line for process name :: $statline");
+        }
+        return "$1-$pid";
+}
+sub parent_info($$) {
+        my $pid = $_[0];
+        my $statline = $_[1];
+        my $ppid;
+        if ($pid == 0) {
+                return "NOPARENT-0";
+        }
+        if ($statline !~ /$regex_statppid/o) {
+                die("Failed to match stat line process ppid:: $statline");
+        }
+        # Read the ppid stat line
+        $ppid = $1;
+        return guess_process_pid($ppid, read_statline($ppid));
+}
+sub process_events {
+        my $traceevent;
+        my $process_pid;
+        my $cpus;
+        my $timestamp;
+        my $tracepoint;
+        my $details;
+        my $statline;
+        # Read each line of the event log
+EVENT_PROCESS:
+        while ($traceevent = <STDIN>) {
+                if ($traceevent =~ /$regex_traceevent/o) {
+                        $process_pid = $1;
+                        $tracepoint = $4;
+                        if ($opt_read_procstat || $opt_prepend_parent) {
+                                $process_pid =~ /(.*)-([0-9]*)$/;
+                                my $process = $1;
+                                my $pid = $2;
+                                $statline = read_statline($pid);
+                                if ($opt_read_procstat && $process eq '') {
+                                        $process_pid = guess_process_pid($pid, $statline);
+                                }
+                                if ($opt_prepend_parent) {
+                                        $process_pid = parent_info($pid, $statline) . " :: $process_pid";
+                                }
+                        }
+                        # Unnecessary in this script. Uncomment if required
+                        # $cpus = $2;
+                        # $timestamp = $3;
+                } else {
+                        next;
+                }
+                # Perl Switch() sucks majorly
+                if ($tracepoint eq "mm_page_alloc") {
+                        $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
+                } elsif ($tracepoint eq "mm_page_free_direct") {
+                        $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++;
+                } elsif ($tracepoint eq "mm_pagevec_free") {
+                        $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++;
+                } elsif ($tracepoint eq "mm_page_pcpu_drain") {
+                        $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
+                        $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
+                } elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
+                        $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
+                        $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
+                } elsif ($tracepoint eq "mm_page_alloc_extfrag") {
+                        # Extract the details of the event now
+                        $details = $5;
+                        my ($page, $pfn);
+                        my ($alloc_order, $fallback_order, $pageblock_order);
+                        my ($alloc_migratetype, $fallback_migratetype);
+                        my ($fragmenting, $change_ownership);
+                        if ($details !~ /$regex_fragdetails/o) {
+                                print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n";
+                                next;
+                        }
+                        $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
+                        $page = $1;
+                        $pfn = $2;
+                        $alloc_order = $3;
+                        $fallback_order = $4;
+                        $pageblock_order = $5;
+                        $alloc_migratetype = $6;
+                        $fallback_migratetype = $7;
+                        $fragmenting = $8;
+                        $change_ownership = $9;
+                        if ($fragmenting) {
+                                $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
+                                if ($fallback_order <= 3) {
+                                        $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
+                                } else {
+                                        $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
+                                }
+                        }
+                        if ($change_ownership) {
+                                $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
+                        }
+                } else {
+                        $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+                }
+                # Catch a full pcpu drain event
+                if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
+                                $tracepoint ne "mm_page_pcpu_drain") {
+                        $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
+                        $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+                }
+                # Catch a full pcpu refill event
+                if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
+                                $tracepoint ne "mm_page_alloc_zone_locked") {
+                        $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
+                        $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+                }
+                if ($sigint_pending) {
+                        last EVENT_PROCESS;
+                }
+        }
+}
+sub dump_stats {
+        my $hashref = shift;
+        my %stats = %$hashref;
+        # Dump per-process stats
+        my $process_pid;
+        my $max_strlen = 0;
+        # Get the maximum process name
+        foreach $process_pid (keys %perprocesspid) {
+                my $len = length($process_pid);
+                if ($len > $max_strlen) {
+                        $max_strlen = $len;
+                }
+        }
+        $max_strlen += 2;
+        printf("\n");
+        printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n",
+                "Process", "Pages",  "Pages",      "Pages", "Pages", "PCPU",  "PCPU",   "PCPU",    "Fragment",  "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
+        printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n",
+                "details", "allocd", "allocd",     "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing",   "Changed", "Severe", "Moderate", "");
+        printf("%-" . $max_strlen . "s %8s %10s   %8s %8s   %8s %8s %8s   %8s %8s %8s %8s %8s %8s\n",
+                "",        "",       "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
+        foreach $process_pid (keys %stats) {
+                # Dump final aggregates
+                if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
+                        $stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
+                        $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
+                }
+                if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
+                        $stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
+                        $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
+                }
+                printf("%-" . $max_strlen . "s %8d %10d   %8d %8d   %8d %8d %8d   %8d %8d %8d %8d %8d %8d\n",
+                        $process_pid,
+                        $stats{$process_pid}->{MM_PAGE_ALLOC},
+                        $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
+                        $stats{$process_pid}->{MM_PAGE_FREE_DIRECT},
+                        $stats{$process_pid}->{MM_PAGEVEC_FREE},
+                        $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
+                        $stats{$process_pid}->{HIGH_PCPU_DRAINS},
+                        $stats{$process_pid}->{HIGH_PCPU_REFILLS},
+                        $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
+                        $stats{$process_pid}->{HIGH_EXT_FRAG},
+                        $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
+                        $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
+                        $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
+                        $stats{$process_pid}->{EVENT_UNKNOWN});
+        }
+}
+sub aggregate_perprocesspid() {
+        my $process_pid;
+        my $process;
+        undef %perprocess;
+        foreach $process_pid (keys %perprocesspid) {
+                $process = $process_pid;
+                $process =~ s/-([0-9])*$//;
+                if ($process eq '') {
+                        $process = "NO_PROCESS_NAME";
+                }
+                $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
+                $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
+                $perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT};
+                $perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE};
+                $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
+                $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
+                $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
+                $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
+                $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
+                $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
+                $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
+                $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
+                $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
+        }
+}
+sub report() {
+        if (!$opt_ignorepid) {
+                dump_stats(\%perprocesspid);
+        } else {
+                aggregate_perprocesspid();
+                dump_stats(\%perprocess);
+        }
+}
+# Process events or signals until neither is available
+sub signal_loop() {
+        my $sigint_processed;
+        do {
+                $sigint_processed = 0;
+                process_events();
+                # Handle pending signals if any
+                if ($sigint_pending) {
+                        my $current_time = time;
+                        if ($sigint_exit) {
+                                print "Received exit signal\n";
+                                $sigint_pending = 0;
+                        }
+                        if ($sigint_report) {
+                                if ($current_time >= $sigint_received + 2) {
+                                        report();
+                                        $sigint_report = 0;
+                                        $sigint_pending = 0;
+                                        $sigint_processed = 1;
+                                }
+                        }
+                }
+        } while ($sigint_pending || $sigint_processed);
+}
+signal_loop();
+report();
diff --git a/Documentation/trace/power.txt b/Documentation/trace/power.txt
deleted file mode 100644
index cd805e16dc27..000000000000
--- a/Documentation/trace/power.txt
+++ /dev/null
@@ -1,17 +0,0 @@
-The power tracer collects detailed information about C-state and P-state
-transitions, instead of just looking at the high-level "average"
-information.
-There is a helper script found in scrips/tracing/power.pl in the kernel
-sources which can be used to parse this information and create a
-Scalable Vector Graphics (SVG) picture from the trace data.
-To use this tracer:
-        echo 0 > /sys/kernel/debug/tracing/tracing_enabled
-        echo power > /sys/kernel/debug/tracing/current_tracer
-        echo 1 > /sys/kernel/debug/tracing/tracing_enabled
-        sleep 1
-        echo 0 > /sys/kernel/debug/tracing/tracing_enabled
-        cat /sys/kernel/debug/tracing/trace | \
-                perl scripts/tracing/power.pl > out.sv
diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 000000000000..5eb4e487e667
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
+                Notes on Analysing Behaviour Using Events and Tracepoints
+                        Documentation written by Mel Gorman
+                PCL information heavily based on email from Ingo Molnar
+1. Introduction
+===============
+Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
+creating custom kernel modules to register probe functions using the event
+tracing infrastructure.
+Simplistically, tracepoints will represent an important event that when can
+be taken in conjunction with other tracepoints to build a "Big Picture" of
+what is going on within the system. There are a large number of methods for
+gathering and interpreting these events. Lacking any current Best Practises,
+this document describes some of the methods that can be used.
+This document assumes that debugfs is mounted on /sys/kernel/debug and that
+the appropriate tracing options have been configured into the kernel. It is
+assumed that the PCL tool tools/perf has been installed and is in your path.
+2. Listing Available Events
+===========================
+2.1 Standard Utilities
+----------------------
+All possible events are visible from /sys/kernel/debug/tracing/events. Simply
+calling
+  $ find /sys/kernel/debug/tracing/events -type d
+will give a fair indication of the number of events available.
+2.2 PCL
+-------
+Discovery and enumeration of all counters and events, including tracepoints
+are available with the perf tool. Getting a list of available events is a
+simple case of
+  $ perf list 2>&1 | grep Tracepoint
+  ext4:ext4_free_inode                     [Tracepoint event]
+  ext4:ext4_request_inode                  [Tracepoint event]
+  ext4:ext4_allocate_inode                 [Tracepoint event]
+  ext4:ext4_write_begin                    [Tracepoint event]
+  ext4:ext4_ordered_write_end              [Tracepoint event]
+  [ .... remaining output snipped .... ]
+2. Enabling Events
+==================
+2.1 System-Wide Event Enabling
+------------------------------
+See Documentation/trace/events.txt for a proper description on how events
+can be enabled system-wide. A short example of enabling all events related
+to page allocation would look something like
+  $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+2.2 System-Wide Event Enabling with SystemTap
+---------------------------------------------
+In SystemTap, tracepoints are accessible using the kernel.trace() function
+call. The following is an example that reports every 5 seconds what processes
+were allocating the pages.
+  global page_allocs
+  probe kernel.trace("mm_page_alloc") {
+        page_allocs[execname()]++
+  }
+  function print_count() {
+        printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
+        foreach (proc in page_allocs-)
+                printf("%-25d %s\n", page_allocs[proc], proc)
+        printf ("\n")
+        delete page_allocs
+  }
+  probe timer.s(5) {
+          print_count()
+  }
+2.3 System-Wide Event Enabling with PCL
+---------------------------------------
+By specifying the -a switch and analysing sleep, the system-wide events
+for a duration of time can be examined.
+ $ perf stat -a \
+        -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+        -e kmem:mm_pagevec_free \
+        sleep 10
+ Performance counter stats for 'sleep 10':
+           9630  kmem:mm_page_alloc
+           2143  kmem:mm_page_free_direct
+           7424  kmem:mm_pagevec_free
+   10.002577764  seconds time elapsed
+Similarly, one could execute a shell and exit it as desired to get a report
+at that point.
+2.4 Local Event Enabling
+------------------------
+Documentation/trace/ftrace.txt describes how to enable events on a per-thread
+basis using set_ftrace_pid.
+2.5 Local Event Enablement with PCL
+-----------------------------------
+Events can be activate and tracked for the duration of a process on a local
+basis using PCL such as follows.
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+                 -e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.909
+    Performance counter stats for './hackbench 10':
+          17803  kmem:mm_page_alloc
+          12398  kmem:mm_page_free_direct
+           4827  kmem:mm_pagevec_free
+    0.973913387  seconds time elapsed
+3. Event Filtering
+==================
+Documentation/trace/ftrace.txt covers in-depth how to filter events in
+ftrace.  Obviously using grep and awk of trace_pipe is an option as well
+as any script reading trace_pipe.
+4. Analysing Event Variances with PCL
+=====================================
+Any workload can exhibit variances between runs and it can be important
+to know what the standard deviation in. By and large, this is left to the
+performance analyst to do it by hand. In the event that the discrete event
+occurrences are useful to the performance analyst, then perf can be used.
+  $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
+                        -e kmem:mm_pagevec_free ./hackbench 10
+  Time: 0.890
+  Time: 0.895
+  Time: 0.915
+  Time: 1.001
+  Time: 0.899
+   Performance counter stats for './hackbench 10' (5 runs):
+          16630  kmem:mm_page_alloc         ( +-   3.542% )
+          11486  kmem:mm_page_free_direct   ( +-   4.771% )
+           4730  kmem:mm_pagevec_free       ( +-   2.325% )
+    0.982653002  seconds time elapsed   ( +-   1.448% )
+In the event that some higher-level event is required that depends on some
+aggregation of discrete events, then a script would need to be developed.
+Using --repeat, it is also possible to view how events are fluctuating over
+time on a system wide basis using -a and sleep.
+  $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+                -e kmem:mm_pagevec_free \
+                -a --repeat 10 \
+                sleep 1
+  Performance counter stats for 'sleep 1' (10 runs):
+           1066  kmem:mm_page_alloc         ( +-  26.148% )
+            182  kmem:mm_page_free_direct   ( +-   5.464% )
+            890  kmem:mm_pagevec_free       ( +-  30.079% )
+    1.002251757  seconds time elapsed   ( +-   0.005% )
+5. Higher-Level Analysis with Helper Scripts
+============================================
+When events are enabled the events that are triggering can be read from
+/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
+options exist as well. By post-processing the output, further information can
+be gathered on-line as appropriate. Examples of post-processing might include
+  o Reading information from /proc for the PID that triggered the event
+  o Deriving a higher-level event from a series of lower-level events.
+  o Calculate latencies between two events
+Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
+script that can read trace_pipe from STDIN or a copy of a trace. When used
+on-line, it can be interrupted once to generate a report without existing
+and twice to exit.
+Simplistically, the script just reads STDIN and counts up events but it
+also can do more such as
+  o Derive high-level events from many low-level events. If a number of pages
+    are freed to the main allocator from the per-CPU lists, it recognises
+    that as one per-CPU drain even though there is no specific tracepoint
+    for that event
+  o It can aggregate based on PID or individual process number
+  o In the event memory is getting externally fragmented, it reports
+    on whether the fragmentation event was severe or moderate.
+  o When receiving an event about a PID, it can record who the parent was so
+    that if large numbers of events are coming from very short-lived
+    processes, the parent process responsible for creating all the helpers
+    can be identified
+6. Lower-Level Analysis with PCL
+================================
+There may also be a requirement to identify what functions with a program
+were generating events within the kernel. To begin this sort of analysis, the
+data must be recorded. At the time of writing, this required root
+  $ perf record -c 1 \
+        -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+        -e kmem:mm_pagevec_free \
+        ./hackbench 10
+  Time: 0.894
+  [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
+Note the use of '-c 1' to set the event period to sample. The default sample
+period is quite high to minimise overhead but the information collected can be
+very coarse as a result.
+This record outputted a file called perf.data which can be analysed using
+perf report.
+  $ perf report
+  # Samples: 30922
+  #
+  # Overhead    Command                     Shared Object
+  # ........  .........  ................................
+  #
+      87.27%  hackbench  [vdso]
+       6.85%  hackbench  /lib/i686/cmov/libc-2.9.so
+       2.62%  hackbench  /lib/ld-2.9.so
+       1.52%       perf  [vdso]
+       1.22%  hackbench  ./hackbench
+       0.48%  hackbench  [kernel]
+       0.02%       perf  /lib/i686/cmov/libc-2.9.so
+       0.01%       perf  /usr/bin/perf
+       0.01%       perf  /lib/ld-2.9.so
+       0.00%  hackbench  /lib/i686/cmov/libpthread-2.9.so
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+According to this, the vast majority of events occured triggered on events
+within the VDSO. With simple binaries, this will often be the case so lets
+take a slightly different example. In the course of writing this, it was
+noticed that X was generating an insane amount of page allocations so lets look
+at it
+  $ perf record -c 1 -f \
+                -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
+                -e kmem:mm_pagevec_free \
+                -p `pidof X`
+This was interrupted after a few seconds and
+  $ perf report
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object
+  # ........  .......  .......................................
+  #
+      51.95%     Xorg  [vdso]
+      47.95%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so
+       0.01%     Xorg  [kernel]
+  #
+  # (For more details, try: perf report --sort comm,dso,symbol)
+  #
+So, almost half of the events are occuring in a library. To get an idea which
+symbol.
+  $ perf report --sort comm,dso,symbol
+  # Samples: 27666
+  #
+  # Overhead  Command                            Shared Object  Symbol
+  # ........  .......  .......................................  ......
+  #
+      51.95%     Xorg  [vdso]                                   [.] 0x000000ffffe424
+      47.93%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixmanFillsse2
+       0.09%     Xorg  /lib/i686/cmov/libc-2.9.so               [.] _int_malloc
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] pixman_region32_copy_f
+       0.01%     Xorg  [kernel]                                 [k] read_hpet
+       0.01%     Xorg  /opt/gfx-test/lib/libpixman-1.so.0.13.1  [.] get_fast_path
+       0.00%     Xorg  [kernel]                                 [k] ftrace_trace_userstack
+To see where within the function pixmanFillsse2 things are going wrong
+  $ perf annotate pixmanFillsse2
+  [ ... ]
+    0.00 :         34eeb:       0f 18 08                prefetcht0 (%eax)
+         :      }
+         :
+         :      extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
+         :      _mm_store_si128 (__m128i *__P, __m128i __B) :      {
+         :        *__P = __B;
+   12.40 :         34eee:       66 0f 7f 80 40 ff ff    movdqa %xmm0,-0xc0(%eax)
+    0.00 :         34ef5:       ff
+   12.40 :         34ef6:       66 0f 7f 80 50 ff ff    movdqa %xmm0,-0xb0(%eax)
+    0.00 :         34efd:       ff
+   12.39 :         34efe:       66 0f 7f 80 60 ff ff    movdqa %xmm0,-0xa0(%eax)
+    0.00 :         34f05:       ff
+   12.67 :         34f06:       66 0f 7f 80 70 ff ff    movdqa %xmm0,-0x90(%eax)
+    0.00 :         34f0d:       ff
+   12.58 :         34f0e:       66 0f 7f 40 80          movdqa %xmm0,-0x80(%eax)
+   12.31 :         34f13:       66 0f 7f 40 90          movdqa %xmm0,-0x70(%eax)
+   12.40 :         34f18:       66 0f 7f 40 a0          movdqa %xmm0,-0x60(%eax)
+   12.31 :         34f1d:       66 0f 7f 40 b0          movdqa %xmm0,-0x50(%eax)
+At a glance, it looks like the time is being spent copying pixmaps to
+the card.  Further investigation would be needed to determine why pixmaps
+are being copied around so much but a starting point would be to take an
+ancient build of libpixmap out of the library path where it was totally
+forgotten about from months ago!