aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/prio_tree.txt
blob: 3aa68f9a117b19cac846aabd769d53d46f657e82 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
The prio_tree.c code indexes vmas using 3 different indexes:
	* heap_index  = vm_pgoff + vm_size_in_pages : end_vm_pgoff
	* radix_index = vm_pgoff : start_vm_pgoff
	* size_index = vm_size_in_pages

A regular radix-priority-search-tree indexes vmas using only heap_index and
radix_index. The conditions for indexing are:
	* ->heap_index >= ->left->heap_index &&
		->heap_index >= ->right->heap_index
	* if (->heap_index == ->left->heap_index)
		then ->radix_index < ->left->radix_index;
	* if (->heap_index == ->right->heap_index)
		then ->radix_index < ->right->radix_index;
	* nodes are hashed to left or right subtree using radix_index
	  similar to a pure binary radix tree.

A regular radix-priority-search-tree helps to store and query
intervals (vmas). However, a regular radix-priority-search-tree is only
suitable for storing vmas with different radix indices (vm_pgoff).

Therefore, the prio_tree.c extends the regular radix-priority-search-tree
to handle many vmas with the same vm_pgoff. Such vmas are handled in
2 different ways: 1) All vmas with the same radix _and_ heap indices are
linked using vm_set.list, 2) if there are many vmas with the same radix
index, but different heap indices and if the regular radix-priority-search
tree cannot index them all, we build an overflow-sub-tree that indexes such
vmas using heap and size indices instead of heap and radix indices. For
example, in the figure below some vmas with vm_pgoff = 0 (zero) are
indexed by regular radix-priority-search-tree whereas others are pushed
into an overflow-subtree. Note that all vmas in an overflow-sub-tree have
the same vm_pgoff (radix_index) and if necessary we build different
overflow-sub-trees to handle each possible radix_index. For example,
in figure we have 3 overflow-sub-trees corresponding to radix indices
0, 2, and 4.

In the final tree the first few (prio_tree_root->index_bits) levels
are indexed using heap and radix indices whereas the overflow-sub-trees below
those levels (i.e. levels prio_tree_root->index_bits + 1 and higher) are
indexed using heap and size indices. In overflow-sub-trees the size_index
is used for hashing the nodes to appropriate places.

Now, an example prio_tree:

  vmas are represented [radix_index, size_index, heap_index]
                 i.e., [start_vm_pgoff, vm_size_in_pages, end_vm_pgoff]

level  prio_tree_root->index_bits = 3
-----
												_
  0			 				[0,7,7]					 |
  							/     \					 |
				      ------------------       ------------			 |     Regular
  				     /					   \			 |  radix priority
  1		 		[1,6,7]					  [4,3,7]		 |   search tree
  				/     \					  /     \		 |
			 -------       -----			    ------       -----		 |  heap-and-radix
			/		    \			   /		      \		 |      indexed
  2		    [0,6,6]	 	   [2,5,7]		[5,2,7]		    [6,1,7]	 |
		    /     \		   /     \		/     \		    /     \	 |
  3		[0,5,5]	[1,5,6]		[2,4,6]	[3,4,7]	    [4,2,6] [5,1,6]	[6,0,6]	[7,0,7]	 |
		   /			   /		       /		   		_
                  /		          /		      /					_
  4	      [0,4,4]		      [2,3,5]		   [4,1,5]				 |
  		 /			 /		      /					 |
  5	     [0,3,3]		     [2,2,4]		  [4,0,4]				 |  Overflow-sub-trees
  		/			/							 |
  6	    [0,2,2]		    [2,1,3]							 |    heap-and-size
  	       /		       /							 |       indexed
  7	   [0,1,1]		   [2,0,2]							 |
  	      /											 |
  8	  [0,0,0]										 |
  												_

Note that we use prio_tree_root->index_bits to optimize the height
of the heap-and-radix indexed tree. Since prio_tree_root->index_bits is
set according to the maximum end_vm_pgoff mapped, we are sure that all
bits (in vm_pgoff) above prio_tree_root->index_bits are 0 (zero). Therefore,
we only use the first prio_tree_root->index_bits as radix_index.
Whenever index_bits is increased in prio_tree_expand, we shuffle the tree
to make sure that the first prio_tree_root->index_bits levels of the tree
is indexed properly using heap and radix indices.

We do not optimize the height of overflow-sub-trees using index_bits.
The reason is: there can be many such overflow-sub-trees and all of
them have to be suffled whenever the index_bits increases. This may involve
walking the whole prio_tree in prio_tree_insert->prio_tree_expand code
path which is not desirable. Hence, we do not optimize the height of the
heap-and-size indexed overflow-sub-trees using prio_tree->index_bits.
Instead the overflow sub-trees are indexed using full BITS_PER_LONG bits
of size_index. This may lead to skewed sub-trees because most of the
higher significant bits of the size_index are likely to be 0 (zero). In
the example above, all 3 overflow-sub-trees are skewed. This may marginally
affect the performance. However, processes rarely map many vmas with the
same start_vm_pgoff but different end_vm_pgoffs. Therefore, we normally
do not require overflow-sub-trees to index all vmas.

From the above discussion it is clear that the maximum height of
a prio_tree can be prio_tree_root->index_bits + BITS_PER_LONG.
However, in most of the common cases we do not need overflow-sub-trees,
so the tree height in the common cases will be prio_tree_root->index_bits.

It is fair to mention here that the prio_tree_root->index_bits
is increased on demand, however, the index_bits is not decreased when
vmas are removed from the prio_tree. That's tricky to do. Hence, it's
left as a home work problem.


href='#n271'>271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346

sysfs - _The_ filesystem for exporting kernel objects. 

Patrick Mochel	<mochel@osdl.org>

10 January 2003


What it is:
~~~~~~~~~~~

sysfs is a ram-based filesystem initially based on ramfs. It provides
a means to export kernel data structures, their attributes, and the 
linkages between them to userspace. 

sysfs is tied inherently to the kobject infrastructure. Please read
Documentation/kobject.txt for more information concerning the kobject
interface. 


Using sysfs
~~~~~~~~~~~

sysfs is always compiled in. You can access it by doing:

    mount -t sysfs sysfs /sys 


Directory Creation
~~~~~~~~~~~~~~~~~~

For every kobject that is registered with the system, a directory is
created for it in sysfs. That directory is created as a subdirectory
of the kobject's parent, expressing internal object hierarchies to
userspace. Top-level directories in sysfs represent the common
ancestors of object hierarchies; i.e. the subsystems the objects
belong to. 

Sysfs internally stores the kobject that owns the directory in the
->d_fsdata pointer of the directory's dentry. This allows sysfs to do
reference counting directly on the kobject when the file is opened and
closed. 


Attributes
~~~~~~~~~~

Attributes can be exported for kobjects in the form of regular files in
the filesystem. Sysfs forwards file I/O operations to methods defined
for the attributes, providing a means to read and write kernel
attributes.

Attributes should be ASCII text files, preferably with only one value
per file. It is noted that it may not be efficient to contain only
value per file, so it is socially acceptable to express an array of
values of the same type. 

Mixing types, expressing multiple lines of data, and doing fancy
formatting of data is heavily frowned upon. Doing these things may get
you publically humiliated and your code rewritten without notice. 


An attribute definition is simply:

struct attribute {
        char                    * name;
        mode_t                  mode;
};


int sysfs_create_file(struct kobject * kobj, struct attribute * attr);
void sysfs_remove_file(struct kobject * kobj, struct attribute * attr);


A bare attribute contains no means to read or write the value of the
attribute. Subsystems are encouraged to define their own attribute
structure and wrapper functions for adding and removing attributes for
a specific object type. 

For example, the driver model defines struct device_attribute like:

struct device_attribute {
        struct attribute        attr;
        ssize_t (*show)(struct device * dev, char * buf);
        ssize_t (*store)(struct device * dev, const char * buf);
};

int device_create_file(struct device *, struct device_attribute *);
void device_remove_file(struct device *, struct device_attribute *);

It also defines this helper for defining device attributes: 

#define DEVICE_ATTR(_name, _mode, _show, _store)      \
struct device_attribute dev_attr_##_name = {            \
        .attr = {.name  = __stringify(_name) , .mode   = _mode },      \
        .show   = _show,                                \
        .store  = _store,                               \
};

For example, declaring

static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo);

is equivalent to doing:

static struct device_attribute dev_attr_foo = {
       .attr	= {
		.name = "foo",
		.mode = S_IWUSR | S_IRUGO,
	},
	.show = show_foo,
	.store = store_foo,
};


Subsystem-Specific Callbacks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When a subsystem defines a new attribute type, it must implement a
set of sysfs operations for forwarding read and write calls to the
show and store methods of the attribute owners. 

struct sysfs_ops {
        ssize_t (*show)(struct kobject *, struct attribute *, char *);
        ssize_t (*store)(struct kobject *, struct attribute *, const char *);
};

[ Subsystems should have already defined a struct kobj_type as a
descriptor for this type, which is where the sysfs_ops pointer is
stored. See the kobject documentation for more information. ]

When a file is read or written, sysfs calls the appropriate method
for the type. The method then translates the generic struct kobject
and struct attribute pointers to the appropriate pointer types, and
calls the associated methods. 


To illustrate:

#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
#define to_dev(d) container_of(d, struct device, kobj)

static ssize_t
dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf)
{
        struct device_attribute * dev_attr = to_dev_attr(attr);
        struct device * dev = to_dev(kobj);
        ssize_t ret = 0;

        if (dev_attr->show)
                ret = dev_attr->show(dev, buf);
        return ret;
}



Reading/Writing Attribute Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To read or write attributes, show() or store() methods must be
specified when declaring the attribute. The method types should be as
simple as those defined for device attributes:

        ssize_t (*show)(struct device * dev, char * buf);
        ssize_t (*store)(struct device * dev, const char * buf);

IOW, they should take only an object and a buffer as parameters. 


sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
method. Sysfs will call the method exactly once for each read or
write. This forces the following behavior on the method
implementations: 

- On read(2), the show() method should fill the entire buffer. 
  Recall that an attribute should only be exporting one value, or an
  array of similar values, so this shouldn't be that expensive. 

  This allows userspace to do partial reads and seeks arbitrarily over
  the entire file at will. 

- On write(2), sysfs expects the entire buffer to be passed during the
  first write. Sysfs then passes the entire buffer to the store()
  method. 
  
  When writing sysfs files, userspace processes should first read the
  entire file, modify the values it wishes to change, then write the
  entire buffer back. 

  Attribute method implementations should operate on an identical
  buffer when reading and writing values. 

Other notes:

- The buffer will always be PAGE_SIZE bytes in length. On i386, this
  is 4096. 

- show() methods should return the number of bytes printed into the
  buffer. This is the return value of snprintf().

- show() should always use snprintf(). 

- store() should return the number of bytes used from the buffer. This
  can be done using strlen().

- show() or store() can always return errors. If a bad value comes
  through, be sure to return an error.

- The object passed to the methods will be pinned in memory via sysfs
  referencing counting its embedded object. However, the physical 
  entity (e.g. device) the object represents may not be present. Be 
  sure to have a way to check this, if necessary. 


A very simple (and naive) implementation of a device attribute is:

static ssize_t show_name(struct device *dev, struct device_attribute *attr, char *buf)
{
	return snprintf(buf, PAGE_SIZE, "%s\n", dev->name);
}

static ssize_t store_name(struct device * dev, const char * buf)
{
	sscanf(buf, "%20s", dev->name);
	return strnlen(buf, PAGE_SIZE);
}

static DEVICE_ATTR(name, S_IRUGO, show_name, store_name);


(Note that the real implementation doesn't allow userspace to set the 
name for a device.)


Top Level Directory Layout
~~~~~~~~~~~~~~~~~~~~~~~~~~

The sysfs directory arrangement exposes the relationship of kernel
data structures. 

The top level sysfs directory looks like:

block/
bus/
class/
devices/
firmware/
net/
fs/

devices/ contains a filesystem representation of the device tree. It maps
directly to the internal kernel device tree, which is a hierarchy of
struct device. 

bus/ contains flat directory layout of the various bus types in the
kernel. Each bus's directory contains two subdirectories:

	devices/
	drivers/

devices/ contains symlinks for each device discovered in the system
that point to the device's directory under root/.

drivers/ contains a directory for each device driver that is loaded
for devices on that particular bus (this assumes that drivers do not
span multiple bus types).

fs/ contains a directory for some filesystems.  Currently each
filesystem wanting to export attributes must create its own hierarchy
below fs/ (see ./fuse.txt for an example).


More information can driver-model specific features can be found in
Documentation/driver-model/. 


TODO: Finish this section.


Current Interfaces
~~~~~~~~~~~~~~~~~~

The following interface layers currently exist in sysfs:


- devices (include/linux/device.h)
----------------------------------
Structure:

struct device_attribute {
        struct attribute        attr;
        ssize_t (*show)(struct device * dev, char * buf);
        ssize_t (*store)(struct device * dev, const char * buf);
};

Declaring:

DEVICE_ATTR(_name, _str, _mode, _show, _store);

Creation/Removal:

int device_create_file(struct device *device, struct device_attribute * attr);
void device_remove_file(struct device * dev, struct device_attribute * attr);


- bus drivers (include/linux/device.h)
--------------------------------------
Structure:

struct bus_attribute {
        struct attribute        attr;
        ssize_t (*show)(struct bus_type *, char * buf);
        ssize_t (*store)(struct bus_type *, const char * buf);
};

Declaring:

BUS_ATTR(_name, _mode, _show, _store)

Creation/Removal:

int bus_create_file(struct bus_type *, struct bus_attribute *);
void bus_remove_file(struct bus_type *, struct bus_attribute *);


- device drivers (include/linux/device.h)
-----------------------------------------

Structure:

struct driver_attribute {
        struct attribute        attr;
        ssize_t (*show)(struct device_driver *, char * buf);
        ssize_t (*store)(struct device_driver *, const char * buf);
};

Declaring:

DRIVER_ATTR(_name, _mode, _show, _store)

Creation/Removal:

int driver_create_file(struct device_driver *, struct driver_attribute *);
void driver_remove_file(struct device_driver *, struct driver_attribute *);