xfs: Document error handlers behavior

Document the implementation of error handlers into sysfs. [dchinner: Added lots more detail.] Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com>
author: Carlos Maiolino <cmaiolino@redhat.com> 2016-09-18 19:38:25 -0400
committer: Dave Chinner <david@fromorbit.com> 2016-09-18 19:38:25 -0400
commit: 5694fe9aadbb26874d2791de1db6ac08aa1b4c14 (patch)
tree: 89a355ce8e81699ad0012a0beecf01c570bab0ca
parent: 77169812739dd800bc3620d781a77c50c75165cc (diff)
1 files changed, 123 insertions, 0 deletions
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
index 8146e9fd5ffc..c2d44e6e117b 100644
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -348,3 +348,126 @@ Removed Sysctls
  ----                          -------
  fs.xfs.xfsbufd_centisec       v4.0
  fs.xfs.age_buffer_centisecs   v4.0
+Error handling
+==============
+XFS can act differently according to the type of error found during its
+operation. The implementation introduces the following concepts to the error
+handler:
+ -failure speed:
+        Defines how fast XFS should propagate an error upwards when a specific
+        error is found during the filesystem operation. It can propagate
+        immediately, after a defined number of retries, after a set time period,
+        or simply retry forever.
+ -error classes:
+        Specifies the subsystem the error configuration will apply to, such as
+        metadata IO or memory allocation. Different subsystems will have
+        different error handlers for which behaviour can be configured.
+ -error handlers:
+        Defines the behavior for a specific error.
+The filesystem behavior during an error can be set via sysfs files. Each
+error handler works independently - the first condition met by an error handler
+for a specific class will cause the error to be propagated rather than reset and
+retried.
+The action taken by the filesystem when the error is propagated is context
+dependent - it may cause a shut down in the case of an unrecoverable error,
+it may be reported back to userspace, or it may even be ignored because
+there's nothing useful we can with the error or anyone we can report it to (e.g.
+during unmount).
+The configuration files are organized into the following hierarchy for each
+mounted filesystem:
+  /sys/fs/xfs/<dev>/error/<class>/<error>/
+Where:
+  <dev>
+        The short device name of the mounted filesystem. This is the same device
+        name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
+  <class>
+        The subsystem the error configuration belongs to. As of 4.9, the defined
+        classes are:
+                - "metadata": applies metadata buffer write IO
+  <error>
+        The individual error handler configurations.
+Each filesystem has "global" error configuration options defined in their top
+level directory:
+  /sys/fs/xfs/<dev>/error/
+  fail_at_unmount               (Min:  0  Default:  1  Max: 1)
+        Defines the filesystem error behavior at unmount time.
+        If set to a value of 1, XFS will override all other error configurations
+        during unmount and replace them with "immediate fail" characteristics.
+        i.e. no retries, no retry timeout. This will always allow unmount to
+        succeed when there are persistent errors present.
+        If set to 0, the configured retry behaviour will continue until all
+        retries and/or timeouts have been exhausted. This will delay unmount
+        completion when there are persistent errors, and it may prevent the
+        filesystem from ever unmounting fully in the case of "retry forever"
+        handler configurations.
+        Note: there is no guarantee that fail_at_unmount can be set whilst an
+        unmount is in progress. It is possible that the sysfs entries are
+        removed by the unmounting filesystem before a "retry forever" error
+        handler configuration causes unmount to hang, and hence the filesystem
+        must be configured appropriately before unmount begins to prevent
+        unmount hangs.
+Each filesystem has specific error class handlers that define the error
+propagation behaviour for specific errors. There is also a "default" error
+handler defined, which defines the behaviour for all errors that don't have
+specific handlers defined. Where multiple retry constraints are configuredi for
+a single error, the first retry configuration that expires will cause the error
+to be propagated. The handler configurations are found in the directory:
+  /sys/fs/xfs/<dev>/error/<class>/<error>/
+  max_retries                   (Min: -1  Default: Varies  Max: INTMAX)
+        Defines the allowed number of retries of a specific error before
+        the filesystem will propagate the error. The retry count for a given
+        error context (e.g. a specific metadata buffer) is reset every time
+        there is a successful completion of the operation.
+        Setting the value to "-1" will cause XFS to retry forever for this
+        specific error.
+        Setting the value to "0" will cause XFS to fail immediately when the
+        specific error is reported.
+        Setting the value to "N" (where 0 < N < Max) will make XFS retry the
+        operation "N" times before propagating the error.
+  retry_timeout_seconds         (Min:  -1  Default:  Varies  Max: 1 day)
+        Define the amount of time (in seconds) that the filesystem is
+        allowed to retry its operations when the specific error is
+        found.
+        Setting the value to "-1" will allow XFS to retry forever for this
+        specific error.
+        Setting the value to "0" will cause XFS to fail immediately when the
+        specific error is reported.
+        Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
+        operation for up to "N" seconds before propagating the error.
+Note: The default behaviour for a specific error handler is dependent on both
+the class and error context. For example, the default values for
+"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
+to "fail immediately" behaviour. This is done because ENODEV is a fatal,
+unrecoverable error no matter how many times the metadata IO is retried.
author	Carlos Maiolino <cmaiolino@redhat.com>	2016-09-18 19:38:25 -0400
committer	Dave Chinner <david@fromorbit.com>	2016-09-18 19:38:25 -0400
commit	5694fe9aadbb26874d2791de1db6ac08aa1b4c14 (patch)
tree	89a355ce8e81699ad0012a0beecf01c570bab0ca
parent	77169812739dd800bc3620d781a77c50c75165cc (diff)

diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt index 8146e9fd5ffc..c2d44e6e117b 100644 --- a/Documentation/filesystems/xfs.txt +++ b/Documentation/filesystems/xfs.txt
@@ -348,3 +348,126 @@ Removed Sysctls
348	---- -------	348	---- -------
349	fs.xfs.xfsbufd_centisec v4.0	349	fs.xfs.xfsbufd_centisec v4.0
350	fs.xfs.age_buffer_centisecs v4.0	350	fs.xfs.age_buffer_centisecs v4.0
		351
		352
		353	Error handling
		354	==============
		355
		356	XFS can act differently according to the type of error found during its
		357	operation. The implementation introduces the following concepts to the error
		358	handler:
		359
		360	-failure speed:
		361	Defines how fast XFS should propagate an error upwards when a specific
		362	error is found during the filesystem operation. It can propagate
		363	immediately, after a defined number of retries, after a set time period,
		364	or simply retry forever.
		365
		366	-error classes:
		367	Specifies the subsystem the error configuration will apply to, such as
		368	metadata IO or memory allocation. Different subsystems will have
		369	different error handlers for which behaviour can be configured.
		370
		371	-error handlers:
		372	Defines the behavior for a specific error.
		373
		374	The filesystem behavior during an error can be set via sysfs files. Each
		375	error handler works independently - the first condition met by an error handler
		376	for a specific class will cause the error to be propagated rather than reset and
		377	retried.
		378
		379	The action taken by the filesystem when the error is propagated is context
		380	dependent - it may cause a shut down in the case of an unrecoverable error,
		381	it may be reported back to userspace, or it may even be ignored because
		382	there's nothing useful we can with the error or anyone we can report it to (e.g.
		383	during unmount).
		384
		385	The configuration files are organized into the following hierarchy for each
		386	mounted filesystem:
		387
		388	/sys/fs/xfs/<dev>/error/<class>/<error>/
		389
		390	Where:
		391	<dev>
		392	The short device name of the mounted filesystem. This is the same device
		393	name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
		394
		395	<class>
		396	The subsystem the error configuration belongs to. As of 4.9, the defined
		397	classes are:
		398
		399	- "metadata": applies metadata buffer write IO
		400
		401	<error>
		402	The individual error handler configurations.
		403
		404
		405	Each filesystem has "global" error configuration options defined in their top
		406	level directory:
		407
		408	/sys/fs/xfs/<dev>/error/
		409
		410	fail_at_unmount (Min: 0 Default: 1 Max: 1)
		411	Defines the filesystem error behavior at unmount time.
		412
		413	If set to a value of 1, XFS will override all other error configurations
		414	during unmount and replace them with "immediate fail" characteristics.
		415	i.e. no retries, no retry timeout. This will always allow unmount to
		416	succeed when there are persistent errors present.
		417
		418	If set to 0, the configured retry behaviour will continue until all
		419	retries and/or timeouts have been exhausted. This will delay unmount
		420	completion when there are persistent errors, and it may prevent the
		421	filesystem from ever unmounting fully in the case of "retry forever"
		422	handler configurations.
		423
		424	Note: there is no guarantee that fail_at_unmount can be set whilst an
		425	unmount is in progress. It is possible that the sysfs entries are
		426	removed by the unmounting filesystem before a "retry forever" error
		427	handler configuration causes unmount to hang, and hence the filesystem
		428	must be configured appropriately before unmount begins to prevent
		429	unmount hangs.
		430
		431	Each filesystem has specific error class handlers that define the error
		432	propagation behaviour for specific errors. There is also a "default" error
		433	handler defined, which defines the behaviour for all errors that don't have
		434	specific handlers defined. Where multiple retry constraints are configuredi for
		435	a single error, the first retry configuration that expires will cause the error
		436	to be propagated. The handler configurations are found in the directory:
		437
		438	/sys/fs/xfs/<dev>/error/<class>/<error>/
		439
		440	max_retries (Min: -1 Default: Varies Max: INTMAX)
		441	Defines the allowed number of retries of a specific error before
		442	the filesystem will propagate the error. The retry count for a given
		443	error context (e.g. a specific metadata buffer) is reset every time
		444	there is a successful completion of the operation.
		445
		446	Setting the value to "-1" will cause XFS to retry forever for this
		447	specific error.
		448
		449	Setting the value to "0" will cause XFS to fail immediately when the
		450	specific error is reported.
		451
		452	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
		453	operation "N" times before propagating the error.
		454
		455	retry_timeout_seconds (Min: -1 Default: Varies Max: 1 day)
		456	Define the amount of time (in seconds) that the filesystem is
		457	allowed to retry its operations when the specific error is
		458	found.
		459
		460	Setting the value to "-1" will allow XFS to retry forever for this
		461	specific error.
		462
		463	Setting the value to "0" will cause XFS to fail immediately when the
		464	specific error is reported.
		465
		466	Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
		467	operation for up to "N" seconds before propagating the error.
		468
		469	Note: The default behaviour for a specific error handler is dependent on both
		470	the class and error context. For example, the default values for
		471	"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
		472	to "fail immediately" behaviour. This is done because ENODEV is a fatal,
		473	unrecoverable error no matter how many times the metadata IO is retried.