aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/block
Commit message (Collapse)AuthorAge
* Merge branch 'for-3.11/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2013-07-22
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block IO driver bits from Jens Axboe: "As I mentioned in the core block pull request, due to real life circumstances the driver pull request would be late. Now it looks like -rc2 late... On the plus side, apart form the rsxx update, these are all things that I could argue could go in later in the cycle as they are fixes and not features. So even though things are late, it's not ALL bad. The pull request contains: - Updates to bcache, all bug fixes, from Kent. - A pile of drbd bug fixes (no big features this time!). - xen blk front/back fixes. - rsxx driver updates, some of them deferred form 3.10. So should be well cooked by now" * 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits) bcache: Allocation kthread fixes bcache: Fix GC_SECTORS_USED() calculation bcache: Journal replay fix bcache: Shutdown fix bcache: Fix a sysfs splat on shutdown bcache: Advertise that flushes are supported bcache: check for allocation failures bcache: Fix a dumb race bcache: Use standard utility code bcache: Update email address bcache: Delete fuzz tester bcache: Document shrinker reserve better bcache: FUA fixes drbd: Allow online change of al-stripes and al-stripe-size drbd: Constants should be UPPERCASE drbd: Ignore the exit code of a fence-peer handler if it returns too late drbd: Fix rcu_read_lock balance on error path drbd: fix error return code in drbd_init() drbd: Do not sleep inside rcu bcache: Refresh usage docs ...
| * Merge tag 'v3.10-rc7' into for-3.11/driversJens Axboe2013-07-02
| |\ | | | | | | | | | | | | | | | | | | | | | Linux 3.10-rc7 Pull this in early to avoid doing it with the bcache merge, since there are a number of changes to bcache between my old base (3.10-rc1) and the new pull request.
| * | drbd: Allow online change of al-stripes and al-stripe-sizePhilipp Reisner2013-06-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow to change the AL layout with an resize operation. For that the reisze command gets two new fields: al_stripes and al_stripe_size. In order to make the operation crash save: 1) Lock out all IO and MD-IO 2) Write the super block with MDF_PRIMARY_IND clear 3) write the bitmap to the new location (all zeros, since we allow only while connected) 4) Initialize the new AL-area 5) Write the super block with the restored MDF_PRIMARY_IND. 6) Unfreeze all IO Since the AL-layout has no influence on the protocol, this operation needs to be beforemed on both sides of a resource (if intended). Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | drbd: Constants should be UPPERCASEPhilipp Reisner2013-06-28
| | | | | | | | | | | | | | | | | | Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | drbd: Ignore the exit code of a fence-peer handler if it returns too latePhilipp Reisner2013-06-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case the connection was established and lost again before the a fence-peer handler returns, ignore the exit code of this instance. (And use the exit code of the later started instance) Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | drbd: Fix rcu_read_lock balance on error pathAndreas Gruenbacher2013-06-28
| | | | | | | | | | | | | | | | | | Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | drbd: fix error return code in drbd_init()Wei Yongjun2013-06-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix to return a negative error code from the error handling case instead of 0, as returned elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | drbd: Do not sleep inside rcuAndreas Gruenbacher2013-06-28
| | | | | | | | | | | | | | | Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | Merge branch 'stable/for-jens-3.10' of ↵Jens Axboe2013-06-28
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.11/drivers Konrad writes: It has the 'feature-max-indirect-segments' implemented in both backend and frontend. The current problem with the backend and frontend is that the segment size is limited to 11 pages. It means we can at most squeeze in 44kB per request. The ring can hold 32 (next power of two below 36) requests, meaning we can do 1.4M of outstanding requests. Nowadays that is not enough. The problem in the past was addressed in two ways - but neither one went upstream. The first solution to this proposed by Justin from Spectralogic was to negotiate the segment size. This means that the ‘struct blkif_sring_entry’ is now a variable size. It can expand from 112 bytes (cover 11 pages of data - 44kB) to 1580 bytes (256 pages of data - so 1MB). It is a simple extension by just making the array in the request expand from 11 to a variable size negotiated. But it had limits: this extension still limits the number of segments per request to 255 (as the total number must be specified in the request, which only has an 8-bit field for that purpose). The other solution (from Intel - Ronghui) was to create one extra ring that only has the ‘struct blkif_request_segment’ in them. The ‘struct blkif_request’ would be changed to have an index in said ‘segment ring’. There is only one segment ring. This means that the size of the initial ring is still the same. The requests would point to the segment and enumerate out how many of the indexes it wants to use. The limit is of course the size of the segment. If one assumes a one-page segment this means we can in one request cover ~4MB. Those patches were posted as RFC and the author never followed up on the ideas on changing it to be a bit more flexible. There is yet another mechanism that could be employed  (which these patches implement) - and it borrows from VirtIO protocol. And that is the ‘indirect descriptors’. This very similar to what Intel suggests, but with a twist. The twist is to negotiate how many of these 'segment' pages (aka indirect descriptor pages) we want to support (in reality we negotiate how many entries in the segment we want to cover, and we module the number if it is bigger than the segment size). This means that with the existing 36 slots in the ring (single page) we can cover: 32 slots * each blkif_request_indirect covers: 512 * 4096 ~= 64M. Since we ample space in the blkif_request_indirect to span more than one indirect page, that number (64M) can be also multiplied by eight = 512MB. Roger Pau Monne took the idea and implemented them in these patches. They work great and the corner cases (migration between backends with and without this extension) work nicely. The backend has a limit right now off how many indirect entries it can handle: one indirect page, and at maximum 256 entries (out of 512 - so 50% of the page is used). That comes out to 32 slots * 256 entries in a indirect page * 1 indirect page per request * 4096 = 32MB. This is a conservative number that can change in the future. Right now it strikes a good balance between giving excellent performance, memory usage in the backend, and balancing the needs of many guests. In the patchset there is also the split of the blkback structure to be per-VBD. This means that the spinlock contention we had with many guests trying to do I/O and all the blkback threads hitting the same lock has been eliminated. Also there are bug-fixes to deal with oddly sized sectors, insane amounts on th ring, and also a security fix (posted earlier).
| | * | xen-blkback: check the number of iovecs before allocating a biosRoger Pau Monne2013-06-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the introduction of indirect segments we can receive requests with a number of segments bigger than the maximum number of allowed iovecs in a bios, so make sure that blkback doesn't try to allocate a bios with more iovecs than BIO_MAX_PAGES Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen-blkfront: set blk_queue_max_hw_sectors correctlyRoger Pau Monne2013-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that indirect segments are enabled blk_queue_max_hw_sectors must be set to match the maximum number of sectors we can handle in a request. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-by: Felipe Franciosi <felipe.franciosi@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen-blkback: workaround compiler bug in gcc 4.1Roger Pau Monne2013-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code generat with gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-54) creates an unbound loop for the second foreach_grant_safe loop in purge_persistent_gnt. The workaround is to avoid having this second loop and instead perform all the work inside the first loop by adding a new variable, clean_used, that will be set when all the desired persistent grants have been removed and we need to iterate over the remaining ones to remove the WAS_ACTIVE flag. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-by: Tom O'Neill <toneill@vmem.com> Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen/blkback: Check for insane amounts of request on the ring (v6).Konrad Rzeszutek Wilk2013-06-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check that the ring does not have an insane amount of requests (more than there could fit on the ring). If we detect this case we will stop processing the requests and wait until the XenBus disconnects the ring. The existing check RING_REQUEST_CONS_OVERFLOW which checks for how many responses we have created in the past (rsp_prod_pvt) vs requests consumed (req_cons) and whether said difference is greater or equal to the size of the ring, does not catch this case. Wha the condition does check if there is a need to process more as we still have a backlog of responses to finish. Note that both of those values (rsp_prod_pvt and req_cons) are not exposed on the shared ring. To understand this problem a mini crash course in ring protocol response/request updates is in place. There are four entries: req_prod and rsp_prod; req_event and rsp_event to track the ring entries. We are only concerned about the first two - which set the tone of this bug. The req_prod is a value incremented by frontend for each request put on the ring. Conversely the rsp_prod is a value incremented by the backend for each response put on the ring (rsp_prod gets set by rsp_prod_pvt when pushing the responses on the ring). Both values can wrap and are modulo the size of the ring (in block case that is 32). Please see RING_GET_REQUEST and RING_GET_RESPONSE for the more details. The culprit here is that if the difference between the req_prod and req_cons is greater than the ring size we have a problem. Fortunately for us, the '__do_block_io_op' loop: rc = blk_rings->common.req_cons; rp = blk_rings->common.sring->req_prod; while (rc != rp) { .. blk_rings->common.req_cons = ++rc; /* before make_response() */ } will loop up to the point when rc == rp. The macros inside of the loop (RING_GET_REQUEST) is smart and is indexing based on the modulo of the ring size. If the frontend has provided a bogus req_prod value we will loop until the 'rc == rp' - which means we could be processing already processed requests (or responses) often. The reason the RING_REQUEST_CONS_OVERFLOW is not helping here is b/c it only tracks how many responses we have internally produced and whether we would should process more. The astute reader will notice that the macro RING_REQUEST_CONS_OVERFLOW provides two arguments - more on this later. For example, if we were to enter this function with these values: blk_rings->common.sring->req_prod = X+31415 (X is the value from the last time __do_block_io_op was called). blk_rings->common.req_cons = X blk_rings->common.rsp_prod_pvt = X The RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, blk_rings->common.req_cons) is doing: req_cons - rsp_prod_pvt >= 32 Which is, X - X >= 32 or 0 >= 32 And that is false, so we continue on looping (this bug). If we re-use said macro RING_REQUEST_CONS_OVERFLOW and pass in the rp instead (sring->req_prod) of rc, the this macro can do the check: req_prod - rsp_prov_pvt >= 32 Which is, X + 31415 - X >= 32 , or 31415 >= 32 which is true, so we can error out and break out of the function. Unfortunatly the difference between rsp_prov_pvt and req_prod can be at 32 (which would error out in the macro). This condition exists when the backend is lagging behind with the responses and still has not finished responding to all of them (so make_response has not been called), and the rsp_prov_pvt + 32 == req_cons. This ends up with us not being able to use said macro. Hence introducing a new macro called RING_REQUEST_PROD_OVERFLOW which does a simple check of: req_prod - rsp_prod_pvt > RING_SIZE And with the X values from above: X + 31415 - X > 32 Returns true. Also not that if the ring is full (which is where the RING_REQUEST_CONS_OVERFLOW triggered), we would not hit the same condition: X + 32 - X > 32 Which is false. Lets use that macro. Note that in v5 of this patchset the macro was different - we used an earlier version. Cc: stable@vger.kernel.org [v1: Move the check outside the loop] [v2: Add a pr_warn as suggested by David] [v3: Use RING_REQUEST_CONS_OVERFLOW as suggested by Jan] [v4: Move wake_up after kthread_stop as suggested by Jan] [v5: Use RING_REQUEST_PROD_OVERFLOW instead] [v6: Use RING_REQUEST_PROD_OVERFLOW - Jan's version] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> gadsa
| | * | xen/blkback: Check device permissions before allowing OP_DISCARDKonrad Rzeszutek Wilk2013-06-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to make sure that the device is not RO or that the request is not past the number of sectors we want to issue the DISCARD operation for. This fixes CVE-2013-2140. Cc: stable@vger.kernel.org Acked-by: Jan Beulich <JBeulich@suse.com> Acked-by: Ian Campbell <Ian.Campbell@citrix.com> [v1: Made it pr_warn instead of pr_debug] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen/blkback: Use physical sector size for setupStefan Bader2013-06-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently xen-blkback passes the logical sector size over xenbus and xen-blkfront sets up the paravirt disk with that logical block size. But newer drives usually have the logical sector size set to 512 for compatibility reasons and would show the actual sector size only in physical sector size. This results in the device being partitioned and accessed in dom0 with the correct sector size, but the guest thinks 512 bytes is the correct block size. And that results in poor performance. To fix this, blkback gets modified to pass also physical-sector-size over xenbus and blkfront to use both values to set up the paravirt disk. I did not just change the passed in sector-size because I am not sure having a bigger logical sector size than the physical one is valid (and that would happen if a newer dom0 kernel hits an older domU kernel). Also this way a domU set up before should still be accessible (just some tools might detect the unaligned setup). [v2: Make xenbus write failure non-fatal] [v3: Use xenbus_scanf instead of xenbus_gather] [v4: Rebased against segment changes] Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen-blkfront: Introduce a 'max' module parameter to alter the amount of ↵Konrad Rzeszutek Wilk2013-06-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | indirect segments. The max module parameter (by default 32) is the maximum number of segments that the frontend will negotiate with the backend for indirect descriptors. Higher value means more potential throughput but more memory usage. The backend picks the minimum of the frontend and its default backend value. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen-blkfront: use a different scatterlist for each requestRoger Pau Monne2013-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In blkif_queue_request blkfront iterates over the scatterlist in order to set the segments of the request, and in blkif_completion blkfront iterates over the raw request, which makes it hard to know the exact position of the source and destination memory positions. This can be solved by allocating a scatterlist for each request, that will be keep until the request is finished, allowing us to copy the data back to the original memory without having to iterate over the raw request. Oracle-Bug: 16660413 - LARGE ASYNCHRONOUS READS APPEAR BROKEN ON 2.6.39-400 CC: stable@vger.kernel.org Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-and-Tested-by: Anne Milicia <anne.milicia@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen-blkback: allocate list of pending reqs in small chunksRoger Pau Monne2013-05-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allocate pending requests in smaller chunks instead of allocating them all at the same time. This change also removes the global array of pending_reqs, it is no longer necessay. Variables related to the grant mapping have been grouped into a struct called "grant_page", this allows to allocate them in smaller chunks, and also improves memory locality. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen-block: implement indirect descriptorsRoger Pau Monne2013-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Indirect descriptors introduce a new block operation (BLKIF_OP_INDIRECT) that passes grant references instead of segments in the request. This grant references are filled with arrays of blkif_request_segment_aligned, this way we can send more segments in a request. The proposed implementation sets the maximum number of indirect grefs (frames filled with blkif_request_segment_aligned) to 256 in the backend and 32 in the frontend. The value in the frontend has been chosen experimentally, and the backend value has been set to a sane value that allows expanding the maximum number of indirect descriptors in the frontend if needed. The migration code has changed from the previous implementation, in which we simply remapped the segments on the shared ring. Now the maximum number of segments allowed in a request can change depending on the backend, so we have to requeue all the requests in the ring and in the queue and split the bios in them if they are bigger than the new maximum number of segments. [v2: Fixed minor comments by Konrad. [v1: Added padding to make the indirect request 64bit aligned. Added some BUGs, comments; fixed number of indirect pages in blkif_get_x86_{32/64}_req. Added description about the indirect operation in blkif.h] Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> [v3: Fixed spaces and tabs mix ups] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen-blkback: expand map/unmap functionsRoger Pau Monne2013-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Preparatory change for implementing indirect descriptors. Change xen_blkbk_{map/unmap} in order to be able to map/unmap a random amount of grants (previously it was limited to BLKIF_MAX_SEGMENTS_PER_REQUEST). Also, remove the usage of pending_req in the map/unmap functions, so we can map/unmap grants without needing to pass a pending_req. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen-blkback: make the queue of free requests per backendRoger Pau Monne2013-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the last dependency from blkbk by moving the list of free requests to blkif. This change reduces the contention on the list of available requests. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen-blkback: move pending handles list from blkbk to pending_reqRoger Pau Monne2013-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Moving grant ref handles from blkbk to pending_req will allow us to get rid of the shared blkbk structure. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen-blkback: implement LRU mechanism for persistent grantsRoger Pau Monne2013-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This mechanism allows blkback to change the number of grants persistently mapped at run time. The algorithm uses a simple LRU mechanism that removes (if needed) the persistent grants that have not been used since the last LRU run, or if all grants have been used it removes the first grants in the list (that are not in use). The algorithm allows the user to change the maximum number of persistent grants, by changing max_persistent_grants in sysfs. Since we are storing the persistent grants used inside the request struct (to be able to mark them as "unused" when unmapping), we no longer need the bitmap (unmap_seg). Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen-blkback: use balloon pages for all mappingsRoger Pau Monne2013-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using balloon pages for all granted pages allows us to simplify the logic in blkback, especially in the xen_blkbk_map function, since now we can decide if we want to map a grant persistently or not after we have actually mapped it. This could not be done before because persistent grants used ballooned pages, whereas non-persistent grants used pages from the kernel. This patch also introduces several changes, the first one is that the list of free pages is no longer global, now each blkback instance has it's own list of free pages that can be used to map grants. Also, a run time parameter (max_buffer_pages) has been added in order to tune the maximum number of free pages each blkback instance will keep in it's buffer. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: xen-devel@lists.xen.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | xen-blkback: print stats about persistent grantsRoger Pau Monne2013-04-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: xen-devel@lists.xen.org Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | | rsxx: Adding in debugfs entries.Philip J Kelleher2013-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding debugfs entries to help with debugging and testing and testing code. pci_regs: This entry will spit out all of the data stored on the BAR. stats: This entry will display all of the driver stats for each DMA channel. cram: This will allow read/write ability to the CRAM address space on our adapter's CPU. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | rsxx: Fixes incorrect stats calculation.Philip J Kelleher2013-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixing incorrect stats calculation during read retries. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | rsxx: Adding EEH check inside cregs timeout.Philip J Kelleher2013-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unfortunaly, our CPU register path does not do any kind of EEH error checking. So to fix this issue, an ioread32 was added to the CPU register timeout code. This way, the driver can check to see if the timeout was caused by an EEH error or not. This is a dummy read. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | rsxx: Adapter address space sanity check.Philip J Kelleher2013-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding a sanity check to guarentee that DMAs outside of the device's address space will be errored out right away. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | rsxx: Fixes DLPAR add kernel panic if partition still mounted.Philip J Kelleher2013-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A kernel panic would occur on a DLPAR add if there was a partition still mounted during the DLPAR remove. This bug fix will allow the user to unmount the partition and bring the driver back into a good state after the DLPAR add. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | rsxx: Changing the adapter name to the official name.Philip J Kelleher2013-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changing the adapter name from FlashSystem-80 to the official name: Flash Adapter 900GB Full Height. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | rsxx: Adding in sync_start module paramenter.Philip J Kelleher2013-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before, the partition table would have to be reread because our card was attached before it transistioned out of it's 'starting' state. This change will cause the driver to wait to attach the device until the adapter is ready. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | rsxx: Allow block size to be determined by configuration.Philip J Kelleher2013-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the block size was determined by whether or not our Hardware could handle 512 byte accesses. Now, all of our Hardware can handle 512 and 4096 block sizes. This fix allows it to be user configurable. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | rsxx: Fixes soft-lockup issues during DMAs.Philip J Kelleher2013-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The workqueue mechanism has been reworked to prevent soft lockup issues from occuring by adding in mutex sychronization. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | rsxx: Restructured DMA cancel scheme.Philip J Kelleher2013-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before, DMAs would never be cancelled if there was a data stall or an EEH Permenant failure which would cause an unrecoverable I/O hang. The DMA cancellation mechanism has been modified to fix these issues and allows DMAs to be cancelled during the above mentioned events. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | rsxx: Individual workqueues for interruptible events.Philip J Kelleher2013-06-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Giving all interrupt based events their own workqueue to complete tasks on. This fixes a bug that would cause creg commands to timeout if too many are issued at once. Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | Merge branch 'next' of git://git.monstr.eu/linux-2.6-microblazeLinus Torvalds2013-07-10
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull microblaze update from Michal Simek: "This Microblaze merge window is quite minimal. I have also added to my branch one xilinx systemace sparse fix because haven't got any reply from block maintainer." * 'next' of git://git.monstr.eu/linux-2.6-microblaze: xilinx systemace: Fix sparse warnings microblaze: Move __NR_syscalls from uapi microblaze: Enable KGDB in defconfig microblaze: Don't mark arch_kgdb_ops as const.
| * | | | xilinx systemace: Fix sparse warningsMichal Simek2013-07-10
| | |_|/ | |/| | | | | | | | | | | | | | | | | | Fix sysace sparse warnings. Signed-off-by: Michal Simek <michal.simek@xilinx.com>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2013-07-09
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is some follow-on RBD cleanup after the last window's code drop, a series from Yan fixing multi-mds behavior in cephfs, and then a sprinkling of bug fixes all around. Some warnings, sleeping while atomic, a null dereference, and cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (36 commits) libceph: fix invalid unsigned->signed conversion for timespec encoding libceph: call r_unsafe_callback when unsafe reply is received ceph: fix race between cap issue and revoke ceph: fix cap revoke race ceph: fix pending vmtruncate race ceph: avoid accessing invalid memory libceph: Fix NULL pointer dereference in auth client code ceph: Reconstruct the func ceph_reserve_caps. ceph: Free mdsc if alloc mdsc->mdsmap failed. ceph: remove sb_start/end_write in ceph_aio_write. ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL. ceph: fix sleeping function called from invalid context. ceph: move inode to proper flushing list when auth MDS changes rbd: fix a couple warnings ceph: clear migrate seq when MDS restarts ceph: check migrate seq before changing auth cap ceph: fix race between page writeback and truncate ceph: reset iov_len when discarding cap release messages ceph: fix cap release race libceph: fix truncate size calculation ...
| * | | | rbd: fix a couple warningsSage Weil2013-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gcc isn't quite smart enough and generates these warnings: drivers/block/rbd.c: In function 'rbd_img_request_fill': drivers/block/rbd.c:1266:22: warning: 'bio_list' may be used uninitialized in this function [-Wmaybe-uninitialized] drivers/block/rbd.c:2186:14: note: 'bio_list' was declared here drivers/block/rbd.c:2247:10: warning: 'pages' may be used uninitialized in this function [-Wmaybe-uninitialized] even though they are initialized for their respective code paths. Signed-off-by: Sage Weil <sage@inktank.com>
| * | | | rbd: take a little creditAlex Elder2013-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a name to the list of authors. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
| * | | | rbd: use rwsem to protect header updatesAlex Elder2013-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Updating an image header needs to be protected to ensure it's done consistently. However distinct headers can be updated concurrently without a problem. Instead of using the global control lock to serialize headder updates, just rely on the header semaphore. (It's already used, this just moves it out to cover a broader section of the code.) That leaves the control mutex protecting only the creation of rbd clients, so rename it. This resolves: http://tracker.ceph.com/issues/5222 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
| * | | | rbd: don't hold ctl_mutex to get/put deviceAlex Elder2013-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an rbd device is first getting mapped, its device registration is protected the control mutex. There is no need to do that though, because the device has already been assigned an id that's guaranteed to be unique. An unmap of an rbd device won't proceed if the device has a non-zero open count or is already being unmapped. So there's no need to hold the control mutex in that case either. Finally, an rbd device can't be opened if it is being removed, and it won't go away if there is a non-zero open count. So here too there's no need to hold the control mutex while getting or putting a reference to an rbd device's Linux device structure. Drop the mutex calls in these cases. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
| * | | | rbd: protect against concurrent unmapsAlex Elder2013-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure two concurrent unmap operations on the same rbd device won't collide, by only proceeding with the removal and cleanup of a device if is not already underway. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
| * | | | rbd: set removing flag while holding list lockAlex Elder2013-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When unmapping a device, its id is supplied, and that is used to look up which rbd device should be unmapped. Looking up the device involves searching the rbd device list while holding a spinlock that protects access to that list. Currently all of this is done under protection of the control lock, but that protection is going away soon. To ensure the rbd_dev is still valid (still on the list) while setting its REMOVING flag, do so while still holding the list lock. To do so, get rid of __rbd_get_dev(), and open code what it did in the one place it was used. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
| * | | | rbd: protect against duplicate client creationAlex Elder2013-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If more than one rbd image has the same ceph cluster configuration (same options, same set of monitors, same keys) they normally share a single rbd client. When an image is getting mapped, rbd looks to see if an existing client can be used, and creates a new one if not. The lookup and creation are not done under a common lock though, so mapping two images concurrently could lead to duplicate clients getting set up needlessly. This isn't a major problem, but it's wasteful and different from what's intended. This patch fixes that by using the control mutex to protect both the lookup and (if needed) creation of the client. It was previously used just when creating. This resolves: http://tracker.ceph.com/issues/3094 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
| * | | | rbd: clean up a few things in the refresh pathAlex Elder2013-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This includes a few relatively small fixes I found while examining the code that refreshes image information. This resolves: http://tracker.ceph.com/issues/5040 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
| * | | | rbd: flush dcache after zeroing page dataAlex Elder2013-07-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Neither zero_bio_chain() nor zero_pages() contains a call to flush caches after zeroing a portion of a page. This can cause problems on architectures that have caches that allow virtual address aliasing. This resolves: http://tracker.ceph.com/issues/4777 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
| * | | | rbd: drop original request earlier for existence checkAlex Elder2013-07-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The reference to the original request dropped at the end of rbd_img_obj_exists_callback() corresponds to the reference taken in rbd_img_obj_exists_submit() to account for the stat request referring to it. Move the put of that reference up right after clearing that pointer to make its purpose more obvious. Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
| * | | | rbd: Use min_t() to fix comparison of distinct pointer types warningGeert Uytterhoeven2013-07-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | drivers/block/rbd.c: In function ‘zero_pages’: drivers/block/rbd.c:1102: warning: comparison of distinct pointer types lacks a cast Remove the hackish casts and use min_t() to fix this. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Alex Elder <elder@inktank.com>