forked from illumos/illumos-gate
-
Notifications
You must be signed in to change notification settings - Fork 109
OS-8642 #508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tsoome
wants to merge
20
commits into
master
Choose a base branch
from
OS-8642
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Remove unneeded global, practically constant, state pointer variables (arc_anon, arc_mru, etc.), replacing them with macros of real state variables addresses (&ARC_anon, &ARC_mru, etc.). Change ARC_EVICT_ALL from -1ULL to UINT64_MAX, not requiring special handling in inner loop of ARC reclamation. Respectively change bytes argument of arc_evict_state() from int64_t to uint64_t. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #12348 Change-Id: I110acf32a6d1fbbfc7cc2c71c7159779d5193829
Change-Id: I1c82af10fadfe24ccb5f9fb18d121aa8bdbb7c40
Change-Id: I8d8f1cf2e218cb60e8d86acfefc6ad4bc2c2c1a3
…[email protected]> Change-Id: I972befb767b418fa47ba6fe7f6d74ad553a0905f
Remove all instances of list handling where the API is not used and instead list data members are directly accessed. Doing this sort of thing is bad for portability. Additionally, ensure that list_link_init() is called on newly created list nodes. This ensures the node is properly initialized and does not rely on the assumption that zero'ing the list_node_t via kmem_zalloc() is the same as proper initialization. Signed-off-by: Brian Behlendorf <[email protected]>
The function bpobj_iterate_impl overflows the stack when bpobjs are deeply nested. Rewrite the function to eliminate the recursion. Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Zuchowski <[email protected]> Closes #7674 Closes #7675 Closes #7908
This patch introduces an assertion that can catch pitfalls in development where there is a mismatch between the size of reads and writes between a *_phys structure and its respective in-core structure when bonus buffers are used. This debugging-aid should be complementary to the verification done by ztest in ztest_verify_dnode_bt(). A side to this patch is that we now clear out any extra bytes past a bonus buffer's new size when the buffer is shrinking. Reviewed-by: Matt Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tom Caputi <[email protected]> Signed-off-by: Serapheim Dimitropoulos <[email protected]> Closes #8348
Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.
From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed]. Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:
[1] The waiter didn't call into any of the two functions - which I
find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
with?).
[2] The waiter did call in one of the above function but it passed 0 as
the space/delta to be dirtied (or undirtied) and then the callee
returned immediately (e.g both `dsl_pool_dirty_space()` and
`dsl_pool_undirty_space()` return immediately when space is 0).
In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever `dp_dirty_total` hits 0. That
said and given that we've never hit this before, it would make sense
to think more on why the above situation occured.
Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with `sync=always` and started doing contiguous
writes in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
`dbuf_write_physdone()` isn't be called as we go through the override
code paths, and thus `dsl_pool_undirty_space()` is never called). As a
result we'd have to wait until we get to `dsl_pool_sync()` where we
zero out all dirty data accounting for the pool and the current TXG's
metadata.
In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.
In both cases (sync & dedup) we shouldn't have to wait until
`dsl_pool_sync()` zeros out the accounting data. According to the
comment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
````
/*
* We have written all of the accounted dirty data, so our
* dp_space_towrite should now be zero. However, some seldom-used
* code paths do not adhere to this (e.g. dbuf_undirty(), also
* rounding error in dbuf_write_physdone).
* Shore up the accounting of any dirtied space now.
*/
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
````
Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.
Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through `dbuf_new_size()` we wouldn't update
the accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.
For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.
For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in `dsl_pool_sync()` after we synced the MOS. Then we
can go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.
Another fix is that we now update the accounting explicitly for
counting errors in `dbuf_write_done()`.
Finally, `dbuf_new_size()` updates the accounted space of the
appropriate dirty record correctly now.
The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Prakash Surya <[email protected]>
Signed-off-by: Serapheim Dimitropoulos <[email protected]>
External-issue: DLPX-47285
Closes #9137
Additionally pull in state machine comments about upcoming async cow work. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9902
Factor the portion of dbuf_sync_leaf() responsible for handling bonus buffers out in to its own dbuf_sync_bonus() helper function. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matt Ahrens <[email protected]> Signed-off-by: Matt Macy <[email protected]> Closes #9909
For each WRITE record in the stream, `zfs receive` creates a DMU transaction (`dmu_tx_create()`) and writes this block's data into the object. If per-block overheads (as opposed to per-byte overheads) dominate performance (as is often the case with small recordsize), the per-dmu-transaction overheads can be significant. For example, in some workloads the `receieve_writer` thread is 100% on CPU, and more than half of its CPU time is in these per-tx routines (e.g. dmu_tx_hold_write, dmu_tx_assign, dmu_tx_commit). To improve performance of `zfs receive`, this commit batches WRITE records which are to nearby offsets of the same object, and uses one DMU transaction to write them all. By default the batch size is 1MB, which for recordsize=8K reduces the number of DMU transactions by 128x for full send streams (incrementals will depend on how "clumpy" the changed blocks are). This commit improves the performance of `dd if=stream | zfs recv` from 78,800 blocks/sec to 98,100 blocks/sec (25% improvement). Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10099
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes and os_synced_dnodes. Since the number of sublists by default is equal to number of CPUs, it will dispatch equal, potentially large, number of tasks, waking up many CPUs to handle them, even if only one or few of sublists actually have any work to do. This change adds check for empty sublists to avoid this. Reviewed by: Sean Eric Fagan <[email protected]> Reviewed by: Matt Ahrens <[email protected]> Reviewed by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Closes #8909
The process of evicting data from the ARC is referred to as `arc_adjust`. This commit changes the term to `arc_evict`, which is more specific. Reviewed-by: George Wilson <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #10592 Change-Id: Ia9aa16c82c9b78a2864c53ef9f4e7f7c21216184
The performance of `zfs receive` can be bottlenecked on the CPU consumed by the `receive_writer` thread, especially when receiving streams with small compressed block sizes. Much of the CPU is spent creating and destroying dbuf's and arc buf's, one for each `WRITE` record in the send stream. This commit introduces the concept of "lightweight writes", which allows `zfs receive` to write to the DMU by providing an ABD, and instantiating only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this "dirty leaf block" are not instantiated. Because there is no dbuf with the dirty data, this mechanism doesn't support reading from "lightweight-dirty" blocks (they would see the on-disk state rather than the dirty data). Since the dedup-receive code has been removed, `zfs receive` is write-only, so this works fine. Because there are no arc bufs for the received data, the received data is no longer cached in the ARC. Testing a receive of a stream with average compressed block size of 4KB, this commit improves performance by 50%, while also reducing CPU usage by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer() and dbuf_evict() is now 1/7th (14%) of what it was. Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35% New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0% The code is also restructured in a few ways: Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies some existing code that no longer needs `DB_DNODE_ENTER()` and related routines. The new field is needed by the lightweight-type dirty record. To ensure that the `dr_dnode` field remains valid until the dirty record is freed, we have to ensure that the `dnode_move()` doesn't relocate the dnode_t. To do this we keep a hold on the dnode until it's zio's have completed. This is already done by the user-accounting code (`userquota_updates_task()`), this commit extends that so that it always keeps the dnode hold until zio completion (see `dnode_rele_task()`). `dn_dirty_txg` was previously zeroed when the dnode was synced. This was not necessary, since its meaning can be "when was this dnode last dirtied". This change simplifies the new `dnode_rele_task()` code. Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive). Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: George Wilson <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11105
This commit partially reverts changes to multilists in PR 7968 (multi-threaded spa-sync()) and adds some cache line alignments to separate read-only multilists and heavily modified refcount's to different cache lines. Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-by: iXsystems, Inc. Closes #12158 Change-Id: I4556072d1e2af101d356e3e19ea5d135ef9a08f9
It is very expensive and not informative to call multilist_is_empty() for each arc_change_state() on debug builds to check for impossible. Instead implement special index function for arc_l2c_only->arcs_list, multilists, panicking on any attempt to use it. Reviewed-by: Mark Maybee <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #12421 Change-Id: I69fbf0e3ca385e17a898e2e3ef24b9a6c67166ab
When the eviction thread goes to shrink an ARC state, it allocates a set of marker buffers used to hold its place in the state's sublists. This can be problematic in low memory conditions, since 1) the allocation can be substantial, as we allocate NCPU markers; 2) on at least FreeBSD, page reclamation can block in arc_wait_for_eviction() In particular, in stress tests it's possible to hit a deadlock on FreeBSD when the number of free pages is very low, wherein the system is waiting for the page daemon to reclaim memory, the page daemon is waiting for the ARC eviction thread to finish, and the ARC eviction thread is blocked waiting for more memory. Try to reduce the likelihood of such deadlocks by pre-allocating markers for the eviction thread at ARC initialization time. When evicting buffers from an ARC state, check to see if the current thread is the ARC eviction thread, and use the pre-allocated markers for that purpose rather than dynamically allocating them. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: George Amanakis <[email protected]> Signed-off-by: Mark Johnston <[email protected]> Closes #12985 Change-Id: I3a91a7bb150881806bfccd8576c85ec70a2f0c28
Change-Id: I081d4a9a19a634b214777ca315ed53f6d7bc296e
On systems with enormous amounts of memory, the single arc_evict thread can become a bottleneck if reads and writes are stuck behind it, waiting for old data to be evicted before new data can take its place. This commit adds support for evicting from multiple ARC lists in parallel, by farming the evict work out to some number of threads and then accumulating their results. A new tuneable, zfs_arc_evict_threads, sets the number of threads. By default, it will scale based on the number of CPUs. Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]> Signed-off-by: Alexander Stetsenko <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Rob Norris <[email protected]> Co-authored-by: Mateusz Piotrowski <[email protected]> Co-authored-by: Alexander Stetsenko <[email protected]> Closes #16486 Change-Id: I68951f2f9577f8bd3c898a457038e8f4069ed0d2
Rather than panic debug builds when we fail to parse a whole ZIL, let's instead improve the logging of errors and continue like in a release build. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ryan Moeller <[email protected]> Closes #14116
|
There is diffs from zfstest run on SmartOS and zfstest run on omnios, note that omnios had disks with trim, both builds with gcc 14, I did strip away the duration (s/[..:..] //) and made sure the test runner users are the same. Still need to perform the repeated runs to catch the flipping tests and still need to analyze the failed/killed tests. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the chain of commits leading to parallel eviction from ARC update.
I'm going to build the same work in omnios and have zfstests running (I do have the test setup in omnios and openindiana only atm).
The list of commits is constructed based on changes discovered working patches down from most recent to oldest. Some of them are rather critical (fixing dirty list accounting for example) and some are there to make it possible to track the changes from OpenZFS (for example, the terminology change). Please also note that in between of patches there are gaps in timeline -- in those gaps there are branches of patches implementing other fixes/features, if we want to pick those, it is rather important to keep the commits as they are to make it possible to understand what is really going on with flow of changes.