-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dmu_tx_hold_zap() does dnode_hold() 7x on same object #4641
Labels
Type: Performance
Performance improvement or performance problem
Comments
OpenZFS-illumos pull request: openzfs/openzfs#109 |
nedbass
pushed a commit
to nedbass/zfs
that referenced
this issue
May 28, 2016
Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Upstream bugs: DLPX-44797 Ported by: Ned Bass <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7004 ZFSonLinux-issue: openzfs#4641 OpenZFS-commit: unmerged Porting notes: - Changed ASSERT0(err) to VERIFY0(err) in zap_lockdir() to avoid unused variable error. Code may be refactored in future upstream revision to clean this up. - Changed EXPORT_SYMBOL(zap_count_write) to EXPORT_SYMBOL(zap_count_write_by_dnode) in zap_micro.c
nedbass
pushed a commit
to nedbass/zfs
that referenced
this issue
May 28, 2016
Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Upstream bugs: DLPX-44797 Ported by: Ned Bass <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7004 ZFSonLinux-issue: openzfs#4641 OpenZFS-commit: unmerged Porting notes: - Changed ASSERT0(err) to VERIFY0(err) in zap_lockdir() to avoid unused variable error. Code may be refactored in future upstream revision to clean this up. - Changed EXPORT_SYMBOL(zap_count_write) to EXPORT_SYMBOL(zap_count_write_by_dnode) in zap_micro.c
nedbass
pushed a commit
to nedbass/zfs
that referenced
this issue
Jun 24, 2016
Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Upstream bugs: DLPX-44797 Ported by: Ned Bass <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7004 ZFSonLinux-issue: openzfs#4641 OpenZFS-commit: unmerged Porting notes: - Changed ASSERT0(err) to VERIFY0(err) in zap_lockdir() to avoid unused variable error. Code may be refactored in future upstream revision to clean this up. - Changed EXPORT_SYMBOL(zap_count_write) to EXPORT_SYMBOL(zap_count_write_by_dnode) in zap_micro.c
This was referenced Jun 26, 2016
behlendorf
pushed a commit
to LLNL/zfs
that referenced
this issue
Jun 27, 2016
Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Upstream bugs: DLPX-44797 Ported by: Ned Bass <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7004 ZFSonLinux-issue: openzfs#4641 OpenZFS-commit: unmerged Porting notes: - Changed ASSERT0(err) to VERIFY0(err) in zap_lockdir() to avoid unused variable error. Code may be refactored in future upstream revision to clean this up. - Changed EXPORT_SYMBOL(zap_count_write) to EXPORT_SYMBOL(zap_count_write_by_dnode) in zap_micro.c
tuomari
pushed a commit
to tuomari/zfs
that referenced
this issue
Jun 30, 2016
Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Upstream bugs: DLPX-44797 Ported by: Ned Bass <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7004 ZFSonLinux-issue: openzfs#4641 OpenZFS-commit: unmerged Porting notes: - Changed ASSERT0(err) to VERIFY0(err) in zap_lockdir() to avoid unused variable error. Code may be refactored in future upstream revision to clean this up. - Changed EXPORT_SYMBOL(zap_count_write) to EXPORT_SYMBOL(zap_count_write_by_dnode) in zap_micro.c
behlendorf
added
the
Type: Performance
Performance improvement or performance problem
label
Jul 15, 2016
tuomari
pushed a commit
to tuomari/zfs
that referenced
this issue
Jul 17, 2016
Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Upstream bugs: DLPX-44797 Ported by: Ned Bass <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/7004 ZFSonLinux-issue: openzfs#4641 OpenZFS-commit: unmerged Porting notes: - Changed ASSERT0(err) to VERIFY0(err) in zap_lockdir() to avoid unused variable error. Code may be refactored in future upstream revision to clean this up. - Changed EXPORT_SYMBOL(zap_count_write) to EXPORT_SYMBOL(zap_count_write_by_dnode) in zap_micro.c
ahrens
added a commit
to ahrens/zfs
that referenced
this issue
Aug 15, 2016
Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Closes openzfs#4641
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, and with a workaround for #4636, I observed poor performance (~30,000 file creations per second). I noticed that
dmu_tx_hold_zap()
was using about 30% of all CPU, and doingdnode_hold()
7 times on the same object (the ZAP object that is being held):dmu_tx_hold_zap()
keeps a hold on thednode_t
the entire time it is running, indmu_tx_hold_t:txh_dnode
, so it would be nice to use thednode_t
that we already have in hand, rather than repeatedly callingdnode_hold()
. To do this, we need to pass thednode_t
down through all the intermediate calls thatdmu_tx_hold_zap()
makes, making these routines take thednode_t*
rather than anobjset_t*
and auint64_t object
number. In particular, the following routines will need to have analogous*_by_dnode()
variants created:dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()
A prototype implementation has shown that this can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around #4636. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by
dmu_tx_hold_zap()
was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds.Once #4636 is fixed, the reward for fixing this issue is high, and the cost is low. Although the code changes are spread among many functions, they are quite straightforward to make and to understand. There is no risk of hurting performance in other use cases. We may find that this “… by dnode” technique can be applied to other use cases as well.
The text was updated successfully, but these errors were encountered: