-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix metric reporting error #18689
Open
suixinpr
wants to merge
1,157
commits into
Alluxio:main
Choose a base branch
from
suixinpr:metrics-bug
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Fix metric reporting error #18689
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### What changes are proposed in this pull request? Remove hostname from metrics key ### Why are the changes needed? For easy aggregation on prometheus and grafana side ### Does this PR introduce any user facing changes? Add a flag to disable this for compatibility pr-link: Alluxio#18121 change-id: cid-ba6c2f9fae625747192044168fce7dc026c66b9c
besides the User-CLI.md doc, update other doc files that refer to `bin/alluxio` commands - remove docs on path config - remove starting/stopping job master/worker from contributor docs pr-link: Alluxio#18128 change-id: cid-fc71dd493b16ef3aeeb0b1b190941c43b9af9cab
What changes are proposed in this pull request? I have created a test and create a liststatus test for its function. Why are the changes needed? Please clarify why the changes are needed. For instance, add a unit test for DoraWorkerClientServiceHandler. Does this PR introduce any user facing changes? No. pr-link: Alluxio#18059 change-id: cid-b82706a4419700f017584f3e5579d2ef3410aeb3
Make fuse max reader concurrency configurable. The default value was 64 and it was unchangeable. pr-link: Alluxio#18129 change-id: cid-9c55821622329bd1e608da2e7445e8ab591df38a
Fix typo from alluxio.max.fuse.reader.concurrency to alluxio.fuse.max.reader.concurrency pr-link: Alluxio#18134 change-id: cid-434086cf6ba9e9f8d173e3417fc8518963dfa102
update usages of bin/alluxio, bin/alluxio-start.sh and bin/alluxio-stop.sh to their new counterparts simplify section of CephFS.md and remove sections related to mounting. the ufs must be configured as the root mount via alluxio-site.properties. pr-link: Alluxio#18136 change-id: cid-fa7d0eec00c8fb136680ef6d5a2c7ee78571d123
### What changes are proposed in this pull request? Support accessing OSS through proxy by configuring alluxio properties or system properties. ### Why are the changes needed? When accessing OSS through a proxy, the OSS client cannot recognize the proxy configuration in system property and environment variables. So it has to proactively set proxy-related configurations in the configuration. ### Does this PR introduce any user facing changes? Please list the user-facing changes introduced by your change, including 1. change in user-facing APIs 2. addition or removal of property keys 3. webui pr-link: Alluxio#18139 change-id: cid-5e30dfd90747d4a1aafe9b2ff985331f05fefec6
### What changes are proposed in this pull request? If don't set oss.proxy.host, the default value should be NULL ### Why are the changes needed? If don't set oss.proxy.host, the default value should be NULL ### Does this PR introduce any user facing changes? Please list the user-facing changes introduced by your change, including 1. change in user-facing APIs 2. addition or removal of property keys 3. webui pr-link: Alluxio#18142 change-id: cid-bce2790e583445c4ba6720d2f0a64551fb19de20
### What changes are proposed in this pull request? In this change the path conversion logic is extracted to static utility methods for code reuse (because other classes may use the same path resolution logic). The method names are slightly improved, to distinguish the member methods in `DoraCacheFileSystem` (may be inherited) from the static utility methods. pr-link: Alluxio#18140 change-id: cid-557fa148f6daa41f0b296132e5a2ecae6c5d6c22
pr-link: Alluxio#18138 change-id: cid-e9f862385bdfe6cc5e6938eb49907055449deb4a
### What changes are proposed in this pull request? 1. Move the path resolution logic from `DoraCacheFileSystem` to `PathUtils` where it makes more sense 2. Fix the alluxioPathToUfsPath resolution by handling the case where the ufs path may have no matching alluxio path, making the util method more generic pr-link: Alluxio#18146 change-id: cid-ebace1efcf58e385bbf71b599e4b5a15a2199f7e
### What changes are proposed in this pull request? Please outline the changes and how this PR fixes the issue. ### Why are the changes needed? Better error handling when a stream is closed. Log more and don't swallow errors ### Does this PR introduce any user facing changes? N/A pr-link: Alluxio#18145 change-id: cid-b1146421ea5bc51e9a5eea5bf11ce1a5d8466912
Use jackson JSON as a standard format for reports. Output example for `bin/alluxio info report summary`: ``` { "mSafeMode":false, "mZookeeper":false, "mRaftJournal":true, "version":"304-SNAPSHOT", "uptime":"0 day(s), 0 hour(s), 3 minute(s), and 7 second(s)", "rpcPort":19998, "webPort":19999, "masterAddress":"Ec2Cluster-masters-0:19998", "masterVersions":[ { "version":"304-SNAPSHOT", "state":"PRIMARY", "host":"Ec2Cluster-masters-0", "port":19998 } ], "started":"08-24-2023 06:43:32:856", "zookeeperAddress":[ ], "raftJournalAddress":[ "Ec2Cluster-masters-0:19200" ], "liveWorkers":2, "lostWorkers":0, "freeCapacity":"2048.00MB", "totalCapacityOnTiers":{ "MEM":"2048.00MB" }, "usedCapacityOnTiers":{ "MEM":"0B" } } ``` pr-link: Alluxio#18047 change-id: cid-bf6d54f47390a4d2bd84e4baac2ea2863d4638e1
### What changes are proposed in this pull request? refactor the cli code ### Why are the changes needed? make the code more flexible and easy for adding more functions in a cleaner way ### Does this PR introduce any user facing changes? nope pr-link: Alluxio#18152 change-id: cid-d8f937075174d913daf32387d781161096f03345
1. Use `addSuppressed` instead of creating a new exception to throw the original exception. 2. Go ahead to write data to UFS when it encountered exception during the time writing data to Alluxio Worker. pr-link: Alluxio#18017 change-id: cid-9337252b71e40fced28fb1598ea88eed56c69229
### What changes are proposed in this pull request? Improve distributed load 1. Configurable job failure criteria 2. Configuration to determine if the load job should be restored from journal or not 3. Add an option to skip existing fully loaded file 4. Add retry count for failed files 5. Bug fixing ### Why are the changes needed? To enhance the distributed load tool ### Does this PR introduce any user facing changes? Yes. The skip-if-exists option is added to the distributed load cli. pr-link: Alluxio#18153 change-id: cid-5644da1c09bd6ee48f628552f51cb570de581b93
### What changes are proposed in this pull request? Recover the ufs uri support ### Why are the changes needed? Ufs uri should be the first class citizen in dora ### Does this PR introduce any user facing changes? No pr-link: Alluxio#18135 change-id: cid-a35fe39bdf69879ef113fc97737a35ebf6d8b29a
Add docker doc back pr-link: Alluxio#18130 change-id: cid-6038dce4ae1821f0e7ccf0e2e874bed5d312057d
### What changes are proposed in this pull request? Add a test so we can monitor the vnode distribution is not too uneven. This test calculates the standard deviation over mean on the collection of virtual nodes assigned to physical nodes. It arbitrarily bounds it at 0.25, but ideally this number should get smaller over time as we improve hashing algorithm and use better ways to assign virtual nodes to physical nodes. ### Why are the changes needed? We may change hashing algorithm and virtual node assignment in the future, this will provide guidance and catch errors. ### Does this PR introduce any user facing changes? No. pr-link: Alluxio#18147 change-id: cid-152d8edc9b65ef59967d5985849feeb471a6650d
Add the following CLI tools for debugging and analyzing caching issues: 1. checkCaching. Checks if files under a path have been cached in alluxio. 2. location. Displays the list of hosts storing the specified file. 3. consistentHash. This command is for checking whether the consistent hash ring is changed or not. pr-link: Alluxio#18151 change-id: cid-c89b98da70a5270070d873bdcfce1aa9b23cf083
Update the Golang side commands to be able to use this output: 1. Return either the yaml or json (default) output to the console. 2. Users can define the format they want with `--format` flag, like `bin/alluxio info report --format yaml` 3. In JSON format, print properties in a fixed, easy-to-read order 4. In YAML format, print properties alphabetically (since YAML specification regards property order non-significant) Before: ``` {"safeMode":false,"masterVersions":[{"version":"304-SNAPSHOT","host":"localhost","port":19998,"state":"PRIMARY"}],"masterAddress":"localhost:19998","zookeeperAddress":[],"useZookeeper":false,"raftJournalAddress":["localhost:19200"],"useRaftJournal":true,"liveWorkers":1,"lostWorkers":0,"freeCapacity":"1024.00MB","totalCapacityOnTiers":{"MEM":"1024.00MB"},"usedCapacityOnTiers":{"MEM":"0B"},"version":"304-SNAPSHOT","webPort":19999,"started":"09-15-2023 15:54:56:635","uptime":"0 day(s), 0 hour(s), 26 minute(s), and 37 second(s)","rpcPort":19998} ``` After (in JSON): ``` { "rpcPort": 19998, "started": "09-15-2023 15:54:56:635", "uptime": "0 day(s), 0 hour(s), 55 minute(s), and 31 second(s)", "safeMode": false, "version": "304-SNAPSHOT", "webPort": 19999, "masterVersions": [ { "version": "304-SNAPSHOT", "host": "localhost", "port": 19998, "state": "PRIMARY" } ], "masterAddress": "localhost:19998", "zookeeperAddress": [], "useZookeeper": false, "raftJournalAddress": [ "localhost:19200" ], "useRaftJournal": true, "liveWorkers": 1, "lostWorkers": 0, "freeCapacity": "1024.00MB", "totalCapacityOnTiers": { "MEM": "1024.00MB" }, "usedCapacityOnTiers": { "MEM": "0B" } } ``` After (in YAML): ``` freeCapacity: 1024.00MB liveWorkers: 1 lostWorkers: 0 masterAddress: localhost:19998 masterVersions: - host: localhost port: 19998 state: PRIMARY version: 304-SNAPSHOT raftJournalAddress: - localhost:19200 rpcPort: 19998 safeMode: false started: 09-15-2023 15:54:56:635 totalCapacityOnTiers: MEM: 1024.00MB uptime: 0 day(s), 1 hour(s), 1 minute(s), and 36 second(s) useRaftJournal: true useZookeeper: false usedCapacityOnTiers: MEM: 0B version: 304-SNAPSHOT webPort: 19999 zookeeperAddress: [] ``` pr-link: Alluxio#18159 change-id: cid-deb6e74552de9afcf45391c6c230a9fe00785e37
### What changes are proposed in this pull request? Add datePredicate, i.e.: lastModifiedDate(2000/01/01, 2023/09/01) ### Why are the changes needed? Customer requirement. ### Does this PR introduce any user facing changes? na pr-link: Alluxio#18155 change-id: cid-7e1a7b7d208747807b87502c9da854ddf0b8c7fc
For random reads, bytes read per file is not a constant any more. In spite of existing duration distribution, need a throughput distribution for better understanding of reading performance. Also, when duration too long, grpc will receive huge size of output data. Should aggregate data points to transfer more datapoints with limited output size. Group datapoints by threads and time slices: Example: ``` nodeResults: { worker-0: { dataPoints: [ data: [ { // worker 0, thread 0, slice 0 count: 1, iobytes: 33554432, }, { // worker 0, thread 0, other slices … } ], [ // worker 0, other workers … ] ] throughputPercentiles: […] }, worker-1: { // other workers … } } ``` Slice time with `--slice-size` flag, e.g. `--slice-size 1s`. pr-link: Alluxio#18149 change-id: cid-ec8ed5a4f9eeaa86b1d86b6b449db4647d584823
### What changes are proposed in this pull request? Add Unit Test for OSS, OBS and GCS ### Why are the changes needed? Unit test is important for improving functions of Alluxio. ### Does this PR introduce any user facing changes? No pr-link: Alluxio#17985 change-id: cid-c757b8249c62e2ccf0483cb99436e33d351358a1
### What changes are proposed in this pull request? Fix datePredicate so it would respect the interval specified in the policy. Polish tests ### Why are the changes needed? bug fix ### Does this PR introduce any user facing changes? na pr-link: Alluxio#18167 change-id: cid-b035b5cc31aa70a2f83de2c3f84ba49ed75f9fb5
- Add a new CLI command that iterates through the command tree and generates a markdown file based on each command's definition and description - Migrate all the content in the previous User-CLI.md into the corresponding commands in golang code, mainly updating their `Long` and `Examples` fields - Run `bin/alluxio generate user-cli` to write the generated content directly into `docs/en/operation/User-CLI.md` pr-link: Alluxio#18144 change-id: cid-9b29f273efef9693e1b0b303c62cc19602d77acc
### What changes are proposed in this pull request? Fix the broken PagedDoraWorkerTest. ### Why are the changes needed? The old test is.broken, I just fix it. ### Does this PR introduce any user facing changes? no. pr-link: Alluxio#18150 change-id: cid-a142297c4f08780189e4321abc8e99ff512091ec
### What changes are proposed in this pull request? Fix runtime stats when cache hit ### Why are the changes needed? We already count the runtime stats in local cache manager, but we didn't pass in a proper cache context ### Does this PR introduce any user facing changes? Please list the user-facing changes introduced by your change, including 1. change in user-facing APIs 2. addition or removal of property keys 3. webui pr-link: Alluxio#18503 change-id: cid-233916f6b410432f8300f10ac8a2385afe969f2c
Fix Unrecognized error ### What changes are proposed in this pull request? Please outline the changes and how this PR fixes the issue. ### Why are the changes needed? Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, describe the bug. ### Does this PR introduce any user facing changes? Please list the user-facing changes introduced by your change, including 1. change in user-facing APIs 2. addition or removal of property keys 3. webui pr-link: Alluxio#18527 change-id: cid-9753dc51317f4ebf4ba50cd5463155e821191ac3
### What changes are proposed in this pull request? Add decommission capability for worker and removenode cli cmd to remove node from etcd registration ### Why are the changes needed? to add decommission capability for worker ### Does this PR introduce any user facing changes? New cli added for remove worker node pr-link: Alluxio#18530 change-id: cid-0d41896ef29c976d773a4dfa0b07bd5efd836115
### What changes are proposed in this pull request? Port changes from [@LetianYuan](https://github.com/LetianYuan) in PR : Alluxio#17256 in urgency for resolving injection vulnerability. ### Why are the changes needed? Resolve issue: https://nvd.nist.gov/vuln/detail/CVE-2023-38889 ### Does this PR introduce any user facing changes? No. pr-link: Alluxio#18536 change-id: cid-a8122e3ff3324f1a9d881f55cc425ab30a020c35
### What changes are proposed in this pull request? rebuild libjnifuse on Centos7 (glibc 2.17, libfuse 2.9.2) the build should work on Ubuntu with glibc 2.3x ### Why are the changes needed? libjnifuse.so fails to link to libfuse2, issue created here : Alluxio#18513 Introduced by a jnifuse change to rebuild the libjnifuse.so ### Does this PR introduce any user facing changes? fuse2 should be able to use now pr-link: Alluxio#18537 change-id: cid-d5bd6421ccd76f59e6e1216ac9940c419c3c6a45
### What changes are proposed in this pull request? Alluxio doesn't work with early-access (EA) JDKs which makes it hard to do some forward-testing before new JDK is released. ### Why are the changes needed? Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, describe the bug. ### Does this PR introduce any user facing changes? Please list the user-facing changes introduced by your change, including 1. change in user-facing APIs 2. addition or removal of property keys 3. webui pr-link: Alluxio#18553 change-id: cid-15c97cfab92f5ceafbc9051550cf596399ba02b9
### What changes are proposed in this pull request? 1. catch all exception thrown by fetch worker cluster view from etcd …when refresh the workers view instead of propagate throwing all the way to the caller 2. make worker failure detection timeout (the timeout to determine if a worker is in FAILED state) configurable thru setting the service discovery entity's lease ttl 3. make newLeaseInternal always overwrite the key with the newly created lease ### Why are the changes needed? To resolve: 1) currently worker is considered down only 2sec after its disconnection with etcd, its too small, make the failure detection timeout configurable for registered service discovery services. 2) etcd unavaible runtime exception will be propagated to caller which is non-ideal, currently capture at FileSystemContext.getCachedWorkers layer to prevent propagating to IO layer, causing an unnecessary ufs fallback such as cold read. ### Does this PR introduce any user facing changes? No pr-link: Alluxio#18554 change-id: cid-3e87779f7cce10861e5b8a871030a89f9b828ab1
### What changes are proposed in this pull request? When a old worker is down for couple days and up again with old data, the metrics Client.CachePagesAges would show high page age which exceed the ttl. And this violates the regulation. This is caused by: 1. we are not deleting old files when restoring local cache, and old data may still be able to serve users. 2. when we are not able to run invalidate scheduled tasks when we are in restore state(initialDelay = 0). In the meanwhile, fix a metrics calculation: CLIENT_CACHE_PAGES_INVALIDATED ### Why are the changes needed? bug fix ### Does this PR introduce any user facing changes? na pr-link: Alluxio#18568 change-id: cid-c335697a018e0d57182045e96245f1145ed2494c
### What changes are proposed in this pull request? Using name threads is generally a good practice and helps system administrators understand what's going on with the system. Without explicit naming, the JDK-provided pools jave threads named `pool-N-thread-T`. pr-link: Alluxio#18573 change-id: cid-2fb9e71b75de62e7c6a0ead8c83037922957bf68
### What changes are proposed in this pull request? Modified the logic of Alluxio clearing expired metadata. The expiration time of RocksDB is set so that expired metadata can be cleared in time every time RocksDB performs a compaction operation. ### Why are the changes needed? Alluxio has a problem that expired metadata cannot be cleared in time, which may lead to insufficient disk space in the long run. We have modified the logic for clearing expired metadata to ensure that if the user sets an appropriate metadata expiration time, the disk will not continue to accumulate metadata that cannot be cleared. ### Does this PR introduce any user facing changes? None pr-link: Alluxio#18582 change-id: cid-40c506311a76a3c7585227ca2b5f488088ce9486
### What changes are proposed in this pull request? Add HTTP getPage cache hit ratio. This is used for mimicing how much data read from alluxio cache and how much data possibly fallback to UFS given that no that much user get metrics from alluxiofs client. ### Why are the changes needed? To get cache hit ratio of worker http server get page ### Does this PR introduce any user facing changes? Yes, add some metrics pr-link: Alluxio#18594 change-id: cid-e6ee6b8f2d5a751d72e804fe8e067b02cbf3ac2c
Install netcat in Dockerfile as needed. pr-link: Alluxio#18598 change-id: cid-d2cc6b7c6d485a526e038c2faa94830933d3f716
Add a path to the worker http server to enable an easy health check. pr-link: Alluxio#18596 change-id: cid-e12e3a2d1f91152f0af5cb598ba49c65a8abe42c
Add health check RESTful API in Worker HTTP Server. pr-link: Alluxio#18604 change-id: cid-9a2a13e98df8e6f5c7298e4cd32af2747dc365da
### What changes are proposed in this pull request? Fix metrics when delete non existing page ### Why are the changes needed? When deleting non-existing page we see a abnormal spike in PageAge. Still doesn't know why we have abnormal PageAge but since it's non-existing page we would not able to read it as well so it make sense to not record the age. Also we continue to delete page from page store if we didn't find it in metastore so we avoid inconsistency between pagestore and metastore ### Does this PR introduce any user facing changes? na pr-link: Alluxio#18618 change-id: cid-7d4803cb2d8c433d3ad077a7bf9e3074d91b3389
### What changes are proposed in this pull request? add back netty dependency within grpc ### Why are the changes needed? previously we exclude netty dependency since we already have netty-all outside Alluxio#9942 But due to grpc/grpc-java#11284, we sometimes have incompatibility between grpc and netty, and it's better to use shaded netty within grpc so we can be sure that they are compatible. ### Does this PR introduce any user facing changes? na pr-link: Alluxio#18642 change-id: cid-65d86f315e023592060b6a9f864bfe2a972dfe68
Install UCX on dockerfile jdk 8 for github PR test pr-link: Alluxio#18663 change-id: cid-44eb2b7eb1c432ca4dd3b8050fa1dd908f0b124e
Thank you for your pull request. |
OK, I have signed the CLA and e-mail back to [email protected] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
Fix metric reporting error.
When the worker reports metrics to the master, the metric type reported to the master should be consistent with the instanceType
Why are the changes needed?
When I was reading data, I found that some metrics in the master were inaccurate.
For example, the "Remote Alluxio Read" on the master homepage was less than the amount of data read by the client, and also less than the "Bytes Read Remotely" on the worker page.
By adding logs, it can be found that the "Bytes Read Remotely" of the worker is correct. However, after the woker sent it to the master, some
alluxio.grpc.Metric
records were skipped.alluxio.grpc.Metric
is obtained from the worker'salluxio.metrics.MetricsSystem#reportMetrics
and received and processed by the master'salluxio.master.metrics.MetricsStore#putReportedMetrics
.When processing reportMetrics, in
alluxio.metrics.MetricsSystem#SHOULD_REPORT_METRICS
, there are actually bothInstanceType.CLIENT
andInstanceType.Worker
metrics. They will be reported to the master by callingalluxio.metrics.MetricsSystem#reportWorkerMetrics
andalluxio.metrics.MetricsSystem#reportClientMetrics
interfaces by different worker threads. However, in the master, putReportedMetrics will only process the records ofInstanceType.Worker
and ignore the records ofInstanceType.CLIENT
.Since Counter only sends diff, if a Worker's metric is sent by reportClientMetrics, the master's record will record one less data.
Does this PR introduce any user facing changes?
none