Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve AST parser performance #1158

Closed
patrickkusebauch opened this issue Apr 22, 2023 · 17 comments
Closed

Improve AST parser performance #1158

patrickkusebauch opened this issue Apr 22, 2023 · 17 comments

Comments

@patrickkusebauch
Copy link
Collaborator

During a Slack conversation with Rubén Rubio, we found out that Deptrac was the slowest quality gate in his CI pipeline. Even slower than running both PHPStan and Psalm together. This should not be the case. We should be faster than either of these tools. We do significantly less work. All 3 of us are parsing AST (should take about the same), but then we only resolve dependencies, while Psalm and PHPStan do a lot more work resolving types and applying rules on those.

Some numbers that came from a debug build of deptrac with perf counters:

image

AstMap created in 49.504072 seconds
start emitting dependencies "ClassDependencyEmitter"
end emitting dependencies in 0.262179 seconds
start emitting dependencies "UsesDependencyEmitter"
end emitting dependencies in 0.091963 seconds
start flatten dependencies
end flatten dependencies in 0.732068 seconds
formatting dependencies.

Obviously, the most expensive part is the AST parsing. We should look at how it is written, if it can be parallelized and if we can take some of NikicPHPParser's performance suggestions into consideration.

@patrickkusebauch
Copy link
Collaborator Author

Slack thread link for reference: https://symfony-devs.slack.com/archives/C036EU1GZS9/p1681985816572369

@ilnytskyi
Copy link

ilnytskyi commented May 4, 2023

I have been working with NikicPhpParser for my project similar to deptrac before I discovered deptrac :)
I solved similar problem this way:

  1. Collect files from target dir. (just scandir, no symfony components). But, I think it might be negligible
  2. Get number of CPUs with (int)nproc
  3. array_chunk found files by estimated number of chunk based on cpus available
  4. prepared batch callbacks for a process manager class similar to https://github.com/magento/magento2/blob/2.4-develop/app/code/Magento/Indexer/Model/ProcessManager.php that leverages https://www.php.net/manual/en/function.pcntl-fork.php
  5. At this point the only problem was to store data, as php does not mutate objects between processes
  6. I tried to use tmp file and later switched to sockets between parent and child processes to use as IPC
  7. at this point I reconstructed dependency tree form IPC text storage and proceeded with post processing and formatters

The results were stunning.

but in deptrac you have this problem only for the first run, then .deptrac.cache is used.

It looks like you could do something similar for

\Qossmic\Deptrac\Core\Ast\AstLoader::createAstMap

and here

\Qossmic\Deptrac\Core\Ast\Parser\NikicPhpParser\NikicPhpParser::parseFile
\Qossmic\Deptrac\Core\Ast\Parser\Cache\AstFileReferenceFileCache::set

data should be written into a socket or directly into .deptrac.cache then reloaded.

If there are cpus available the first run would be faster on Linux systems.
If child process files, the dependency tree would be broken :( (never happened to me)
Might introduce unexpected behavior for custom event observers when they collect data into a singleton object to use later; the singleton will be emtpy so users must be aware of using IPC for such cases.

To me deptrac works just fine. A few minutes pipeline for static tests is acceptable. But maybe someone else would like to work that faster, especially in case of git hooks prepush/precommit integrations, then the time is valuable.

@patrickkusebauch
Copy link
Collaborator Author

patrickkusebauch commented May 4, 2023

In CI you customarily do not have the cache file and do the whole analysis. Hence the times in the screenshot. I hope to pick the brains of developers of other SA tools like PHPStan and Rector, as they are both Czech (same as me) and easily reachable in some of the Czech PHP Slack communities.

Regardless, thanks for the tips, they might come in handy to anyone looking to implement this.

@patrickkusebauch
Copy link
Collaborator Author

Also, if you have your project public somewhere, I would love to take a look. There might be some useful gems there to integrate. 🙂

@rubenrubiob
Copy link
Contributor

I also dug a little bit into Deptrac analysis. As @ilnytskyi said, the only thing I found to improve the performance was to parallelize the analysis. I did not implement anything, but the only solution I found was to use pcntl_fork.

I used this strategy 10 years ago in some projects and the performance was huge: from running out of resources after hours of executing to finish in just some minutes. In my case, I used shared memory with the shmop_* functions, something like this: https://www.php.net/manual/en/function.pcntl-fork.php#115855

The thing is that pcnt_fork is only available on Linux, not on Mac nor Windows. And I think that most Docker images do not include the extension unless it is explicitly installed. Even though, I guess it is worth trying it.

On the other hand, I solved my performance problem by not including the vendor folder, like they explain here: #506 (comment) I think it is a common use case, so maybe it would be useful to add a section in the documentation explaining how to correctly include vendor dependencies.

@smoench
Copy link
Contributor

smoench commented May 5, 2023

Just one minor improvement could be: removing the is_file here, at least for the parser. As far as I remember we are passing the parser only "valid" files and the false from the file_get_contents would be enough when there is something wrong. This would save some I/O.

Another one for the cache would be switching the hash algo. I came across this PR from rector. They switched to the algo xxh128 (since PHP 8.1) and linked following Benchmark. This means when we are switching in this file sha1_file to hash_file('xxh128', ...) we would gain some performance improvements.

@staabm
Copy link
Contributor

staabm commented May 13, 2023

If you have a workload somewhere which reproduces the slowness I would be happy to help and profile it

@staabm
Copy link
Contributor

staabm commented May 13, 2023

oh my.. if I would be interessted in performance, I would not run my static analysis tooling thru docker ;-).


running it locally on my mac pro m1 takes 0.169secs (without docker)

blog-src git:(main) ✗ time vendor/bin/deptrac analyse --config-file=hexagonal-layers.depfile.yaml --report-uncovered --fail-on-uncovered --no-cache
 51/51 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100%


 -------------------- ----- 
  Report                    
 -------------------- ----- 
  Violations           0    
  Skipped violations   0    
  Uncovered            0    
  Allowed              222  
  Warnings             0    
  Errors               0    
 -------------------- ----- 

vendor/bin/deptrac analyse --config-file=hexagonal-layers.depfile.yaml     0.13s user 0.04s system 97% cpu 0.169 total

I wonder why the CI pipeline scans so much files but locally only 51 are scanned? I guess my command is wrong?

@patrickkusebauch
Copy link
Collaborator Author

I would not say people care about performance as much as it does not compute to me why we take longer than PHPStan and Psalm combined. It was me who made it an issue. deptrac should not be the slowest link in the CI pipeline, that is just not a good user experience.

@staabm
Copy link
Contributor

staabm commented May 13, 2023

Yeah, I was joking. If I can reproduce it locally - without docker - I can look into it

@gennadigennadigennadi
Copy link
Member

@staabm it's also me, with my setup it also only parses 51 files(in docker) but more than 6000 through the github action. Maybe it's not a deptrac bottleneck but a github action config problem?

A small improvement for the pipeline would be to use the flag --no-cache, this way there will no writes made by deptrac.

@rubenrubiob
Copy link
Contributor

Sorry, I did not see the comments with the analysis. I improved Deptrac's configuration to only scan src files. You can see the performance degradation mentioned on the first comment if you include vendor in the paths to analyze by Deptrac.

@gennadigennadigennadi
Copy link
Member

I did some profiling my self, but I'm not that experienced in it. But what i found out is, that one big performance penalty, beside the AST parsing, is builing the symfony service container. I'm curiois to see your results @staabm .

@gennadigennadigennadi
Copy link
Member

gennadigennadigennadi commented May 23, 2023

Sorry, I did not see the comments with the analysis. I improved Deptrac's configuration to only scan src files. You can see the performance degradation mentioned on the first comment if you include vendor in the paths to analyze by Deptrac.

Okay, I see. I think we should work on the performance of deptrac in the future. If I got it right the screenshot @patrickkusebauch shared was based on the deptrac-config which parsed the whole vendor dir, right? But the other tools just parsed the src?

@rubenrubiob
Copy link
Contributor

Sorry, I did not see the comments with the analysis. I improved Deptrac's configuration to only scan src files. You can see the performance degradation mentioned on the first comment if you include vendor in the paths to analyze by Deptrac.

Okay, I see. I think we should work on the performance of deptrac in the future. If I got it right the screenshot @patrickkusebauch shared was based on the deptrac-config which parsed the whole vendor dir, right? But the other tools just parsed the src?

Exactly, the first screenshot was taken when Deptrac scanned the vendor folder. And yes, the other tools only scan the src folder. It is as I said in a previous comment:

On the other hand, I solved my performance problem by not including the vendor folder, like they explain here: #506 (comment) I think it is a common use case, so maybe it would be useful to add a section in the documentation explaining how to correctly include vendor dependencies.

@gennadigennadigennadi
Copy link
Member

close: the biggest performance impact is still php-parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants