Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[analyzer] Fix performance of getTaintedSymbolsImpl() #89606

Merged
merged 1 commit into from
Apr 23, 2024

Conversation

NagyDonat
Copy link
Contributor

@NagyDonat NagyDonat commented Apr 22, 2024

Previously the function

std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,
                                                    const MemRegion *Reg,
                                                    TaintTagType K,
                                                    bool returnFirstOnly)

(one of the 8 overloaded variants under this name) was handling element regions in a highly inefficent manner: it performed the "also examine the super-region" step twice. (Once in the branch for element regions, and once in the more general branch for all SubRegions -- note that ElementRegion is a subclass of SubRegion.)

As pointer arithmetic produces ElementRegions, it's not too difficult to get a chain of N nested element regions where this inefficient recursion would proudce 2^N calls. I suspect that this issue might be behind #89045 (note that sheervideo.c does very complex pointer arithmetic).

This commit is essentially NFC, apart from the performance improvements and the removal of (probably irrelevant) duplicate entries from the return value of getTaintedSymbols() calls.
Fixes #89045

Previously the function
```
std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,
                                                    const MemRegion *Reg,
                                                    TaintTagType K,
                                                    bool returnFirstOnly)
```
(one of the 8 overloaded variants under this name) was handling element
regions in a highly inefficent manner: it performed the "also examine
the super-region" step twice. (Once in the branch for element regions,
and once in the more general branch for all `SubRegion`s -- note that
`ElementRegion` is a subclass of `SubRegion`.)

As pointer arithmetic produces `ElementRegion`s, it's not too difficult
to get a chain of N nested element regions where this inefficient
recursion would proudce 2^N calls. I suspect that this issue might be
behind llvm#89045
(note that `sheervideo.c` does very complex pointer arithmetic).

This commit is essentially NFC, apart from the performance improvements
and the removal of (probably irrelevant) duplicate entries from the
return value of `getTaintedSymbols()` calls.
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:static analyzer labels Apr 22, 2024
@llvmbot
Copy link
Member

llvmbot commented Apr 22, 2024

@llvm/pr-subscribers-clang-static-analyzer-1

@llvm/pr-subscribers-clang

Author: None (NagyDonat)

Changes

Previously the function

std::vector&lt;SymbolRef&gt; taint::getTaintedSymbolsImpl(ProgramStateRef State,
                                                    const MemRegion *Reg,
                                                    TaintTagType K,
                                                    bool returnFirstOnly)

(one of the 8 overloaded variants under this name) was handling element regions in a highly inefficent manner: it performed the "also examine the super-region" step twice. (Once in the branch for element regions, and once in the more general branch for all SubRegions -- note that ElementRegion is a subclass of SubRegion.)

As pointer arithmetic produces ElementRegions, it's not too difficult to get a chain of N nested element regions where this inefficient recursion would proudce 2^N calls. I suspect that this issue might be behind #89045 (note that sheervideo.c does very complex pointer arithmetic).

This commit is essentially NFC, apart from the performance improvements and the removal of (probably irrelevant) duplicate entries from the return value of getTaintedSymbols() calls.


Full diff: https://github.com/llvm/llvm-project/pull/89606.diff

1 Files Affected:

  • (modified) clang/lib/StaticAnalyzer/Checkers/Taint.cpp (+6-8)
diff --git a/clang/lib/StaticAnalyzer/Checkers/Taint.cpp b/clang/lib/StaticAnalyzer/Checkers/Taint.cpp
index 4edb671753bf45..6362c82b009d72 100644
--- a/clang/lib/StaticAnalyzer/Checkers/Taint.cpp
+++ b/clang/lib/StaticAnalyzer/Checkers/Taint.cpp
@@ -216,21 +216,17 @@ std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,
   std::vector<SymbolRef> TaintedSymbols;
   if (!Reg)
     return TaintedSymbols;
-  // Element region (array element) is tainted if either the base or the offset
-  // are tainted.
+
+  // Element region (array element) is tainted if the offset is tainted.
   if (const ElementRegion *ER = dyn_cast<ElementRegion>(Reg)) {
     std::vector<SymbolRef> TaintedIndex =
         getTaintedSymbolsImpl(State, ER->getIndex(), K, returnFirstOnly);
     llvm::append_range(TaintedSymbols, TaintedIndex);
     if (returnFirstOnly && !TaintedSymbols.empty())
       return TaintedSymbols; // return early if needed
-    std::vector<SymbolRef> TaintedSuperRegion =
-        getTaintedSymbolsImpl(State, ER->getSuperRegion(), K, returnFirstOnly);
-    llvm::append_range(TaintedSymbols, TaintedSuperRegion);
-    if (returnFirstOnly && !TaintedSymbols.empty())
-      return TaintedSymbols; // return early if needed
   }
 
+  // Symbolic region is tainted if the corresponding symbol is tainted.
   if (const SymbolicRegion *SR = dyn_cast<SymbolicRegion>(Reg)) {
     std::vector<SymbolRef> TaintedRegions =
         getTaintedSymbolsImpl(State, SR->getSymbol(), K, returnFirstOnly);
@@ -239,6 +235,8 @@ std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,
       return TaintedSymbols; // return early if needed
   }
 
+  // Any subregion (including Element and Symbolic regions) is tainted if its
+  // super-region is tainted.
   if (const SubRegion *ER = dyn_cast<SubRegion>(Reg)) {
     std::vector<SymbolRef> TaintedSubRegions =
         getTaintedSymbolsImpl(State, ER->getSuperRegion(), K, returnFirstOnly);
@@ -318,4 +316,4 @@ std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,
     }
   }
   return TaintedSymbols;
-}
\ No newline at end of file
+}

Copy link
Contributor

@dkrupp dkrupp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested change make a lot of sense. Thanks.
LGTM.

Copy link
Contributor

@steakhal steakhal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

The patch makes sense. Thank you for the promptly fix.
I've checked and this resolves the FFmpeg issue.
i wonder if this is serious enough to consider hangs as a crash and nominate this PR for backport to clang-18.
Leave me no doubt, we will have it downstream for sure, but upstream users might wanna also have this.
@Xazax-hun WDYT?

@steakhal
Copy link
Contributor

Ah, I think rushed ahead of myself.
I applied the patch to clang-17, where it of course we didn't have any issues even with this broken isTainted.
Now that I applied the patch to our clang-18 based branch, the file analyzes in 1:40, which is still far off from the baseline run 32 seconds. But 1:40 is arguably a lot better than 1.23 hours.

I perfed again, and it shows this:
image

This is much better, and showcases the next slowdown bug I'm hunting for the days since I reported this one :D
The remaining 1 extra minute is lost during Z3 refutation.

So, yea, this PR fixes the bug I reported in #89045, so we can close that once this is merged.
And stay tuned for the next slowdown bug. I can already tell you that the bitwise shift checker introduces new constraints, after which we can now refute more bugs! - but also spends more time for finding a valid bug report in an EQClass , while previously we just picked the first one as we found those path constraints sat. But I'll come back to that once I have something concrete to share.

@Xazax-hun
Copy link
Collaborator

@Xazax-hun WDYT?

If it really takes over an hour to analyze a file in clang-18, I'd agree that this has similar severity to a crash, and I support backporting the fix.

@NagyDonat NagyDonat merged commit ce763bf into llvm:main Apr 23, 2024
7 checks passed
steakhal pushed a commit to steakhal/llvm-project that referenced this pull request Apr 23, 2024
Previously the function
```
std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,
                                                    const MemRegion *Reg,
                                                    TaintTagType K,
                                                    bool returnFirstOnly)
```
(one of the 4 overloaded variants under this name) was handling element
regions in a highly inefficient manner: it performed the "also examine
the super-region" step twice. (Once in the branch for element regions,
and once in the more general branch for all `SubRegion`s -- note that
`ElementRegion` is a subclass of `SubRegion`.)

As pointer arithmetic produces `ElementRegion`s, it's not too difficult
to get a chain of N nested element regions where this inefficient
recursion would produce 2^N calls.

This commit is essentially NFC, apart from the performance improvements
and the removal of (probably irrelevant) duplicate entries from the
return value of `getTaintedSymbols()` calls.

Fixes llvm#89045

(cherry picked from commit ce763bf)
tstellar pushed a commit to steakhal/llvm-project that referenced this pull request Apr 25, 2024
Previously the function
```
std::vector<SymbolRef> taint::getTaintedSymbolsImpl(ProgramStateRef State,
                                                    const MemRegion *Reg,
                                                    TaintTagType K,
                                                    bool returnFirstOnly)
```
(one of the 4 overloaded variants under this name) was handling element
regions in a highly inefficient manner: it performed the "also examine
the super-region" step twice. (Once in the branch for element regions,
and once in the more general branch for all `SubRegion`s -- note that
`ElementRegion` is a subclass of `SubRegion`.)

As pointer arithmetic produces `ElementRegion`s, it's not too difficult
to get a chain of N nested element regions where this inefficient
recursion would produce 2^N calls.

This commit is essentially NFC, apart from the performance improvements
and the removal of (probably irrelevant) duplicate entries from the
return value of `getTaintedSymbols()` calls.

Fixes llvm#89045

(cherry picked from commit ce763bf)
@pointhex pointhex mentioned this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:static analyzer clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[analyzer] Serious slowdown on FFmpeg sheervideo.c when using ArrayBoundV2
5 participants