-
-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: high level file classification #426
Comments
I think pointing out or emphasizing metadata files like LICENSE or COPYING in scan results would be a great addition. When I am doing analysis of 3rd party stuff, these are the first things I look at and if a project takes the time to include these, they are almost always correct. If these files appeared somewhere near the top of scan results where ever they are being viewed (html app or AboutCode manager), that would really be helpful during analysis. |
@pombredanne Would this make more sense as another fileinfo scan field, or as an additional thing added on after the fact, like the |
@MaJuRG sorry for the late reply! a fileinfo field makes the most sense |
from @mjherzog #873 which is moved here instead We currently have several "file type" fields returned from a scan:
For this topic, I will ignore Type since this just covers File vs Directory and focus on files only. We need some simpler way to identify the file type in one field to facilitate filtering in AboutCode Manager and other tools. MIME Type and File Type each have pros and cons. -In many cases MIME Type seems more useful because it summarizes the type a bit more - e.g. "text/x-shellscript" is probably more useful than corresponding File Types like "Bourne-Again shell script, ASCII text executable" and "POSIX shell script, ASCII text executable" because I primarily want to find all of the script files (which often do not have an extension).
It may be the case that we could get the best result with a new Summary File Type field where the possible values are: Binary, Archive, Text, Media, Source or Script, but I am not sure whether a scanning will resolve to only one of these values (presumably we have multiple fields today because of some overlap). The primary use case is that I want to easily filter for Binary and Source code files which are the primary targets for analysis. The secondary use case is to easily filter for chunks like Script or Media files. This will also be important for filtering DeltaCode results to set up alerts/warnings for code files, but ignore or lower the priority of changes to Script or Media files. I reviewed some scans and noticed many shell script files show up as Text rather than Script so the current identification of Script: true/false is not going to help much. |
Something to consider is ClearlyDefined facets. It would be best to align classifications with these. See Also the notion of "scope" for dependencies is closely related. See https://github.com/heremaps/oss-review-toolkit/blob/master/model/src/main/kotlin/Scope.kt#L27 |
Some comments:
|
#1754 Prototype new summary/primary Content Type prototype |
@pombredanne I really want to comment on this and
So maybe 1st way would be easy to implement and sounds practical |
To support #377 and other scan-based deduction and related refinements, an important step is to "classify" the files in the codebase being scanned. This would mean defining a few high level buckets and heuristics to classify a file in a bucket.
With such classification, smarter results could be provided: for instance the license of documentation files or build scripts does not have the same impact as the license of the main code (and may often not be part of a build or redistributed software as used in a system or app).
I am opening this up for discussion to define the classifications. I think there should be as few classifications as possible. They could be part of a hierarchy, but flat is probably better and simpler.
Here is a first shot at what these classes could be:
Note that a file may end up in more than one class... not sure this would be a good thing.
Beside this classification, determining if some file is
deployed
ornot deployed
as part of a production build andbuilt
vs.not built
is another topic altogether which would not be covered explicitly here.The text was updated successfully, but these errors were encountered: