-
Notifications
You must be signed in to change notification settings - Fork 138
GSOC 2022
AboutCode has been accepted as a GSoC mentoring org for 2022! See https://summerofcode.withgoogle.com/programs/2022/organizations/aboutcode
TL;DR See our list of ideas: https://github.com/nexB/aboutcode/wiki/GSOC-2022#our-project-ideas
This page contains information for aspiring interested in participating and helping with the GSoC 2022 program.
AboutCode is a family of FOSS projects to uncover data ... about software code:
- where does the code come from? which software package?
- what is its license? copyright?
- is the code vulnerable, maintained, well coded?
All these are questions that are important to answer: there are millions of free and open source software components available on the web for reuse.
Knowing where a software package comes from, what its license is and whether it is vulnerable should be a problem of the past such that everyone can safely consume more free and open source software.
Join us to make it so!
Our tools are used to help detect and report the origin and license of source code, packages and binaries as well as discover software and package dependencies, and in the future track security vulnerabilities, bugs and other important software package attributes. This is a suite of command line tools, web-based and API servers and desktop applications.
-
Scancode.io is a web-based and API to run and review scans in rich scripted ScanPipe pipelines.
-
ScanCode Toolkit is a popular command line tool to scan code for licenses, copyrights and packages, used by many organizations and FOSS projects, small and large.
-
VulnerableCode is a web-based API and database to collect and track all the known software package vulnerabilities.
-
Scancode Workbench is a JavaScript, Electron-based desktop application to review scan results and document your origin and license conclusions.
-
AboutCode Toolkit is a command line tool to document and inventory known packages and licenses and generate attribution docs, typically using the results of analyzed and reviewed scans.
-
TraceCode Toolkit is a command line tool to find which source code file is used to create a compiled binary by tracing and graphing a build.
-
DeltaCode is a command line tool to compare scans and determine if and where there are material differences that affect licensing.
-
FetchCode is a library to reliably fetch any code via HTTP, FTP and version control systems such as git.
-
container-inspector is a command line tool to analyze the code in Docker and container images and a low-level library to handle this
-
license-expression is a library to parse, analyze, simplify and render boolean license expression (such as SPDX)
We have also co-founded and contributed to important projects for other organizations:
-
Package URL which is an emerging standard to reference software packages of all types with simple, readable and concise URLs.
-
SPDX aka. Software Package Data Exchange, a spec to document the origin and licensing of packages.
-
ClearlyDefined to review and help FOSS projects improve their licensing and documentation clarity.
Join the chat online at https://gitter.im/aboutcode-org/discuss (or by IRC or matrix) Introduce yourself and start the discussion!
For personal issues, you can contact the primary org admin directly: @pombredanne and [email protected]
Please try asking questions the smart way: http://www.catb.org/~esr/faqs/smart-questions.html
Discovering the origin, license and security of code is a vast topic. We primarily use Python with some C/C++ , Rust and Go for performance sensitive code. We use Django and PostgreSQL for web apps and API servers. We use Electron and JavaScript for our ScanCode Workbench.
Our domain includes text analysis and processing (for instance for copyrights and licenses detection), parsing (for package manifest formats), binary analysis (to detect the origin and license of binaries, primarily based on the corresponding source code), Web-based tools and APIs (to expose the tools and libraries as Web Services) and low-level data structures for efficient matching (such as high performance string search automatons).
Incoming students will need the following skills:
- Intermediate to strong Python programming. For some projects, strong C/C++, Go or Rust may be needed.
- Familiarity with git as a version control system. Take the time to learn git!
- Ability to set up your own development environment
- An interest in open source security, licensing and generally software composition analysis.
We are happy to help you get up to speed, and the more you are able to demonstrate ability and skills in advance, the more likely we are to choose your application!
We expect your application to be in the range of 1000 words. Anything less than that will probably not contain enough information for us to determine whether you are the right person for the job. Your proposal should contain at least the following information, plus anything you think is relevant:
-
Your name
-
Title of your proposal
-
Abstract of your proposal
-
Detailed description of your idea including explanation on why is it innovative and what it will contribute to the project
-
hint: explain your data structures and you planned main processing flows in details.
-
Description of previous work, existing solutions (links to prototypes, bibliography are more than welcome)
-
Mention the details of your academic studies, any previous work, internships
-
Relevant skills that will help you to achieve the goal (programming languages, frameworks)?
-
Any previous open-source projects (or even previous GSoC) you have contributed to and links.
-
Do you plan to have any other commitments during GSoC that may affect your work? Any vacations/holidays? Will you be available full time to work on your project? (Hint: do not bother applying if this is not a serious main time commitment during the GSoC time frame)
Join the chat online or by IRC at https://gitter.im/aboutcode-org/discuss introduce yourself and start the discussion!
The best way to demonstrate your capability would be to submit a small patch
ahead of the project selection for an existing issue or a new issue.
We will always consider and prefer a project submissions where you have
submitted a patch over any other submission without a patch.
You can pick any project idea from the list below. If you have other ideas that are not in this list, contact the team first to make sure it makes sense.
Here is a list of candidate project ideas for your consideration. Your own ideas are welcomed too! Please chat about them to get early feedback!
NOTE: these ideas are not sorted in a specific important and priority order... we are working to improve this.
There are two project lengths:
- Medium (~175 hours)
- Large (~350 hours)
If you are proposing an idea from this ideas list, it should match what is listed here, and additionally please have a discussion with the mentors about your proposed length and timeline.
We have marked our ideas with medium/large but this is tentative and a best guess only. Often they are both used to mark a project as it can be both. But still most of these are on the larger side, as these are large complex projects and you're likely underestimating the complexity (and how much we'll bug you to make sure everything is up to our standards) if you're proposing a medium length project anyway.
Please also not that there is difference in the stipend based on what you select also.
We have limited ways to navigate the data in ScanCode.io The goal of these project(s) are to improve the UI in several areas and in particular:
- enabling better linking to resource details from the graphics view
- provide streamlined simpler resource views that only display the important data and have fewer details (but still provide ways to drill down)
- improve the way match details are visualized in the a single resource page such that which license and which copyright where detected where is more obvious and the actual license scoring is
This can be a large or medium size project.
This project is to create a new web application in ScanCode.io to help reach conclusions on an analysis project wrt. the origin, license or vulnerabilities of a codebase. This is an important project that comprise:
- design the data models for conclusions
- create a mini framework to run "bots" that can automate reaching "conclusions" on licensing and origin including spotting issues and exceptions
- create the UI to visualize these conclusions and eventually update them by hand
This is a large size project.
ScanCode.io / ScanCode Toolkit: Create web application to scan and review a single license text
Create a web app and JSON Rest API to detect any text for license, and submit bugs (SCTK-aaS).
This project is to create a web-based mini application to be plugged in ScanCode.io as a new Django app with its own model that would receive a text as an input and run ScanCode Toolkit license and copyright detection on this text. It would display the results such that the matched texts and all matching details are easy to understand and make it visually obvious what was matched and how it was matched. It should also allow to easily provide feedback if the detection is not correct, and suggest a better detection and possibly the creation of a new license detection rule. It should also integrate and run the "scancode-analyzer" https://github.com/nexB/scancode-analyzer/ to find possible issues automatically. Finally it should allow the report of a license detection easily and integrated in the app based on the results.
See also: Prototype a license detection view for scancode.io
This is a large or medium size project idea.
This project should extend ScanCode.io such that it can use external storage for the scanned code. The problem is that when you run a large number of projects the volume of storage that is used in ScanCode.io grows a lot. For this we can now archive projects, but we cannot archive the corresponding code that was scanned. The goal of this project is to add a new option in ScanCode.io to also archive to some blob storage the code that was scanned such that:
-
this can be done at the same time a project is archived
-
it can be possible to restore from this archival a state that is essentially the same as the original project state in terms of files and data
-
it would mean to archive the code input of a project or the whole workspace of a project
-
as a bonus it should also export the projects data, codebase resources, packages and other models, such that this can be imported in another instance of the same version of Scancode.io
This is a medium or large size project idea.
This project would add SPDX and CycloneDX reporting options to a ScanCode.io project. It could and should reuse as much as possible ScanCode Toolkit reporting capabilities for SPDX and CycloneDX.
This is a medium size project idea.
This project should create a new framework for advanced ScanCode.io pipelines such that it becomes possible to:
- include pluggable new data models specific to a pipeline (for instance to store the debug symbols found in a binary file)
- add pluggable UI for a pipeline that would include ways to navigate the data models
- add pluggable reporting for a pipeline that would include standard reports
As a practical implementation, this project should implement a concrete UI and extension to store and display extended information for Docker images and VM image projects such as the OS, FS and layer details (displayed today as simple plain text)
This is a large size project idea.
ScanCode.io: create a system and web UI to scan ALL the packages from Debian and fix and review all of them
This project would become a prototype to help scan and curate the package licensing of a specific ecosystem. It would include:
- specific pipelines tuned to collect lists of all the packages and organize the scans of these correctly
- specific UI to visualize the queue of scan projects
- specific libraries to detect common licensing issues of this package type
- a UI to organize the community/peer review of all these package scans and issues
- extension to create reports and update the package type manifests (here Debian machine readable copyright files)
-See also Create web application for massive scanning campaign of a whole package ecosystem
This is a large size project idea.
In particular, this project could add a new pipeline for integration with external matching services This would include tool such as SoftwareHeritage or Scanoss and other Component or package identification integration. The goal would be to create "pipes" and an improved package scanning pipeline that would include matching.
ScanCode.io: Integrate ORT dependency resolution in ScanCode.io
This is a large project idea.
ScanCode.io: Add web service for software package and project evaluations and comparisons (djangopackages-like)
This project would build on the djangopackages/opencomparison code to provide:
- a general purpose and easy way to create and share package comparison grids
- their scanning integration in ScanCode.io
This is a large or medium project idea.
There are several tasks to perform there and in particular:
-
upgrade to the latest version of all the packages
-
See Improve Workbench Table View for an extensive description and links to the related issues
-
remove the conclusions module
-
possibly switch to using TypeScript rather than plain JS
-
Also consider alternative UI such as the Opposum UI as a possible merger path https://github.com/opossum-tool/
This is a large size project idea.
ScanCode Toolkit: Create API docs automatically from ScanCode data models
This is a medium to large size project idea.
ScanCode Toolkit/ScanCode.io: Create GitHub SBOM creation action(s)
This is about to create a scan using a GitHub action, optionally also creating SPDX and CycloneDX outputs.
This is a medium size project idea.
ScanCode Toolkit: Enable detection of private licenses
This is a medium or large size project idea.
This project would design and implement new ways to detect and filter possible ScanCode false positive license detections, possibly using AI and machine learning. A first attempt exists with scancode-analyzer https://github.com/nexB/scancode-analyzer/ and would be furthered and improved. There are also heuristics or rule-based approached that could be created.
See these issues:
- RFC: Revamp "unknown" license detection: https://github.com/nexB/scancode-toolkit/issues/1675
- RFC: a plan for false positive license detection: https://github.com/nexB/scancode-toolkit/issues/2878
This is a medium or large size project idea.
ScanCode Toolkit: Improve package-specific license detection quality
These can be several medium size project ideas.
See the details page at https://github.com/nexB/aboutcode/wiki/Project-Ideas-Improve-package-license-detection for a list of package types/ecosystems that need some tender, lover and care.
ScanCode Toolkit: Improve Copyright detection accuracy and speed in ScanCode
This is a medium size project idea.
This project would add support to collect and parse the data from CycloneDX and SPDX SBOMs in a ScanCode Toolkit scan.
This would likely mean to treat these as "package data" as if it were a package manifest.
This is a large size project idea.
ScanCode Toolkit: Create a high performance multi-pattern matching automaton
This is to have faster license and copyright detection using less memory.
This is a large size project idea.
This project is about the integration of multiple existing plugins and tools with a singular to find which source code used to create a compiled using symbols, debug symbols, strings or more.
This is a large size project idea and this requires quite a bit of knowledge of binaries and source and build processes.
This project would implement a Language Server Protocol server for license and copyright that would be usng ScanCode toolkit and provide live license and copyright feedback directly in IDEs. It would also provide a plugin for integration in at least one IDE such Atom, VSCode or Eclipse.
This is a large size project idea.
There are two main categories of projects for VulnerableCode:
-
A. COLLECTION: this category is to mine and collect or infer more new and improved data. This includes collecting new data sources, inferring and improving existing data or collecting new primary data (such as finding a fix commit of a vulnetrability)
-
B. USAGE: this category is about using and consuming the vulnerability database and includes the API proper, the GUI, the integrations, and data sharing, feedback and curation.
against other tools: the "vulntotal" project! (Category A)
There are several "free" vulnerability check tools and services such as OWASP, Dependency check and Dependency track, deps.dev, osv.dev, Sonatype OSS Index, Gitlab Vulnerability DB, Github Dependabot, and more (Snyk, BlackDuck, WS, etc.) The goal of this project would be to cross-validate against these services and DB, virustotal style to report if a queried package/version is found as vulnerable.
This is a medium or large size project idea.
The goal of this project is to detect vulnerable packages found in a ScanCode.io project. See https://github.com/nexB/aboutcode/wiki/Project-Ideas-VulnerableCode-ScanCode.io-CI-integration
The project would be to provide a way to effectively mine issues (such as GitHub issues) for possible unreported vulnerabilities.
For a start this should be focused on a few prominent repos. This project should also find Fix Commits.
This could use NLP and machine learning to "understand" vulnerability descriptions: Often security advisories do not provide structured information on which package and package versions are vulnerable.
We could create a system which would infer vulnerable package name and version(s) by parsing the vulnerability description using natural language processing techniques and heuristics.
This is a large size project idea.
VulnerableCode: Add more data sources and mine the graph to find correlations between vulnerabilities (Category A)
See https://github.com/nexB/vulnerablecode#how for background info. We want to search for more vulnerability data sources and consume them. There is a large number of pending tickets for data sources.
This is a large size project idea.
The goal of this web app (integrated in the core VulnerableCode) would be to assist in the curation of vulnerabilities and the operation of VulnerableCode.
The UI would enable reviewers to triage, refine, improve and curate vulnerability data. This could include linking and displaying remote references in place.
The UI should also help display importers and improvers errors and provide to act on these to fix errors that require data resolution.
There are also data models needed to support an efficient review queue.
This is a large size project idea.
VulnerableCode: Create a UI to browse and query vulnerabilities and vulnerable packages (Category B)
We have today a minimal UI that need TLC to be usable.
The goal of this project is to:
- design the user experience for vulnerable packages and vulnerability navigation and query
- implement this user experience
This is a large project
A key attraction of VulnerableCode is its built-in support for purl. The goal of this project is to make purl more accessible and visible and:
- enhance the purl2url and url2purl support of the packageurl Python library such that it can process more common package types
- enhance the packageurl Python library to convert more purl-like data to purl and in particular the OSV format, the new NVD 5.0 reference, the ORT coordinates, etc.
- enhance the purl2cpe VulnerableCode utility such that it can process more cases to create better purls. Create script to publish of a continuously updated repository with the purl2pce data.
- expose a url2purl API service in VulnerableCode to help create correct purls
- expose a purl2url API service in VulnerableCode to help return a list of URLs given a purl.
- publish
This is a large size project idea.
VulnerableCode: Make improver to infer the affected ranges for the advisory data sources that only gives fixed version (Category A)
Take for example AlpineLinux Importer, we can only get the fixed versions from the importer. The aim of this project is to make a generic improver to infer the affected ranges with the help of fixed versions.
VulnerableCode: return SPDX or CycloneDX report for VEX (vex: vulnerability exploitability) (Category B)
Create scanners which would verify whether a codebase is vulnerable to a vulnerability. Once we know that a vulnerable package is in use, a scanner could check for whether the vulnerable code is called, or if environmental conditions or configuration are conducive to the vulnerability, etc. This could be based on yara rules, OpenVAS or similar. Or based on Eclipse Steady and deeper code analysis, static or dynamic.
DependentCode: Create a mostly universal Package dependencies resolver
Fetchcode Project Automatically fetch code archives given a purl
FetchCode/ScanCode.io: Build a SoftwareHeritage API Client library
AboutCode Toolkit: Improve AboutCode Toolkit with New/Enhance Features
Univers: Validate that the univers library can handle all the versions and ranges of all the packages!
Project(s) in this domain would consist in building test suites and fix them for all the versions in univers. Practically this means to download all the version of all the packages of an ecosystem (for instance PyPI) and validate that we can compare the version as good as the package management tool of reference for this ecosystem. For instance in alpine, https://git.alpinelinux.org/apk-tools/tree/test/version.sh?h=v2.12.9
Some specific highlights would cover:
-
writing code that can collect the list of all the versions of all the packages in a given package ecosystem (for instance PyPI, npm,etc). This code would be likely in FetchCode or ScanCode Toolkit packagedcode module. This could be extended to collect all the version ranges.
-
write an automated test harness to ensure that the univers library can properly parse (and unparse) all the versions and version ranges of all the packages.
-
write an automated test harness to ensure that the univers library can properly sort all the versions of each package in an ecosystem.
-
Update the univers library accordingly and create a unit test suite as needed
CommonCode: Package name and version inference from a file name: get package name and version reliably
This project would provide a more reliable way to infer a package name and version from a package archive name. For instance the simple cases of "log4j-1.2.3.jar" could yield type:maven, name:log4j, version:1.2.3 Existing regex-based code in commoncode at https://github.com/nexB/commoncode/blob/main/src/commoncode/version.py is a bit complex to maintain. The project could possibly use some machine learning. In all case part of the project is to collect a test dataset of a large number of released archives names from various sources (sf.net, SWH, Debian, Fedora) to use as test (and possibly training set for ML)
Browsing pkg:pypi/packageurl-python/ should go to https://pypi.org/project/packageurl-python/
And create/register a Duck Duck Go bang mapper for https://github.com/package-url/packageurl-python/blob/main/src/packageurl/contrib/purl2url.py Also add multiple URLs in purl2url.py
See https://github.com/nexB/aboutcode/wiki/Project-Ideas-Project-popularity