The service side of clearlydefined.io
Unless of course you are working on it, you should not need to run this service yourself. Rather, you can use https://dev-api.clearlydefined.io for experimental work or https://api.clearlydefined.io for working with production data.
If you do want to run the service locally, follow these steps.
- Clone the repo
- Set the following environment variables. For mor options, see the config file.
- CURATION_GITHUB_BRANCH= put a branch name here. DON'T use
master
> - CURATION_GITHUB_TOKEN= personal access token with public_repo scope>
- HARVEST_AZBLOB_CONNECTION_STRING= Azure blob connection string
- HARVEST_AZBLOB_CONTAINER_NAME= name of container holding harvested data
- API_TOKEN= the token to use for authorizing clients calling this service
- PORT= Defaults to 3000, like a lot of other dev setups. Set this if you are running more than one service that uses that port.
- CURATION_GITHUB_BRANCH= put a branch name here. DON'T use
npm install && npm start
When you are done with that, the service will be up and running at http://localhost:3000 (or whatever port you picked). Note as well that there is a handy VS Code launch configuration that runs the service on port 5000 and makes debugging simple.
- Scan plugin checks if it has already harvested data for a package by calling GET /harvest/...
- If it's already harvested then it stops processing
- If not it performs the scan and uploads the resulting files by calling PUT /harvest/...
- User visits the site and looks up information about a package which calls GET /packages/...
- This gets the harvested data for a package
- It then normalized the harvested data to the normalized schema
- It then runs the summarizer to condense the normalized schemas into a single normalized schema for the package
- It then loads any curation patch that applies to the package and patches the normalized schema and returns the end result
- They notice an error and edit a patch, a preview of the result of applying the patch is displayed by calling POST /packages/.../preview with the proposed patch
- They submit the patch which calls PATCH /curations/...
- A pull request is initiated and a build process runs against the patch
- The build gets the normalized schema for each of the patches in the pull request by calling GET /packages/... and also a preview of the result by calling POST /packages/.../preview and puts a diff in the PR for a curator to review
- A curator reviews the diff and if they're happy with the end result merges the PR
- As an optimization post merge we could normalize, summarize, and patch the affected package and store the result, if we did this then GET /packages/... would simply read that cache rather than doing the work on the fly
package:
type: string
name: string
provider: string
revision: string
source_location:
provider: string
url: string
revision: string
path: string
copyright:
statements: string[]
holders: string[]
authors: string[]
license:
expression: string
TODO
{
"source_location": {
"provider": "",
"url": "",
"revision": "",
"path": ""
},
"copyright": {
"statements": [],
"holders": [],
"authors": []
},
"license": {
"expression": ""
}
}
As a PATCH you only need to provide the attributes you want to add or update, any attributes not included will be ignored. To explicitly remove an attribute set its value to null
.
TODO: Make sure the attribute names are consistent with AboutCode/ScanCode TODO: Include a section where the author's identity and reasoning is provided
TODO
Curation patches will be stored in: https://github.com/clearlydefined/curated-data
type (npm)
provider (npmjs.org)
name.yaml (redie)
Note that the package name may contain a namespace portion, if it does, then the namespace will become a directory under provider and the packageName.yaml will be stored in the namespace directory. For example, a scoped NPM package would have a directory for the scope under provider, and then the packageName.yaml would be in the scope directory. Similarly, for Maven, the groupId would be the namespace, and the artifactId would be the packageName.
type (git)
provider (github.com)
namespace (Microsoft)
name.yaml (redie)
TODO
Harvested data will be stored in: https://github.com/clearlydefined/harvested-data
This location is temporary, as harvested data grows will likely need to move it out of GitHub to scale.
type
provider
namespace -- if none then set to '-'
name
revision
tool
toolVersion -- this is the native output file. If more than one file then they should be archived together
What term should we use instead of package?
- AboutCode says package
- Concerns that "native" source consumers don't consider what they consume as a package
- Defer decision :)
What to name output files?
- If a single file, then output.ext (e.g. output.json)
- If multiple files, then keep their native names.
How to handle different versions of scanners?
- Results are immutable
- Curations are tied to particular "tool configurations"
- Tool configurations are immutable
- Tool configuration and revision should be captured in the output directory
Do we merge results from different versions of ScanCode? How does this impact curation?
- New scan results flag previous curations as needing review (optimization: only if they change the result)
- The summarization process will be configured as to what tool configurations to consider for summarization (this would need to be a priority list)
- The summarization process should take into account the link from the package back to the source
Scanning a package where it's actually the source you need to scan, what to store where Maven supports scanning sources JAR or scanning GitHub repository
- If we can determine the source location from the package, then we'll queue up the source to be scanned
- If we can't determine the source location, and the package contains "source" then we'll scan the package
- Some scanners will run on packages (such as things that inspect package manifest)
- We should track the link from the package to the source
How to handle tags?
- When we scan GitHub we need to track what tags/branches point at which commits
- We will use the long commit (40 character) reference
Need to define "origin" and/or pick another term
- Propose to use "provider"
How do we handle case sensitivity?
- If no package managers allow different packages with same name different case, then we can be case preserving, case insensitive
- We need to be case preserving because some registry endpoints are case sensitive (e.g. NPM)
Define how to do the linking
- We will store one end (package to source), we will cache/materialize the reverse link as part of a build job (or the like)
The format of harvested data is tool-specific. Tool output is stored in the tool's native output format. If there is a choice between multiple output formats then the priorities are:
- Machine-readable
- Lossless
- JSON
- git
- maven
- npm
- nuget
- rubygem
- central.maven.org
- github.com
- npmjs.org
- nuget.org
- ScanCode
- Fossology
- provider - the provider of metadata about the package (e.g. npmjs.org, github.com, nuget.org, myget.org)
- revision - used instead of version because it's a superset and valid for source
- tool configuration - a tuple used to describe the combination of a tool and a named configuration, at a minimum the named configuration should include the version of the tool, but it could also describe the combination of settings used, for example, ScanCode-2.2.1_deepscan and ScanCode-2.2.1_fastscan
- type - the format of the package (e.g. git, maven, npm)
- Swagger to replace most of this doc
- Complete registries
- Complete terminology
Build and run the container.
docker build -t ort .
docker run --mount type=bind,source="<path to repo>",target=/app ort scanner -d /app/output/package-json-dependencies.yml -o /app/output-scanner