Improved mechanism for integrating malware checks and other package checking tools

**What's the problem this feature will solve?**
This system will provide basic ground for integrating various checks to be performed when a new package (or version of an existing package) is uploaded. Additionally, the same mechanism can be re-used to generate a feed of changes or as a notification mechanism for interested parties when a new package version is released. Existing tools such as OpenSSF Package-Feed rely on the XML RSS feed which is not suitable for machine processing and automated systems (e.g. free-text descriptions, no link/checksums to get the exact files changed, etc).

PyPI currently does not support a means of integrating various checks to be performed when a new package or a new package version is uploaded. Additionally, there is no current mechanism to generate a feed of changes or a notification mechanism for parties to know when a new package version is released.

The current malware analysis system has severe limitations for integration purposes and future development. The first and biggest limitation is that the current malware checks are functioning as plugins/modules that are called as  a normal python function (e.g. same codebase as a warehouse). This in practice means that the malware check has to be part of the same environment/installation including the same python interpreter, dependencies, running process, synchronous mode, and OS with its available resources. It may be impossible to integrate advanced tools/checks as they have their own dependencies, supported python versions, or even requirements for a different OS. In some cases it may be even dangerous to run the malware check on the same system as the production warehouse frontend, this includes for example dynamic malware analysis (sandbox escapes). Separating the malware check system from the warehouse would make it much more flexible to integrate with external tools/products while minimizing security concerns. The communication protocol (webhook format & async responses) can be designed to also address several other concerns and drawbacks of the current system:

- **Management of malware checks and package checks** in general for easier automated filter/flagging of packages can be done just by adding a new endpoint to the list of checks as long as it supports the schema of the protocol
- **High rate of false positives** - check endpoints can be enabled in a monitoring-only mode which would only log the verdicts/output to assess the number of alerts fired over time and tuning of configuration
- **Async mode** - it is not possible in some cases to reply in real-time with a verdict/data about the uploaded/changed package. This is common for dynamic malware analysis or some advanced static analysis checks which may take several minutes to complete. Async mode can be enabled by providing a pingback webhook which can be contacted at a later time with the response instead of replying immediately when a webhook is called.
- **Parallelism** - it is expected that the endpoints for checking the package do not rely on the order and results of previous checks. This would allow for a parallel run of all configured checks, minimizing the time for the whole analysis pipeline to complete by contacting all webhook endpoints simultaneously

**Describe the solution you'd like**
Implement a feature that allows defining a set of audit hooks to be called when a specific action occurs, namely: a package has been uploaded and a new package has been created (e.g. package namespace reserved). This initial list of events can be later extended.

Proposed parts of the system could be following:
- webhook generator that generates a payload with a given data such as package name, audit event id/name, package version, location of a package (url), checksum, and in general all the information available in the package JSON file (e.g. the information listed in the “urls” sections such as https://pypi.org/pypi/requests/json ) for a given release / dist file if available
- given a webhook payload generator, a webhook management system would be needed as well where administrators of PyPI can configure these webhooks, which would then be iterated in a defined order and a generated payload is sent to them
- webhooks may respond with a status that may affect the package that is being uploaded such as
  - OK: everything is alright, no action needed
  - WARNING: package may proceed to be uploaded, however, some remarks may be displayed to the author and/or PyPI admins on the package details page. Such cases may include a wheel package that does not fully conform to the PEP standard, invalid checksums (RECORDS in wheel), or various files not expected to be in a package (venv directories, sensitive files, leaking credentials, etc...)
  - BLOCKED: package publishing has been blocked and must be approved by a person with the appropriate privileges (PyPI admins) to confirm. These may be cases such as high confidence identification of malicious code, typosquatting package targeting another highly popular project, etc...


**Additional context**
There are also various other scenarios where these audit hook system can be re-used:
- typosquatting; external auditing system that takes only package name into consideration

- auditable verdicts; communication between warehouse and external webhook can be logged to provide visibility and auditability into why specific actions took place and by making this log publicly available, the system can be independently monitored/audited

- providing package feeds; there are already several systems, projects, and companies that are relying upon having an accurate feed of packages uploaded to PyPI. In most cases and also the recommended way (according to docs) is to use an XML feed that provides a list of recently updated packages. However this XML feed does not appear to be reliable enough (appears to be capped 100, can be easily overflowed during a targeted attack so a package would never appear in that feed) and does not hold all the important information to for example scan a package content (no URL, checksum, package type [sdist/wheel]). An alternative to this is to use the XMLRPC API which provides the needed information but the users are discouraged to use this as it is scheduled to be deprecated (https://warehouse.readthedocs.io/api-reference/xml-rpc.html#pypi-s-xml-rpc-methods) hence it is not suitable to be used in production or in a system that needs to rely on it. Currently, the only reliable mechanism for this appears to be using a mirror sync protocol or building upon a tool that uses it (for example Bandersnatch). The webhook mechanism can be easily repurposed for this use case by attaching an external system/listener that would generate the publicly accessible package feed from the webhook payload directly or even forward/proxy the webhook to other external systems and integrations such as IFTTT, slack webhooks, CI pipeline triggers and so on. This would allow users to subscribe to a global package feed or a specific project feed for any changes (which is already available in RSS feed but again, the information there is very limited) while making the process more effective so the user/target is not required to constantly poll PyPI for changes (pull mechanism) but is instead notified when a change occurs (push mechanism).

There are also several aspects of the webhook mechanism to consider. In the overview of the reply from the external audit system are status code/verdicts that affect the uploaded package which should be preferably in a JSON format accompanied with additional data. (just returning OK/WARNING/BLOCKED is not enough). This additional information in the response should include all the necessary information about the verdict and all the information (if possible) that was used to generate the verdict. It is also very likely that the system may generate more than one verdict such as multiple warnings when checking the wheel package format so the response format should be considered to be an array of elements (objects/dicts) instead of a single element. The last aspect of this system to consider is that in some cases, especially dynamic malware analysis or other package introspection/file scanning, an immediate response to a webhook with verdicts may not be possible. Dynamic malware analysis may often take several minutes to complete which is not feasible for real-time communication and instead should be done using an additional async response mechanism. This delayed async response may be easily implemented by providing an API endpoint to which the auditing system would reach once the analysis is completed and may even be indicated upfront when replying to the webhook requests that a response will arrive later. (e.g. by using for example an additional status like "PENDING"/"ASYNC" or something similar)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved mechanism for integrating malware checks and other package checking tools #9737

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improved mechanism for integrating malware checks and other package checking tools #9737

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions