Skip to content

Conversation

@isasmendiagus
Copy link
Contributor

@isasmendiagus isasmendiagus commented Nov 17, 2025

Summary by CodeRabbit

  • New Features

    • Added --license-sources (-ls) option to filter copyleft inspection; supports both "-ls src1 src2" and repeated "-ls src1 -ls src2" syntaxes.
  • Changed

    • Copyleft dataset switched to OSADL authoritative source, expanding coverage (adds missing "-or-later" variants).
    • Copyleft checks now default to component-level licenses to reduce file-level noise.
  • Documentation

    • Added OSADL dataset attribution and license notice to the README.
  • Bug Fixes

    • Fixed terminal cursor disappearing after aborting a scan (Ctrl+C).

@coderabbitai
Copy link

coderabbitai bot commented Nov 17, 2025

Walkthrough

Adds OSADL-backed copyleft data and a --license-sources option to filter which license sources are considered; threads license_sources through Copyleft and ScanResultProcessor; defaults copyleft checks to component-level sources; replaces manual spinner lifecycles with context-managed spinners; fixes cursor-on-Ctrl+C; updates docs, tests, and version.

Changes

Cohort / File(s) Summary
OSADL Copyleft System
src/scanoss/constants.py, src/scanoss/data/osadl-copyleft.json, src/scanoss/osadl.py
Adds OSADL dataset file and Osadl loader; introduces VALID_LICENSE_SOURCES and DEFAULT_COPYLEFT_LICENSE_SOURCES constants.
Copyleft Inspection Core
src/scanoss/inspection/utils/license_utils.py, src/scanoss/inspection/utils/scan_result_processor.py, src/scanoss/inspection/policy_check/scanoss/copyleft.py
Reworks copyleft detection to use OSADL; adds license_sources parameter to Copyleft and ScanResultProcessor; implements source-based selection (_select_licenses) and include/exclude/explicit filters.
CLI Wiring
src/scanoss/cli.py
Adds --license-sources CLI option and passes license_sources into Copyleft inspection paths.
Spinner & Progress Management
src/scanoss/filecount.py, src/scanoss/scanner.py, src/scanoss/scanners/folder_hasher.py, src/scanoss/scanners/scanner_hfh.py, src/scanoss/threadedscanning.py
Replaces manual spinner start/finish with context-managed spinner usage; introduces atexit/__del__ cleanup and async gRPC scan thread pattern in scanner_hfh.
Tests
tests/test_osadl.py, tests/test_policy_inspect.py
Adds unit tests for Osadl and copyleft license-source filtering behavior.
Docs & Metadata
README.md, CHANGELOG.md, LICENSE, src/scanoss/__init__.py
Adds OSADL dataset attribution (CC-BY-4.0), documents new option and dataset, appends dataset notice to LICENSE, and bumps package version to 1.41.0.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant Copyleft
    participant ScanResultProcessor
    participant LicenseUtil
    participant Osadl

    CLI->>Copyleft: init(license_sources)
    Copyleft->>ScanResultProcessor: init(license_sources)
    Copyleft->>LicenseUtil: init(include/exclude/explicit)
    LicenseUtil->>Osadl: lazy load data (on first call)

    Note over ScanResultProcessor,LicenseUtil: For each component
    ScanResultProcessor->>ScanResultProcessor: selected = _select_licenses(licenses_data)
    alt filtering mode (license_sources provided)
        ScanResultProcessor->>ScanResultProcessor: keep licenses with source in license_sources + 'unknown'
    else priority mode (no license_sources)
        ScanResultProcessor->>ScanResultProcessor: use legacy source priority
    end
    ScanResultProcessor->>LicenseUtil: is_copyleft(spdxid)
    alt explicit list present
        LicenseUtil->>LicenseUtil: check explicit_licenses only
    else
        LicenseUtil->>Osadl: is_copyleft(spdxid)
        Osadl-->>LicenseUtil: boolean
    end
    LicenseUtil-->>ScanResultProcessor: copyleft flag
    ScanResultProcessor-->>Copyleft: aggregated results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Focus review on:
    • src/scanoss/inspection/utils/license_utils.py — filter precedence, explicit vs include/exclude vs OSADL fallback.
    • src/scanoss/inspection/utils/scan_result_processor.py — correctness of _select_licenses and impacts on output ordering.
    • src/scanoss/osadl.py and src/scanoss/data/osadl-copyleft.json — data loading, normalization, and mapping accuracy.
    • CLI wiring in src/scanoss/cli.py and interactions with legacy copyleft paths.
    • Threaded/async scanner changes to ensure spinner cleanup and error paths remain correct.

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • eeisegn
  • matiasdaloia

Poem

🐰 Through data fields I hopped and read,

OSADL maps where licenses tread.
Sources chosen, filters set,
Spinners tidy, cursor met —
A hop, a scan, the tree well fed.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title "V1.41.0" is a version number that lacks meaningful detail about the actual changes being made in this pull request. Replace the version-only title with a concise summary of the main changes, such as 'Add OSADL copyleft data and license source filtering' or 'Upgrade to OSADL copyleft database with license source filtering options'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch v1.41.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

SCANOSS SCAN Completed 🚀

  • Detected components: 1
  • Undeclared components: 0
  • Declared components: 1
  • Detected files: 87
  • Detected files undeclared: 0
  • Detected files declared: 87
  • Licenses detected: 1
  • Licenses detected with copyleft: 0
  • Policies: ✅ 1 pass (1 total)

View more details on SCANOSS Action Summary

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/scanoss/scanner.py (1)

920-923: Wrap long post‑size warning in scan_wfp_file_threaded to fix E501

The post‑size warning line is over the 120‑character limit. Wrapping it avoids the lint error and keeps the message readable:

-                    if self.debug and cur_size > self.max_post_size:
-                        Scanner.print_stderr(f'Warning: Post size {cur_size} greater than limit {self.max_post_size}')
+                    if self.debug and cur_size > self.max_post_size:
+                        Scanner.print_stderr(
+                            f'Warning: Post size {cur_size} greater than limit {self.max_post_size}'
+                        )

This should clear the remaining E501 violation flagged by the pipeline.

src/scanoss/inspection/utils/scan_result_processor.py (1)

167-189: Add file_spdx_tag to the priority sources list

The review's concern is confirmed: file_spdx_tag exists in real test data but is omitted from priority_sources at line 337. When only file_spdx_tag licenses are present, the code incorrectly falls back to returning all licenses instead of prioritizing that source.

Update line 337 from:

priority_sources = ['component_declared', 'license_file', 'file_header', 'scancode']

to:

priority_sources = ['component_declared', 'license_file', 'file_header', 'file_spdx_tag', 'scancode']
🧹 Nitpick comments (9)
src/scanoss/scanners/folder_hasher.py (1)

179-184: Consider adding a comment to clarify the tree-building logic.

The loop appends file_item to each ancestor directory's files list, which means each file appears in multiple directory nodes. While this is intentional for hierarchical folder hashing (where each directory's hash includes all files in its subtree), a brief comment would make the design more immediately clear to future maintainers.

src/scanoss/threadedscanning.py (1)

107-112: Consider alternatives to __del__ for cleanup.

While the __del__ method provides a safety net for progress bar cleanup, Python destructors have unreliable invocation timing and are generally discouraged for critical cleanup logic. Combined with the atexit registration (line 82), this creates dual cleanup paths that may execute in any order or potentially twice.

Consider:

  • Using context managers (__enter__/__exit__) for deterministic cleanup, allowing usage like with ThreadedScanning(...) as scanner:
  • Documenting that callers should explicitly call complete_bar() after scanning completes
  • Choosing a single cleanup mechanism (atexit OR __del__) rather than both

The bare except Exception: is acceptable in destructors since errors during cleanup cannot be meaningfully handled, but it may hide diagnostic information during development.

src/scanoss/filecount.py (1)

152-155: Address PLC0206 by iterating file_types with .items()

You can clear the PLC0206 warning and improve readability by iterating key/value pairs directly:

-            for k in file_types:
-                d = file_types[k]
-                csv_dict.append({'extension': k, 'count': d[0], 'size(MB)': f'{d[1] / (1 << 20):,.2f}'})
+            for extension, (count, size_bytes) in file_types.items():
+                csv_dict.append(
+                    {
+                        'extension': extension,
+                        'count': count,
+                        'size(MB)': f'{size_bytes / (1 << 20):,.2f}',
+                    }
+                )
src/scanoss/scanners/scanner_hfh.py (1)

113-152: Threaded gRPC scan helper is correct; consider aligning spinner gating with other scanners

_execute_grpc_scan cleanly encapsulates the gRPC call and error handling, and the scan() method’s thread + spinner loop is straightforward and safe (join() after the loop guarantees scan_results visibility).

For consistency with Scanner.scan_folder/scan_files and FileCount.count_files, you might optionally gate the spinner on quiet/TTY and use nullcontext:

+import sys
+from contextlib import nullcontext
@@
-        spinner_ctx = Spinner('Scanning folder...')
-
-        with spinner_ctx as spinner:
+        spinner_ctx = (
+            Spinner('Scanning folder...')
+            if (not self.base.quiet and sys.stderr.isatty())
+            else nullcontext()
+        )
+
+        with spinner_ctx as spinner:
             grpc_thread = threading.Thread(target=self._execute_grpc_scan, args=(hfh_request,))
             grpc_thread.start()
@@
-            while grpc_thread.is_alive():
-                spinner.next()
-                time.sleep(0.1)
+            while grpc_thread.is_alive():
+                if spinner:
+                    spinner.next()
+                time.sleep(0.1)

Not required, but it would make spinner behaviour uniform across the codebase.

tests/test_osadl.py (1)

34-99: Good coverage; tweak final shared‑data assertion

The tests exercise the key Osadl behaviors (loading, case‑insensitive lookups, copyleft vs non‑copyleft, and empty/None handling) nicely. The last assertion in test_multiple_instances_share_data:

self.assertIs(Osadl._shared_copyleft_data, Osadl._shared_copyleft_data)

is a tautology and doesn’t verify that instances share the same data. If you want to assert sharing, consider checking through the instances instead, e.g.:

-        self.assertIs(Osadl._shared_copyleft_data, Osadl._shared_copyleft_data)
+        self.assertIs(osadl1._shared_copyleft_data, osadl2._shared_copyleft_data)

or simply rely on the existing behavioral checks and drop this line.

CHANGELOG.md (1)

8-28: Changelog accurately reflects new copyleft and -ls behavior

The Unreleased notes line up with the code: --license-sources is limited to copyleft inspection, OSADL is now the copyleft authority, -or-later variants are called out, and the default of component‑level sources with -ls override matches the CLI wiring. If you want to nitpick wording, “OSADL authoritative copyleft license data” could be hyphenated as “OSADL‑authoritative copyleft license data,” but that’s purely stylistic.

src/scanoss/cli.py (2)

704-715: --license-sources argument is well‑wired; default handled downstream

The -ls/--license-sources definition with action='extend' and nargs='+' correctly supports both syntaxes you document and constrains values via VALID_LICENSE_SOURCES. Since you don’t set a default here, it assumes Copyleft (or its collaborators) treat license_sources=None as “use DEFAULT_COPYLEFT_LICENSE_SOURCES,” which matches the help text. That implicit contract looks fine but is worth keeping in sync if defaults change later.


1758-1771: New license_sources plumbing into Copyleft looks correct

Passing license_sources=args.license_sources into Copyleft(...) completes the CLI→policy wiring for both raw and legacy copyleft commands, and falls back to None when -ls isn’t used. Behaviorally this matches the changelog. As a minor improvement, consider updating the inspect_copyleft docstring to mention the new license_sources option so the function‑level documentation stays aligned with the CLI help.

src/scanoss/osadl.py (1)

69-76: Load failures silently degrade all copyleft checks to “non‑copyleft”

If _load_copyleft_data fails (I/O, JSON error, packaging issue), it logs to stderr and returns False, but __init__ ignores the return value and is_copyleft simply consults an empty _shared_copyleft_data, treating every license as non‑copyleft. That’s a fairly silent failure mode for something described as “authoritative”.

Consider at least surfacing the failure to callers (e.g., via a boolean attribute or raising on first failure), so upstream logic can detect that OSADL data wasn’t available and either abort or emit a clear warning.

For example:

 class Osadl:
@@
-    _shared_copyleft_data = {}
-    _data_loaded = False
+    _shared_copyleft_data: dict[str, str] = {}
+    _data_loaded: bool = False
+    _load_failed: bool = False
@@
     def __init__(self, debug: bool = False):
@@
-        self._load_copyleft_data()
+        if not self._load_copyleft_data():
+            Osadl._load_failed = True
@@
-        if Osadl._data_loaded:
+        if Osadl._data_loaded:
             return True
@@
-            self.print_stderr(f'ERROR: Problem loading OSADL copyleft data: {e}')
-            return False
+            self.print_stderr(f'ERROR: Problem loading OSADL copyleft data: {e}')
+            return False
@@
         Osadl._data_loaded = True
@@
     def is_copyleft(self, spdx_id: str) -> bool:
@@
-        if not spdx_id:
+        if not spdx_id or Osadl._load_failed:
             return False

That keeps behavior simple while making failures visible to policy/CLI layers if they want to react differently.

Also applies to: 92-107, 108-130

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4060c9f and 7455238.

📒 Files selected for processing (16)
  • CHANGELOG.md (1 hunks)
  • README.md (1 hunks)
  • src/scanoss/cli.py (4 hunks)
  • src/scanoss/constants.py (1 hunks)
  • src/scanoss/data/osadl-copyleft.json (1 hunks)
  • src/scanoss/filecount.py (2 hunks)
  • src/scanoss/inspection/policy_check/scanoss/copyleft.py (4 hunks)
  • src/scanoss/inspection/utils/license_utils.py (1 hunks)
  • src/scanoss/inspection/utils/scan_result_processor.py (3 hunks)
  • src/scanoss/osadl.py (1 hunks)
  • src/scanoss/scanner.py (5 hunks)
  • src/scanoss/scanners/folder_hasher.py (1 hunks)
  • src/scanoss/scanners/scanner_hfh.py (2 hunks)
  • src/scanoss/threadedscanning.py (3 hunks)
  • tests/test_osadl.py (1 hunks)
  • tests/test_policy_inspect.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (10)
src/scanoss/osadl.py (1)
src/scanoss/inspection/utils/license_utils.py (1)
  • is_copyleft (71-102)
src/scanoss/inspection/utils/license_utils.py (2)
src/scanoss/osadl.py (3)
  • Osadl (31-130)
  • print_debug (62-67)
  • is_copyleft (108-130)
src/scanoss/scanossbase.py (2)
  • ScanossBase (28-107)
  • print_debug (58-63)
src/scanoss/filecount.py (2)
src/scanoss/scanossbase.py (2)
  • print_msg (51-56)
  • print_trace (65-70)
src/scanoss/cli.py (1)
  • file_count (1312-1343)
src/scanoss/scanners/scanner_hfh.py (2)
src/scanoss/scanossgrpc.py (1)
  • folder_hash_scan (429-447)
src/scanoss/scanossbase.py (1)
  • print_stderr (45-49)
src/scanoss/inspection/utils/scan_result_processor.py (1)
src/scanoss/inspection/utils/license_utils.py (2)
  • LicenseUtil (30-111)
  • init (47-69)
src/scanoss/scanner.py (3)
src/scanoss/file_filters.py (2)
  • get_filtered_files_from_folder (311-352)
  • get_filtered_files_from_files (354-404)
src/scanoss/threadedscanning.py (3)
  • stop_scanning (154-158)
  • queue_add (141-149)
  • run (168-191)
src/scanoss/scanners/scanner_hfh.py (1)
  • scan (126-152)
src/scanoss/inspection/policy_check/scanoss/copyleft.py (1)
src/scanoss/inspection/utils/scan_result_processor.py (1)
  • ScanResultProcessor (53-412)
tests/test_osadl.py (1)
src/scanoss/osadl.py (2)
  • Osadl (31-130)
  • is_copyleft (108-130)
src/scanoss/scanners/folder_hasher.py (2)
src/scanoss/scanossbase.py (1)
  • print_debug (58-63)
src/scanoss/utils/crc64.py (2)
  • CRC64 (29-96)
  • get_hash_buff (82-96)
tests/test_policy_inspect.py (1)
src/scanoss/inspection/policy_check/scanoss/copyleft.py (2)
  • Copyleft (50-240)
  • run (220-240)
🪛 GitHub Actions: Lint
src/scanoss/filecount.py

[warning] 100-100: PLR0912 Too many branches (15 > 12) in count_files.


[warning] 100-100: PLR0915 Too many statements (54 > 50) in count_files.


[warning] 152-152: PLC0206 Extracting value from dictionary without calling .items().

src/scanoss/scanner.py

[error] 397-397: E501 Line too long (121 > 120).


[error] 416-416: E501 Line too long (124 > 120).


[error] 802-802: E501 Line too long (121 > 120).

🪛 LanguageTool
README.md

[uncategorized] ~140-~140: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...vecommons.org/licenses/by/4.0/) by the [Open Source Automation Development Lab (OSADL) eG](...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~142-~142: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...g/). Attribution: A project by the Open Source Automation Development Lab (OSADL) eG. ...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

CHANGELOG.md

[uncategorized] ~17-~17: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ... - Copyleft detection now uses [OSADL (Open Source Automation Development Lab)](https://ww...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (12)
src/scanoss/scanners/folder_hasher.py (1)

162-192: LGTM! Context manager improves resource cleanup.

The switch to a context manager for the progress bar ensures proper cleanup even when exceptions occur or the user interrupts with Ctrl+C. Moving the exception handling inside the loop also improves resilience by allowing file processing to continue when individual files fail.

src/scanoss/threadedscanning.py (2)

25-25: LGTM!

The atexit import is necessary for the cleanup registration added below.


81-82: Review comment contains inaccurate claims about double cleanup and idempotency.

The underlying premise is flawed:

  • Bar.finish() from the progress.bar library is already idempotent—it sets an internal flag and safely handles multiple calls. The suggested fix (adding self.bar = None) is unnecessary.
  • The "double cleanup risk" is not a real issue because both the atexit handler and __del__ are guarded by if self.bar:, and the underlying method is designed to handle repeated invocations safely.

The valid concerns here are:

  • The bound method reference in atexit.register(self.complete_bar) does keep self alive until program exit (minor memory lifecycle consideration).
  • The lack of signal handler means atexit may not fire on Ctrl+C (valid but separate from cleanup idempotency).

The suggested fix does not address the actual memory concern and adds unnecessary code given the idempotent nature of Bar.finish().

Likely an incorrect or invalid review comment.

src/scanoss/filecount.py (1)

100-147: Spinner context usage is sound and keeps behaviour intact

Using spinner_ctx = Spinner(...) if (not self.quiet and self.isatty) else nullcontext() and with spinner_ctx as spinner: cleanly scopes spinner lifetime and avoids manual finish() calls. The if spinner: spinner.next() guard correctly handles the nullcontext() branch, and the counting logic is unchanged aside from indentation.

src/scanoss/scanner.py (1)

1051-1061: Spinner context in wfp_folder matches the new pattern and looks correct

The new spinner_ctx = Spinner(...) if (not self.quiet and self.isatty) else nullcontext() plus with spinner_ctx as spinner: pattern is consistent with scan_folder/scan_files and properly guards spinner.next() behind an if spinner check. The fingerprinting logic is unchanged apart from the visual feedback.

src/scanoss/constants.py (1)

21-22: License source constants are well‑defined for downstream filtering

Defining VALID_LICENSE_SOURCES and DEFAULT_COPYLEFT_LICENSE_SOURCES here centralizes the license source vocabulary and matches how Copyleft and ScanResultProcessor._select_licenses use these strings. No issues spotted.

README.md (1)

139-142: Dataset license notice is clear and appropriately attributed

The new section clearly distinguishes the MIT license of the application from the CC‑BY‑4.0 license of the embedded OSADL copyleft dataset and provides proper attribution and source URL. Looks good.

tests/test_policy_inspect.py (1)

31-645: New copyleft license_sources tests give solid coverage of the filtering semantics

Importing DEFAULT_COPYLEFT_LICENSE_SOURCES/VALID_LICENSE_SOURCES and adding the test_copyleft_policy_*license_sources* cases thoroughly exercise:

  • Default vs explicit license_sources (including None).
  • Single‑source and multi‑source scenarios.
  • Interaction with include/exclude filters and different result files.
  • JSON and markdown/Jira markdown outputs.

The expectations (component counts, status values, and allowed license['source'] sets) are consistent with Copyleft and ScanResultProcessor._select_licenses behaviour. This is a good, targeted test extension for the new OSADL‑backed logic.

src/scanoss/inspection/policy_check/scanoss/copyleft.py (1)

29-102: Copyleft license_sources wiring is consistent with ScanResultProcessor and tests

Importing DEFAULT_COPYLEFT_LICENSE_SOURCES, extending __init__ with license_sources: list = None, and then doing:

self.license_sources = license_sources or DEFAULT_COPYLEFT_LICENSE_SOURCES
self.results_processor = ScanResultProcessor(..., explicit, self.license_sources)

cleanly propagates the configuration into ScanResultProcessor._select_licenses, matching how the new tests drive behaviour. The default now explicitly favours ['component_declared', 'license_file'], which aligns with the intended “component‑level first” semantics. Just be aware that an empty list would also resolve to the default (which seems fine for current usage).

src/scanoss/data/osadl-copyleft.json (1)

1-133: OSADL dataset structure looks compatible with the loader

The JSON shape and keys (including the copyleft mapping and the status values like "Yes" / "Yes (restricted)" / "No") align with the expectations in Osadl._load_copyleft_data and Osadl.is_copyleft. I don’t see structural or syntactic issues in the excerpt, and the included metadata (title/license/attribution/timestamp) satisfies attribution requirements.

src/scanoss/inspection/utils/license_utils.py (1)

71-102: Copyleft decision precedence is clear and matches OSADL semantics

The is_copyleft flow (early False on empty, explicit list override, include > exclude, then fallback to self.osadl.is_copyleft) is straightforward and matches the documented logic. The consistent lowercasing for all filter sets and lookups also aligns with how Osadl normalizes IDs, so you avoid case‑sensitivity issues.

src/scanoss/inspection/utils/scan_result_processor.py (1)

65-81: Remove the defensive license_sources normalization—it is not needed.

The original concern about raw CLI strings being passed is invalid. The --license-sources option uses argparse's action='extend' with choices validation, which guarantees args.license_sources is always a list of validated strings. This list is passed through to ScanResultProcessor consistently, making the suggested defensive code unnecessary.

All call sites pass license_sources as either:

  • A list (from CLI via copyleft inspection)
  • None (other call sites using defaults)

The current code at line 80 is safe as-is.

Likely an incorrect or invalid review comment.

@github-actions
Copy link

SCANOSS SCAN Completed 🚀

  • Detected components: 1
  • Undeclared components: 0
  • Declared components: 1
  • Detected files: 86
  • Detected files undeclared: 0
  • Detected files declared: 86
  • Licenses detected: 1
  • Licenses detected with copyleft: 0
  • Policies: ✅ 1 pass (1 total)

View more details on SCANOSS Action Summary

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/scanoss/osadl.py (1)

79-88: Fix OSADL data resource path or all licenses will be treated as non‑copyleft

The loader currently resolves the JSON path as:

f_name = importlib_resources.files(__name__) / 'data/osadl-copyleft.json'

With osadl.py in scanoss/osadl.py, __name__ is scanoss.osadl, so this looks for scanoss/osadl/data/osadl-copyleft.json. The actual file is packaged at scanoss/data/osadl-copyleft.json, so the load will fail, _shared_copyleft_data stays empty, and is_copyleft() will effectively return False for all licenses.

Consider resolving from the top-level scanoss package instead:

-        try:
-            f_name = importlib_resources.files(__name__) / 'data/osadl-copyleft.json'
-            with importlib_resources.as_file(f_name) as f:
-                with open(f, 'r', encoding='utf-8') as file:
-                    data = json.load(file)
+        try:
+            # Resolve resource from the root "scanoss" package
+            # Expects file at scanoss/data/osadl-copyleft.json
+            f_name = importlib_resources.files('scanoss') / 'data' / 'osadl-copyleft.json'
+            with importlib_resources.as_file(f_name) as path:
+                with open(path, 'r', encoding='utf-8') as file:
+                    data = json.load(file)

Also worth re-running tests in an installed environment (not just from source tree) to confirm Osadl._data_loaded is True and _shared_copyleft_data is non-empty after Osadl() construction.

🧹 Nitpick comments (7)
src/scanoss/__init__.py (1)

25-25: Version bump looks consistent with release intent

__version__ = '1.41.0' matches the PR title and the new Unreleased entries in CHANGELOG.md. No issues here; just remember to convert the Unreleased section into a ## [1.41.0] - YYYY-MM-DD entry when you cut the actual release.

CHANGELOG.md (1)

8-28: Changelog entry clearly documents new copyleft behavior and CLI option

The Unreleased section accurately explains the --license-sources option, OSADL-based copyleft data, and the new default to component-level licenses only. This aligns well with the new constants and Copyleft behavior.

Two minor optional polish ideas:

  • When you finalize the release, rename this to ## [1.41.0] - YYYY-MM-DD and add a [1.41.0] compare link at the bottom for consistency with previous versions.
  • Consider adding a short example command (e.g. scanoss-py inspect copyleft -ls component_declared license_file) to make the new option immediately actionable for readers.
LICENSE (1)

23-31: OSADL dataset attribution in LICENSE is clear and well-scoped

The added section cleanly separates the MIT license from the OSADL copyleft dataset, names the data file, states the CC-BY-4.0 license, and notes that the file carries its own header. This is a good, explicit attribution.

If you ever expand to additional third-party datasets, you might reference this section from the README’s Dataset License Notice (or vice versa) to keep documentation in sync, but what you have now is already solid.

README.md (1)

139-142: Dataset license notice is accurate and user-friendly

The new section clearly distinguishes the MIT application license from the OSADL copyleft dataset, references the correct JSON path, and provides both the CC-BY-4.0 and original OSADL URLs. This should give users sufficient clarity on licensing.

Optionally, you could add “See LICENSE for full details” at the end of the paragraph to tie it to the main license file, but it’s not strictly necessary.

tests/test_policy_inspect.py (1)

393-646: License source filtering tests are comprehensive; consider one stronger assertion

The new “Copyleft License Source Filtering Tests” block does a good job validating:

  • default vs. None license_sources fall back to DEFAULT_COPYLEFT_LICENSE_SOURCES
  • single-source filters (component_declared, license_file, file_header) produce the expected status and component set
  • multi-source and “all valid sources” paths behave as intended
  • markdown output, include, exclude, and no-copyleft scenarios are all covered.

Two minor, optional improvements:

  • In test_copyleft_policy_license_sources_with_include_filter, you might explicitly assert that at least one license has spdxid == 'MIT' to prove the include filter is taking effect, not just that some components exist.
  • Where you use assertGreaterEqual on component counts, decide whether you want strict regression detection (exact counts) or flexibility as OSADL data evolves; if stability is expected, tightening those assertions could catch unintended changes earlier.

Overall, this is strong coverage of the new license_sources behavior.

src/scanoss/cli.py (1)

54-69: CLI wiring for --license-sources is coherent; consider explicit default + confirm runtime support

The new license_sources flow (constants import, argparse options on raw/legacy inspect copyleft, and passing args.license_sources into Copyleft) is consistent and should allow source-filtered copyleft checks while keeping behavior unchanged when the option is omitted, assuming Copyleft applies DEFAULT_COPYLEFT_LICENSE_SOURCES when license_sources is None.

Two minor points to verify/consider:

  1. action='extend' for argparse requires Python 3.8+; confirm this matches your supported runtime matrix.
  2. Since the help string documents the default, you could reflect it at the argparse layer via default=DEFAULT_COPYLEFT_LICENSE_SOURCES (leaving Copyleft’s defaulting logic as a safety net). Not required, but it would make the default visible directly on args.license_sources.

Also applies to: 704-714, 1757-1769

src/scanoss/inspection/utils/scan_result_processor.py (1)

74-74: Consider a more specific type hint.

The parameter uses a generic list type hint. For better type safety and IDE support, consider using Optional[List[str]] or list[str] | None.

-        license_sources: list = None,
+        license_sources: list[str] | None = None,
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7455238 and 6db2eac.

📒 Files selected for processing (14)
  • CHANGELOG.md (1 hunks)
  • LICENSE (1 hunks)
  • README.md (1 hunks)
  • src/scanoss/__init__.py (1 hunks)
  • src/scanoss/cli.py (4 hunks)
  • src/scanoss/constants.py (1 hunks)
  • src/scanoss/data/osadl-copyleft.json (1 hunks)
  • src/scanoss/inspection/policy_check/scanoss/copyleft.py (4 hunks)
  • src/scanoss/inspection/utils/license_utils.py (1 hunks)
  • src/scanoss/inspection/utils/scan_result_processor.py (3 hunks)
  • src/scanoss/osadl.py (1 hunks)
  • src/scanoss/scanner.py (5 hunks)
  • tests/test_osadl.py (1 hunks)
  • tests/test_policy_inspect.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/scanoss/inspection/policy_check/scanoss/copyleft.py
  • src/scanoss/scanner.py
  • src/scanoss/constants.py
🧰 Additional context used
🧬 Code graph analysis (5)
src/scanoss/inspection/utils/scan_result_processor.py (1)
src/scanoss/inspection/utils/license_utils.py (2)
  • LicenseUtil (30-116)
  • init (47-74)
tests/test_policy_inspect.py (2)
src/scanoss/inspection/policy_check/scanoss/copyleft.py (2)
  • Copyleft (50-240)
  • run (220-240)
src/scanoss/inspection/policy_check/policy_check.py (2)
  • run (87-102)
  • PolicyStatus (33-44)
tests/test_osadl.py (2)
src/scanoss/osadl.py (2)
  • Osadl (31-130)
  • is_copyleft (108-130)
src/scanoss/inspection/utils/license_utils.py (1)
  • is_copyleft (76-107)
src/scanoss/inspection/utils/license_utils.py (2)
src/scanoss/osadl.py (3)
  • Osadl (31-130)
  • print_debug (62-67)
  • is_copyleft (108-130)
src/scanoss/scanossbase.py (2)
  • ScanossBase (28-107)
  • print_debug (58-63)
src/scanoss/osadl.py (1)
src/scanoss/inspection/utils/license_utils.py (1)
  • is_copyleft (76-107)
🪛 LanguageTool
README.md

[uncategorized] ~140-~140: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...vecommons.org/licenses/by/4.0/) by the [Open Source Automation Development Lab (OSADL) eG](...

(EN_COMPOUND_ADJECTIVE_INTERNAL)


[uncategorized] ~142-~142: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ...g/). Attribution: A project by the Open Source Automation Development Lab (OSADL) eG. ...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

CHANGELOG.md

[uncategorized] ~17-~17: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ... - Copyleft detection now uses [OSADL (Open Source Automation Development Lab)](https://ww...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (8)
tests/test_policy_inspect.py (1)

31-31: ****

The review comment suggests aligning line 31 to use src.scanoss.* instead of scanoss.*. However, the project configuration in pyproject.toml declares known-first-party = ["scanoss"], which specifies that the correct import style is from scanoss.* (not src.scanoss.*). This aligns with how the source code itself imports from constants (src/scanoss/cli.py and src/scanoss/inspection/policy_check/scanoss/copyleft.py both use from scanoss.constants).

Line 31 is actually correct. The inconsistency in the test file is in lines 32-39, which use src.scanoss.* and should be aligned to match the package-style imports. The review comment identified a real inconsistency but suggested the wrong direction for the fix.

Likely an incorrect or invalid review comment.

tests/test_osadl.py (1)

34-82: Good OSADL behavior coverage

The test suite exercises initialization, GPL/MIT decisions, case-insensitivity, and empty/None handling in a way that closely matches Osadl.is_copyleft semantics. This should give good confidence in the OSADL-backed copyleft logic.

src/scanoss/data/osadl-copyleft.json (1)

1-133: OSADL copyleft dataset structure matches loader expectations

The JSON layout (metadata + copyleft mapping) aligns with Osadl._load_copyleft_data (top-level copyleft object, license IDs as keys, status strings as values) and should work correctly with the lowercasing done in the loader.

src/scanoss/inspection/utils/license_utils.py (1)

25-75: OSADL-backed copyleft logic and idempotent filters look solid

The refactor to use Osadl plus include/exclude/explicit filters is clean and matches the documented 4-step decision flow:

  • init() now clears include_licenses, exclude_licenses, and explicit_licenses up front, making it safe to call multiple times and addressing the prior “sticky state” concern.
  • explicit_licenses correctly short-circuits OSADL and other filters; otherwise include → True, exclude → False, and fallback to self.osadl.is_copyleft(spdxid).

Overall this keeps the API stable while making behavior more predictable and easier to reason about.

Also applies to: 76-107

src/scanoss/inspection/utils/scan_result_processor.py (4)

332-332: Verify that 'unknown' source should always be included in filtering mode.

Line 332 automatically adds 'unknown' to the filter set, meaning licenses with an unknown source will always be included regardless of the user's --license-sources selection. This prevents users from excluding unknown-source licenses when filtering.

Please confirm this is the intended behavior. If so, consider adding a comment explaining the rationale (e.g., "Always include unknown sources as a safety fallback").


167-171: Implementation looks good.

The replacement of the previous priority-based logic with the new _select_licenses method is well-integrated. The method cleanly separates filtering mode from priority mode, and the selected licenses are properly processed in the iteration loop.


316-361: LGTM - dual-mode selection logic is sound.

The _select_licenses method correctly implements both filtering and priority modes:

  • Filtering mode (lines 331-334): Filters licenses by source when license_sources is specified
  • Priority mode (lines 336-361): Preserves the original priority-based selection (component_declared > license_file > file_header > scancode)
  • De-duplication within each source (line 351) prevents duplicate license names
  • Graceful fallback to all licenses when no priority sources are found (line 361)

The logic is well-structured and maintains backward compatibility.


331-334: The current implementation is correct and the review comment is inaccurate.

The condition if self.license_sources: correctly implements the intended behavior. While an empty list [] is technically falsy, this is the correct semantic choice. Here's why:

  1. Design pattern prevents empty lists: The Copyleft class (line 92) uses license_sources or DEFAULT_COPYLEFT_LICENSE_SOURCES, so empty lists never reach ScanResultProcessor in practice.

  2. Intended semantics are clear: Filtering mode activates only for non-empty lists. An empty list should fall through to priority mode, not filter to zero results.

  3. The suggested fix would introduce a bug: Changing to if self.license_sources is not None: would cause empty lists to filter licenses to nothing (empty result), contradicting the class design where Copyleft prevents empty lists via its or operator.

  4. No code paths use empty lists: A search of the codebase found no instances of license_sources = [], confirming this is never a real use case.

Likely an incorrect or invalid review comment.

@isasmendiagus isasmendiagus deleted the v1.41.0 branch November 17, 2025 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants