Potential optimisation for LicenseCompareHelper.matchingStandardLicenseIdsWithinText()

It occurred to me that there may be a potentially substantial performance optimisation possible within the [`org.spdx.utility.compare.LicenseCompareHelper.matchingStandardLicenseIdsWithinText()`](https://github.com/spdx/Spdx-Java-Library/blob/master/src/main/java/org/spdx/utility/compare/LicenseCompareHelper.java#L472-L562) family of methods, beyond the existing "quick check regex" optimisation.

Specifically, the idea is that when a match is found for a given license within an input text, the subset of text that matched is _removed_ from the input text before further matching is performed.  This not only speeds up subsequent matching (by reducing the size of the input), it also allows for early termination once the input is exhausted (i.e. below the size of the smallest possible match).

This enhancement could (should?) also order matching by approximate license popularity (i.e. the most popular licenses are checked first), since that would tend to result in early termination being reached more frequently in real world use.  [The OSI publishes such a list](https://opensource.org/blog/top-open-source-licenses-in-2024), albeit only for OSI-approved licenses - I'm fairly certain I've seen other sources of license popularity that would be more than enough for this purpose (after all it doesn't have to be precise - even using an approximate popularity order would be a lot better than random, or alphabetical, matching processing).

One possible complication is when two templates match the same text - for example the `GPL-2.0-only` and `GPL-2.0-or-later` templates tend to match the same thing.  If this optimisation were implemented as I'm envisaging, whichever license template is used first would be the only one to show up in the result.  Though if that behaviour is undesirable there may be workarounds - the library might maintain a list of "overlapping" templates, and ensure that when one of them matches a given text, the other overlapping templates are always checked too, prior to the matched text being removed from the input.

Another possible complication is if `LicenseCompareHelper` provides indexes into the original input text where a match was found (I don't recall if it does this or not; if it does I'm not currently using that capability).  If so, there will be some extra bookkeeping needed to make sure that those indexes remain correct, given that each subsequent match attempt might only be looking at a fragmented subset of the original input.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential optimisation for LicenseCompareHelper.matchingStandardLicenseIdsWithinText() #341

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential optimisation for LicenseCompareHelper.matchingStandardLicenseIdsWithinText() #341

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions