Skip to content

Potential optimisation for LicenseCompareHelper.matchingStandardLicenseIdsWithinText() #341

@pmonks

Description

@pmonks

It occurred to me that there may be a potentially substantial performance optimisation possible within the org.spdx.utility.compare.LicenseCompareHelper.matchingStandardLicenseIdsWithinText() family of methods, beyond the existing "quick check regex" optimisation.

Specifically, the idea is that when a match is found for a given license within an input text, the subset of text that matched is removed from the input text before further matching is performed. This not only speeds up subsequent matching (by reducing the size of the input), it also allows for early termination once the input is exhausted (i.e. below the size of the smallest possible match).

This enhancement could (should?) also order matching by approximate license popularity (i.e. the most popular licenses are checked first), since that would tend to result in early termination being reached more frequently in real world use. The OSI publishes such a list, albeit only for OSI-approved licenses - I'm fairly certain I've seen other sources of license popularity that would be more than enough for this purpose (after all it doesn't have to be precise - even using an approximate popularity order would be a lot better than random, or alphabetical, matching processing).

One possible complication is when two templates match the same text - for example the GPL-2.0-only and GPL-2.0-or-later templates tend to match the same thing. If this optimisation were implemented as I'm envisaging, whichever license template is used first would be the only one to show up in the result. Though if that behaviour is undesirable there may be workarounds - the library might maintain a list of "overlapping" templates, and ensure that when one of them matches a given text, the other overlapping templates are always checked too, prior to the matched text being removed from the input.

Another possible complication is if LicenseCompareHelper provides indexes into the original input text where a match was found (I don't recall if it does this or not; if it does I'm not currently using that capability). If so, there will be some extra bookkeeping needed to make sure that those indexes remain correct, given that each subsequent match attempt might only be looking at a fragmented subset of the original input.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceSpeed, responsiveness, and memory efficiency

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions