-
Notifications
You must be signed in to change notification settings - Fork 41
Description
It occurred to me that there may be a potentially substantial performance optimisation possible within the org.spdx.utility.compare.LicenseCompareHelper.matchingStandardLicenseIdsWithinText() family of methods, beyond the existing "quick check regex" optimisation.
Specifically, the idea is that when a match is found for a given license within an input text, the subset of text that matched is removed from the input text before further matching is performed. This not only speeds up subsequent matching (by reducing the size of the input), it also allows for early termination once the input is exhausted (i.e. below the size of the smallest possible match).
This enhancement could (should?) also order matching by approximate license popularity (i.e. the most popular licenses are checked first), since that would tend to result in early termination being reached more frequently in real world use. The OSI publishes such a list, albeit only for OSI-approved licenses - I'm fairly certain I've seen other sources of license popularity that would be more than enough for this purpose (after all it doesn't have to be precise - even using an approximate popularity order would be a lot better than random, or alphabetical, matching processing).
One possible complication is when two templates match the same text - for example the GPL-2.0-only and GPL-2.0-or-later templates tend to match the same thing. If this optimisation were implemented as I'm envisaging, whichever license template is used first would be the only one to show up in the result. Though if that behaviour is undesirable there may be workarounds - the library might maintain a list of "overlapping" templates, and ensure that when one of them matches a given text, the other overlapping templates are always checked too, prior to the matched text being removed from the input.
Another possible complication is if LicenseCompareHelper provides indexes into the original input text where a match was found (I don't recall if it does this or not; if it does I'm not currently using that capability). If so, there will be some extra bookkeeping needed to make sure that those indexes remain correct, given that each subsequent match attempt might only be looking at a fragmented subset of the original input.