Skip to content

TemplateRegexMatcher.getStartRegex sometimes returns a regex that matches with an index before the license start #244

@sdheh

Description

@sdheh

Version 1.1.11
Example 1 greedy regex after optional:

String licenseText = "ab cd text";
String licenseTemplate = "<<beginOptional>>cd<<endOptional>> <<var;name=\"copyright\";original=\"Copyright (c) <year> <copyright holders>  \";match=\".{0,5000}\">> text";
TemplateRegexMatcher templateRegexMatcher = new TemplateRegexMatcher(licenseTemplate);
String startRegex = templateRegexMatcher.getStartRegex(25);
System.out.println("start regex: " + startRegex);
Matcher matcher = Pattern.compile(startRegex).matcher(licenseText);
if (matcher.find()) {
    System.out.println("start index found: " + matcher.start());
}

Returns

start regex: (?im)(\Qcd\E\s*)?(.{0,5000})\Qtext\E\s*
start index found: 0

but the start index should be 3.

Example 2 greedy regex at start:

String licenseText = "abtext";
String licenseTemplate = "<<var;name=\"copyright\";original=\"Copyright (c) <year> <copyright holders>  \";match=\".{0,5000}\">> text";
TemplateRegexMatcher templateRegexMatcher = new TemplateRegexMatcher(licenseTemplate);
String startRegex = templateRegexMatcher.getStartRegex(25);
System.out.println("start regex: " + startRegex);
Matcher matcher = Pattern.compile(startRegex).matcher(licenseText);
if (matcher.find()) {
    System.out.println("start index found: " + matcher.start());
}

Returns

start regex: (?im)(.?{0,5000})\Qtext\E\s*
start index found: 1

but the start index should be 2.
.?{0,5000} doesn't seem to work as expected. It is an unusual regex that some online regex websites say is invalid: https://regex101.com/r/l3810b/1, regexr.com/81kfo.
https://www.freeformatter.com/java-regex-tester.html says the regular expression is valid.

I think maybe to fix this you could just offer a method for a regex to find the beginning of the non-optional part. Otherwise a changing the regular expressions in these two cases to something like the following could work

(?im)((\Qcd\E\s*)(.{0,5000})\Qtext\E\s*)|(\Qtext\E\s*)
(?im)\Qtext\E\s*

In the first case if there were multiple optional parts it would get even more complicated to do it correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    matchingLicense matching and recognition

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions