|
| 1 | ++++ |
| 2 | +author = "Rajul Jha" |
| 3 | +title = "GSoC 2025 Final Project Report" |
| 4 | +date = "2025-08-30" |
| 5 | +description = "Final Project Report of GSoC '25 @Fossology." |
| 6 | +tags = [ |
| 7 | + "gsoc", "final-evaluation" ,"atarashi" ,"fossology", "open-source" |
| 8 | +] |
| 9 | ++++ |
| 10 | + |
| 11 | + |
| 12 | + |
| 13 | +# Table of contents |
| 14 | +- [Table of contents](#table-of-contents) |
| 15 | + - [About Fossology](#about-fossology) |
| 16 | + - [About Atarashi](#about-atarashi) |
| 17 | + - [Motivation for my project](#motivation-for-my-project) |
| 18 | + - [Keyword Agent: A pre-filtering step](#keyword-agent-a-pre-filtering-step) |
| 19 | + - [Known drawbacks](#known-drawbacks) |
| 20 | + - [My Learnings](#my-learnings) |
| 21 | + - [Acknowledgements](#acknowledgements) |
| 22 | + - [Planning for future](#planning-for-future) |
| 23 | + |
| 24 | + |
| 25 | +## About Fossology |
| 26 | + |
| 27 | +FOSSology is a long-standing, mature project that provides both a toolkit and a system for license compliance scanning. |
| 28 | +As a toolkit, it exposes multiple scanners (like **nomos**, **ojo**, **copyright**) that can be executed via CLI. |
| 29 | +As a system, it provides a web interface and database backend for scanning large repositories, packages, or directories. |
| 30 | + |
| 31 | +Over the years, FOSSology has integrated different kinds of license scanners—some rule-based, some keyword-based, and now machine-learning–driven approaches like **Atarashi**. |
| 32 | + |
| 33 | +## About Atarashi |
| 34 | + |
| 35 | +Atarashi is one of the projects of FOSSology community that works as an independant Python [package](https://pypi.org/project/atarashi/). |
| 36 | +It works on information retrieval techniques like TFIDF, Cosine Similarity Damerau Levenshtein distance and |
| 37 | +N-gram distance to detect licenses in source code files. |
| 38 | +Atarashi internally implements these algorithms using agents inside of it, like a TFIDF Agent, DLD agent, etc. |
| 39 | +Then any file/folder can be scanned by simply running atarashi using any of the above mentioned agents. |
| 40 | +It output's the predicted answer after comparing with processed data that has been amalgamated from existing sources like [SPDX sources](https://spdx.org/licenses) and FOSSology's internal [license database](https://raw.githubusercontent.com/fossology/fossology/master/install/db/licenseRef.json). |
| 41 | + |
| 42 | +## Motivation for my project |
| 43 | + |
| 44 | +While Atarashi demonstrates promising performance with an accuracy of around 80%, this project aims to significantly improve both the accuracy and robustness of its predictions. |
| 45 | + |
| 46 | +Traditional scanners like **nomos** and **ojo** are rule-based. They rely on predefined license texts or regular expressions. While accurate for standard license texts, they struggle with: |
| 47 | +- Slightly modified licenses (common in practice). |
| 48 | +- Large repositories with mixed content. |
| 49 | + |
| 50 | +Atarashi was introduced to fill these gaps by using ML and IR techniques, but this came with a own set of challenges. It needed numerous features and improvements like: |
| 51 | +- Large scanning times due to expensive calculations on these files. |
| 52 | +- Not integrated into the FOSSology UI and workflows. |
| 53 | +- Lacking optimizations in its database usage, leading to slow scans. |
| 54 | +- Not making full use of multi-stage detection pipelines to minimize false positives and improve scanning speeds. |
| 55 | +- Making use of Nirjas for comment extraction was broken. It needed to be fixed. |
| 56 | + |
| 57 | +The motivation of my project was to tackle these existing issues and provide a neat and scalable solution |
| 58 | +for the same. |
| 59 | + |
| 60 | +## Keyword Agent: A pre-filtering step |
| 61 | + |
| 62 | +In order to improve the scanning times, a new agent called Keyword Agent was introduced. It reduces the |
| 63 | +candidate license set before passing it to other similarity-based agents. The idea was to quickly filter if |
| 64 | +a license/related text is explicitly present in the text. |
| 65 | + |
| 66 | +```yaml |
| 67 | +acknowledg(e|ement|ements)? |
| 68 | +agreement |
| 69 | +as[\s-]is |
| 70 | +copyright |
| 71 | +damages |
| 72 | +deriv(e|ed|ation|ative|es|ing) |
| 73 | +redistribut(e|ion|able|ing)?|distribut(e|ion|able|ing)? |
| 74 | +free software |
| 75 | +grant |
| 76 | +indemnif(i|y|ied|ication|ying)? |
| 77 | +intellectual propert(y|ies)? |
| 78 | +[^e]liabilit(y|ies)? |
| 79 | +licencs? |
| 80 | +mis[- ]?represent |
| 81 | +open source |
| 82 | +patent |
| 83 | +permission |
| 84 | +public[\s-]domain |
| 85 | +require(s|d|ment|ments)? |
| 86 | +same terms |
| 87 | +see[\s:-]*(https?://|file://|www.|[A-Za-z0-9._/-]+) |
| 88 | +source (and|or)? ?binary |
| 89 | +source code |
| 90 | +subject to |
| 91 | +terms and conditions |
| 92 | +warrant(y|ies|ed|ing)? |
| 93 | +without (fee|restrict(ion|ed)?|limit(ation|ed)?) |
| 94 | +severability clause |
| 95 | +``` |
| 96 | + |
| 97 | +For example, `redistribut(e|ion|able|ing)?|distribut(e|ion|able|ing)?` this pattern matches words like |
| 98 | +redistribute, distributing, redistributable etc. Since these words directly point to the presence of a |
| 99 | +license, the Keyword Agent marks it as a license possibility and then sends it to the next stage for complete |
| 100 | +scanning. If no license is found, then it eliminates the file there itself, saving crucial time. |
| 101 | +On non license text, this makes the agent upto **50%** faster! |
| 102 | + |
| 103 | +<!--  --> |
| 104 | + |
| 105 | +Other stats: |
| 106 | +- Ran the KeywordAgent on NomosTestFiles. |
| 107 | +- Achieved ~99.5% accuracy, confirming robustness of regex pattern matching. |
| 108 | +- Detected minor edge cases (true negatives) which informed the next steps for keyword expansion. |
| 109 | + |
| 110 | +## Known drawbacks |
| 111 | + |
| 112 | +* The copyright scanner results give some unnecessary information along with the copyright findings, which is a known issue of the copyright scanner. This needs to be fixed in coming future. |
| 113 | +* The line number algorithm might struggle to find the line numbers if the diff is not properly formatted or is tampered with. It heavily depends on the diff format. |
| 114 | + |
| 115 | +## My Learnings |
| 116 | + |
| 117 | +* Git was definitely the skill I improved the most during GSoC :) |
| 118 | +* I spent a lot of time working with Python, Docker, and CI/CD tools like GitHub Actions, and I feel way more confident in using them now. |
| 119 | +* Gained hands-on experience with SBOM generation, package parsing, and the integration of FOSSology scanners, which broadened my technical expertise. |
| 120 | +* I learned some valuable lessons on writing clean, maintainable code. I focused on proper formatting, modular programming, and object-oriented techniques. |
| 121 | +* I also got to learn about different packaging methods that are industry standards and why following community norms is so important. It’s the little things that make a big difference in how reliable and compatible your software is. |
| 122 | +* I spent some time optimizing Docker images and learning how to speed up program execution with practices like concurrency. |
| 123 | +* GSoC also helped me improve my documentation game. Writing weekly progress reports, crafting clear commit messages, and keeping a work log became second nature, and it really pays off in keeping everything organized. |
| 124 | +* Attending the weekly community meetings and project calls was a big part of my GSoC experience. They really helped me see the bigger picture and kept me motivated. Plus, these calls were great for making sure everyone was on the same page and moving in the right direction. |
| 125 | + |
| 126 | +## Acknowledgements |
| 127 | + |
| 128 | +I want to express my deepest gratitude to everyone who supported me during my GSoC 2024 journey with FOSSology. |
| 129 | + |
| 130 | +First and foremost, I would like to thank my mentors, [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd), [Gaurav Mishra](https://github.com/GMishx), [Avinal Kumar](https://github.com/avinal) and [Kaushalendra Pratap](https://github.com/Kaushl2208) whose guidance, patience, and expertise were invaluable. Your encouragement and feedback helped me grow both technically and personally, and I’m incredibly grateful for all the time and effort you invested in my project. |
| 131 | + |
| 132 | +A huge thank you to my family for their unwavering support and understanding throughout this journey. Your belief in me kept me motivated, and I couldn’t have done this without you. |
| 133 | + |
| 134 | +Finally, I’d like to extend my thanks to the entire FOSSology community. From the very beginning, you were welcoming and always ready to help. Working with such a friendly and knowledgeable group made this experience truly rewarding, and I’m proud to have contributed to this amazing project. |
| 135 | + |
| 136 | +Thank you all for making GSoC 2024 such a memorable and transformative experience for me. |
| 137 | + |
| 138 | +## Planning for future |
| 139 | + |
| 140 | +* I realize that writing open source code comes with the responsibility to maintain it. And I am more than |
| 141 | +happy to do so. |
| 142 | +* The next major goal for me is to wrap up the dependency scan part of the project; of which NPM dependencies and PHP dependencies are the one's I am currently working on. |
| 143 | +* In the longer run, I plan to keep involved with the community, continue to contribute to open source |
| 144 | +and most importantly, continue to learn newer things. |
0 commit comments