Skip to content

Commit 3766bbd

Browse files
committed
Add Projects section
1 parent dc581f9 commit 3766bbd

File tree

4 files changed

+172
-4
lines changed

4 files changed

+172
-4
lines changed

config.toml

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ instagram = 'rjknightmare'
1919

2020
2121
name = 'Rajul Jha'
22-
bio = 'Software Developer'
22+
bio = """Software Engineer, GSoC '24 and '25 @FOSSology"""
2323
favicon = "favicon/2.png"
2424
# monoDarkIcon = true
2525

@@ -46,11 +46,18 @@ identifier = "about"
4646
name = "About"
4747
url = "/about/"
4848
weight = 10
49+
4950
[[menu.main]]
5051
identifier = "contact"
51-
name = "Contact"
52-
url = "/contact/"
53-
weight = 10
52+
name = "Posts"
53+
url = "/posts/"
54+
weight = 20
55+
56+
[[menu.main]]
57+
identifier = "Project"
58+
name = "Projects"
59+
url = "/projects/"
60+
weight = 30
5461

5562

5663
[taxonomies]
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
+++
2+
author = "Rajul Jha"
3+
title = "GSoC 2025 Final Project Report"
4+
date = "2025-08-30"
5+
description = "Final Project Report of GSoC '25 @Fossology."
6+
tags = [
7+
"gsoc", "final-evaluation" ,"atarashi" ,"fossology", "open-source"
8+
]
9+
+++
10+
11+
![Logo](/gsoc-24-project-report/foss-gsoc-logo.png)
12+
13+
# Table of contents
14+
- [Table of contents](#table-of-contents)
15+
- [About Fossology](#about-fossology)
16+
- [About Atarashi](#about-atarashi)
17+
- [Motivation for my project](#motivation-for-my-project)
18+
- [Keyword Agent: A pre-filtering step](#keyword-agent-a-pre-filtering-step)
19+
- [Known drawbacks](#known-drawbacks)
20+
- [My Learnings](#my-learnings)
21+
- [Acknowledgements](#acknowledgements)
22+
- [Planning for future](#planning-for-future)
23+
24+
25+
## About Fossology
26+
27+
FOSSology is a long-standing, mature project that provides both a toolkit and a system for license compliance scanning.
28+
As a toolkit, it exposes multiple scanners (like **nomos**, **ojo**, **copyright**) that can be executed via CLI.
29+
As a system, it provides a web interface and database backend for scanning large repositories, packages, or directories.
30+
31+
Over the years, FOSSology has integrated different kinds of license scanners—some rule-based, some keyword-based, and now machine-learning–driven approaches like **Atarashi**.
32+
33+
## About Atarashi
34+
35+
Atarashi is one of the projects of FOSSology community that works as an independant Python [package](https://pypi.org/project/atarashi/).
36+
It works on information retrieval techniques like TFIDF, Cosine Similarity Damerau Levenshtein distance and
37+
N-gram distance to detect licenses in source code files.
38+
Atarashi internally implements these algorithms using agents inside of it, like a TFIDF Agent, DLD agent, etc.
39+
Then any file/folder can be scanned by simply running atarashi using any of the above mentioned agents.
40+
It output's the predicted answer after comparing with processed data that has been amalgamated from existing sources like [SPDX sources](https://spdx.org/licenses) and FOSSology's internal [license database](https://raw.githubusercontent.com/fossology/fossology/master/install/db/licenseRef.json).
41+
42+
## Motivation for my project
43+
44+
While Atarashi demonstrates promising performance with an accuracy of around 80%, this project aims to significantly improve both the accuracy and robustness of its predictions.
45+
46+
Traditional scanners like **nomos** and **ojo** are rule-based. They rely on predefined license texts or regular expressions. While accurate for standard license texts, they struggle with:
47+
- Slightly modified licenses (common in practice).
48+
- Large repositories with mixed content.
49+
50+
Atarashi was introduced to fill these gaps by using ML and IR techniques, but this came with a own set of challenges. It needed numerous features and improvements like:
51+
- Large scanning times due to expensive calculations on these files.
52+
- Not integrated into the FOSSology UI and workflows.
53+
- Lacking optimizations in its database usage, leading to slow scans.
54+
- Not making full use of multi-stage detection pipelines to minimize false positives and improve scanning speeds.
55+
- Making use of Nirjas for comment extraction was broken. It needed to be fixed.
56+
57+
The motivation of my project was to tackle these existing issues and provide a neat and scalable solution
58+
for the same.
59+
60+
## Keyword Agent: A pre-filtering step
61+
62+
In order to improve the scanning times, a new agent called Keyword Agent was introduced. It reduces the
63+
candidate license set before passing it to other similarity-based agents. The idea was to quickly filter if
64+
a license/related text is explicitly present in the text.
65+
66+
```yaml
67+
acknowledg(e|ement|ements)?
68+
agreement
69+
as[\s-]is
70+
copyright
71+
damages
72+
deriv(e|ed|ation|ative|es|ing)
73+
redistribut(e|ion|able|ing)?|distribut(e|ion|able|ing)?
74+
free software
75+
grant
76+
indemnif(i|y|ied|ication|ying)?
77+
intellectual propert(y|ies)?
78+
[^e]liabilit(y|ies)?
79+
licencs?
80+
mis[- ]?represent
81+
open source
82+
patent
83+
permission
84+
public[\s-]domain
85+
require(s|d|ment|ments)?
86+
same terms
87+
see[\s:-]*(https?://|file://|www.|[A-Za-z0-9._/-]+)
88+
source (and|or)? ?binary
89+
source code
90+
subject to
91+
terms and conditions
92+
warrant(y|ies|ed|ing)?
93+
without (fee|restrict(ion|ed)?|limit(ation|ed)?)
94+
severability clause
95+
```
96+
97+
For example, `redistribut(e|ion|able|ing)?|distribut(e|ion|able|ing)?` this pattern matches words like
98+
redistribute, distributing, redistributable etc. Since these words directly point to the presence of a
99+
license, the Keyword Agent marks it as a license possibility and then sends it to the next stage for complete
100+
scanning. If no license is found, then it eliminates the file there itself, saving crucial time.
101+
On non license text, this makes the agent upto **50%** faster!
102+
103+
<!-- ![Screenshot](/) -->
104+
105+
Other stats:
106+
- Ran the KeywordAgent on NomosTestFiles.
107+
- Achieved ~99.5% accuracy, confirming robustness of regex pattern matching.
108+
- Detected minor edge cases (true negatives) which informed the next steps for keyword expansion.
109+
110+
## Known drawbacks
111+
112+
* The copyright scanner results give some unnecessary information along with the copyright findings, which is a known issue of the copyright scanner. This needs to be fixed in coming future.
113+
* The line number algorithm might struggle to find the line numbers if the diff is not properly formatted or is tampered with. It heavily depends on the diff format.
114+
115+
## My Learnings
116+
117+
* Git was definitely the skill I improved the most during GSoC :)
118+
* I spent a lot of time working with Python, Docker, and CI/CD tools like GitHub Actions, and I feel way more confident in using them now.
119+
* Gained hands-on experience with SBOM generation, package parsing, and the integration of FOSSology scanners, which broadened my technical expertise.
120+
* I learned some valuable lessons on writing clean, maintainable code. I focused on proper formatting, modular programming, and object-oriented techniques.
121+
* I also got to learn about different packaging methods that are industry standards and why following community norms is so important. It’s the little things that make a big difference in how reliable and compatible your software is.
122+
* I spent some time optimizing Docker images and learning how to speed up program execution with practices like concurrency.
123+
* GSoC also helped me improve my documentation game. Writing weekly progress reports, crafting clear commit messages, and keeping a work log became second nature, and it really pays off in keeping everything organized.
124+
* Attending the weekly community meetings and project calls was a big part of my GSoC experience. They really helped me see the bigger picture and kept me motivated. Plus, these calls were great for making sure everyone was on the same page and moving in the right direction.
125+
126+
## Acknowledgements
127+
128+
I want to express my deepest gratitude to everyone who supported me during my GSoC 2024 journey with FOSSology.
129+
130+
First and foremost, I would like to thank my mentors, [Shaheem Azmal M MD](https://github.com/shaheemazmalmmd), [Gaurav Mishra](https://github.com/GMishx), [Avinal Kumar](https://github.com/avinal) and [Kaushalendra Pratap](https://github.com/Kaushl2208) whose guidance, patience, and expertise were invaluable. Your encouragement and feedback helped me grow both technically and personally, and I’m incredibly grateful for all the time and effort you invested in my project.
131+
132+
A huge thank you to my family for their unwavering support and understanding throughout this journey. Your belief in me kept me motivated, and I couldn’t have done this without you.
133+
134+
Finally, I’d like to extend my thanks to the entire FOSSology community. From the very beginning, you were welcoming and always ready to help. Working with such a friendly and knowledgeable group made this experience truly rewarding, and I’m proud to have contributed to this amazing project.
135+
136+
Thank you all for making GSoC 2024 such a memorable and transformative experience for me.
137+
138+
## Planning for future
139+
140+
* I realize that writing open source code comes with the responsibility to maintain it. And I am more than
141+
happy to do so.
142+
* The next major goal for me is to wrap up the dependency scan part of the project; of which NPM dependencies and PHP dependencies are the one's I am currently working on.
143+
* In the longer run, I plan to keep involved with the community, continue to contribute to open source
144+
and most importantly, continue to learn newer things.

content/projects/_index.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
+++
2+
aliases = ["projects", "portfolio", "showcase", "docs"]
3+
title = "Projects"
4+
author = "Rajul Jha"
5+
tags = ["index"]
6+
+++

content/projects/quizzly.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
+++
2+
author = "Rajul Jha, Syed Ali Ul Hasan, Aarish Shah Mohsin"
3+
title = "Quizzly.io"
4+
date = "2025-01-15"
5+
description = "Quizzly.io"
6+
tags = [
7+
"quizzly","hackathon","backend"
8+
]
9+
+++
10+
11+
## A modern AI powered, personalized quiz management platform

0 commit comments

Comments
 (0)