Skip to content

TEXT-126: Adding Sorensen-Dice similarity algorithm #621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

kinow
Copy link
Member

@kinow kinow commented Oct 27, 2024

Supersedes #103

The work in #103 seemed to be going in the right direction but needed some review before being merged. I started syncing the code and rebasing it, but noticed that the code had been done on master branch in a fork. However, that master branch was later overwritten by other commits for other pull requests.

This means I am not able to update the pull request (or at least I cannot think of a way of doing so), thus this new pull request with the same change (one commit with the same author, OP).

There were a few merge commits, which I rebased squashing everything into a single commit, keeping just the two files that were being displayed in the GitHub UI diff, and that contain the class and test for the Sorensen-Dice algo.

I will try to move it forward, even if the review comments are not displayed here. In case I do not managed to finish it, anyone is welcome to take it over. Just to make sure to read the feedback in the linked pull request, and address or comment CC'in others.

@kinow kinow marked this pull request as draft October 27, 2024 14:01
@garydgregory
Copy link
Member

Please don't use assert-j ;-) it's not as clean as junit (assert-j makes a mess of trying to do assertEquals for example).

@kinow
Copy link
Member Author

kinow commented Oct 27, 2024

Please don't use assert-j ;-) it's not as clean as junit (assert-j makes a mess of trying to do assertEquals for example).

I hadn't seen that that was used in the test. IDE failed to compile, and then I simply rewrote that as normal JUnit + assertThrows. Not sure why that had been used before, but it's removed. Will send a few separated commits addressing different feedback, will wait for new feedback (if I manage to get it to some alright state), and squash or drop commits as needed 👍 Feel free to commit anything here if you'd like, at any time 👍

Thanks!

@kinow
Copy link
Member Author

kinow commented Oct 27, 2024

I think the CI build might pass this time. I fixed the git conflicts, then ide warnings, checkstyles, javadocs. For today that's all the time I had 🌧️ Going to continue another day, probably starting reading @aherbert 's comments like this one #103 (comment), which I think should be addressed before this is ready for a review & merge. Thanks!

@ameyjadiye
Copy link
Contributor

I know its been very long time I was not able to work on this and #103 disscussion became too long to understand, I'm still keen to get this released. @kinow - is there anything I can help with ?

@kinow
Copy link
Member Author

kinow commented Apr 21, 2025

HI @ameyjadiye , apologies for closing your previous PR. I wanted to make some progress on this, but I couldn't update your branch.

I believe @aherbert had some good points on the algorithm implemented here (e.g. using bigrams). IMHO, we could merge it if we made it very clear this was a bigrams implementation, and added an issue to have another implementation fo the sorensen-dice algorithm without bigrams (similar to other algo implementations here). (But having the bigram implementation is only useful if there is a good reason for that, I think?)

There were other comments about the internals of the possible implementation. From memory, without opening the previous comments/threads, I believe it was about re-using the Jaccard similarity.

So I think we need to 1) first assess if the bigram implementation is really useful, 2) either create a separate issue or replace this by the non-bigram implementation, 3) and if so, check how hard it'd be to go with @aherbert 's suggestion to look into re-using the other similarity score/metric.

Then add tests, and it'd be ready for a final review.

Cheers, and good to see you around again! 👋

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants