-
Notifications
You must be signed in to change notification settings - Fork 264
TEXT-126: Adding Sorensen-Dice similarity algorithm #621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Please don't use assert-j ;-) it's not as clean as junit (assert-j makes a mess of trying to do assertEquals for example). |
I hadn't seen that that was used in the test. IDE failed to compile, and then I simply rewrote that as normal JUnit + assertThrows. Not sure why that had been used before, but it's removed. Will send a few separated commits addressing different feedback, will wait for new feedback (if I manage to get it to some alright state), and squash or drop commits as needed 👍 Feel free to commit anything here if you'd like, at any time 👍 Thanks! |
I think the CI build might pass this time. I fixed the git conflicts, then ide warnings, checkstyles, javadocs. For today that's all the time I had 🌧️ Going to continue another day, probably starting reading @aherbert 's comments like this one #103 (comment), which I think should be addressed before this is ready for a review & merge. Thanks! |
src/test/java/org/apache/commons/text/similarity/SorensenDiceSimilarityTest.java
Show resolved
Hide resolved
HI @ameyjadiye , apologies for closing your previous PR. I wanted to make some progress on this, but I couldn't update your branch. I believe @aherbert had some good points on the algorithm implemented here (e.g. using bigrams). IMHO, we could merge it if we made it very clear this was a bigrams implementation, and added an issue to have another implementation fo the sorensen-dice algorithm without bigrams (similar to other algo implementations here). (But having the bigram implementation is only useful if there is a good reason for that, I think?) There were other comments about the internals of the possible implementation. From memory, without opening the previous comments/threads, I believe it was about re-using the Jaccard similarity. So I think we need to 1) first assess if the bigram implementation is really useful, 2) either create a separate issue or replace this by the non-bigram implementation, 3) and if so, check how hard it'd be to go with @aherbert 's suggestion to look into re-using the other similarity score/metric. Then add tests, and it'd be ready for a final review. Cheers, and good to see you around again! 👋 |
Supersedes #103
The work in #103 seemed to be going in the right direction but needed some review before being merged. I started syncing the code and rebasing it, but noticed that the code had been done on
master
branch in a fork. However, thatmaster
branch was later overwritten by other commits for other pull requests.This means I am not able to update the pull request (or at least I cannot think of a way of doing so), thus this new pull request with the same change (one commit with the same author, OP).
There were a few merge commits, which I rebased squashing everything into a single commit, keeping just the two files that were being displayed in the GitHub UI diff, and that contain the class and test for the Sorensen-Dice algo.
I will try to move it forward, even if the review comments are not displayed here. In case I do not managed to finish it, anyone is welcome to take it over. Just to make sure to read the feedback in the linked pull request, and address or comment CC'in others.