Skip to content

Conversation

@JBGruber
Copy link

@JBGruber JBGruber commented Oct 19, 2020

I love the fuzzyjoin package and today I wanted to learn a little better how exactly it works. By coincidence, I stumbled across #71 and thought it was a pretty good idea to try and implement it, so I would understand the working of the package a bit better (but feel free to reject this as it was mainly a practice that turned out better than I thought).

The PR is still lacking some tests but I wanted to check if you are interested in adding these functions first.

For me, the main reason I want to work with similarity instead of distances is that they are standardized between 0 and 1 (at least most methods). Since I usually work with longer texts of heterogeneous lengths. Newspaper articles, for example, vary significantly in lengths and trying to find duplicates based on distance alone is basically impossible.

@emilBeBri
Copy link

Very nice, hopefully it will be implemented in the main branch! thank you.

@codecov-commenter
Copy link

codecov-commenter commented Jul 19, 2025

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants