-
Notifications
You must be signed in to change notification settings - Fork 202
feat: split-words linter counterpart to merge-words #885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
use std::sync::Arc; | ||
|
||
use crate::{CharString, Dictionary, Document, FstDictionary}; | ||
|
||
use super::{Lint, LintKind, Linter, Suggestion}; | ||
|
||
pub struct SplitWords { | ||
dict: Arc<FstDictionary>, | ||
} | ||
|
||
impl SplitWords { | ||
pub fn new() -> Self { | ||
Self { | ||
dict: FstDictionary::curated(), | ||
} | ||
} | ||
} | ||
|
||
impl Default for SplitWords { | ||
fn default() -> Self { | ||
Self::new() | ||
} | ||
} | ||
|
||
impl Linter for SplitWords { | ||
fn lint(&mut self, document: &Document) -> Vec<Lint> { | ||
let mut lints = Vec::new(); | ||
|
||
let (mut word1, mut word2) = (CharString::new(), CharString::new()); | ||
|
||
for w in document.tokens() { | ||
if !w.kind.is_word() { | ||
continue; | ||
} | ||
|
||
if w.span.len() < 2 { | ||
continue; | ||
} | ||
|
||
let w_chars = document.get_span_content(&w.span); | ||
|
||
if self.dict.contains_word(w_chars) { | ||
continue; | ||
} | ||
|
||
let mut found = false; | ||
|
||
for i in 1..w_chars.len() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It's definitely never going to be free. Heuristics and optimization can probably help. Benching now...
We could have heuristics such as only doing this for words over a certain length, and only trying splitting at several points near the middle. I didn't try to get rid of all allocations, clones, use slices rather than strings as much as possible etc. I think there's a trade-off We could also have it turned off by default? |
||
let midpoint = w_chars.len() / 2; | ||
let midpoint = if i & 1 == 0 { | ||
midpoint + i / 2 | ||
} else { | ||
midpoint - i / 2 | ||
}; | ||
|
||
let first_half = &w_chars[..midpoint]; | ||
let second_half = &w_chars[midpoint..]; | ||
|
||
word1.clear(); | ||
word1.extend_from_slice(first_half); | ||
word2.clear(); | ||
word2.extend_from_slice(second_half); | ||
|
||
if self.dict.contains_exact_word(&word1) && self.dict.contains_exact_word(&word2) { | ||
let mut open = word1.clone(); | ||
open.push(' '); | ||
open.extend_from_slice(second_half); | ||
|
||
lints.push(Lint { | ||
span: w.span, | ||
lint_kind: LintKind::WordChoice, | ||
suggestions: vec![Suggestion::ReplaceWith(open.to_vec())], | ||
message: "It seems this is actually two words joined together.".to_owned(), | ||
priority: 63, | ||
}); | ||
found = true; | ||
} | ||
|
||
// The following logic won't be useful unless and until hyphenated words are added to the dictionary | ||
|
||
let mut hyphenated = word1.clone(); | ||
hyphenated.push('-'); | ||
hyphenated.extend_from_slice(second_half); | ||
|
||
if self.dict.contains_exact_word(&hyphenated) { | ||
lints.push(Lint { | ||
span: w.span, | ||
lint_kind: LintKind::WordChoice, | ||
suggestions: vec![Suggestion::ReplaceWith(hyphenated.to_vec())], | ||
message: "It seems this is actually two words joined together.".to_owned(), | ||
priority: 63, | ||
}); | ||
found = true; | ||
} | ||
|
||
if found { | ||
break; | ||
} | ||
} | ||
} | ||
lints | ||
} | ||
|
||
fn description(&self) -> &str { | ||
"Accidentally forgetting a space between words is common. This rule looks for valid words that are joined together without whitespace." | ||
} | ||
} | ||
|
||
#[cfg(test)] | ||
mod tests { | ||
use crate::linting::tests::{assert_lint_count, assert_suggestion_result}; | ||
|
||
use super::SplitWords; | ||
|
||
#[test] | ||
fn heretofore() { | ||
assert_lint_count( | ||
"onetwo threefour fivesix seveneight nineten.", | ||
SplitWords::default(), | ||
5, | ||
); | ||
} | ||
|
||
#[test] | ||
fn foobar() { | ||
assert_suggestion_result("moreso", SplitWords::default(), "more so"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused as to why these must be changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was too. A number of the tests are based on the number of lints found and some unexpected splits result in pairs of "words" in the dictionary. In this case I think it was "soul" and "d". (Line 19,421
d/~5NXGJ
)I think having both this linter and dictionary comments will help us sort out which dictionary entries really shouldn't be there, or add justifications for things that don't look like words explaining why they actually belong.
At the same time, the split_words logic can be tweaked. It was for things like this that I made it case sensitive. There are many single-letter uppercase "words" and they are often easy to justify. I also thought of not splitting all the way down to single letters but that would rule out a whole class of common errors such as
a way
/away
from being spotted.I think what we'll get is false positives that will give us insight into both curating the dictionary and tweaking heuristics in the
split_words
logic.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised to see such changes too.
Once again, I jump in a discussion I try to follow without being sure about the exact issue this PR try to resolve.
So please be kind if my remarks are totally out of topic 😅
The way I understood this PR, it's about using words in the dictionary to compute if a word that is not in the dictionary is not made of words that are present in the dictionary.
So if someone added
foo
andbar
,foobar
andbarfoo
won't be reported as invalid.But if someone added
mergeable
, we may acceptunmergeable
, but apparently maybe alsomergeableun
and some other strange variations, such asinmergeable
...I mean unless I'm wrong it means that from now, the following words wouldn't be reported as errors?:
I can get that
misstakes
is made of two real wordsmiss
andtakes
But would
takesmiss
be accepted?Maybe, I simply don't get the scope of this PR and/or this rule.
But from my perspective, the split logic should be limited to words that are longer than a few characters.
But I would exclude one letter "word" from this logic. I mean having:
aadd
as made ofa + add
ork
as made ofor + k
sould
as made ofsoul + d
And maybe word from 2-letter words (and maybe 3-letter) should come from a list that is manually maintained.
All these make me think there is something, either the logic of this PR, or simply my understanding
Once again, I might be totally out of scope.
But seeing that tests files were updated because the rule was updated makes me think there is a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No no. It detects places where you either missed hitting the spacebar or wrongly think something like
incase
oreverytime
is a standard word. So it would flag 'misstakes' and 'takesmiss' (which would already be flagged as misspellings) and adds suggestions like did you mean 'miss takes'. It only does these checks for words that are not in the dictionary. It'll find things written as compounds by Germanic language speakers who forget they're written as two words in English, product names and trademarks like 'vscode' and 'wifi' that people write without a space but that are officially written with a space or hyphen.The tests it changes are all brittle tests that depend on the number of lints found not changing.