Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions harper-comments/tests/language_support.rs
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ create_test!(merged_lines.ts, 1);
create_test!(javadoc_clean_simple.java, 0);
create_test!(javadoc_complex.java, 5);
create_test!(issue_132.rs, 1);
create_test!(laravel_app.php, 2);
create_test!(laravel_app.php, 3);
create_test!(ignore_shebang_1.sh, 0);
create_test!(ignore_shebang_2.sh, 0);
create_test!(ignore_shebang_3.sh, 0);
Expand All @@ -58,7 +58,8 @@ create_test!(common.mill, 1);

// Checks that some comments are masked out
create_test!(ignore_comments.rs, 1);
create_test!(ignore_comments.c, 1);
// Both spell_check and split_words linters flag this now
create_test!(ignore_comments.c, 2);

// These are to make sure nothing crashes.
create_test!(empty.js, 0);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ public static void main(String[] args) {
}

/**
* This doc has a link in it: {@link this sould b ignor} but not tis
* This doc has a link in it: {@link this sould b ignor} but not thsi
*
* @param name this is anoher test.
*/
Expand Down
6 changes: 3 additions & 3 deletions harper-comments/tests/language_support_sources/jsdoc.ts
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
/** This is a doc comment.
* Since there are no keywords it _sould_ be checked. */
* Since there are no keywords it _shuld_ be checked. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused as to why these must be changed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused as to why these must be changed?

I was too. A number of the tests are based on the number of lints found and some unexpected splits result in pairs of "words" in the dictionary. In this case I think it was "soul" and "d". (Line 19,421 d/~5NXGJ)

I think having both this linter and dictionary comments will help us sort out which dictionary entries really shouldn't be there, or add justifications for things that don't look like words explaining why they actually belong.

At the same time, the split_words logic can be tweaked. It was for things like this that I made it case sensitive. There are many single-letter uppercase "words" and they are often easy to justify. I also thought of not splitting all the way down to single letters but that would rule out a whole class of common errors such as a way/away from being spotted.

I think what we'll get is false positives that will give us insight into both curating the dictionary and tweaking heuristics in the split_words logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised to see such changes too.

Once again, I jump in a discussion I try to follow without being sure about the exact issue this PR try to resolve.

So please be kind if my remarks are totally out of topic 😅

The way I understood this PR, it's about using words in the dictionary to compute if a word that is not in the dictionary is not made of words that are present in the dictionary.

So if someone added foo and bar, foobar and barfoo won't be reported as invalid.

But if someone added mergeable, we may accept unmergeable, but apparently maybe also mergeableun and some other strange variations, such as inmergeable...

I mean unless I'm wrong it means that from now, the following words wouldn't be reported as errors?:

  • tis
  • sould
  • aadd
  • ork
  • misstakes

I can get that misstakes is made of two real words miss and takes

But would takesmiss be accepted?

Maybe, I simply don't get the scope of this PR and/or this rule.

But from my perspective, the split logic should be limited to words that are longer than a few characters.

But I would exclude one letter "word" from this logic. I mean having:

  • a that leads to accept aadd as made of a + add
  • k that leads to accept ork as made of or + k
  • d that leads to accept sould as made of soul + d

And maybe word from 2-letter words (and maybe 3-letter) should come from a list that is manually maintained.

All these make me think there is something, either the logic of this PR, or simply my understanding

Once again, I might be totally out of scope.
But seeing that tests files were updated because the rule was updated makes me think there is a problem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no. It detects places where you either missed hitting the spacebar or wrongly think something like incase or everytime is a standard word. So it would flag 'misstakes' and 'takesmiss' (which would already be flagged as misspellings) and adds suggestions like did you mean 'miss takes'. It only does these checks for words that are not in the dictionary. It'll find things written as compounds by Germanic language speakers who forget they're written as two words in English, product names and trademarks like 'vscode' and 'wifi' that people write without a space but that are officially written with a space or hyphen.

The tests it changes are all brittle tests that depend on the number of lints found not changing.

function test(){}

/** This is also a doc comment.
* @class this sould be unchecked. */
class Clazz { }

/** Here is another example: {@link this sould also b unchecked}. But this _sould_ be.*/
/** Here is another example: {@link this sould also b unchecked}. But this _shuold_ be.*/

/** However, tis should be checked, while {@link tis should not} */
/** However, thsi should be checked, while {@link tis should not} */

/**
* The following examples should be ignored by Harper.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ int test() {}
*/
int arbitrary() {}

/// Let's aadd a cuple spelling errors for good measure.
/// Let's putin a cuple spelling errors for good measure.
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,5 @@ function test() {}
*/
function arbitrary() {}

// Let's aadd a cuple spelling errors for good measure.
// Let's putin a cuple spelling errors for good measure.

2 changes: 2 additions & 0 deletions harper-core/src/linting/lint_group.rs
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ use super::somewhat_something::SomewhatSomething;
use super::spaces::Spaces;
use super::spell_check::SpellCheck;
use super::spelled_numbers::SpelledNumbers;
use super::split_words::SplitWords;
use super::that_which::ThatWhich;
use super::the_how_why::TheHowWhy;
use super::the_my::TheMy;
Expand Down Expand Up @@ -350,6 +351,7 @@ impl LintGroup {
insert_pattern_rule!(SomewhatSomething, true);
insert_struct_rule!(Spaces, true);
insert_struct_rule!(SpelledNumbers, false);
insert_struct_rule!(SplitWords, true);
insert_pattern_rule!(ThatWhich, true);
insert_pattern_rule!(TheHowWhy, true);
insert_struct_rule!(TheHowWhy, true);
Expand Down
2 changes: 2 additions & 0 deletions harper-core/src/linting/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ mod somewhat_something;
mod spaces;
mod spell_check;
mod spelled_numbers;
mod split_words;
mod suggestion;
mod that_which;
mod the_how_why;
Expand Down Expand Up @@ -130,6 +131,7 @@ pub use somewhat_something::SomewhatSomething;
pub use spaces::Spaces;
pub use spell_check::SpellCheck;
pub use spelled_numbers::SpelledNumbers;
pub use split_words::SplitWords;
pub use suggestion::Suggestion;
pub use that_which::ThatWhich;
pub use the_how_why::TheHowWhy;
Expand Down
128 changes: 128 additions & 0 deletions harper-core/src/linting/split_words.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
use std::sync::Arc;

use crate::{CharString, Dictionary, Document, FstDictionary};

use super::{Lint, LintKind, Linter, Suggestion};

pub struct SplitWords {
dict: Arc<FstDictionary>,
}

impl SplitWords {
pub fn new() -> Self {
Self {
dict: FstDictionary::curated(),
}
}
}

impl Default for SplitWords {
fn default() -> Self {
Self::new()
}
}

impl Linter for SplitWords {
fn lint(&mut self, document: &Document) -> Vec<Lint> {
let mut lints = Vec::new();

let (mut word1, mut word2) = (CharString::new(), CharString::new());

for w in document.tokens() {
if !w.kind.is_word() {
continue;
}

if w.span.len() < 2 {
continue;
}

let w_chars = document.get_span_content(&w.span);

if self.dict.contains_word(w_chars) {
continue;
}

let mut found = false;

for i in 1..w_chars.len() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like O(n^2), which could have significant performance considerations. Run cargo bench on master and on this branch to get a good understanding of the performance implications.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like O(n^2), which could have significant performance considerations. Run cargo bench on master and on this branch to get a good understanding of the performance implications.

It's definitely never going to be free. Heuristics and optimization can probably help. Benching now...
master before pulling latest changes

     Running benches/parse_demo.rs (target/release/deps/parse_demo-e61b92e973b62c69)
Benchmarking parse_essay: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60.
parse_essay             time:   [1.1415 ms 1.1487 ms 1.1594 ms]
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

lint_essay              time:   [4.0558 ms 4.0617 ms 4.0686 ms]
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

lint_essay_uncached     time:   [36.347 ms 37.231 ms 38.571 ms]
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

split_words without merging latest changes

     Running benches/parse_demo.rs (target/release/deps/parse_demo-934b9b6b76825e17)
Benchmarking parse_essay: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.2s, enable flat sampling, or reduce sample count to 60.
parse_essay             time:   [1.1756 ms 1.1879 ms 1.2016 ms]
                        change: [+2.3471% +3.5357% +4.9174%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

lint_essay              time:   [4.6598 ms 5.4398 ms 6.8017 ms]
                        change: [+15.204% +33.930% +75.811%] (p = 0.01 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe

lint_essay_uncached     time:   [36.381 ms 36.485 ms 36.601 ms]
                        change: [-5.4025% -2.0040% +0.3988%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe

master after pulling latest changes

     Running benches/parse_demo.rs (target/release/deps/parse_demo-db6a88add7c2fdb6)
Benchmarking parse_essay: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60.
parse_essay             time:   [1.1495 ms 1.1665 ms 1.1878 ms]
                        change: [-3.5494% -1.8548% +0.1785%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  9 (9.00%) high mild
  8 (8.00%) high severe

lint_essay              time:   [4.1378 ms 4.2216 ms 4.3449 ms]
                        change: [-38.141% -22.394% -8.9018%] (p = 0.02 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  6 (6.00%) high mild
  10 (10.00%) high severe

lint_essay_uncached     time:   [36.746 ms 37.151 ms 37.752 ms]
                        change: [+0.6915% +1.8258% +3.5692%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

We could have heuristics such as only doing this for words over a certain length, and only trying splitting at several points near the middle. I didn't try to get rid of all allocations, clones, use slices rather than strings as much as possible etc. I think there's a trade-off

We could also have it turned off by default?

let midpoint = w_chars.len() / 2;
let midpoint = if i & 1 == 0 {
midpoint + i / 2
} else {
midpoint - i / 2
};

let first_half = &w_chars[..midpoint];
let second_half = &w_chars[midpoint..];

word1.clear();
word1.extend_from_slice(first_half);
word2.clear();
word2.extend_from_slice(second_half);

if self.dict.contains_exact_word(&word1) && self.dict.contains_exact_word(&word2) {
let mut open = word1.clone();
open.push(' ');
open.extend_from_slice(second_half);

lints.push(Lint {
span: w.span,
lint_kind: LintKind::WordChoice,
suggestions: vec![Suggestion::ReplaceWith(open.to_vec())],
message: "It seems this is actually two words joined together.".to_owned(),
priority: 63,
});
found = true;
}

// The following logic won't be useful unless and until hyphenated words are added to the dictionary

let mut hyphenated = word1.clone();
hyphenated.push('-');
hyphenated.extend_from_slice(second_half);

if self.dict.contains_exact_word(&hyphenated) {
lints.push(Lint {
span: w.span,
lint_kind: LintKind::WordChoice,
suggestions: vec![Suggestion::ReplaceWith(hyphenated.to_vec())],
message: "It seems this is actually two words joined together.".to_owned(),
priority: 63,
});
found = true;
}

if found {
break;
}
}
}
lints
}

fn description(&self) -> &str {
"Accidentally forgetting a space between words is common. This rule looks for valid words that are joined together without whitespace."
}
}

#[cfg(test)]
mod tests {
use crate::linting::tests::{assert_lint_count, assert_suggestion_result};

use super::SplitWords;

#[test]
fn heretofore() {
assert_lint_count(
"onetwo threefour fivesix seveneight nineten.",
SplitWords::default(),
5,
);
}

#[test]
fn foobar() {
assert_suggestion_result("moreso", SplitWords::default(), "more so");
}
}
2 changes: 1 addition & 1 deletion harper-core/tests/test_sources/chinese_lorem_ipsum.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
The following text was generated using [a Chinese lorem ipsum generator](https://pinkylam.me/generator/chinese-lorem-ipsum/).
The following text was generated using [a Chinese lorrm ipsum generator](https://pinkylam.me/generator/chinese-lorem-ipsum/).

食棵支每躲種。奶象打星爪子二細喜才記行在發像原斤!頁固點子衣點豆看身蝴看苗急午公何足,筆娘經色蝶行元香也要。麻了綠尼固世,色北書目登功;因告黑。

Expand Down
2 changes: 1 addition & 1 deletion harper-core/tests/test_sources/pr_504.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ These say "This is in Greek/Georgian/Thai" in those languages:
ეს ქართულად.
นี่มันภาษาไทย

This is English with misstakes.
This is English with erors.
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,13 @@
#titleblock(
title: "A fluid dynamic model for glaier flow",
authors: ("Grant Lemons", "John Doe", "Jane Doe"),
abstract: lorem(80),
abstract: lorrm(80),
doc,
)
]

= Introduction
#lorem(300)

= Related ork
= Related wrk
#lorem(200)
Loading