Skip to content

Conversation

@paulkernfeld
Copy link

@paulkernfeld paulkernfeld commented Sep 16, 2020

  • Move code in from wiki-ai/bwds repo
  • Upgrade to current versions of revscoring and other wikimedia libs
  • Add some unit tests

Phabricator task: T131861

@paulkernfeld paulkernfeld changed the title WIP: bad word detection system [DO NOT MERGE] bad word detection system Sep 16, 2020
@paulkernfeld
Copy link
Author

I'm going to mark this as "ready for review" to try to get Travis to build it. Once I get aspell working (wikimedia/revscoring#498) I'll be able to build on my own machine.

@paulkernfeld paulkernfeld marked this pull request as ready for review September 16, 2020 20:26
@paulkernfeld paulkernfeld changed the title [DO NOT MERGE] bad word detection system bad word detection system Sep 18, 2020
@paulkernfeld
Copy link
Author

@halfak I think this PR is pretty close. If you could help me out with the feature extraction step in the bot_gen function, that would be awesome.

@paulkernfeld
Copy link
Author

All right, I think I figured it out!

  • I see that there are actually some "badwords" features mentioned in revscoring. For example, the feature dependency SVG mentions a feature called added_badwords_ratio. Do we want to use those features for this?
  • In order to test that the feature actually works, I tested it against the real English Wikipedia API. Should I skip that test or convert it use offline data?
  • Right now my tests write out the cache files. Should I do that in a temp directory?
  • Before, the script needed to be invoked with a particular language but now it doesn't. Does this mean that I made a mistake, or is this just the march of progress?
  • Do we need dump_based_detection or only this script?

@paulkernfeld
Copy link
Author

Hey @accraze. Not urgent at all from my perspective, but are you someone who could review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants