Status for pmatch-based analysis/tokenisation

Issues:
- [X] Ambiguous input
    - Seems to work fine
- [X] Ambiguous multiword expessions with ambiguous tokenisation
    - Seems to work – represented within lexc now; hfst-tokenise also
      supports forms on the analyses now
- [X] Ambiguous multiword expessions need reorganising after CG
    - The module cg-mwesplit takes wordforms from readings and turns them into
      new cohorts
- [X] Unknown words
    - The set-difference method only works for words without
      flag diacritics (even though we should be working only on the form-side?)
      and leads to binary blow-up: With only lower unknowns, we get 45M;
      lower+upper gives 67M, while no unknowns gives 27M
    - Fixed instead by treating empty analyses as unknown-tokens in
      hfst-tokenise, and outputting unmatched strings with a prefix
- [ ] Treat input that's within superblanks as unmatched
    - probably requires a change in hfst-tokenise itself
- [X] Try >1 space for ambiguous MWE's? – represented within lexc now
- [ ] Try set-difference-unknowns method with regular hfst commands?

Moved here from top of gramcheck tokeniser header.

@unhammer, @lynnda-hill  - til info

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Status for pmatch-based analysis/tokenisation #52

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Status for pmatch-based analysis/tokenisation #52

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions