code: Contains all the code (and some of the data) for the project. Large models are not included.bert: Contains the parameters trained models for BERT as well as their training, testing, eval data and some other related scripts from the aforementioned link.marmot: Contains data needed for training, evaluating, and testing a Marmot model.preproc_bert.py: A script used for preprocessing files for BERT to remove unnecessary tokens (that describe the range of agglutinations of tokens).functions.py: A Python file containing functions that are reused across multiple Jupyter Notebooks (also for the sake of keeping the notebooks shorter).requirements.txt: A list of necessary libraries/modules (not complete).morfeusz_concraft_preannotation.ipynb: A notebook that was used to generate the CoNLL-U files with lemma, UPOS, XPOS information for manual annotation.NKJP-vocabulary-check.ipynb: A notebook where the vocabulary from the memoir and the test set are tested against the NKJP.bigram-and-trigram-statistics.ipynb: A notebook where bigram and trigram counts for UPOS and XPOS tags for the test set and the memoir are obtained.korba_data_extraction.ipynb: A notebook where Korba data is extracted - this data is not used due to a different tagset!*.ipynb: Other notebooks detailing the evaluation of different taggers and lemmatizers.
data: Contains the majority of the data used in the project.mistakes: Contains the outputs of the Jupyter Notebooks with mistakes made by specific taggers or lemmatizers.ngram_stats: Contains the statistics for the ngrams.ud-treebanks: Contains the UD treebanks out of which PDB is used - not included in this repository due to size.korba_corpus: Contains the Korba Corpus, a corpus of 17th and 16th century Polish. Not included in the repository due to size and not used due to a different tagset.memoirs.txt: The text of the memoir, with 1 line corresponding to 1 paragraph.memoirs_annotated*.txt: Different files containing different stages and/or amounts of annotation. Refer tomemoirs_annotated_10k.txtfor the most recent version of UPOS annotation in that format.memoirs*.conllu: Different files containing preannotation or annotation in a CoNLL-U format. Refer tomemoirs_3k_corrected.conllufor the fully corrected version (UPOS, XPOS, lemma), andmemoirs_10k_corrected.conllufor a version with 10k tokens with manual UPOS annotation, but not all XPOS and lemma annotation has been reviewed in that file.korba*.txt: Two files with selected Korba data that was not used due to differences in the tagset.
thesis-plan: Contains different versions of the thesis plan and its presentation.thesis/latex-2022: Contains the LaTeX files + the PDF version of the most recent version of the thesis itself.
- 
                Notifications
    
You must be signed in to change notification settings  - Fork 0
 
Turtilla/swe-ma-thesis
Folders and files
| Name | Name | Last commit message  | Last commit date  | |
|---|---|---|---|---|
Repository files navigation
About
MA thesis from the MLT program at the University of Gothenburg (graduation year 2023)
Resources
Stars
Watchers
Forks
Releases
No releases published
              Packages 0
        No packages published