|
| 1 | +# HW4: Phrase-Based Translation |
| 2 | + |
| 3 | +You will need to download `data.tar.gz` file from the course website, and *uncompress* it into the `data` folder inside `hw4` (if you put it elsewhere, change the location in the code). You should be then able to run: |
| 4 | + |
| 5 | + ``` |
| 6 | + python data.py |
| 7 | + ``` |
| 8 | + |
| 9 | +The current assignment description is available [here](http://sameersingh.org/courses/statnlp/wi17/assignments.html#hw4). |
| 10 | + |
| 11 | +## Files |
| 12 | + |
| 13 | +There are quite a few files in this folder: |
| 14 | + |
| 15 | +* `lm.py`: Similar to the assignment in HW2, this code provides an implementation of a Trigram language model with Kneser-Ney smoothing, along with the parameters of such a model trained on a really large corpus of English documents. Note, since we are computing $P(f|e)P(e)$, we do not require a language model of French in order to perform the decoding. The format of the language model file, known as the ARPA model, consists of 1-, 2-, and 3-grams, with their log probabilities and backoff scores. |
| 16 | +You can load and query the language model using the main function of this file. |
| 17 | + |
| 18 | +* `phrase.py`: Code for the French to English phrase table. Each line in the file contains a pair of these phrases, along with a number of scores for different *features* of the pair. The code reads this file and computes the single score $g_p$ for each pair of phrases. This code also provides a handy method to get all the possible phrase translations for a given sentence, i.e. `phrases()` corresponds to `Phrases` in the pseudocode. |
| 19 | +You can investigate the translation table as shown in the main function. |
| 20 | + |
| 21 | +* `decoder.py`: Implementation of the multiple stack-based decoding algorithm. |
| 22 | +This implementation attempts to follow the above notation of the pseudocode (and Collins' notes) as much as possible, deviating as needed for an optimized implementation. |
| 23 | +The code implements a working monotonic decoder that does not take the language model into account. |
| 24 | +This is especially important when you are looking at the code for finding compatible phrases (`Compatible`), computing the language model score (`lm_score`), and the distortion score (`dist_score`). |
| 25 | +Some code that differs from the pseudocode includes precomputing the set of phrases that should be considered for position $r$ in `index_phrases` and extra fields in the state to make equality comparisons efficient (`key` in `State`). You will need to develop a reasonable understanding of this code, so please post privately or publicly on Piazza if you are not able to understand something. |
| 26 | + |
| 27 | +* `submission.py`: Skeleton code for the submission. It contains the three types of decoders, out of which only the first one, `MonotonicDecoder`, works as intended. You have to implement the other functions in this skeleton. |
| 28 | + |
| 29 | +* `data.py`: This is the code that reads in the files related to the translation model, reads French sentences from `test.fr`, corresponding English translations from `test.en`, runs the model on the French sentences, and computes the Bleu score on the predictions. |
| 30 | +It also contains some simple words and phrases to translate into French, just to test your decoder. |
| 31 | + |
| 32 | +* `bleu_score.py`: Code for computing the Bleu score for each translation/prediction pair (this code was ported from NLTK by Zhengli Zhao). |
0 commit comments