Skip to content

Commit 02a0bd7

Browse files
committed
hw4: readme
1 parent 111e19b commit 02a0bd7

File tree

1 file changed

+32
-0
lines changed

1 file changed

+32
-0
lines changed

hw4/README.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# HW4: Phrase-Based Translation
2+
3+
You will need to download `data.tar.gz` file from the course website, and *uncompress* it into the `data` folder inside `hw4` (if you put it elsewhere, change the location in the code). You should be then able to run:
4+
5+
```
6+
python data.py
7+
```
8+
9+
The current assignment description is available [here](http://sameersingh.org/courses/statnlp/wi17/assignments.html#hw4).
10+
11+
## Files
12+
13+
There are quite a few files in this folder:
14+
15+
* `lm.py`: Similar to the assignment in HW2, this code provides an implementation of a Trigram language model with Kneser-Ney smoothing, along with the parameters of such a model trained on a really large corpus of English documents. Note, since we are computing $P(f|e)P(e)$, we do not require a language model of French in order to perform the decoding. The format of the language model file, known as the ARPA model, consists of 1-, 2-, and 3-grams, with their log probabilities and backoff scores.
16+
You can load and query the language model using the main function of this file.
17+
18+
* `phrase.py`: Code for the French to English phrase table. Each line in the file contains a pair of these phrases, along with a number of scores for different *features* of the pair. The code reads this file and computes the single score $g_p$ for each pair of phrases. This code also provides a handy method to get all the possible phrase translations for a given sentence, i.e. `phrases()` corresponds to `Phrases` in the pseudocode.
19+
You can investigate the translation table as shown in the main function.
20+
21+
* `decoder.py`: Implementation of the multiple stack-based decoding algorithm.
22+
This implementation attempts to follow the above notation of the pseudocode (and Collins' notes) as much as possible, deviating as needed for an optimized implementation.
23+
The code implements a working monotonic decoder that does not take the language model into account.
24+
This is especially important when you are looking at the code for finding compatible phrases (`Compatible`), computing the language model score (`lm_score`), and the distortion score (`dist_score`).
25+
Some code that differs from the pseudocode includes precomputing the set of phrases that should be considered for position $r$ in `index_phrases` and extra fields in the state to make equality comparisons efficient (`key` in `State`). You will need to develop a reasonable understanding of this code, so please post privately or publicly on Piazza if you are not able to understand something.
26+
27+
* `submission.py`: Skeleton code for the submission. It contains the three types of decoders, out of which only the first one, `MonotonicDecoder`, works as intended. You have to implement the other functions in this skeleton.
28+
29+
* `data.py`: This is the code that reads in the files related to the translation model, reads French sentences from `test.fr`, corresponding English translations from `test.en`, runs the model on the French sentences, and computes the Bleu score on the predictions.
30+
It also contains some simple words and phrases to translate into French, just to test your decoder.
31+
32+
* `bleu_score.py`: Code for computing the Bleu score for each translation/prediction pair (this code was ported from NLTK by Zhengli Zhao).

0 commit comments

Comments
 (0)