The dataset used to train the system is r/Fakeddit which contains both image and text data.
ATTENTION: In order to load the original tsv files as dataframes use pd.read_csv('filepath', sep=\t).
-
preprocessing.py:-
findCorruptImages():- Because the images were downloaded through multiple sessions, every time the session would end abruptly, the image being downloaded would become corrupt. Hence the need to delete those images.
-
dropUnusedRows():- The dataset is huge (Roughly 1 million rows, which means roughly 1 million images). Not all of them were used,
so this function checks the
directoryof the images (train,test, etc.) and only keeps the rows of the csv files that contain the imageidsfound in thedirectory.
- The dataset is huge (Roughly 1 million rows, which means roughly 1 million images). Not all of them were used,
so this function checks the
-
removeDatasetBias():- The part of the dataset that was initially downloaded, had a larger number of
Falseimages (not fake) thanTrue(fake) images. So, this function removes the bias and makes the number of0s equal to the number of1s in the2_way_labelcolumn of the csv.
- The part of the dataset that was initially downloaded, had a larger number of
-
-
image_downloader.py:- This script downloads the images of the dataset, it was taken from the github repo of the paper's authors. It was modified in order to search for already existing images and skip them, as well as to skip images when the server is not responding.
-
resnet.py:- The implementation of the ResNet network, taken from this github repo and the 18 layer version was added.
-
dataset.py:- Custom class to load the images and labels into tensors in order to train the model based on pytorch documentation.
-
get_random_subset_of_dataset.py:- The downloaded images were still too many and the training took about 30 mins per epoch (on an NVIDIA GTX 1650 graphics card), so I had to reduce the number of images even more. This is where this script comes in.
- After running this, we need to run
preprocessing.pyagain in order to remove dataset bias and make new csv files with only the necessary number of rows.
-
image_classification.py:- Here happens the training of the ResNet model for the image classification.
- In the
transformstheLambdatransforms are used because some images contained either < 3 channels or > 3 channels after their transformation to tensor, and our ResNet model takes 3-channel inputs. -
CrossEntropyLoss()is used which is commonly used in binary classification problems,SGD()optimization, andReduceLROnPlateau()withpatience = 1optimization for the learning rate. The latter means that if the validation loss is not decreased for two consecutive epochs, the learning rate will be multiplied with$10^{-1}$ . - tqdm is used to show a progress bar when training the network.