Problem Statemtent Suggest the tags based on the content that was there in the question posted on Stackoverflow.
Data Source : https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data
Youtube : https://youtu.be/nNDqbUhtIRg
Research paper : https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
Research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL
- Predict as many tags as possible with high precision and recall.
- Incorrect tags could impact customer experience on StackOverflow.
- No strict latency constraints
All of the data is in 2 files: Train and Test.
Train.csv contains 4 columns: Id,Title,Body,Tags.The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).
Test.csv contains the same columns but without the Tags, which you are to predict.
Size of Train.csv - 6.75GB
Size of Test.csv - 2GB
Number of rows in Train.csv = 6034195
Title: Implementing Boundary Value Analysis of Software Testing in a C++ program? Body :\n\n#include< iostream>\n #include< stdlib.h>\n\n using namespace std;\n\n int main()\n {\n int n,a[n],x,c,u[n],m[n],e[n][4];\n cout<<"Enter the number of variables";\n cin>>n;\n\n cout<<"Enter the Lower, and Upper Limits of the variables";\n for(int y=1; y<n+1; y++)\n {\n cin>>m[y];\n cin>>u[y];\n }\n for(x=1; x<n+1; x++)\n {\n a[x] = (m[x] + u[x])/2;\n }\n c=(n*4)-4;\n for(int a1=1; a1<n+1; a1++)\n {\n\n e[a1][0] = m[a1];\n e[a1][1] = m[a1]+1;\n e[a1][2] = u[a1]-1;\n e[a1][3] = u[a1];\n }\n for(int i=1; i<n+1; i++)\n {\n for(int l=1; l<=i; l++)\n {\n if(l!=1)\n {\n cout<<a[l]<<"\\t";\n }\n }\n for(int j=0; j<4; j++)\n {\n cout<<e[i][j];\n for(int k=0; k<n-(i+1); k++)\n {\n cout<<a[k]<<"\\t";\n }\n cout<<"\\n";\n }\n } \n\n system("PAUSE");\n return 0; \n }\n
<p>The answer should come in the form of a table like</p>\n\n <pre><code> 1 50 50\n 2 50 50\n 99 50 50\n 100 50 50\n 50 1 50\n 50 2 50\n 50 99 50\n 50 100 50\n 50 50 1\n 50 50 2\n 50 50 99\n 50 50 100\n </code></pre>\n\n <p>if the no of inputs is 3 and their ranges are\n 1,100\n 1,100\n 1,100\n (could be varied too)</p>\n\n <p>The output is not coming,can anyone correct the code or tell me what\'s wrong?</p>\n'
Tags : 'c++ c'
- We first downloaded the train and the test data from : https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data
- As the dataset was too big(around 8Gb) we decided to take 100k data points to work with.
- We saved the 100k points as a csv file. We did the EDA
- Most frequent tag (i.e. c#) is used 331505 times.
- Since some tags occur much more frequenctly than others, Micro-averaged F1-score is the appropriate metric for this probelm.
- 384 Tags are used more than 100 times
- 1 Tags are used more than 7000 times
- Maximum number of tags per question: 5
- Minimum number of tags per question: 1
- Avg. number of tags per question: 2.883030
- Most of the questions are having 2 or 3 tags
Here is a Comparison of our models.
No of Tags | Model | Precision | Recall | F1-measure | Hyperparameter |
---|---|---|---|---|---|
500 | Logistic Regression | 0.6464 | 0.3602 | 0.4626 | 100 |
500 | Linear SVM | 0.8033 | 0.2022 | 0.3231 | 0.00001 |
500 | Linear SVM | 0.8046 | 0.2027 | 0.3239 | 0.0001 |
5500 | Logistic Regression | 0.7205 | 0.2298 | 0.3485 | 0.00001 |