Skip to content

Problem Statemtent: Suggest the tags based on the content that was there in the question posted on Stackoverflow.

Notifications You must be signed in to change notification settings

ankanD1601/Stack-Overflow-Tag-predictor

Repository files navigation

Stack-Overflow-Tag-predictor

Problem Statemtent Suggest the tags based on the content that was there in the question posted on Stackoverflow.

Source / useful links

Data Source : https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data
Youtube : https://youtu.be/nNDqbUhtIRg
Research paper : https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
Research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL

Real World / Business Objectives and Constraints

  1. Predict as many tags as possible with high precision and recall.
  2. Incorrect tags could impact customer experience on StackOverflow.
  3. No strict latency constraints

Data Overview

Refer: https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data
All of the data is in 2 files: Train and Test.
Train.csv contains 4 columns: Id,Title,Body,Tags.
Test.csv contains the same columns but without the Tags, which you are to predict.
Size of Train.csv - 6.75GB
Size of Test.csv - 2GB
Number of rows in Train.csv = 6034195
The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).

Example Data Point

Title:  Implementing Boundary Value Analysis of Software Testing in a C++ program?
Body : 

        #include<
        iostream>\n
        #include<
        stdlib.h>\n\n
        using namespace std;\n\n
        int main()\n
        {\n
                 int n,a[n],x,c,u[n],m[n],e[n][4];\n         
                 cout<<"Enter the number of variables";\n         cin>>n;\n\n         
                 cout<<"Enter the Lower, and Upper Limits of the variables";\n         
                 for(int y=1; y<n+1; y++)\n         
                 {\n                 
                    cin>>m[y];\n                 
                    cin>>u[y];\n         
                 }\n         
                 for(x=1; x<n+1; x++)\n         
                 {\n                 
                    a[x] = (m[x] + u[x])/2;\n         
                 }\n         
                 c=(n*4)-4;\n         
                 for(int a1=1; a1<n+1; a1++)\n         
                 {\n\n             
                    e[a1][0] = m[a1];\n             
                    e[a1][1] = m[a1]+1;\n             
                    e[a1][2] = u[a1]-1;\n             
                    e[a1][3] = u[a1];\n         
                 }\n         
                 for(int i=1; i<n+1; i++)\n         
                 {\n            
                    for(int l=1; l<=i; l++)\n            
                    {\n                 
                        if(l!=1)\n                 
                        {\n                    
                            cout<<a[l]<<"\\t";\n                 
                        }\n            
                    }\n            
                    for(int j=0; j<4; j++)\n            
                    {\n                
                        cout<<e[i][j];\n                
                        for(int k=0; k<n-(i+1); k++)\n                
                        {\n                    
                            cout<<a[k]<<"\\t";\n               
                        }\n                
                        cout<<"\\n";\n            
                    }\n        
                 }    \n\n        
                 system("PAUSE");\n        
                 return 0;    \n
        }\n
        
\n\n
    <p>The answer should come in the form of a table like</p>\n\n
    <pre><code>       
    1            50              50\n       
    2            50              50\n       
    99           50              50\n       
    100          50              50\n       
    50           1               50\n       
    50           2               50\n       
    50           99              50\n       
    50           100             50\n       
    50           50              1\n       
    50           50              2\n       
    50           50              99\n       
    50           50              100\n
    </code></pre>\n\n
    <p>if the no of inputs is 3 and their ranges are\n
    1,100\n
    1,100\n
    1,100\n
    (could be varied too)</p>\n\n
    <p>The output is not coming,can anyone correct the code or tell me what\'s wrong?</p>\n'

Tags : 'c++ c'

Our Approach

  1. We first downloaded the train and the test data from : https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data
  2. As the dataset was too big(around 8Gb) we decided to take 100k data points to work with.
  3. We saved the 100k points as a csv file. We did the EDA
  4. Most frequent tag (i.e. c#) is used 331505 times.
  5. Since some tags occur much more frequenctly than others, Micro-averaged F1-score is the appropriate metric for this probelm.
  6. 384 Tags are used more than 100 times
  7. 1 Tags are used more than 7000 times
  8. Maximum number of tags per question: 5
  9. Minimum number of tags per question: 1
  10. Avg. number of tags per question: 2.883030
  11. Most of the questions are having 2 or 3 tags

Summary

Here is a Comparison of our models.

No of Tags Model Precision Recall F1-measure Hyperparameter
500 Logistic Regression 0.6464 0.3602 0.4626 100
500 Linear SVM 0.8033 0.2022 0.3231 0.00001
500 Linear SVM 0.8046 0.2027 0.3239 0.0001
5500 Logistic Regression 0.7205 0.2298 0.3485 0.00001

About

Problem Statemtent: Suggest the tags based on the content that was there in the question posted on Stackoverflow.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published