Stack-Overflow-Tag-predictor

Problem Statemtent Suggest the tags based on the content that was there in the question posted on Stackoverflow.

Source / useful links

Data Source : https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data
Youtube : https://youtu.be/nNDqbUhtIRg
Research paper : https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
Research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL

Real World / Business Objectives and Constraints

Predict as many tags as possible with high precision and recall.
Incorrect tags could impact customer experience on StackOverflow.
No strict latency constraints

Data Overview

Refer: https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data
All of the data is in 2 files: Train and Test.

Train.csv contains 4 columns: Id,Title,Body,Tags.

Test.csv contains the same columns but without the Tags, which you are to predict.

Size of Train.csv - 6.75GB

Size of Test.csv - 2GB

Number of rows in Train.csv = 6034195

The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).

Example Data Point

Title:  Implementing Boundary Value Analysis of Software Testing in a C++ program?
Body : 
        #include<
        iostream>\n
        #include<
        stdlib.h>\n\n
        using namespace std;\n\n
        int main()\n
        {\n
                 int n,a[n],x,c,u[n],m[n],e[n][4];\n         
                 cout<<"Enter the number of variables";\n         cin>>n;\n\n         
                 cout<<"Enter the Lower, and Upper Limits of the variables";\n         
                 for(int y=1; y<n+1; y++)\n         
                 {\n                 
                    cin>>m[y];\n                 
                    cin>>u[y];\n         
                 }\n         
                 for(x=1; x<n+1; x++)\n         
                 {\n                 
                    a[x] = (m[x] + u[x])/2;\n         
                 }\n         
                 c=(n*4)-4;\n         
                 for(int a1=1; a1<n+1; a1++)\n         
                 {\n\n             
                    e[a1][0] = m[a1];\n             
                    e[a1][1] = m[a1]+1;\n             
                    e[a1][2] = u[a1]-1;\n             
                    e[a1][3] = u[a1];\n         
                 }\n         
                 for(int i=1; i<n+1; i++)\n         
                 {\n            
                    for(int l=1; l<=i; l++)\n            
                    {\n                 
                        if(l!=1)\n                 
                        {\n                    
                            cout<<a[l]<<"\\t";\n                 
                        }\n            
                    }\n            
                    for(int j=0; j<4; j++)\n            
                    {\n                
                        cout<<e[i][j];\n                
                        for(int k=0; k<n-(i+1); k++)\n                
                        {\n                    
                            cout<<a[k]<<"\\t";\n               
                        }\n                
                        cout<<"\\n";\n            
                    }\n        
                 }    \n\n        
                 system("PAUSE");\n        
                 return 0;    \n
        }\n
        \n\n
    <p>The answer should come in the form of a table like</p>\n\n
    <pre><code>       
    1            50              50\n       
    2            50              50\n       
    99           50              50\n       
    100          50              50\n       
    50           1               50\n       
    50           2               50\n       
    50           99              50\n       
    50           100             50\n       
    50           50              1\n       
    50           50              2\n       
    50           50              99\n       
    50           50              100\n
    </code></pre>\n\n
    <p>if the no of inputs is 3 and their ranges are\n
    1,100\n
    1,100\n
    1,100\n
    (could be varied too)</p>\n\n
    <p>The output is not coming,can anyone correct the code or tell me what\'s wrong?</p>\n'

Tags : 'c++ c'

Our Approach

We first downloaded the train and the test data from : https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data
As the dataset was too big(around 8Gb) we decided to take 100k data points to work with.
We saved the 100k points as a csv file. We did the EDA
Most frequent tag (i.e. c#) is used 331505 times.
Since some tags occur much more frequenctly than others, Micro-averaged F1-score is the appropriate metric for this probelm.
384 Tags are used more than 100 times
1 Tags are used more than 7000 times
Maximum number of tags per question: 5
Minimum number of tags per question: 1
Avg. number of tags per question: 2.883030
Most of the questions are having 2 or 3 tags

Summary

Here is a Comparison of our models.

No of Tags	Model	Precision	Recall	F1-measure	Hyperparameter
500	Logistic Regression	0.6464	0.3602	0.4626	100
500	Linear SVM	0.8033	0.2022	0.3231	0.00001
500	Linear SVM	0.8046	0.2027	0.3239	0.0001
5500	Logistic Regression	0.7205	0.2298	0.3485	0.00001

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Applying Logistic Regression with OneVsRest Classifier.ipynb		Applying Logistic Regression with OneVsRest Classifier.ipynb
Approach and summary.ipynb		Approach and summary.ipynb
Asssignments.ipynb		Asssignments.ipynb
EDA and Preprocessiong...ipynb		EDA and Preprocessiong...ipynb
README.md		README.md
Taking 100k points.ipynb		Taking 100k points.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stack-Overflow-Tag-predictor

Source / useful links

Real World / Business Objectives and Constraints

Data Overview

Example Data Point

Our Approach

Summary

About

Uh oh!

Releases

Packages

Languages

ankanD1601/Stack-Overflow-Tag-predictor

Folders and files

Latest commit

History

Repository files navigation

Stack-Overflow-Tag-predictor

Source / useful links

Real World / Business Objectives and Constraints

Data Overview

Example Data Point

Our Approach

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages