An in-depth analysis of the Online News Popularity UCI dataset using data mining and machine learning techniques. This project explores how various factors influence the popularity of online articles through advanced analytics approaches.
- Content from articles published by Mashable
- Acquisition date: January 8, 2015
- The dataset contains statistics associated with articles rather than original content
- Performance values were estimated using Random Forest classification with rolling window assessment
The dataset contains 61 attributes including:
- 58 predictive features
- 2 non-predictive fields
- 1 target variable
- Article Metadata: URL, publication timing
- Content Statistics: Word counts, uniqueness metrics
- Media Elements: Links, images, videos
- Channel Categories: Lifestyle, Tech, Business, etc.
- Keyword Performance: Min, max, and average shares
- Temporal Features: Day of week indicators
- Semantic Analysis: LDA topic proximity
- Sentiment Analysis: Polarity, subjectivity metrics
- Target Variable: Number of article shares
- Scale Management: Handling a large-scale dataset
- Data Quality: Identifying and removing outliers
- Computational Constraints: Resource limitations for complex models
- Performance Optimization: Balancing accuracy with processing time
- Business Perspective: Translating technical insights into business value
-
Technical Skills:
- Data preparation and modeling best practices
- Efficient implementation of machine learning algorithms
- Working with various R environments (RStudio, Jupyter, Colab)
-
Professional Development:
- Academic reporting in ACM format
- Research methodology for data science projects
- Leveraging data science communities (Kaggle, KDnuggets)
-
Business Application:
- Viewing data through business impact lens
- Narrative construction from analytical findings
- Balancing roles of business analyst and data scientist
- RStudio
- Jupyter with R kernel
- Google Colab with R kernel