Web Traffic Time Series Forecasting

piyush kirtivardhan
5 min readOct 23, 2020

Wikipedia is the largest, most trusted online platform for all the people around the world to learn, share their knowledge related to anything.

Wikipedia something which every person in the world uses one way or another. Each month, over 18 billion visits done on Wikipedia pages making it one of the most visited websites in the world, according to Alexa.com Wikipedia’s the one place where you can learn, share their knowledge. It features knowledge of a wide range of topics. Wikipedia develops at a rate of over 1.9 edits per second, performed by editors from all over the world. Currently, the English Wikipedia includes 6,150,827 articles and it averages 601 new articles per day. This amount of data can be analyses in many ways.

Table of content:

  1. ML problem statement
  2. Dataset Introduction
  3. EDA
  4. Models
  5. Summary & Future work
  6. Learnings
  7. References

Problem Statement

Predicting the future website visits of multiple time-series for Wikipedia webpages and analyzing the pattern of traffic to understand the website visits more carefully and closely.

Source: https://www.kaggle.com/c/web-traffic-time-series-forecasting/overview

Dataset Introduction

The original dataset has 145k time series data on the number of views of different Wiki pages. The dataset looks like:

Data Field Explanation

Dataset contains RangeIndex: 145063 entries, 0 to 145062

Columns: 804 entries, Page to 2017–09–10

dtypes: float64(803), object(1)

memory usage: 889.8+ MB. The columns in the table are:

Page- Unique identifier for each webpage

Date :2015 -07–01 to 2017–09–10

EDA

EDA stands for exploratory data analysis it is used to understand/explore the data and remove any miscellaneous data available which can be removed by using pre-processing techniques,what are feature which will be important to build ML Models.

From the page name expanding and find the below

There is more visits by desktop than mobile and all access, but we have more number of sites visited by all access than any other means of access.

The Web traffic of Day wise, there is not such a big difference in the number of visit in day wise, we notice that during the period of 2016 Olympics (July to August) we have more number of views

In the year 2016 get more number of average view than the other two years 2015 and 2017.

We get little more views on weekends than weekdays, but its doest get much deflection, Because we get 70 percent of view in 5 day which means around 14 percentage of view in 1 day, and same as 30 percentage of view in 2 day means approx 15 percentage of view on 1 day. So there is the almost the same amount of view on weekdays and weekends

Model:

Transform of categorical values

SGDRegressor Model

Decision TreeRegressor

RandomForestRegressor

AdaBoostRegressor

Dense Model

Conv1d LSTM model

Models With there Kaggle Score

For the SGDRegressor, DecisionTreeRegressor, RandomForestRegressor and AdaBoostRegressor we have only use the 75 days of data for the training out of 803 days data(due to ram limitation).

Here we will talk more about Dense Model:

SUMMARY & Future work

⦁ From above table we can see that Dense Model got highest kaggle score so we can use our trained Dense Model for prediction on unseen data

⦁ EDA : We have analyzed all our dataset by using different types visulaization techniques such as bar plot,graph, etc..,, we found that there were many missing , NaN, there are many website which is listed on server at the end of 2016, so pervious values are nan

⦁ Feature Engineering : In feature engineering steps we removed all missing value and NaN value and O we extracted new features from existing feature such we have extracted month,year,day,weekend,accestype,language,agent and etc

⦁ Data preparation : we have only take the 75 days of data from 803 days, because of ram limititation

⦁ Data preprocessing : We have transform all our numerical features using standardization. For categorical features we have used label encoding.

⦁ Models applied : Before applying any ML algorithm we did hyper parameter tuning in-order to get best parameters then we have applied various Machine learning algorithms like SGDRegressor, DecisionTreeRegressor, RandomForestRegressor and AdaBoostRegressor. We also tried deep learning algorithm Dense Model and Conv1d-lstm model.

⦁ Future works : Because we have less amount of data because of RAM limitaion we are not getting desired result but if we can use whole data we can get better result. We can come with more feature in future and if we do more hyperparameter definitely we can getter result

Learnings

My primary takeaways from this competition:

⦁ Spend more time on kaggle discussions, and participate frequently. It will build a loop — work, discuss. discuss, work.

⦁ I think I need to do more eda and find more feature like language, day of month, day of year, day of week, holidays, long weekends,etc

⦁ I have to more focus on model and tuning, for the future

⦁ Don’t be afraid of time-series problems. At the end of the day, its only a regression problem!

References

⦁ How to Develop LSTM Models for Time Series Forecasting (https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/)

⦁ Deep Learning Models for Univariate Time Series Forecasting(https://machinelearningmastery.com/how-to-develop-deep-learning-models-for-univariate-time-series-forecasting/)

⦁ Web Traffic Forecasting

https://towardsdatascience.com/web-traffic-forecasting-f6152ca240cb

⦁ Github Link for top Submission of competition

https://github.com/sjvasquez/web-traffic-forecasting

https://www.kaggle.com/arjunsurendran/using-lstm-on-training-data

https://github.com/Arturus/kaggle-web-traffic

https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

Link to github : https://github.com/pkirti/Web-Traffic-Time-Series-Forecasting

LinkdIn profile :https://www.linkedin.com/in/piyush-kirtivardhan-5085821b0/

Thanks for your precious time

--

--