PORTFOLIO

Data Pipeline, Python

This is the web scraper project that employs the scraper on the Ofsted website for primary and secondary public school. The project came as an idea of combining data mining/ pipeline and education:  the goal is an automation of the data collection (school info, rating, snapshot) or update,  storage, remote monitoring. The pipeline uses various techniques and engines to collect, store, process, display information about data and system metrics as well.

Technologies Used:

  • Python 3.9.7
  • Chromedriver: latest
  • Chrome: latest
  • AWS S3
  • AWS RDS and Pg4Admin/ PostgreSQL
  • AWS EC2 instance
  • Docker and Dockerd set up on the EC2 instance
  • Prometheus
  • Node-Exporter
  • Grafana Dashboard

Football Match Outcome Prediction, Python

The Football Match Outcome Prediction project: the user processes a large number of files that contain information about football matches that have taken place since 1990. The data has to be cleaned so it can be fed to the model. Then, different models are trained with the dataset, and the best performing model is selected. The hyperparameters of this model are tuned, so its performance is improved.

Technologies Used:

  • Pandas
  • Seaborn
  • Selenium
  • Webdriver / Chrome
  • SciKit Learn

Trained Models:

  • Logistic Regression
  • Random Forest
  • Decision Tree
  • SVM
  • AdaBoost on Decision Tree
  • AdaBoost on Logistic Regression
  • Gradient Boost
  • MLP

Human Activity Recognition Project (ML) using R

The goal of the project is to predict the manner in which the participants did the exercise. Furthermore, data is split into two sets for training and testing purposes. After the initial data overview and preparation the machine learning algorithm is applied to obtain satisfying level of performance first on the training set and further on the test set.

Trained Models:

  • Decision Tree (rpart)
  • Random Forest (rf)
  • Stochastic Gradient Boosting Models

Confusion matrix and accuracy levels used for evaluation


A/B Testing, conversion rate test with Z-stat

The goal of the project: to identify if a change to the web page increase the outcome of an interest

Task: suppose you are working for an e-commerce company and the marketing team is trying to decide if they should launch a new webpage.

They ran an A/B test and need help analyzing the results.

They provided you with this dataset, which contains the following fields:

  • user_id: the user_id of the person visiting the website
  • timestamp: the time in which the user visited the website
  • group: treatment vs control, treatment saw the new landing page, control saw the old landing page
  • landing_page: new vs old landing page, labeled 'new_page'/'old_page'
  • converted: 0/1 flag denoted whether the user visiting the page ended up converting

Given this information, you're asked to come up with a recommendation for the marketing team -- should the marketing team adopt the new landing page?

The team wants the landing page with the highest conversion rate.

Project summary:

data preparation: NA values removal, align values

define statistics: sample size, alpha value

visualisation: sns.kde plot to show probabiity distribution of two sample sets

apply z-stat to evaluate p-value

RESULT: calculated p-value is above alpha value, therefore it is recommended to not proceed with the new_page launch.