I consider myself a data science enthusiast who is always eager to learn new topics and implement new ideas. I also write about data science, machine learning, deep learning and statistics. Below is a list of the projects I have done and some of the stories from my blog on Medium.
Projects
Image Classification with Deep Learning
Motivation: Computer vision is an highly important field in data science with many applications from self-driving cars to cancer diagnosis. Convolutional neural networks (CNNs) are commonly used for computer vision and image classification tasks. I implemented a CNN using Keras to perform a binary classification task. I tried to explain the concepts of each step in a convolutional neural network and the theory behind them.
Data: The images are taken from Caltech101 dataset
Achievements:
- Preprocess the images using ImageDataGenerator of Keras. ImageDataGenerator generates batches of tensor image data with real-time data augmentation by applying random selections and transformations (such as rotating and shifting) in batches. Data augmentation increases the diversity of the dataset and thus helps to get more accurate results and also prevents the model from overfitting.
- Build a CNN model with convolution and pooling layes as well as a flattenning and a dense layer.
- The model achieved 99% accuracy on training set and 98.1% accuracy on test set.
Cryptocurrency Prediction with Deep Learning
Motivation: Although the first redistributed cryptocurrency (bitcoin) was created in 2009, the idea of digital money arised in 1980s. In the recent years, cryptocurrencies have gained tremendeous popularity. As traditional currencies, the value of cryptocurrencies are changing in time. Using the historical data, I will implement a recurrent neural netwok using LSTM (Long short-term memory) layers to predict the trend of value of a cryptocurrency in the future.
Data: There is a huge dataset about cryptocurrency market prices on Kaggle. I only used a part of it which is historical price data of litecoin.
Achievements:
- Preprocess the data and convert it to a format that can be used as input to LSTM layer
- Build a deep learning model with Keras that includes LSTM layers and a dense layer
- The model achieved to reduce the loss below 0.002 on training set and also predict the trend on the test set.
How to Improve: We can build a more robust and accurate model by collecting more data. We can also try to adjust number of nodes in a layer or add additional LSTM layers. We can also try to increase the number of timesteps which was 90 in our model.
Churn Prediction
Motivation: Churn prediction is common use case in machine learning domain. It is very critical for business to have an idea about why and when customers are likely to churn (i.e. leave the company). Having a robust and accurate churn prediction model helps businesses to take actions to prevent customers from leaving the company.
Data: I used the telco customer churn dataset available on Kaggle. The dataset includes 20 features (independent variables) and 1 target (dependent) variable for 7043 customers.
Achievements:
- With an extensive exploratory data analysis process, I understand the characteristics of features as well as the relationship among them. Then, eliminated the redundant features.
- Encode categorical features so that they can be used as input to a machine learning model. Also, normalized the numerical values.
- Implemented two models:
- Ridge classifier: Achieved 76.1% accuracy on test set
- Random forests: Initially achievend 84.2% accuracy but I managed to increase the accuracy to 90% with hyperparameter tuning.
How to Improve: The fuel of machine learning models is data so if we can collect more data, it is always helpful in improving the model. We can also try a wider range of parameters in GridSearchCV because a little adjustment in a parameter may slighlty increase the model.
Predicting Used Car Prices
Motivation: Used cars are usually sold on a website called “sahibinden (from the owner)” in Turkey. “Sahibinden” means “from the owner”. Dealers also use this website to sell or buy used cars. Thus, it shapes the used car market at some level. The most critical part of selling a used car is to determine the optimal price. There are many websites that give you an estimate on the value of a used car but it is better to also search the market before setting the price. Moreover, there are other factors which affect the price such as location, how fast you want to sell the car, smoking in the car and so on. Before we post an ad on the website, it is best to look through the price of similar cars. However, this process might be exhausting because there are lots of ads online. Therefore, I decided to take advantage of the convenience offered by machine learning to create a model that predicts used car prices based on the data available on “sahibinden”.
Data: I scraped the data of a particular brand and model from “sahibinden.com” website. Dataset includes 7 features and the price (target variable) of 6731 cars.
Achievements:
- I was able to collect raw data using web scraping techniques.
- Clean the raw data to make it suitable for data analysis.
- With an extensive exploratory data analysis process, I was able to detect the effect of features on the price.
- Implemented two models:
- Linear regression: Achieved R-squared score of 0.84.
- Random forests: Achieved R-squared score of 0.90.
How to Improve: There are many ways to improve a machine learning model. I think the most fundamental and effective one is to gather more data. In our case, we can (1) collect data for more cars or (2) more information of the cars in the current dataset or both. For the first one, there are other websites to sell used cars so we can increase the size of our dataset by adding new cars. For the second one, we can scrape more data about the cars from “sahibinden” website. If we click on an ad, another page with detailed information and pictures opens up. In this page, people write about the problems of the car, any previous accident or repairment and so on.
Another way to improve is to adjust model hyperparameters. We can use RandomizedSearchCV to find optimum hyperparameter values.