Art Toy Price Prediction on Shopee
March - April 2024
A machine learning solution designed to predict the price of 'Art Toys' on the Shopee platform. This project focuses on developing and training ML model, processing data, and selecting suitable algorithms and methodologies to enhance accuracy. It aims to determine optimal pricing, analyze price changes' impact on sales, and assess how various factors influence pricing.
Data Collection:
We begin by collecting data through web scraping from the Shopee website (https://shopee.co.th/). Using BeautifulSoup, we extract the HTML content, while Selenium is used to automate and control the web browser. Each retrieved data row represents a top-selling product on the Shopee platform when searching for "Art Toy."
The dataset consists of six features as follows:
Product
Promotion Price
Percent Discount
Sold Per Month
Location
Rating
Click to view the Data Collection (Web Scraping) Code.
Next, perform data exploration and adjustments, such as converting product names to text strings, replacing the '฿' symbol, converting price values (e.g., from 'k' to 1000), and using the count of stars/images for rating extraction.
Data Processing
Separate numerical and categorical columns.
Add a price_cut column to categorize products into high and low price groups. If the promotion price is less than or equal to 279.6 (40th percentile), assign a value of 0 for low price. If the promotion price is greater than 347.40 (60th percentile), assign a value of 1 for high price. Drop the middle price range.
Convert the categorical column into binary for the 'location' feature.
Fill in missing values using SimpleImputer.
Scale features using MinMaxScaler.
Features Selection
SelectKBest: Mutual Information : ['percentDiscount','soldPerMonth','Overseas']
Correlation :['percentDiscount','Overseas’, ‘จังหวัดกรุงเทพมหานคร Bangkok’]
SelectKBest: Chi-squared : [‘จังหวัดกรุงเทพมหานคร Bangkok’,'percentDiscount', 'Overseas']
Model Training
Algorithm Selection
K-Nearest Neighbor (KNN): Effective for non-linear relationships, Simple to understand
Naive Bayes: Works well when features are independent
Artificial Neural Networks (ANN): Handles complex non-linear relationships ,Suitable for various types of data
Random Forest: Robust across different datasets.
After selecting the appropriate machine learning algorithm for predicting the sale price of 'Art Toys' on the Shopee platform, the next steps are model training and hyperparameter tuning , using techniques like grid search, to optimize the performance of the model.
Result and Evaluation
For performance evaluation, the results of each model are compared using the classification report and ROC AUC to display accuracy values. Additionally, the ROC curve is plotted for both the "low" and "high" price classes, summarizing the performance of each model as follows:
Model Performance Comparison Matrix
It can be concluded that Naive Bayes achieved the best score of 0.755, while Random Forest performed the best in terms of accuracy (0.758), recall (0.76), and precision (0.76).
