Problem Statement
- The goal of this project is to develop a recommender system that provides product recommendations.
- Your focus in this exercise should be on the following:
The following is recommendation of the steps that should be employed towards attempting to solve this problem statement:
- Data Collection: Gather historical order data from the database or e-commerce platform. Collect healthiness information through surveys or health records (if available).
- Data Preprocessing: Clean and preprocess the historical order and healthiness data, handle missing values, and encode categorical variables.
- Recommender System Development:
- Choose and implement the appropriate recommender system algorithm (collaborative filtering, content-based, or hybrid).
- Split the data into training and testing sets.
- Train and fine-tune the recommender system using historical order data.
- Feature Engineering:
- Normalize or scale numerical features as necessary.
- Healthiness Prediction Model Development:
- Choose a suitable machine learning/deep learning algorithm
- Split the healthiness data into training and testing sets.
- Testing and Validation:
- Evaluate the performance of the recommender system using appropriate metrics.
- Conduct user testing and collect feedback for system improvements.
- Documentation and Deployment:
- Prepare comprehensive documentation for the project, including the models’ technical details and usage instructions.
- Deploy the system on a web server or cloud platform for public use.
Tasks/Activities List
Your code should contain the following activities/Analysis:
- Data Collection
- Data Preprocessing
- Feature Engineering
- Recommender System
- Evaluation
- Integration
- User Interface/API
- Testing and Validation
- Model Refinement
- Documentation
- Deployment Plan
Success Metrics
Below are the metrics for the successful submission of this case study.
- The accuracy of the model on the test data set should be > 75% (Subjective in nature)
- Add methods for Hyperparameter tuning.
- Perform model validation.
Bonus Points
- You can package your solution in a zip file included with a README that explains the installation and execution of the end-to-end pipeline.
- You can demonstrate your documentation skills by describing how it benefits our company.
Data
We construct a dataset via collecting data from Allrecipes.com. The website is chosen since it is one of the largest food-oriented social networks with 1.5 billion visits per year. In particular, we crawl 52,821 recipes from 27 categories posted between 2000 and 2018. For each recipe, we crawl its ingredients, image and the corresponding ratings from users. To ensure the quality of the data, we filter recipes that do not have images and that contain repeated ingredients and zero reviews. Then we obtain raw_data, including 1,160,267 users, 49,698 recipes with 38,131 ingredients and 3,794,003 interactions , which contains files as following:
- recipe_file:
each recipe information is presented in one line, which consists of recipe id, name, average ratings of reviewers, image url, review nums, ingredients, cooking directions, nutritions, and reviews. - interaction_file:
each interaction is presented in one line, which consists of user id, recipe id, rating, and dateLastModified. - images:
49,698 images, named with recipe_id.jpg.
Besides, in order to evaluate recommendation models’ performance, we holdout the latest 30% of interaction history to construct the test set, and split the remaining data into training (60%) and validation (10%) sets. Then we retain users which occur in both training and test sets, and obtain 68,768 users, 45,630 recipes with 33,147 ingredients and 1,093,845 interactions. We call this dataset as core_data, the dataset consists of files as following:
- recipe_file: each recipe information is presented in one line, which consists of recipe id, name, image url, ingredients, cooking directions, nutritions.
- interaction_file(train.rating/valid.rating/test.rating): each interaction is presented in one line, which consists of user id, recipe id, rating, and dateLastModified.
- images: 45,630 images, named with recipe_id.jpg.