4/26/2023 0 Comments Kudos image![]() All missing values in numeric features were imputed using the median strategy (mean was also tried but median worked better).Any missing values in a categorical feature were just assigned a new category ‘NONE’ - this seems to work well most of the time when dealing with missing categories.The general imputation scheme used on the whole dataset was: This was done because races seemed to always have a big influence on Kudos. 10th or 1st, as these activities were obviously races. A KNN imputer was also considered but there weren't many unlabeled data points so manual entry was quicker.Ī heuristic function was also used to fill in activities with titles that contain race positions e.g. ![]() I chose to manually enter the values as any simple imputation technique would've assigned all the missing values to the default Strava category of ‘easy run’. Before imputing missing values, it's always really important to first investigate missing values and try to find out if there is a reason why they are missing, it might be important!Īfter some inspection, I manually entered some missing values for workout_type as I believed this to be very important and some older data didn’t have labels. For example, the workout_type feature didn't always have a ‘race’ option (which turned out to have a lot of predicting power!). There was not a huge amount of missing values in the dataset, and most of the missing values came from features that are fairly new to Strava/features I didn't always use. It is worth noting that the only time the model was evaluated on the test set was right at the end of the project when I wanted to get a real gauge of the true performance. This scheme was also used to tune the hyperparameters of the promising models (the folds were reshuffled before hyperparameter tuning though). Using this cross-validation scheme I was able to confidently compare different models to select the most promising models without worrying about overfitting the validation sets. To add to this, the longest activity of the day tends to receive more kudos than the other runs that day.Ĭode to Create Stratified-K-Fold Cross-Validation (written by author) By looking at time distribution between activities, it was found that runs that are quickly followed in succession by other runs tend to receive less kudos than runs that were the only activity that day.SMOTE was therefore used to oversample this workout type to give more predictive power to this feature and it was later found that this reduced the RMSE. Intuitively, I also believe that races receive more Kudos in general. Workout type seems to correlate strongly with Kudos, however, there aren't that many data points for workout type 1 (a race).The photo count feature only has a few different data points in each nonzero category so we can change this to a binary feature of contains photos/no photos.This was confirmed when looking at a correlation matrix. Features such as distance, moving_time, and average_speed_mpk seem to share a similar distribution to the one we have with kudos_count.These were removed from the training set. This meant that other users couldn't see or interact with the activity. Upon further investigation, these were predominantly activity uploads that were initially set to private but then switched to public at a later date. An isolation forest model was used to detect outliers and out of the 800 training points, 17 outliers were found.A BoxCox transform could therefore be used to see a more ‘normal-looking’ distribution in the target variable (normality can then be checked by plotting a q-q plot, for example). When using certain models, for example, linear regression, would like to see more of a normal distribution in the target variable. It was found that the target variable (Kudos) is skewed right (also bimodal) and there are some extreme values above ~100.Datapoints Kept and Dropped (image by author)
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |