Post

Calorie Expenditure Prediction Using Supervised Machine Learning Models

This project uses supervised machine learning models to predict the number of calories burned during a workout based on biometric and activity features such as age, weight, duration, heart rate, and body temperature. The goal is to support personalized fitness insights through accurate energy expenditure estimation.

๐Ÿ“š Import Required Libraries

1
2
3
4
5
6
7
8
9
10
import pandas as pd
from pathlib import Path
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_log_error

๐Ÿ—‚๏ธ Load and Preview the Dataset

1
2
3
4
5
6
7
8
test_path = Path('Predict_Calorie') / 'test.csv'
train_path = Path('Predict_Calorie') / 'train.csv'
sub_path = Path('Predict_Calorie') / 'gender_submission.csv'
test = pd.read_csv(test_path)
train = pd.read_csv(train_path)
print("Train shape:", train.shape)
print("Test shape:", test.shape)
train.head()
1
2
Train shape: (750000, 9)
Test shape: (250000, 8)
idSexAgeHeightWeightDurationHeart_RateBody_TempCalories
00male36189.082.026.0101.041.0150.0
11female64163.060.08.085.039.734.0
22female51161.064.07.084.039.829.0
33male20192.090.025.0105.040.7140.0
44female38166.061.025.0102.040.6146.0

๐Ÿ” Exploratory Data Analysis (EDA)

1
train.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 9 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   id          750000 non-null  int64  
 1   Sex         750000 non-null  object 
 2   Age         750000 non-null  int64  
 3   Height      750000 non-null  float64
 4   Weight      750000 non-null  float64
 5   Duration    750000 non-null  float64
 6   Heart_Rate  750000 non-null  float64
 7   Body_Temp   750000 non-null  float64
 8   Calories    750000 non-null  float64
dtypes: float64(6), int64(2), object(1)
memory usage: 51.5+ MB
1
train.describe()
idAgeHeightWeightDurationHeart_RateBody_TempCalories
count750000.000000750000.000000750000.000000750000.000000750000.000000750000.000000750000.000000750000.000000
mean374999.50000041.420404174.69768575.14566815.42101595.48399540.03625388.282781
std216506.49528415.17504912.82449613.9827048.3540959.4498450.77987562.395349
min0.00000020.000000126.00000036.0000001.00000067.00000037.1000001.000000
25%187499.75000028.000000164.00000063.0000008.00000088.00000039.60000034.000000
50%374999.50000040.000000174.00000074.00000015.00000095.00000040.30000077.000000
75%562499.25000052.000000185.00000087.00000023.000000103.00000040.700000136.000000
max749999.00000079.000000222.000000132.00000030.000000128.00000041.500000314.000000

๐Ÿ“Š Distribution of Calories Burned

1
2
3
4
5
6
plt.figure(figsize=(8, 4))
sns.histplot(train['Calories'], bins=50, kde=True)
plt.title("Distribution of Calories Burned")
plt.xlabel("Calories")
plt.ylabel("Frequency")
plt.show()

output_8_0

๐Ÿ“ฆ Boxplot: Calories by Sex

1
2
3
4
plt.figure(figsize=(6, 4))
sns.boxplot(data=train, x='Sex', y='Calories')
plt.title("Calories Burned by Sex")
plt.show()

output_10_0

๐Ÿงฎ Correlation Heatmap of Numeric Features

1
2
3
4
5
plt.figure(figsize=(10, 6))
sns.heatmap(train.select_dtypes(include='number').corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()

output_12_0

๐ŸŽฏ Purpose of the Model

The goal of this machine learning model is to predict the number of calories burned during a workout session based on biometric and activity-related inputs such as age, weight, height, duration, heart rate, and body temperature.

By accurately estimating calorie expenditure, the model can support applications in fitness tracking, personalized health insights, and activity planning, helping users and systems better understand energy usage in physical activities.

๐Ÿท๏ธ Encode Categorical Variables

1
2
3
le = LabelEncoder()
train['Sex'] = le.fit_transform(train['Sex'])  # male=1, female=0
test['Sex'] = le.transform(test['Sex'])

๐ŸŽฏ Define Features and Target

1
2
3
X = train.drop(columns=['Calories', 'id'])
y = train['Calories']
X_test = test.drop(columns=['id'])

๐Ÿงช Function to Evaluate Performance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def evaluate_and_plot(model, name, X_train, X_val, y_train, y_val):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    y_pred = np.clip(y_pred, 0, None)
    y_val_clipped = np.clip(y_val, 0, None)

    rmsle = np.sqrt(mean_squared_log_error(y_val_clipped, y_pred))
    print(f"{name} - Validation RMSLE:", round(rmsle, 4))  # << changed here

    # Visualization
    plt.figure(figsize=(6, 6))
    sns.scatterplot(x=y_val, y=y_pred, alpha=0.3)
    plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], color='red', linestyle='--')
    plt.xlabel("Actual Calories")
    plt.ylabel("Predicted Calories")
    plt.title(f"{name} - Actual vs. Predicted Calories")
    plt.show()

    return name, rmsle, model

๐Ÿ”€ Train/Test Split for Validation

1
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

๐ŸŒฒ Train a Random Forest Regressor

1
2
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
RandomForestRegressor(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

๐Ÿ“ Evaluate Model with RMSLE

1
evaluate_and_plot(rf_model, "Random Forest", X_train, X_val, y_train, y_val)
1
Random Forest - Validation RMSLE: 0.0634

output_26_1

1
2
3
('Random Forest',
 np.float64(0.06339603326021119),
 RandomForestRegressor(random_state=42))

๐ŸŒŸ Gradient Boosting Regressor

1
2
from sklearn.ensemble import GradientBoostingRegressor
gbr_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

๐Ÿ“ Evaluate Model with RMSLE

1
evaluate_and_plot(gbr_model, "Gradient Boosting", X_train, X_val, y_train, y_val)
1
Gradient Boosting - Validation RMSLE: 0.1294

output_30_1

1
2
3
('Gradient Boosting',
 np.float64(0.12937499268300726),
 GradientBoostingRegressor(random_state=42))

โšก XGBoost Regressor

1
2
from xgboost import XGBRegressor
xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.1, objective='reg:squarederror', random_state=42)

๐Ÿ“ Evaluate Model with RMSLE

1
evaluate_and_plot(xgb_model, "XGBoost", X_train, X_val, y_train, y_val)
1
XGBoost - Validation RMSLE: 0.0682

output_34_1

1
2
3
4
5
6
7
8
9
10
11
12
13
('XGBoost',
 np.float64(0.06817993870283824),
 XGBRegressor(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              feature_weights=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.1, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=100,
              n_jobs=None, num_parallel_tree=None, ...))

๐Ÿคฉ Summary

ModelRMSLEPerformance Summary
๐ŸŒฒ Random Forest0.0634โœ… Best overall โ€” highly accurate with low error, robust to overfitting
โšก XGBoost0.0682๐Ÿ”ผ Close second โ€” excellent performance, slight trade-off for training speed
๐ŸŒŸ Gradient Boosting0.1294๐ŸŸก Significantly higher error โ€” may benefit from tuning or more trees

๐Ÿ“Œ Conclusion: Why Random Forest Performed Best

Among the three models tested, Random Forest Regressor achieved the lowest RMSLE (0.0634), making it the most accurate in predicting calorie expenditure.

๐Ÿ” Why did Random Forest perform best?

  • It handles nonlinear relationships and feature interactions very well without requiring extensive tuning.
  • Itโ€™s robust to noise and outliers, which is useful when working with real-world biometric data.
  • It averages over many decision trees, reducing overfitting and yielding stable, reliable results.

In contrast, Gradient Boosting and XGBoost can be more sensitive to parameter settings and typically require more fine-tuning to reach optimal performance.

This post is licensed under CC BY 4.0 by the author.

Trending Tags