Introduction
I’ve published a Kaggle dataset that tracks the year-over-year evolution of MLB pitchers’ arsenals from 2020 to 2025. This dataset enables analysis of how pitchers adjust their pitch mix, velocity, and spin rates over time.
Dataset Link: https://www.kaggle.com/datasets/yasunorim/mlb-pitcher-arsenal-2020-2025
Dataset Overview
This dataset provides pitch-by-pitch aggregated metrics for MLB pitchers across six seasons (2020-2025).
Basic Information
- Period: 2020-2025 seasons (6 seasons)
- Rows: 4,253 rows (pitcher × season combinations)
- Columns: 111 columns
- Format: Wide format (1 row = 1 pitcher × 1 season)
- Filter: Only pitchers with 100+ pitches in a season
- Quality Score: 10.0/10 on Kaggle
Data Source
Data is sourced from MLB Advanced Media (Statcast) via the pybaseball library and aggregated by pitcher × season × pitch type.
Data Structure
Identifier Columns (3 columns)
player_id: MLB player IDplayer_name: Player nameseason: Season year (2020-2025)
Pitch Metrics (18 pitch types × 6 metrics = 108 columns)
For each pitch type, the dataset includes 6 metrics:
- usage_pct: Usage rate (0-100%)
- avg_speed: Average velocity (mph)
- avg_spin: Average spin rate (rpm)
- whiff_rate: Whiff rate (0-1)
- avg_pfx_x: Average horizontal movement (inches, gravity-adjusted)
- avg_pfx_z: Average vertical movement (inches, gravity-adjusted)
Pitch Types (18 types)
FF (Four-seam), SI (Sinker), FC (Cutter), SL (Slider), CU (Curve), CH (Changeup), FS (Splitter), KC (Knuckle Curve), FO (Forkball), EP (Eephus), KN (Knuckleball), ST (Sweeper), SV (Slurve), and more.
Usage
Downloading the Data
Download the CSV directly from Kaggle:
import pandas as pd
# Download via Kaggle API
!kaggle datasets download -d yasunorim/mlb-pitcher-arsenal-2020-2025
# Load data
df = pd.read_csv('pitcher_arsenal_evolution_2020_2025.csv')Analysis Notebook
A comprehensive analysis notebook is also available:
https://www.kaggle.com/code/yasunorim/pitcher-arsenal-analysis
The notebook includes:
- Individual pitcher arsenal trend analysis
- MLB-wide pitch type trends (2020-2025)
- Velocity change analysis
- Correlation heatmaps
Use Cases
1. Pitcher Arsenal Trend Analysis
Track year-over-year changes in a specific pitcher’s repertoire:
# Yusei Kikuchi's pitch usage evolution
kikuchi = df[df['player_name'].str.contains('Kikuchi', case=False)]
kikuchi.plot(x='season', y=['SL_usage_pct', 'FF_usage_pct', 'CH_usage_pct'])For Kikuchi, slider usage increased from ~20% in 2019 to over 40% in 2022-2025, showing a significant shift in pitch mix strategy.
2. Injury Recovery Analysis
Analyze changes in pitch mix and velocity before and after injury:
# Compare two seasons for a specific pitcher
pitcher_data = df[df['player_id'] == 123456]
pitcher_data[['season', 'FF_avg_speed', 'FF_usage_pct']]3. MLB-Wide Trend Analysis
Visualize league-wide pitch type trends:
# Calculate yearly average usage rates
yearly_avg = df.groupby('season')[['FF_usage_pct', 'SI_usage_pct', 'SL_usage_pct']].mean()
yearly_avg.plot(kind='line', marker='o')Recent trends show increasing usage of sliders and sweepers across MLB.
4. Machine Learning Features
Use arsenal metrics as features for pitcher performance prediction:
# Calculate pitch diversity
df['pitch_diversity'] = (df[[col for col in df.columns if 'usage_pct' in col]] > 5).sum(axis=1)Data Collection Method
The data collection notebook is also available for reproducing the dataset with updated seasons:
https://www.kaggle.com/datasets/yasunorim/mlb-pitcher-arsenal-2020-2025?select=pitcher_arsenal_evolution_2020_2025.ipynb
Collection Process
from pybaseball import statcast_pitcher
import pandas as pd
# Example: Collect arsenal data for a pitcher
pitcher_id = 660271 # Shohei Ohtani
season = 2025
# Get pitch-by-pitch data
df = statcast_pitcher(
start_dt=f"{season}-03-01",
end_dt=f"{season}-11-30",
player_id=pitcher_id
)
# Aggregate by pitch type
arsenal = df.groupby('pitch_type').agg({
'release_speed': 'mean',
'release_spin_rate': 'mean',
'pfx_x': 'mean',
'pfx_z': 'mean',
'pitch_type': 'count'
})Conclusion
I hope this dataset is useful for baseball analytics research and machine learning projects. If you have any questions or feedback, feel free to leave a comment or start a discussion on Kaggle!
Links
- Dataset: https://www.kaggle.com/datasets/yasunorim/mlb-pitcher-arsenal-2020-2025
- Analysis Notebook: https://www.kaggle.com/code/yasunorim/pitcher-arsenal-analysis
- GitHub Repository: https://github.com/yasumorishima/kaggle-datasets