MLB Pitcher Arsenal Evolution Dataset (2020-2025)

Introduction

I’ve published a Kaggle dataset that tracks the year-over-year evolution of MLB pitchers’ arsenals from 2020 to 2025. This dataset enables analysis of how pitchers adjust their pitch mix, velocity, and spin rates over time.

Dataset Link: https://www.kaggle.com/datasets/yasunorim/mlb-pitcher-arsenal-2020-2025

Dataset Overview

This dataset provides pitch-by-pitch aggregated metrics for MLB pitchers across six seasons (2020-2025).

Basic Information

Period: 2020-2025 seasons (6 seasons)
Rows: 4,253 rows (pitcher × season combinations)
Columns: 111 columns
Format: Wide format (1 row = 1 pitcher × 1 season)
Filter: Only pitchers with 100+ pitches in a season
Quality Score: 10.0/10 on Kaggle

Data Source

Data is sourced from MLB Advanced Media (Statcast) via the pybaseball library and aggregated by pitcher × season × pitch type.

Data Structure

Identifier Columns (3 columns)

player_id: MLB player ID
player_name: Player name
season: Season year (2020-2025)

Pitch Metrics (18 pitch types × 6 metrics = 108 columns)

For each pitch type, the dataset includes 6 metrics:

usage_pct: Usage rate (0-100%)
avg_speed: Average velocity (mph)
avg_spin: Average spin rate (rpm)
whiff_rate: Whiff rate (0-1)
avg_pfx_x: Average horizontal movement (inches, gravity-adjusted)
avg_pfx_z: Average vertical movement (inches, gravity-adjusted)

Pitch Types (18 types)

FF (Four-seam), SI (Sinker), FC (Cutter), SL (Slider), CU (Curve), CH (Changeup), FS (Splitter), KC (Knuckle Curve), FO (Forkball), EP (Eephus), KN (Knuckleball), ST (Sweeper), SV (Slurve), and more.

Usage

Downloading the Data

Download the CSV directly from Kaggle:

import pandas as pd

# Download via Kaggle API
!kaggle datasets download -d yasunorim/mlb-pitcher-arsenal-2020-2025

# Load data
df = pd.read_csv('pitcher_arsenal_evolution_2020_2025.csv')

Analysis Notebook

A comprehensive analysis notebook is also available:

https://www.kaggle.com/code/yasunorim/pitcher-arsenal-analysis

The notebook includes:

Individual pitcher arsenal trend analysis
MLB-wide pitch type trends (2020-2025)
Velocity change analysis
Correlation heatmaps

Use Cases

1. Pitcher Arsenal Trend Analysis

Track year-over-year changes in a specific pitcher’s repertoire:

# Yusei Kikuchi's pitch usage evolution
kikuchi = df[df['player_name'].str.contains('Kikuchi', case=False)]
kikuchi.plot(x='season', y=['SL_usage_pct', 'FF_usage_pct', 'CH_usage_pct'])

For Kikuchi, slider usage increased from ~20% in 2019 to over 40% in 2022-2025, showing a significant shift in pitch mix strategy.

2. Injury Recovery Analysis

Analyze changes in pitch mix and velocity before and after injury:

# Compare two seasons for a specific pitcher
pitcher_data = df[df['player_id'] == 123456]
pitcher_data[['season', 'FF_avg_speed', 'FF_usage_pct']]

3. MLB-Wide Trend Analysis

Visualize league-wide pitch type trends:

# Calculate yearly average usage rates
yearly_avg = df.groupby('season')[['FF_usage_pct', 'SI_usage_pct', 'SL_usage_pct']].mean()
yearly_avg.plot(kind='line', marker='o')

Recent trends show increasing usage of sliders and sweepers across MLB.

4. Machine Learning Features

Use arsenal metrics as features for pitcher performance prediction:

# Calculate pitch diversity
df['pitch_diversity'] = (df[[col for col in df.columns if 'usage_pct' in col]] > 5).sum(axis=1)

Data Collection Method

The data collection notebook is also available for reproducing the dataset with updated seasons:

https://www.kaggle.com/datasets/yasunorim/mlb-pitcher-arsenal-2020-2025?select=pitcher_arsenal_evolution_2020_2025.ipynb

Collection Process

from pybaseball import statcast_pitcher
import pandas as pd

# Example: Collect arsenal data for a pitcher
pitcher_id = 660271  # Shohei Ohtani
season = 2025

# Get pitch-by-pitch data
df = statcast_pitcher(
    start_dt=f"{season}-03-01",
    end_dt=f"{season}-11-30",
    player_id=pitcher_id
)

# Aggregate by pitch type
arsenal = df.groupby('pitch_type').agg({
    'release_speed': 'mean',
    'release_spin_rate': 'mean',
    'pfx_x': 'mean',
    'pfx_z': 'mean',
    'pitch_type': 'count'
})

Conclusion

I hope this dataset is useful for baseball analytics research and machine learning projects. If you have any questions or feedback, feel free to leave a comment or start a discussion on Kaggle!