MLB Pitcher Arsenal Evolution Dataset (2020-2025)

Comprehensive pitch arsenal metrics tracking the evolution of MLB pitchers’ repertoires from 2020 to 2025
baseball
mlb
kaggle
dataset
pitching
arsenal
Author

Yasunori Morishima

Published

February 9, 2026

Introduction

I’ve published a Kaggle dataset that tracks the year-over-year evolution of MLB pitchers’ arsenals from 2020 to 2025. This dataset enables analysis of how pitchers adjust their pitch mix, velocity, and spin rates over time.

Dataset Link: https://www.kaggle.com/datasets/yasunorim/mlb-pitcher-arsenal-2020-2025

Dataset Overview

This dataset provides pitch-by-pitch aggregated metrics for MLB pitchers across six seasons (2020-2025).

Basic Information

  • Period: 2020-2025 seasons (6 seasons)
  • Rows: 4,253 rows (pitcher × season combinations)
  • Columns: 111 columns
  • Format: Wide format (1 row = 1 pitcher × 1 season)
  • Filter: Only pitchers with 100+ pitches in a season
  • Quality Score: 10.0/10 on Kaggle

Data Source

Data is sourced from MLB Advanced Media (Statcast) via the pybaseball library and aggregated by pitcher × season × pitch type.


Data Structure

Identifier Columns (3 columns)

  • player_id: MLB player ID
  • player_name: Player name
  • season: Season year (2020-2025)

Pitch Metrics (18 pitch types × 6 metrics = 108 columns)

For each pitch type, the dataset includes 6 metrics:

  1. usage_pct: Usage rate (0-100%)
  2. avg_speed: Average velocity (mph)
  3. avg_spin: Average spin rate (rpm)
  4. whiff_rate: Whiff rate (0-1)
  5. avg_pfx_x: Average horizontal movement (inches, gravity-adjusted)
  6. avg_pfx_z: Average vertical movement (inches, gravity-adjusted)

Pitch Types (18 types)

FF (Four-seam), SI (Sinker), FC (Cutter), SL (Slider), CU (Curve), CH (Changeup), FS (Splitter), KC (Knuckle Curve), FO (Forkball), EP (Eephus), KN (Knuckleball), ST (Sweeper), SV (Slurve), and more.


Usage

Downloading the Data

Download the CSV directly from Kaggle:

import pandas as pd

# Download via Kaggle API
!kaggle datasets download -d yasunorim/mlb-pitcher-arsenal-2020-2025

# Load data
df = pd.read_csv('pitcher_arsenal_evolution_2020_2025.csv')

Analysis Notebook

A comprehensive analysis notebook is also available:

https://www.kaggle.com/code/yasunorim/pitcher-arsenal-analysis

The notebook includes:

  • Individual pitcher arsenal trend analysis
  • MLB-wide pitch type trends (2020-2025)
  • Velocity change analysis
  • Correlation heatmaps

Use Cases

1. Pitcher Arsenal Trend Analysis

Track year-over-year changes in a specific pitcher’s repertoire:

# Yusei Kikuchi's pitch usage evolution
kikuchi = df[df['player_name'].str.contains('Kikuchi', case=False)]
kikuchi.plot(x='season', y=['SL_usage_pct', 'FF_usage_pct', 'CH_usage_pct'])

For Kikuchi, slider usage increased from ~20% in 2019 to over 40% in 2022-2025, showing a significant shift in pitch mix strategy.

2. Injury Recovery Analysis

Analyze changes in pitch mix and velocity before and after injury:

# Compare two seasons for a specific pitcher
pitcher_data = df[df['player_id'] == 123456]
pitcher_data[['season', 'FF_avg_speed', 'FF_usage_pct']]

3. MLB-Wide Trend Analysis

Visualize league-wide pitch type trends:

# Calculate yearly average usage rates
yearly_avg = df.groupby('season')[['FF_usage_pct', 'SI_usage_pct', 'SL_usage_pct']].mean()
yearly_avg.plot(kind='line', marker='o')

Recent trends show increasing usage of sliders and sweepers across MLB.

4. Machine Learning Features

Use arsenal metrics as features for pitcher performance prediction:

# Calculate pitch diversity
df['pitch_diversity'] = (df[[col for col in df.columns if 'usage_pct' in col]] > 5).sum(axis=1)

Data Collection Method

The data collection notebook is also available for reproducing the dataset with updated seasons:

https://www.kaggle.com/datasets/yasunorim/mlb-pitcher-arsenal-2020-2025?select=pitcher_arsenal_evolution_2020_2025.ipynb

Collection Process

from pybaseball import statcast_pitcher
import pandas as pd

# Example: Collect arsenal data for a pitcher
pitcher_id = 660271  # Shohei Ohtani
season = 2025

# Get pitch-by-pitch data
df = statcast_pitcher(
    start_dt=f"{season}-03-01",
    end_dt=f"{season}-11-30",
    player_id=pitcher_id
)

# Aggregate by pitch type
arsenal = df.groupby('pitch_type').agg({
    'release_speed': 'mean',
    'release_spin_rate': 'mean',
    'pfx_x': 'mean',
    'pfx_z': 'mean',
    'pitch_type': 'count'
})

Conclusion

I hope this dataset is useful for baseball analytics research and machine learning projects. If you have any questions or feedback, feel free to leave a comment or start a discussion on Kaggle!