← Back to Portfolio

Statsbomb Sports Data Collection

Real-time sports data collection achieved 10x growth through architecture-as-data: domain logic as configuration, UI workflows as state machines, and claims-based metadata. What we built worked—expanding to new sports without rewriting code. The lesson came from what took longer: building distributed team ownership two years late meant racing time when external constraints intervened.

Real-Time Data Collection Collaborative Workflows Contextual UI Adaptation XState Kafka ANTLR

Timeline: 2018-2022 • Principal engineer championing initiatives, Cairo team of 20


TL;DR: What We Built

Real-time sports data collection system that scaled from 100 to 1000+ collectors while reducing collection time by 75% (16 hours → 4 hours). Achieved through architecture-as-data: domain logic as configuration (DSLs), UI workflows as state machines, and claims-based metadata resolution. Expanded from soccer to American football with minimal code changes.

Key Results: 99% error prevention, near real-time latency (~20s), 10x collector scale without proportional staffing.

Core Lesson: Team building multiplies architectural leverage—waiting until year three to expand beyond two-person partnership meant racing external constraints before completing customer-facing features.

Problem: Manual Collection Doesn’t Scale

Week one. Ali and I sat down with paper. We hand-wrote sequencing rules for soccer matches—which events could follow which, what validated as legal, how phases transitioned. It was painful. But necessary.

The industry-standard tool—Dartfish—couldn’t scale to what we needed. Button-heavy interface forced mouse clicks for every selection. It had keyboard shortcuts, but they weren’t contextual. Each key could only map to one event type—globally, across the entire data spec. No overlap possible. With 50+ event types but only 40 usable keys, we hit a ceiling. Collectors memorized which arbitrary key triggered which event, at all times, regardless of game context. Cognitive overload.

Two collectors per match, each following a team, manually entering 3000+ events over 16 hours. Duplicated effort. No prevention, only validation after mistakes happened.

Years of daily conversations with collectors revealed their actual workflows and friction points. They needed keyboard flows that built muscle memory. Contextual mappings that reduced cognitive load. Collaboration patterns that eliminated rework.

The dataspec itself needed to be data, not code, so product managers could express rules in one place that drove the entire system.

Event storming sessions mapped the domain. Five bounded contexts emerged: match metadata, event collection, people coordination, media management, contextual aggregation. One insight stood out: arbitrary aggregation from atomic facts. Individual passes and dribbles were atomic events. Multiple dribbles by the same player became a “carry”—a player-level durational fact. Team possession spanned all team durational facts (carry, foul-won, opponent-out) into a higher-level aggregation. Turnovers derived from possession changes. The system needed to support any aggregation pattern, not just what we knew upfront.

Match Metadata was different. Players, clubs, referees, stadiums—reference data that thousands of collectors needed but only 5 people managed. Crowd-sourced data meant conflicting information: same player, different spellings; same club, multiple sources. The system required automated golden entity resolution—5 people couldn’t manually reconcile metadata for thousands of collectors across hundreds of matches weekly.

Product managers couldn’t define new collection requirements without engineering bottlenecks. Collectors waited weeks for tooling updates. Operations needed to scale from 100 collectors to thousands without proportional support staff.

Why This Mattered to Statsbomb’s Business

Statsbomb provided granular sports analytics to professional clubs, broadcasters, and betting operators—markets where data velocity creates competitive advantage. Teams analyzing opponent patterns hours after matches finished fell behind. Broadcasters needed live insights for real-time commentary. Betting markets priced during play, not after final whistle.

The 75% efficiency gain wasn’t just operational—it unlocked new revenue streams. Real-time collection enabled incremental analysis during live matches—broadcasters could comment on possession patterns as they emerged, teams could track opponent tactics in-match. Multi-sport expansion (soccer to American football) without proportional engineering cost meant entering adjacent markets with existing infrastructure. 10x operational scale without linear staffing growth preserved margins while growing coverage.

The architectural separation between rules and execution created a strategic moat: product managers could customize data specifications per client without engineering rewrites. A Premier League club wanting detailed positioning data and a Championship club needing only basic events both ran on the same system—different configurations, same codebase.

Architecture: Separation as First Principle

Architecture emerges when you chase why/where users struggle, not what features they request.

Domain Logic as Configuration

We separated three concerns: entry validation (dataspec), sequencing logic (DSL), and aggregation rules (grouping DSL). Product managers wrote rules in readable syntax. The system compiled to execution logic.

Example: Soccer Possession Rules
# Topological Dependency Layers:
# Layer 1: complete-pass, shot-complete, carry, counter-attack-start (atomic derivatives)
# Layer 2: possession, tiki-taka (depend on carry)
# Layer 3: turnover (depends on shot-complete, possession)

# Rule 1: Successful pass completion (atomic event derivative - Layer 1)
# When: Player A passes, Player B receives successfully
# Derive: complete-pass metric with distance, angle, pressure
Derive event complete-pass from events sequence:
  - pass team mine
    then reception field outcome is complete team mine

# Rule 2: Shot on target (atomic event derivative - Layer 1)
# When: Shot followed by goalkeeper save or goal
# Derive: shot-complete (distinguishes blocked/off-target from on-target)
Derive event shot-complete from events sequence:
  - shot team mine
    then goal-keeper field outcome is won team opponent OR goal team mine

# Rule 3: Player carry (player-level durational - Layer 1)
# When: Same player performs multiple consecutive dribbles
# Derive: carry (player-level durational fact)
Derive event carry from events sequence:
  - reception player X
    then dribble player X +  # One or more dribbles by same player

# Rule 4: Defensive recovery counter-attack (simple derived - Layer 1)
# When: Team intercepts, then successfully passes
# Derive: counter-attack-start (high-value transition)
Derive event counter-attack-start from events sequence:
  - interception field outcome is won team mine
    then pass team mine

# Rule 5: Team possession (team-level aggregation - Layer 2, depends on Rule 3)
# When: Team has active carry, foul-won, or ball out by opponent
# Derive: possession (spans all team durational facts)
Derive event possession from events sequence:
  - carry team mine +
    ~ reception team mine
    ~ foul-committed team opponent
    ~ out team opponent # ~ Means carries or more including team passes and play restarts

# Rule 6: Tiki-taka analysis (complex sequence - Layer 2, depends on Rule 3)
# When: Team completes 3+ consecutive passes without loss
# Derive: tiki-taka
Derive event tiki-taka from events sequence:
  - pass team mine +
    ~ carry team mine # ~ Means passes or more including carries between passes

# Rule 7: Turnover (higher-level derivative - Layer 3, depends on Rules 2 & 5)
# When: Possession shifts from one team to opponent OR opponent shoots on target
# Derive: turnover metric
Derive event turnover from events sequence:
  - possession team mine
    then possession team opponent* # zero or more possessions by the other team
    then shot-complete team opponent
Example: American Football Drive Segmentation
# Topological Dependency Layers:
# Layer 1: snap-complete, red-zone-conversion, third-down-success, drive-stall (atomic derivatives)
# Layer 2: offensive-drive (depends on snap-complete)
# Layer 3: scoring-drive (depends on offensive-drive)

# Rule 1: Successful snap play (atomic event derivative - Layer 1)
# When: Snap followed by successful gain (run or pass completion)
# Derive: snap-complete
Derive event snap-complete from events sequence:
  - snap team mine
    then run field outcome is complete team mine

# Rule 2: Red zone conversion (atomic derivative - Layer 1)
# When: Team enters red zone (20-yard line) and scores touchdown
# Derive: red-zone-conversion
Derive event red-zone-conversion from events sequence:
  - snap field yard-line is less-than 20 team mine
    then touchdown team mine

# Rule 3: Third-down success (atomic derivative - Layer 1)
# When: Successful play on 3rd down maintains possession (first down reset)
# Derive: third-down-success
Derive event third-down-success from events sequence:
  - snap field down is 3 team mine
    then snap field down is 1 team mine # Reset to first down

# Rule 4: Drive stall (atomic derivative - Layer 1)
# When: Team fails to convert on 4th down or punts
# Derive: drive-stall
Derive event drive-stall from events sequence:
  - snap team mine +
    then turnover-on-downs team mine OR punt team mine

# Rule 5: Offensive drive (team-level aggregation - Layer 2, depends on Rule 1)
# When: Team gains possession and runs offensive plays
# Derive: offensive-drive (spans all plays until scoring/turnover)
Derive event offensive-drive from events sequence:
  - kickoff-return field outcome is complete team mine
    ~ punt-return field outcome is complete team mine
    ~ turnover-recovery team mine
    then snap team mine +
    ~ snap-complete team mine # ~ Means snaps or more including completions
    until touchdown team mine OR field-goal team mine OR turnover team opponent

# Rule 6: Scoring drive (higher-level derivative - Layer 3, depends on Rule 5)
# When: Offensive drive ends with touchdown or field goal
# Derive: scoring-drive
Derive event scoring-drive from events sequence:
  - offensive-drive team mine
    then touchdown team mine OR field-goal field outcome is good team mine

When we expanded to American football, product managers wrote new dataspecs and grouping rules.

Visualizing the Data Flow

Domain experts don’t think in code—they think in sequences. We used visual timelines to translate their mental models into technical architecture. This diagram shows how atomic events (Layer 0) flow through time, and higher-level facts derive from event patterns.

The functional pipeline splits into 3 paths, then recombines. Path 1: map filters events to one team. Path 2: reduce derives possession phases from team events. Path 3: reduce aggregates match statistics. The overlap function then stacks all 3 layers (events + phases + stats) onto a single timeline—like layers in Photoshop. Same temporal axis, multiple analytical dimensions visible simultaneously.

The Live-Collection-App: UX as First Principle

The first thing we built wasn’t backend architecture. It was the tool collectors would use every day—an Electron desktop app that replaced Dartfish’s rigid forms with keyboard-first workflows.

Real-Time Collaboration

GraphQL subscriptions via WebSocket enabled Google Docs-style collaboration. Multiple collectors worked concurrently—one collector led by publishing the base event name, others contributed their assigned data pieces (player positions, freeze frames, detailed attributes). Role separation by design meant no conflicts were possible. Freeze frame positioning and manual location data coexisted as valid refinements at different granularities.

Concurrent collection follows a pub-sub pattern: lead collector establishes base event identity, specialized collectors subscribe and contribute domain-specific data. Role separation prevents conflicts— each collector owns distinct data fields, no overlapping writes.

State Machines for Impossible States

We used XState to model collection workflows as explicit finite state machines. Traditional event handlers create implicit state explosions (if (editing && !watching && hasBase)). State machines make all transitions explicit and testable.

// Main collection workflow
{
  id: 'main-room',
  type: 'parallel',
  states: {
    videoMode: ['manual', 'loop', 'editing'],
    collection: ['watching', 'collectingBase', 'addingPartials'],
    freezeFrame: ['idle', 'active', 'reviewing']
  }
}

Five major state machines orchestrated the app: main room, event entry, player customization, freeze frame queue, and batch validation. This prevented illegal states by design—you couldn’t enter a freeze frame without an active event, couldn’t submit without required fields.

30+ Keyboard Shortcuts: Muscle Memory Over Mouse Clicks

We used Mousetrap with module scoping. Same key, different action per context:

// Context-aware keyboard mappings
const keyboardShortcuts = {
  'main-room': {
    't': 'Create new event (pass, shot, tackle)',
    'p': 'Add player to event',
    '/': 'Toggle video loop mode',
    '←→': 'Rewind/forward 5 seconds'
  },
  'freeze-frame': {
    't': 'Toggle team view (home/away)',
    '1-9': 'Select player by jersey number',
    'Enter': 'Confirm positions'
  }
};

// Same key 't', different meaning per context
bindShortcuts('main-room', keyboardShortcuts['main-room']);
bindShortcuts('freeze-frame', keyboardShortcuts['freeze-frame']);

Collectors touch-typed events like Vim users touch-type code. Video control (arrow keys), event creation (letter keys), navigation (/, Z), editing (Shift+E), freeze frames (number keys). Hands stayed on keyboard. Eyes stayed on video.

Contextual UI Adaptation

The interface adapted based on game state and previous decisions. When a collector selected “Pass,” the system showed only valid pass types for that player’s position. Auto-filled mandatory fields with single valid options. Skipped irrelevant choices entirely.

// After DSL sequence rules run and determine valid next options
// If only one option remains valid, auto-fill and skip the question
// Example: "Pass" event followed by rules determining only "Reception" is valid → auto-fill
if (hasOnlyOneValidOption(currentField, validOptions)) {
  return progressToNextField(allFields, currentField, validOptions[0]);
}

Cognitive load minimization: collectors never saw irrelevant options.

Video Loop Mode

Press / to toggle between manual playback and auto-replay. In loop mode, the system repeated a 5-second window around the active event. No manual rewinding. Collectors filled fields while watching the replay, then pressed / to move forward.

Computer Vision Assistance

Freeze frames (positioning data for all 22 players) were semi-automated. Collectors triggered frame extraction, CV service detected players and ball, system overlaid predictions on pitch canvas, collectors corrected misidentifications. Hybrid approach: 80-90% automation, human correction for edge cases.

This tool became the foundation. The dataspec drove UI validation, the DSL defined legal event sequences, the state machines enforced correctness. Concurrent collection during live matches with sub-minute latency. The backend scaled to support what the UX tool showed collectors needed.

Backend Evolution: From Batch to Real-Time

Breaking matches down by decision rather than team increased correctness without additional effort. Computer vision assisted input, contextual keyboard mappings reduced cognitive load, and automated linting caught preventable errors. Collectors focused on judgment over correction—handling edge cases where human expertise mattered rather than catching mistakes.

Beyond Simple Sequences: Event Dependency Graphs

Events don’t just flow sequentially—they form directed acyclic graphs. A foul-committed event might depend on a previous clearance, which itself connects to atomic pass events. The system modeled both temporal relationships (PREV/NEXT) and logical dependencies (DEPENDS_ON).

Solid lines show temporal flow (PREV - what happened next). Dashed lines show phase validation rules (what transitions are legal). Only Layer 0 atomic events drive phase transitions. The state machine prevents invalid sequences—you can’t pass during “loose” phase, can’t recover during “possession” phase. Derived events like “carry” are computed separately from sequences of Layer 0 events.

Match Metadata: Temporal Stages with Automated Entity Resolution

The metadata challenge required different architectural thinking. Match metadata had temporal stages that preceded live collection:

Stage 1: Competition Schedules (Days/Weeks Before Kickoff)

We organized competition schedules daily and observed the web for updates—fixtures announced, kickoff times changed, venues moved. Web scrapers monitored official sources. Any schedule changes propagated automatically.

Stage 2: Matchday Lineups (Hours Before Kickoff)

Ahead of kickoff, we scraped matchday lineups and match details—starting XI, formations, referees, stadium conditions. This is where most new entities entered the system.

Stage 3: Automatic Entity Streaming

Whenever scrapers or collectors encountered a new entity—player spelling variant, unknown referee, club name mismatch—the system automatically streamed it to the info resolution team (5 people managing metadata for thousands of collectors). No manual data entry. No waiting for batch imports.

With thousands of collectors crowd-sourcing data across multiple sources, conflicts were inevitable: same player, different spellings; same club, multiple sources. The system needed to resolve golden entities automatically—5 people couldn’t manually reconcile every conflict.

Rather than key-value pairs, metadata became claims from actors with provenance (who), confidence (reliability), and temporality (when). Each claim carried its source context. When multiple sources asserted conflicting facts, the graph structure made conflicts explicit and queryable.

The 5-person metadata team’s workflow: query for pending conflicts → review side-by-side with full context → resolve once → system cascades to all dependent data.

Claims from data sources (amber) flow into the system. Conflicts (blue) route to a queue. The metadata team (emerald) resolves ambiguity. Golden entities cascade updates to dependent systems automatically.

The system prevented duplicate entities by design, caught conflicts automatically, and escalated true ambiguity. The 5-person team focused on the 1-2% of ambiguous cases requiring domain expertise (is “FC Barcelona” the same as “Barcelona FC”? yes; but “Manchester United” vs “Manchester City”? no).

Impact: From Bottleneck to Multiplier

Design for the 80% common case; production iteration reveals the 20% edge cases.

Quantitative Results

Real-time collection: Concurrent collection during live matches replaced post-match sequential processing.

Minimal engineering bottleneck: When we expanded to American football, product managers wrote new dataspecs and grouping rules with minimal code changes.

Automated validation: Linting and contextual validation caught errors automatically. Collectors focused on judgment calls where human expertise mattered—event type conflicts, ambiguous data points—rather than catching preventable mistakes.

Non-linear leverage: Operations scaled 10x without proportional staffing increases.

The system scaled horizontally. Product managers shipped features.

Qualitative Changes

Collector experience: Keyboard flows built muscle memory. Contextual mappings reduced cognitive load. Computer vision assisted input.

Client customization: Different granularity requirements—basic event tracking or advanced positioning data—handled through configuration.

Metadata team stayed small: The 5-person team queried pending conflicts, reviewed side-by-side, resolved once, and the system cascaded updates.

The Team Building Challenge: Learning to Lead

The architecture worked. But building it was only half the story.

Year One: Finding the Core Partnership

Week one: Ali and I hand-wrote sequencing rules. Early months: Adham joined as co-architect. The foundational architecture—data as configuration, state machines, DSLs—emerged from our collaborative exploration. Adham pushed me to chase ideas I wasn’t confident enough to pursue alone.

Month one backend: a single Go endpoint called sync—batch, offline. We started building the live-collection-app with a junior engineer transitioning from aerospace. No formal process. Just the urgent need to replace Dartfish.

Years Two-Three: Scaling from Partnership to Distributed Team

Adham and I could execute together, but the domain kept expanding. The transition from two-person partnership to distributed team took longer than it needed—I was learning how to lead engineers while building production systems. The tension: I didn’t want to compromise on the architectural ideas the domain required, but keeping that thinking between just us became a bottleneck. Every new subsystem, every architectural decision, required our direct involvement.

By year three, we had built the hiring and mentoring systems needed to scale beyond the core partnership.

The Breakthrough: Distributed Ownership

When Waheed, Hadeel, and Abdallah joined, something shifted. They didn’t just implement—they took ownership and iterated beyond what we started with:

Adham (Services Architecture Lead)
  • Designed all backend services architecture
  • Led Kafka event streaming implementation
  • Built GraphQL-based web scraper (modeled DOM operations as GraphQL queries—querying the web like an API)
Waheed (Metadata System)
  • Implemented claims-based entity resolution
  • Built automatic conflict detection
  • Owned golden entity cascade logic
Hadeel (Collection Service and Dataspec)
  • Implemented real-time collection service
  • Owned Kafka integration and GraphQL subscriptions
  • Built WebSocket infrastructure for concurrent collaboration
  • Collaborated with PMs on evolving the dataspec capabilities
Abdallah (Desktop Application and Dataspec)
  • Maintained the Electron live-collection-app
  • Collaborated with PMs on evolving the dataspec capabilities

They took the foundational concepts and applied them to contexts we hadn’t explored yet.

The External Constraint

In 2022, I left Egypt. By then: 1000+ collectors, internal DSLs shipped, American football expansion worked. The customer-facing DSL that would let clients define their own derived facts remained unfinished when I left—team scaling became the bottleneck, not the architecture.

The Lesson

Team building isn’t separate from technical work. It’s the multiplier that makes ambitious architecture sustainable. Waiting until year three to expand beyond the two-person partnership meant racing against time when external constraints intervened.

The lesson: distributed ownership from day one, not year three. The architecture proved its value—scaling to 1000+ collectors across multiple sports. The team multiplied that value. Waheed, Hadeel, and Abdallah should have started alongside Adham from year one. The technical concepts would have propagated faster, subsystem ownership would have emerged earlier, and the customer-facing DSL would have shipped.

Architecture shapes what’s possible. But people make it real.

Lessons: Context Shapes What Works

What Worked in This Context

Architecture emerged from observing where collectors struggled—watching hesitation, workarounds, and slowdowns. Separation of rules from execution paid off at 1000+ collectors. At 100, it felt like over-engineering. Designing for 80% common cases with flexible extension points prevented premature optimization.

What I’d Change

Invest in team building from day one, not year three. The two-year delay meant racing time when I relocated. Adham, Waheed, Hadeel, and Abdallah iterated beyond the original design—they should have started sooner.

Event storming from Day 1, not month 6. Understanding bounded contexts (match metadata, event collection, people coordination, media management, contextual aggregation) shaped every technical decision.

When These Patterns Apply

These aren’t universal truths—they worked because of specific conditions:

Scale threshold: Separation creates leverage at 1000+ users, not 100. At 50 users with stable requirements, build directly. At 10x growth projections with variant needs, invest in separation.

Domain complexity: Sports analytics, medical diagnosis, legal review—domains where experts have tacit knowledge worth externalizing. The more complex the expertise, the more value in DSLs that let non-engineers express rules.

Multi-variant requirements: Multiple sports, client customization, regional variations. When every customer wants different behavior, configuration prevents maintaining N codebases.

The meta-lesson: More domain knowledge → better problem diagnosis → better architectural decisions → concepts that create value instead of complexity. Context always determines what works.

Behind the Scenes

The human side of building production systems: our team in Cairo (2018-2022), where “make illegal states unrepresentable” went from paper sketches to production reality.

Photos from 2018-2021: Team collaboration, technical discussions, and the people who took foundational concepts beyond their initial vision.