← Back to Portfolio
Statsbomb Sports Data Collection
Real-time sports data collection achieved 10x growth through architecture-as-data: domain logic as configuration, UI workflows as state machines, and claims-based metadata. What we built worked—expanding to new sports without rewriting code. The lesson came from what took longer: building distributed team ownership two years late meant racing time when external constraints intervened.
Timeline: 2018-2022 • Principal engineer championing initiatives, Cairo team of 20
TL;DR: What We Built
Real-time sports data collection system that scaled from 100 to 1000+ collectors while reducing collection time by 75% (16 hours → 4 hours). Achieved through architecture-as-data: domain logic as configuration (DSLs), UI workflows as state machines, and claims-based metadata resolution. Expanded from soccer to American football with minimal code changes.
Key Results: 99% error prevention, near real-time latency (~20s), 10x collector scale without proportional staffing.
Core Lesson: Team building multiplies architectural leverage—waiting until year three to expand beyond two-person partnership meant racing external constraints before completing customer-facing features.
Problem: Manual Collection Doesn’t Scale
Week one. Ali and I sat down with paper. We hand-wrote sequencing rules for soccer matches—which events could follow which, what validated as legal, how phases transitioned. It was painful. But necessary.
The industry-standard tool—Dartfish—couldn’t scale to what we needed. Button-heavy interface forced mouse clicks for every selection. It had keyboard shortcuts, but they weren’t contextual. Each key could only map to one event type—globally, across the entire data spec. No overlap possible. With 50+ event types but only 40 usable keys, we hit a ceiling. Collectors memorized which arbitrary key triggered which event, at all times, regardless of game context. Cognitive overload.
Two collectors per match, each following a team, manually entering 3000+ events over 16 hours. Duplicated effort. No prevention, only validation after mistakes happened.
Years of daily conversations with collectors revealed their actual workflows and friction points. They needed keyboard flows that built muscle memory. Contextual mappings that reduced cognitive load. Collaboration patterns that eliminated rework.
The dataspec itself needed to be data, not code, so product managers could express rules in one place that drove the entire system.
Event storming sessions mapped the domain. Five bounded contexts emerged: match metadata, event collection, people coordination, media management, contextual aggregation. One insight stood out: arbitrary aggregation from atomic facts. Individual passes and dribbles were atomic events. Multiple dribbles by the same player became a “carry”—a player-level durational fact. Team possession spanned all team durational facts (carry, foul-won, opponent-out) into a higher-level aggregation. Turnovers derived from possession changes. The system needed to support any aggregation pattern, not just what we knew upfront.
Match Metadata was different. Players, clubs, referees, stadiums—reference data that thousands of collectors needed but only 5 people managed. Crowd-sourced data meant conflicting information: same player, different spellings; same club, multiple sources. The system required automated golden entity resolution—5 people couldn’t manually reconcile metadata for thousands of collectors across hundreds of matches weekly.
Product managers couldn’t define new collection requirements without engineering bottlenecks. Collectors waited weeks for tooling updates. Operations needed to scale from 100 collectors to thousands without proportional support staff.
Why This Mattered to Statsbomb’s Business
Statsbomb provided granular sports analytics to professional clubs, broadcasters, and betting operators—markets where data velocity creates competitive advantage. Teams analyzing opponent patterns hours after matches finished fell behind. Broadcasters needed live insights for real-time commentary. Betting markets priced during play, not after final whistle.
The 75% efficiency gain wasn’t just operational—it unlocked new revenue streams. Real-time collection enabled incremental analysis during live matches—broadcasters could comment on possession patterns as they emerged, teams could track opponent tactics in-match. Multi-sport expansion (soccer to American football) without proportional engineering cost meant entering adjacent markets with existing infrastructure. 10x operational scale without linear staffing growth preserved margins while growing coverage.
The architectural separation between rules and execution created a strategic moat: product managers could customize data specifications per client without engineering rewrites. A Premier League club wanting detailed positioning data and a Championship club needing only basic events both ran on the same system—different configurations, same codebase.
Architecture: Separation as First Principle
Architecture emerges when you chase why/where users struggle, not what features they request.
Domain Logic as Configuration
We separated three concerns: entry validation (dataspec), sequencing logic (DSL), and aggregation rules (grouping DSL). Product managers wrote rules in readable syntax. The system compiled to execution logic.
Example: Soccer Possession Rules
# Topological Dependency Layers:
# Layer 1: complete-pass, shot-complete, carry, counter-attack-start (atomic derivatives)
# Layer 2: possession, tiki-taka (depend on carry)
# Layer 3: turnover (depends on shot-complete, possession)
# Rule 1: Successful pass completion (atomic event derivative - Layer 1)
# When: Player A passes, Player B receives successfully
# Derive: complete-pass metric with distance, angle, pressure
Derive event complete-pass from events sequence:
- pass team mine
then reception field outcome is complete team mine
# Rule 2: Shot on target (atomic event derivative - Layer 1)
# When: Shot followed by goalkeeper save or goal
# Derive: shot-complete (distinguishes blocked/off-target from on-target)
Derive event shot-complete from events sequence:
- shot team mine
then goal-keeper field outcome is won team opponent OR goal team mine
# Rule 3: Player carry (player-level durational - Layer 1)
# When: Same player performs multiple consecutive dribbles
# Derive: carry (player-level durational fact)
Derive event carry from events sequence:
- reception player X
then dribble player X + # One or more dribbles by same player
# Rule 4: Defensive recovery counter-attack (simple derived - Layer 1)
# When: Team intercepts, then successfully passes
# Derive: counter-attack-start (high-value transition)
Derive event counter-attack-start from events sequence:
- interception field outcome is won team mine
then pass team mine
# Rule 5: Team possession (team-level aggregation - Layer 2, depends on Rule 3)
# When: Team has active carry, foul-won, or ball out by opponent
# Derive: possession (spans all team durational facts)
Derive event possession from events sequence:
- carry team mine +
~ reception team mine
~ foul-committed team opponent
~ out team opponent # ~ Means carries or more including team passes and play restarts
# Rule 6: Tiki-taka analysis (complex sequence - Layer 2, depends on Rule 3)
# When: Team completes 3+ consecutive passes without loss
# Derive: tiki-taka
Derive event tiki-taka from events sequence:
- pass team mine +
~ carry team mine # ~ Means passes or more including carries between passes
# Rule 7: Turnover (higher-level derivative - Layer 3, depends on Rules 2 & 5)
# When: Possession shifts from one team to opponent OR opponent shoots on target
# Derive: turnover metric
Derive event turnover from events sequence:
- possession team mine
then possession team opponent* # zero or more possessions by the other team
then shot-complete team opponent
Example: American Football Drive Segmentation
# Topological Dependency Layers:
# Layer 1: snap-complete, red-zone-conversion, third-down-success, drive-stall (atomic derivatives)
# Layer 2: offensive-drive (depends on snap-complete)
# Layer 3: scoring-drive (depends on offensive-drive)
# Rule 1: Successful snap play (atomic event derivative - Layer 1)
# When: Snap followed by successful gain (run or pass completion)
# Derive: snap-complete
Derive event snap-complete from events sequence:
- snap team mine
then run field outcome is complete team mine
# Rule 2: Red zone conversion (atomic derivative - Layer 1)
# When: Team enters red zone (20-yard line) and scores touchdown
# Derive: red-zone-conversion
Derive event red-zone-conversion from events sequence:
- snap field yard-line is less-than 20 team mine
then touchdown team mine
# Rule 3: Third-down success (atomic derivative - Layer 1)
# When: Successful play on 3rd down maintains possession (first down reset)
# Derive: third-down-success
Derive event third-down-success from events sequence:
- snap field down is 3 team mine
then snap field down is 1 team mine # Reset to first down
# Rule 4: Drive stall (atomic derivative - Layer 1)
# When: Team fails to convert on 4th down or punts
# Derive: drive-stall
Derive event drive-stall from events sequence:
- snap team mine +
then turnover-on-downs team mine OR punt team mine
# Rule 5: Offensive drive (team-level aggregation - Layer 2, depends on Rule 1)
# When: Team gains possession and runs offensive plays
# Derive: offensive-drive (spans all plays until scoring/turnover)
Derive event offensive-drive from events sequence:
- kickoff-return field outcome is complete team mine
~ punt-return field outcome is complete team mine
~ turnover-recovery team mine
then snap team mine +
~ snap-complete team mine # ~ Means snaps or more including completions
until touchdown team mine OR field-goal team mine OR turnover team opponent
# Rule 6: Scoring drive (higher-level derivative - Layer 3, depends on Rule 5)
# When: Offensive drive ends with touchdown or field goal
# Derive: scoring-drive
Derive event scoring-drive from events sequence:
- offensive-drive team mine
then touchdown team mine OR field-goal field outcome is good team mine When we expanded to American football, product managers wrote new dataspecs and grouping rules.
Visualizing the Data Flow
Domain experts don’t think in code—they think in sequences. We used visual timelines to translate their mental models into technical architecture. This diagram shows how atomic events (Layer 0) flow through time, and higher-level facts derive from event patterns.
graph TD
L0["All Events
Both teams"]
MAP["map
Filter team"]
L1["Team Events
One team"]
RED1["reduce
Phases"]
L2["Possession
Phases"]
RED2["reduce
Stats"]
L3["Match
Records"]
OVL["overlap
3 layers"]
L4["Final
Timeline"]
L0 -->|1| MAP
MAP --> L1
L1 -->|2| RED1
RED1 --> L2
L0 -.->|3| RED2
RED2 --> L3
L2 -->|Layer 1| OVL
L1 -.->|Layer 2| OVL
L3 -.->|Layer 3| OVL
OVL --> L4
classDef data fill:#FEF3C7,stroke:#F59E0B,stroke-width:2px,color:#78350F
classDef func fill:#DBEAFE,stroke:#0EA5E9,stroke-width:2px,color:#0C4A6E
classDef final fill:#D1FAE5,stroke:#10B981,stroke-width:3px,color:#065F46
class L0,L1,L2,L3 data
class MAP,RED1,RED2,OVL func
class L4 final Text description of relationships
- Layer 0: Atomic event stream (pass, reception, dribble, shot, goal-keeper) with team colors
- Map transformation: teamAFacts = map(getTeamFacts(teamA), layer-0) filters to black team only
- Reduce transformation: teamADurationalFacts = reduce(durationalFactsReducer, teamAFacts) derives possession phases (carry, defense)
- Reduce transformation: officialRecords = reduce(officialRecordsReducer, layer-0) creates match records
- Overlap transformation: teamACompoundFacts = overlap(teamAFacts, teamADurationalFacts, officialRecords) combines all layers
The functional pipeline splits into 3 paths, then recombines. Path 1: map filters events to one team. Path 2: reduce derives possession phases from team events. Path 3: reduce aggregates match statistics. The overlap function then stacks all 3 layers (events + phases + stats) onto a single timeline—like layers in Photoshop. Same temporal axis, multiple analytical dimensions visible simultaneously.
The Live-Collection-App: UX as First Principle
The first thing we built wasn’t backend architecture. It was the tool collectors would use every day—an Electron desktop app that replaced Dartfish’s rigid forms with keyboard-first workflows.
Real-Time Collaboration
GraphQL subscriptions via WebSocket enabled Google Docs-style collaboration. Multiple collectors worked concurrently—one collector led by publishing the base event name, others contributed their assigned data pieces (player positions, freeze frames, detailed attributes). Role separation by design meant no conflicts were possible. Freeze frame positioning and manual location data coexisted as valid refinements at different granularities.
Concurrent collection follows a pub-sub pattern: lead collector establishes base event identity, specialized collectors subscribe and contribute domain-specific data. Role separation prevents conflicts— each collector owns distinct data fields, no overlapping writes.
graph TD
LEAD[Lead Collector
Creates base event] -->|GraphQL
subscription| SUB[Base Event Published
shot: abc123
team: A, time: 45:30]
SUB -->|WebSocket
broadcast| POS[Position Collector
Subscribes]
SUB -->|WebSocket
broadcast| FREEZE[Freeze Frame Collector
Subscribes]
SUB -->|WebSocket
broadcast| ATTR[Attributes Collector
Subscribes]
POS -->|Contributes
player_positions| MERGE[Merge Service
Combines data]
FREEZE -->|Contributes
freeze_frame_data| MERGE
ATTR -->|Contributes
detailed_attributes| MERGE
MERGE --> FINAL[Complete Event
shot: abc123
All data pieces merged]
classDef atomic fill:#FEF3C7,stroke:#F59E0B,stroke-width:2px,color:#78350F
classDef derived fill:#DBEAFE,stroke:#0EA5E9,stroke-width:2px,color:#0C4A6E
classDef system fill:#D1FAE5,stroke:#10B981,stroke-width:2px,color:#065F46
class LEAD,POS,FREEZE,ATTR atomic
class SUB derived
class MERGE,FINAL system Text description of relationships
- Lead Collector (amber) publishes base event (shot, event_id: abc123, timestamp, team)
- GraphQL Subscription (blue) broadcasts base event to all active collectors via WebSocket
- Position Collector (amber) subscribes and contributes player_positions data field
- Freeze Frame Collector (amber) subscribes and contributes freeze_frame_data field
- Attributes Collector (amber) subscribes and contributes detailed_attributes field
- Merge Service (emerald) combines all contributions into complete event record
- No conflicts possible: each collector writes to distinct data fields (role separation by design)
State Machines for Impossible States
We used XState to model collection workflows as explicit finite state machines. Traditional event handlers create implicit state explosions (if (editing && !watching && hasBase)). State machines make all transitions explicit and testable.
// Main collection workflow
{
id: 'main-room',
type: 'parallel',
states: {
videoMode: ['manual', 'loop', 'editing'],
collection: ['watching', 'collectingBase', 'addingPartials'],
freezeFrame: ['idle', 'active', 'reviewing']
}
}Five major state machines orchestrated the app: main room, event entry, player customization, freeze frame queue, and batch validation. This prevented illegal states by design—you couldn’t enter a freeze frame without an active event, couldn’t submit without required fields.
30+ Keyboard Shortcuts: Muscle Memory Over Mouse Clicks
We used Mousetrap with module scoping. Same key, different action per context:
// Context-aware keyboard mappings
const keyboardShortcuts = {
'main-room': {
't': 'Create new event (pass, shot, tackle)',
'p': 'Add player to event',
'/': 'Toggle video loop mode',
'←→': 'Rewind/forward 5 seconds'
},
'freeze-frame': {
't': 'Toggle team view (home/away)',
'1-9': 'Select player by jersey number',
'Enter': 'Confirm positions'
}
};
// Same key 't', different meaning per context
bindShortcuts('main-room', keyboardShortcuts['main-room']);
bindShortcuts('freeze-frame', keyboardShortcuts['freeze-frame']);Collectors touch-typed events like Vim users touch-type code. Video control (arrow keys), event creation (letter keys), navigation (/, Z), editing (Shift+E), freeze frames (number keys). Hands stayed on keyboard. Eyes stayed on video.
Contextual UI Adaptation
The interface adapted based on game state and previous decisions. When a collector selected “Pass,” the system showed only valid pass types for that player’s position. Auto-filled mandatory fields with single valid options. Skipped irrelevant choices entirely.
// After DSL sequence rules run and determine valid next options
// If only one option remains valid, auto-fill and skip the question
// Example: "Pass" event followed by rules determining only "Reception" is valid → auto-fill
if (hasOnlyOneValidOption(currentField, validOptions)) {
return progressToNextField(allFields, currentField, validOptions[0]);
}Cognitive load minimization: collectors never saw irrelevant options.
Video Loop Mode
Press / to toggle between manual playback and auto-replay. In loop mode, the system repeated a 5-second window around the active event. No manual rewinding. Collectors filled fields while watching the replay, then pressed / to move forward.
Computer Vision Assistance
Freeze frames (positioning data for all 22 players) were semi-automated. Collectors triggered frame extraction, CV service detected players and ball, system overlaid predictions on pitch canvas, collectors corrected misidentifications. Hybrid approach: 80-90% automation, human correction for edge cases.
This tool became the foundation. The dataspec drove UI validation, the DSL defined legal event sequences, the state machines enforced correctness. Concurrent collection during live matches with sub-minute latency. The backend scaled to support what the UX tool showed collectors needed.
Backend Evolution: From Batch to Real-Time
Breaking matches down by decision rather than team increased correctness without additional effort. Computer vision assisted input, contextual keyboard mappings reduced cognitive load, and automated linting caught preventable errors. Collectors focused on judgment over correction—handling edge cases where human expertise mattered rather than catching mistakes.
Beyond Simple Sequences: Event Dependency Graphs
Events don’t just flow sequentially—they form directed acyclic graphs. A foul-committed event might depend on a previous clearance, which itself connects to atomic pass events. The system modeled both temporal relationships (PREV/NEXT) and logical dependencies (DEPENDS_ON).
graph TD
subgraph "Defense Phase"
CL["clearance
team: A
Layer 0"]
end
subgraph "Loose Phase (No Control)"
BR["ball-recovery
team: B
outcome: complete
Layer 0"]
FF["fifty-fifty
team: A vs B
outcome: won-B
Layer 0"]
end
subgraph "Possession Phase (Team B)"
PASS["pass
team: B
Layer 0"]
DRIB["dribble
team: B
Layer 0"]
end
CL -->|PREV
triggers loose| BR
CL -->|PREV
triggers loose| FF
BR -->|triggers
possession| PASS
BR -->|triggers
possession| DRIB
FF -->|triggers
possession| PASS
CL -.->|phase rule:
loose allows| BR
CL -.->|phase rule:
loose allows| FF
BR -.->|phase rule:
possession allows| PASS
BR -.->|phase rule:
possession allows| DRIB
FF -.->|phase rule:
possession allows| PASS
classDef defense fill:#FEF3C7,stroke:#F59E0B,stroke-width:2px,color:#78350F
classDef loose fill:#DBEAFE,stroke:#0EA5E9,stroke-width:2px,color:#0C4A6E
classDef possession fill:#D1FAE5,stroke:#10B981,stroke-width:2px,color:#065F46
class CL defense
class BR,FF loose
class PASS,DRIB possession Text description of relationships
- clearance (team A, Layer 0 atomic) triggers loose phase - neither team controls ball
- Loose phase allows: ball-recovery or fifty-fifty (Layer 0 atomic temporal successors)
- ball-recovery (outcome: complete, Layer 0) triggers possession phase - team B controls ball
- Possession phase allows: pass, dribble, reception (Layer 0 atomic events only)
- fifty-fifty (outcome: won-B, Layer 0) also triggers possession phase - alternative path
- Phase state machine uses only Layer 0 events - derived events (carry, turnover) computed separately
Solid lines show temporal flow (PREV - what happened next). Dashed lines show phase validation rules (what transitions are legal). Only Layer 0 atomic events drive phase transitions. The state machine prevents invalid sequences—you can’t pass during “loose” phase, can’t recover during “possession” phase. Derived events like “carry” are computed separately from sequences of Layer 0 events.
Match Metadata: Temporal Stages with Automated Entity Resolution
The metadata challenge required different architectural thinking. Match metadata had temporal stages that preceded live collection:
Stage 1: Competition Schedules (Days/Weeks Before Kickoff)
We organized competition schedules daily and observed the web for updates—fixtures announced, kickoff times changed, venues moved. Web scrapers monitored official sources. Any schedule changes propagated automatically.
Stage 2: Matchday Lineups (Hours Before Kickoff)
Ahead of kickoff, we scraped matchday lineups and match details—starting XI, formations, referees, stadium conditions. This is where most new entities entered the system.
Stage 3: Automatic Entity Streaming
Whenever scrapers or collectors encountered a new entity—player spelling variant, unknown referee, club name mismatch—the system automatically streamed it to the info resolution team (5 people managing metadata for thousands of collectors). No manual data entry. No waiting for batch imports.
With thousands of collectors crowd-sourcing data across multiple sources, conflicts were inevitable: same player, different spellings; same club, multiple sources. The system needed to resolve golden entities automatically—5 people couldn’t manually reconcile every conflict.
Rather than key-value pairs, metadata became claims from actors with provenance (who), confidence (reliability), and temporality (when). Each claim carried its source context. When multiple sources asserted conflicting facts, the graph structure made conflicts explicit and queryable.
The 5-person metadata team’s workflow: query for pending conflicts → review side-by-side with full context → resolve once → system cascades to all dependent data.
graph TD
C1[Collector A] -->|claims| N1[Nationality: Egypt]
C2[Scraped Source] -->|claims| N2[Nationality: UK]
API[Third-party API] -->|claims| N1
N1 -->|resolves to| GE[Golden Entity: Player X]
N2 -->|conflict| QUEUE[Pending Conflicts Queue]
QUEUE --> MT[Metadata Team Review]
MT -->|resolves| GE
GE -->|cascades to| DEP1[Match Data]
GE -->|cascades to| DEP2[Historical Stats]
classDef atomic fill:#FEF3C7,stroke:#F59E0B,stroke-width:2px,color:#78350F
classDef derived fill:#DBEAFE,stroke:#0EA5E9,stroke-width:2px,color:#0C4A6E
classDef system fill:#D1FAE5,stroke:#10B981,stroke-width:2px,color:#065F46
class C1,C2,API atomic
class N1,N2 derived
class QUEUE,MT,GE,DEP1,DEP2 system Text description of relationships
- Data sources (amber): Collector A, Scraped Source, and Third-party API submit nationality claims
- Claims (blue): "Nationality: Egypt" and "Nationality: UK" represent conflicting data
- Conflict resolution: Conflicting claim (Nationality: UK) routes to Pending Conflicts Queue
- System nodes (emerald): Metadata Team Review resolves conflicts and updates Golden Entity
- Cascade: Golden Entity (Player X) automatically updates dependent systems (Match Data, Historical Stats)
- Agreed claims resolve directly to golden entity; conflicts require human review
Claims from data sources (amber) flow into the system. Conflicts (blue) route to a queue. The metadata team (emerald) resolves ambiguity. Golden entities cascade updates to dependent systems automatically.
The system prevented duplicate entities by design, caught conflicts automatically, and escalated true ambiguity. The 5-person team focused on the 1-2% of ambiguous cases requiring domain expertise (is “FC Barcelona” the same as “Barcelona FC”? yes; but “Manchester United” vs “Manchester City”? no).
Impact: From Bottleneck to Multiplier
Design for the 80% common case; production iteration reveals the 20% edge cases.
Quantitative Results
Real-time collection: Concurrent collection during live matches replaced post-match sequential processing.
Minimal engineering bottleneck: When we expanded to American football, product managers wrote new dataspecs and grouping rules with minimal code changes.
Automated validation: Linting and contextual validation caught errors automatically. Collectors focused on judgment calls where human expertise mattered—event type conflicts, ambiguous data points—rather than catching preventable mistakes.
Non-linear leverage: Operations scaled 10x without proportional staffing increases.
The system scaled horizontally. Product managers shipped features.
Qualitative Changes
Collector experience: Keyboard flows built muscle memory. Contextual mappings reduced cognitive load. Computer vision assisted input.
Client customization: Different granularity requirements—basic event tracking or advanced positioning data—handled through configuration.
Metadata team stayed small: The 5-person team queried pending conflicts, reviewed side-by-side, resolved once, and the system cascaded updates.
The Team Building Challenge: Learning to Lead
The architecture worked. But building it was only half the story.
Year One: Finding the Core Partnership
Week one: Ali and I hand-wrote sequencing rules. Early months: Adham joined as co-architect. The foundational architecture—data as configuration, state machines, DSLs—emerged from our collaborative exploration. Adham pushed me to chase ideas I wasn’t confident enough to pursue alone.
Month one backend: a single Go endpoint called sync—batch, offline. We started building the live-collection-app with a junior engineer transitioning from aerospace. No formal process. Just the urgent need to replace Dartfish.
Years Two-Three: Scaling from Partnership to Distributed Team
Adham and I could execute together, but the domain kept expanding. The transition from two-person partnership to distributed team took longer than it needed—I was learning how to lead engineers while building production systems. The tension: I didn’t want to compromise on the architectural ideas the domain required, but keeping that thinking between just us became a bottleneck. Every new subsystem, every architectural decision, required our direct involvement.
By year three, we had built the hiring and mentoring systems needed to scale beyond the core partnership.
The Breakthrough: Distributed Ownership
When Waheed, Hadeel, and Abdallah joined, something shifted. They didn’t just implement—they took ownership and iterated beyond what we started with:
Adham (Services Architecture Lead)
- Designed all backend services architecture
- Led Kafka event streaming implementation
- Built GraphQL-based web scraper (modeled DOM operations as GraphQL queries—querying the web like an API)
Waheed (Metadata System)
- Implemented claims-based entity resolution
- Built automatic conflict detection
- Owned golden entity cascade logic
Hadeel (Collection Service and Dataspec)
- Implemented real-time collection service
- Owned Kafka integration and GraphQL subscriptions
- Built WebSocket infrastructure for concurrent collaboration
- Collaborated with PMs on evolving the dataspec capabilities
Abdallah (Desktop Application and Dataspec)
- Maintained the Electron live-collection-app
- Collaborated with PMs on evolving the dataspec capabilities
They took the foundational concepts and applied them to contexts we hadn’t explored yet.
The External Constraint
In 2022, I left Egypt. By then: 1000+ collectors, internal DSLs shipped, American football expansion worked. The customer-facing DSL that would let clients define their own derived facts remained unfinished when I left—team scaling became the bottleneck, not the architecture.
The Lesson
Team building isn’t separate from technical work. It’s the multiplier that makes ambitious architecture sustainable. Waiting until year three to expand beyond the two-person partnership meant racing against time when external constraints intervened.
The lesson: distributed ownership from day one, not year three. The architecture proved its value—scaling to 1000+ collectors across multiple sports. The team multiplied that value. Waheed, Hadeel, and Abdallah should have started alongside Adham from year one. The technical concepts would have propagated faster, subsystem ownership would have emerged earlier, and the customer-facing DSL would have shipped.
Architecture shapes what’s possible. But people make it real.
Lessons: Context Shapes What Works
What Worked in This Context
Architecture emerged from observing where collectors struggled—watching hesitation, workarounds, and slowdowns. Separation of rules from execution paid off at 1000+ collectors. At 100, it felt like over-engineering. Designing for 80% common cases with flexible extension points prevented premature optimization.
What I’d Change
Invest in team building from day one, not year three. The two-year delay meant racing time when I relocated. Adham, Waheed, Hadeel, and Abdallah iterated beyond the original design—they should have started sooner.
Event storming from Day 1, not month 6. Understanding bounded contexts (match metadata, event collection, people coordination, media management, contextual aggregation) shaped every technical decision.
When These Patterns Apply
These aren’t universal truths—they worked because of specific conditions:
Scale threshold: Separation creates leverage at 1000+ users, not 100. At 50 users with stable requirements, build directly. At 10x growth projections with variant needs, invest in separation.
Domain complexity: Sports analytics, medical diagnosis, legal review—domains where experts have tacit knowledge worth externalizing. The more complex the expertise, the more value in DSLs that let non-engineers express rules.
Multi-variant requirements: Multiple sports, client customization, regional variations. When every customer wants different behavior, configuration prevents maintaining N codebases.
The meta-lesson: More domain knowledge → better problem diagnosis → better architectural decisions → concepts that create value instead of complexity. Context always determines what works.
Behind the Scenes
The human side of building production systems: our team in Cairo (2018-2022), where “make illegal states unrepresentable” went from paper sketches to production reality.
Photos from 2018-2021: Team collaboration, technical discussions, and the people who took foundational concepts beyond their initial vision.