Debugging a Sportsbook Integration: Four Bugs, One Database Session
We recently went through a full audit of our sportsbook callback integration - the server-side endpoints that the provider calls when a player places a bet, wins, loses, or gets a refund. What started as a single bug report from the provider, turned into a deep dive through race conditions, SQLAlchemy session mechanics, and the subtle cost of committing too often in an async Python application.
This is a walkthrough of every bug we found and fixed, roughly in the order we found them.
The Setup
When a player places a bet, the provider sends a signed JWT callback to our server. Our server deducts the balance, records the transaction, and returns the updated balance. There are about a dozen callback endpoints: bet/make, bet/win, bet/refund, bet/rollback, bet/discard, ping, balance, and more.
The stack: FastAPI (async), SQLAlchemy (synchronous ORM with autoflush=False), PostgreSQL, a shared get_db dependency that yields a fresh SessionLocal() per request and commits on success / rolls back on exception.
Bug 1: Five parallel bets timing out
The provider’s test suite includes a test that fires five bet/make requests concurrently. Ours was timing out.
The culprit: create_single_bet was calling db.commit() after inserting each individual SingleBet. For a betslip with three selections, that meant three commits before the betslip itself was committed. SQLAlchemy’s synchronous commit() blocks the event loop for the duration of the round trip to PostgreSQL. Five concurrent requests x four commits each = twenty sequential event-loop blocks happening in what should have been parallel processing.
The fix: remove the commits from create_single_bet. Instead, stage the SingleBet objects with db.add() and let create_bet_slip’s existing final commit write everything in one shot.
# Before: 4 commits per bet (N selections + 1 betslip)
def create_single_bet(db, bet, ...):
db_bet = SingleBet(...)
db.add(db_bet)
db.commit() # <- blocking, per-selection
db.refresh(db_bet)
return db_bet
# After: 1 commit for everything
def create_single_bet(db, bet, ...):
db_bet = SingleBet(...)
db.add(db_bet)
return db_bet # committed later by create_bet_slip
Bug 2: Twenty parallel bets timing out
After fixing the five-parallel test, the twenty-parallel test started timing out. The five-parallel fix reduced commits per request, but there were still five commits in the bet/make path overall:
- Reference entity creation (
create_sport,create_tournament, etc.) - one commit each if the entity was new create_bet_slip- one commitupdate_user_finance_and_transaction- one commit (unavoidable - shared function)create_transactioninside finance - one commit (unavoidable)SportsbookTransaction- one commitBetmodel - one commit
We reduced this to two unavoidable commits by changing the SportsbookTransaction insert to a flush (which assigns the auto-increment ID without committing) and deferring the Bet insert to the get_db dependency’s final commit at request teardown:
# SportsbookTransaction: flush instead of commit
db.add(sportsbook_transaction_model)
db.flush() # <- ID assigned, not yet committed
db.refresh(sportsbook_transaction_model)
# Bet model: stage only, committed by get_db
db.add(bet_model)
# no commit here - get_db commits on response
The two remaining commits - finance update and create_transaction - live in a shared service used across the whole platform, so we left them alone.
Bug 3: The Race Condition We Found Along the Way
While fixing the parallel bet timeouts, we noticed that some bets were failing intermittently - not with a timeout, but with HTTP 500 errors. This only happened when the reference tables (sports, tournaments, categories) were empty, i.e., on a fresh database or after clearing data between test runs.
The root cause was a classic check-then-insert race:
def create_sport(db, bet):
existing = db.query(Sport).filter(Sport.id == bet.sport_id).first()
if existing is None:
sport = Sport(id=bet.sport_id, name=bet.sport_name)
db.add(sport)
db.commit() # <- crash if another session commits the same PK first
With requests arriving simultaneously, all sessions query sports at the same instant, all find nothing, and all attempt to insert the same row (same sport ID as primary key). The first one to commit wins, and the others get PostgreSQL’s duplicate key value violates unique constraint -> IntegrityError -> HTTP 500.
The fix was to replace the check-then-insert pattern with PostgreSQL’s INSERT ... ON CONFLICT DO NOTHING, executed directly within the current session’s open transaction - no separate commit needed:
from sqlalchemy.dialects.postgresql import insert as pg_insert
def create_sport(db, bet):
stmt = pg_insert(Sport.__table__).values(
id=bet.sport_id,
name=bet.sport_name,
created_at=datetime.utcnow(),
updated_at=datetime.utcnow(),
).on_conflict_do_nothing(index_elements=['id'])
db.execute(stmt)
# no commit - everything committed by create_bet_slip at the end
PostgreSQL serializes concurrent inserts of the same key at the database level: the second session to arrive blocks briefly until the first commits, then silently skips the insert. No application-level errors, no individual commits per entity.
Bug 4: Deadlocks and Stale Data
In a sportsbook integration, a single “Bet Make” request often needs to update multiple tables: user finance, transaction history, and betslip details. Our original code used standard SQLAlchemy SELECT ... FOR UPDATE patterns within a single transaction.
Under parallel load, this created two critical issues:
When multiple bets for the same user arrived simultaneously, they queued up waiting for the finance row lock. In our async server, any await (like publishing an event to a message broker) yields the event loop. If a request held a lock and then yielded, a competing request could block the entire event loop while waiting for that lock, preventing the first request from ever resuming to release it.
Fix: We introduced explicit commits immediately after critical financial operations and before any asynchronous calls (like await publish_event).
What We Learned
- SQLAlchemy’s
with_for_update()is necessary but not sufficient. We use it to serialize concurrent balance deductions - it’s aSELECT ... FOR UPDATEthat locks the user’s finance row so only one request at a time can read-and-modify the balance. But that lock only helps for the finance update. Reference table inserts need their own concurrency strategy, which is whereON CONFLICT DO NOTHINGfits. - Committing inside a helper function is a trap. Every
db.commit()inside a function that’s called from a request handler has two costs: it blocks the event loop (SQLAlchemy is synchronous;commit()is a network round trip to PostgreSQL), and it commits everything currently staged in the session - including objects staged by callers that weren’t expecting them to be committed yet. The right pattern is to stage everything withdb.add()ordb.flush()(for ID generation) and commit once at the handler’s exit point. - The
get_dbdependency is a natural commit point. FastAPI’s dependency injection system callsget_db’sfinallyblock after the response is sent. Deferring the final commit there means the DB transaction stays open as short as possible (only while the response is being prepared), and any unhandled exception automatically triggers a rollback. - Flush is your friend for ID generation.
db.flush()sends the INSERT to PostgreSQL within the open transaction, which means PostgreSQL assigns the auto-increment primary key, but the row isn’t visible to other sessions yet. You get the ID you need for foreign keys without paying for a commit. - Locks vs. Awaits. Never hold a row lock while the event loop can yield. Committing early releases these locks, allowing parallel requests for the same user to proceed while the first request handles non-critical secondary logic.
Testing Parallel Behavior
To reproduce and verify these fixes, we wrote a parallel load test script (test/parallel_test.py) that generates RSA-signed JWT payloads simulating the provider’s callback format and fires N concurrent requests using asyncio.gather. Key design decisions:
- All request bodies are built before the concurrent phase starts, so JWT signing time doesn’t skew the latency measurements.
- The script generates a throwaway RSA key pair and prints the public key so you can paste it into the development config’s
PUBLIC_KEYfor that test run.
After all fixes: 20/20 requests succeed, wall time well under the timeout threshold.