Git for Your Data Lake — Why Agents Need Isolation and Rollback | Ciro Greco | PyAI Conf 2026

"Unlike code, data is non-local by default. If you change something, there's a bunch of production systems that depend on what you change. And unfortunately, we do not have git."

Ciro Greco, Co-founder and CEO at Bauplan and adjunct professor at Columbia University, at PyAI Conf 2026. He explains why AI coding assistants hit a wall when they try to touch production data — and what to do about it. Every pipeline run is an isolated branch, merges are atomic, and failed runs never reach production. Includes a live demo of an AI agent building and deploying a data pipeline with full git-like version control for your data lake.

0:00 - Why data infrastructure needs to be rebuilt for AI agents

0:44 - Software engineering matured fast with agents; data work is still catching up

1:36 - Why Claude Code works so well: local files, terminal feedback, and git

2:16 - The backfill problem: asking an agent to modify a production table

3:00 - Four missing pieces: isolation, atomicity, observability, and rollback

3:40 - "Data is non-local by default and we don't have git for it"

4:28 - Infrastructure that makes isolation and atomic updates automatic

5:07 - Agents generate 100x the code and run 100x the workload of a person

5:55 - Principle one: make everything git-like with branches, versions, and snapshots

6:44 - Principle two: squeeze your entire data platform into a Python package

7:24 - Everything is Python — tables, infrastructure, the outer loop

8:15 - Live demo: data branches as zero-copy versions of your data lake

9:08 - Commit history and time travel across your entire data lake

9:54 - The agent workflow: branching, running, and iterating from the terminal

10:42 - Building a backfill pipeline with an AI agent in real time

12:03 - Agent generates a Python pipeline script with declarative columns

12:33 - The agent runs, reads terminal output, and iterates until the pipeline works

13:46 - Verifying the new table exists on the feature branch but not on main

14:17 - Merging the branch: atomic publish to production

15:07 - Time travel and undo: nothing is permanent, everything is reversible

15:57 - Q&A begins

16:48 - How branches and hashing work with production writes and new data

17:28 - Atomic merges: multi-table publishes that either fully land or don't

17:59 - How customers actually adopt git semantics for data (mostly through the agent)

19:24 - Agents run 50-60 queries where a human runs 4-5

19:58 - Skills-based automation: data quality tests, log fetching, auto-fix branches

20:29 - Why data can't live in actual git: it's too large and always in the cloud

21:50 - "As far as the agent knows, it's just a git CLI for data"

LINKS:
https://www.bauplanlabs.com/

Видео Git for Your Data Lake — Why Agents Need Isolation and Rollback | Ciro Greco | PyAI Conf 2026 канала Py AI - Meetup and Conference Series

Комментарии отсутствуют