- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
Decoding 1951 census: LLMs extract messy tables
Extracting structured data from historical documents sounds simple — until you encounter tables that change layout every few pages, handwritten annotations, inconsistent schemas, and OCR pipelines that completely lose the plot.
In this talk, Aryan Srivastava (Development Data Lab) walks through building an LLM-powered extraction pipeline for India’s 1951 Population Census handbooks — combining contextual reasoning, schema templates, rule-based systems, and evaluation frameworks to reliably parse messy, format-variant tables at scale.
The session covers:
• why traditional OCR pipelines fail on historical tabular data
• using LLMs to infer document structure and semantics
• balancing automation with rule-based validation
• evaluation strategies for extraction reliability
• lessons from processing large-scale census archives
A practical talk for engineers, researchers, and data practitioners working with unstructured documents, retrieval pipelines, or production LLM systems.
If you enjoy deep practitioner discussions on AI systems, data infrastructure, and production engineering, become part of The Fifth Elephant community:
https://hasgeek.com/fifthelephant#memberships
Видео Decoding 1951 census: LLMs extract messy tables канала Hasgeek TV
In this talk, Aryan Srivastava (Development Data Lab) walks through building an LLM-powered extraction pipeline for India’s 1951 Population Census handbooks — combining contextual reasoning, schema templates, rule-based systems, and evaluation frameworks to reliably parse messy, format-variant tables at scale.
The session covers:
• why traditional OCR pipelines fail on historical tabular data
• using LLMs to infer document structure and semantics
• balancing automation with rule-based validation
• evaluation strategies for extraction reliability
• lessons from processing large-scale census archives
A practical talk for engineers, researchers, and data practitioners working with unstructured documents, retrieval pipelines, or production LLM systems.
If you enjoy deep practitioner discussions on AI systems, data infrastructure, and production engineering, become part of The Fifth Elephant community:
https://hasgeek.com/fifthelephant#memberships
Видео Decoding 1951 census: LLMs extract messy tables канала Hasgeek TV
Комментарии отсутствуют
Информация о видео
7 ч. 44 мин. назад
00:40:09
Другие видео канала




