Загрузка страницы

Productionizing Unstructured Data for AI and Analytics

A large Delta Lake frequently includes a mix of structured and unstructured data. Data teams use Apache SparkTM to analyze structured data, but often struggle to apply the same analysis to unstructured, unlabeled data (e.g. images, video). Teams are forced to use expensive and manual processes to transform unstructured data into something more useful –they either pay a third party to label their data, buy a labeled dataset, or narrow the scope of their project to leverage public datasets. If data teams had faster and more cost effective ways to convert unstructured data into structured data, they could support more advanced use-cases built around their companies’ unique, unstructured datasets.

In this talk, we demonstrate how teams can easily prepare unstructured data for AI and analytics in Databricks. We leverage the LabelSpark library (a connector between Databricks and Labelbox) to connect an unstructured dataset to Labelbox, programmatically set up an ontology for labeling, and return the labeled dataset in a Spark DataFrame. Labeling can be done by humans, AI models in Databricks, or a combination of both. We will also show a model-assisted labeling workflow that allows humans to easily inspect and correct a model’s predicted labels. This process can reduce the amount of unstructured data you need to achieve strong model performance.

Labelbox is a training data platform that allows companies to quickly produce structured data from unstructured data. Combining Databricks and Labelbox gives you an end-to-end environment for unstructured data workflows –a query engine built around Delta Lake, fast annotation tools, and a powerful Machine Learning compute environment.

To learn more, visit www.labelbox.com/databricks-partner

Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

Видео Productionizing Unstructured Data for AI and Analytics канала Databricks
Показать
Комментарии отсутствуют
Введите заголовок:

Введите адрес ссылки:

Введите адрес видео с YouTube:

Зарегистрируйтесь или войдите с
Информация о видео
21 сентября 2021 г. 20:04:32
00:25:35
Яндекс.Метрика