Productionizing Unstructured Data for AI and Analytics
A large Delta Lake frequently includes a mix of structured and unstructured data. Data teams use Apache SparkTM to analyze structured data, but often struggle to apply the same analysis to unstructured, unlabeled data (e.g. images, video). Teams are forced to use expensive and manual processes to transform unstructured data into something more useful –they either pay a third party to label their data, buy a labeled dataset, or narrow the scope of their project to leverage public datasets. If data teams had faster and more cost effective ways to convert unstructured data into structured data, they could support more advanced use-cases built around their companies’ unique, unstructured datasets.
In this talk, we demonstrate how teams can easily prepare unstructured data for AI and analytics in Databricks. We leverage the LabelSpark library (a connector between Databricks and Labelbox) to connect an unstructured dataset to Labelbox, programmatically set up an ontology for labeling, and return the labeled dataset in a Spark DataFrame. Labeling can be done by humans, AI models in Databricks, or a combination of both. We will also show a model-assisted labeling workflow that allows humans to easily inspect and correct a model’s predicted labels. This process can reduce the amount of unstructured data you need to achieve strong model performance.
Labelbox is a training data platform that allows companies to quickly produce structured data from unstructured data. Combining Databricks and Labelbox gives you an end-to-end environment for unstructured data workflows –a query engine built around Delta Lake, fast annotation tools, and a powerful Machine Learning compute environment.
To learn more, visit www.labelbox.com/databricks-partner
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner
Видео Productionizing Unstructured Data for AI and Analytics канала Databricks
In this talk, we demonstrate how teams can easily prepare unstructured data for AI and analytics in Databricks. We leverage the LabelSpark library (a connector between Databricks and Labelbox) to connect an unstructured dataset to Labelbox, programmatically set up an ontology for labeling, and return the labeled dataset in a Spark DataFrame. Labeling can be done by humans, AI models in Databricks, or a combination of both. We will also show a model-assisted labeling workflow that allows humans to easily inspect and correct a model’s predicted labels. This process can reduce the amount of unstructured data you need to achieve strong model performance.
Labelbox is a training data platform that allows companies to quickly produce structured data from unstructured data. Combining Databricks and Labelbox gives you an end-to-end environment for unstructured data workflows –a query engine built around Delta Lake, fast annotation tools, and a powerful Machine Learning compute environment.
To learn more, visit www.labelbox.com/databricks-partner
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner
Видео Productionizing Unstructured Data for AI and Analytics канала Databricks
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
Low-Code Apache SparkComcast makes home entertainment accessible to everyone with voice, data and AINBA Analytics | Data Brew | Season 4 Episode 2Data+AI Summit 2022 HighlightsAccelerating the Pace of Autism Diagnosis with Machine Learning ModelsDistributed Machine Learning at LyftMagnet Shuffle Service: Push-based Shuffle at LinkedInDemo Video: Connect to Power BI Desktop from DatabricksRay and Its Growing EcosystemGain 3 Benefits with Delta SharingPower to the (SQL) People: Python UDFs in DBSQLAutomating Data Quality Processes at ReckittLLM Module 3 - Multi-stage Reasoning | 3.7.3 Notebook Demo Part 3Modern Architecture of a Cloud-Enabled Data and Analytics PlatformHyperspace: An Indexing Subsystem for Apache SparkProtecting PII/PHI Data in Data Lake via Column Level EncryptionRun Your Queries Instantly in One of the Most Optimized EnvironmentsGrab leverages data + AI to create economic opportunities in Southeast AsiaMoving to the Lakehouse: Fast & Efficient Ingestion with Auto LoaderWehkamp excites shoppers with a better online experience with MLSpline: Central Data-Lineage Tracking, Not Only For Spark