Julie Michelman - Pandas, Pipelines, and Custom Transformers
Description
Using pandas and scikit-learn together can be a bit clunky. For complex preprocessing, the scikit-learn Pipeline conveniently chains together transformers. But, it will convert your DataFrame to a numpy array. In this talk, we will walk through pandas DataFrames, scikit-learn preprocessing and Pipelines, and how to use custom transformers to stay in pandas land.
GitHub Link: https://github.com/jem1031/pandas-pipelines-custom-transformers
Abstract
For data science in python, the pandas DataFrame is a common choice to store and manipulate data sets. It has named columns, each of which can contain a different data type, and an index to identify rows and assist in joining. The scikit-learn package is the major machine learning library in python. It has implementations for a wide variety of popular feature engineering, supervised, and unsupervised machine learning algorithms. Perhaps even more importantly to its success, scikit-learn provides a uniform interface for these transformers and estimators, making it easy to swap out one for another.
Many scikit-learn transformers will take and return pandas DataFrames, but some only return numpy arrays. This means losing the column names and row indices. A few important examples include the meta-transformers Pipeline and FeatureUnion. The Pipeline chains together transformers to be applied in order. The FeatureUnion combines the results of transformers that can be applied in parallel. With these, the entire feature engineering process can be stored in one object and easily applied to new data sets.
Luckily, scikit-learn also provides the ability to write your own custom transformers. It is as simple as defining a new class that implements the fit and transform methods. We can use this to create pandas-friendly versions of the Pipeline and FeatureUnion, as well as add transformations that are not already provided.
www.pydata.org
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Видео Julie Michelman - Pandas, Pipelines, and Custom Transformers канала PyData
Using pandas and scikit-learn together can be a bit clunky. For complex preprocessing, the scikit-learn Pipeline conveniently chains together transformers. But, it will convert your DataFrame to a numpy array. In this talk, we will walk through pandas DataFrames, scikit-learn preprocessing and Pipelines, and how to use custom transformers to stay in pandas land.
GitHub Link: https://github.com/jem1031/pandas-pipelines-custom-transformers
Abstract
For data science in python, the pandas DataFrame is a common choice to store and manipulate data sets. It has named columns, each of which can contain a different data type, and an index to identify rows and assist in joining. The scikit-learn package is the major machine learning library in python. It has implementations for a wide variety of popular feature engineering, supervised, and unsupervised machine learning algorithms. Perhaps even more importantly to its success, scikit-learn provides a uniform interface for these transformers and estimators, making it easy to swap out one for another.
Many scikit-learn transformers will take and return pandas DataFrames, but some only return numpy arrays. This means losing the column names and row indices. A few important examples include the meta-transformers Pipeline and FeatureUnion. The Pipeline chains together transformers to be applied in order. The FeatureUnion combines the results of transformers that can be applied in parallel. With these, the entire feature engineering process can be stored in one object and easily applied to new data sets.
Luckily, scikit-learn also provides the ability to write your own custom transformers. It is as simple as defining a new class that implements the fit and transform methods. We can use this to create pandas-friendly versions of the Pipeline and FeatureUnion, as well as add transformations that are not already provided.
www.pydata.org
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
Видео Julie Michelman - Pandas, Pipelines, and Custom Transformers канала PyData
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
How do I encode categorical features using scikit-learn?Kevin Goetsch | Deploying Machine Learning using sklearn pipelinesStephen Simmons - Pandas from the Inside / "Big Pandas"Vincent Warmerdam: Winning with Simple, even Linear, Models | PyData London 2018Data Engineering Principles - Build frameworks not pipelines - Gatis SejaPipelines & Custom Transformers in scikit-learn: The step-by-step guide (with Python code)Pandas Python Tutorial: Creating a Pipeline in PandasJan van der Vegt: A walk through the isolation forest | PyData Amsterdam 2019Extending GDB with Python - Lisa RoachCreating Pipelines Using SKlearn- Machine Learning TutorialPyParis 2018 - Modern Pandas - Writing effective, readable data pipelineChalmer Lowe - Scikit-learn, wrapping your head around machine learning - PyCon 2019James Powell: Does Code Quality Really MatterUse ColumnTransformer to apply different preprocessing to different columnsSpark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie StricklandWant to Truly Master Scikit-Learn? 2 Essential Tips from Core Developer HimselfIntroduction to Scikit-Learn pipeline APISofia Heisler No More Sad Pandas Optimizing Pandas Code for Speed and Efficiency PyCon 2017Introducing Amazon SageMaker Pipelines - AWS re:Invent 2020