Processing 1M Identity Graphs per Second with Spark Structured Streaming
Adobe Experience Platform’s Identity Graph links over 70 billion identities, processes 1M+ records per second, and handles terabytes of data daily to support Identity resolution at scale. With over 25 deployments across seven regions and multiple clouds (Azure & AWS), we use Spark Structured Streaming and Delta tables to ensure stability, low latency, and reliable performance.As our customer data has grown exponentially, so has the need for more sophisticated and scalable data pipelines. Over the past three years, we’ve scaled our ingestion pipeline by 10x. In this session, we’ll share the architecture, data patterns, and techniques that allowed us to scale, along with the lessons we’ve learned along the way.We will cover:Leveraging micro batching to reduce data de-duplication by 70-80%Capturing and monitoring key metrics to track query performance and ensure system stabilityHow delta lake enabled rate limiting and anomalous identity filtering to maintain system stabilityManaging schema evolution and enforcement to handle changing table schema Using VACUUM to ensure regulatory compliance effectively and reliably from the application standpointMulti-cloud pipeline abstractionAsync task processing: Optimized data ingestion into FoundationDB, Adobe's persistent Identity Graph store.Minimal disruption to latency with a custom deployment mechanismLearn how these optimizations enable Adobe Real time Company Data Platform to deliver personalization at scale while maintaining privacy and compliance
Speakers:
- Akanksha Nagpal (Sr Software Engineer, Adobe)
## TL;DR
Adobe Experience Platform addressed the challenge of processing over 1 million identity graphs per second by leveraging Spark Structured Streaming and Delta Lake, resulting in a 10x pipeline scalability and maintaining privacy compliance. This approach enabled efficient data handling with reduced latency and ensured system stability, even during peak loads.
## Opening
Imagine processing identity data at the scale of a bustling city like San Francisco every second—Adobe Experience Platform does precisely that. With a staggering 70 billion records flowing through their systems daily, Adobe faces the daunting task of keeping this data fresh and compliant with privacy standards. This session dives into how they tackled these challenges using Spark Structured Streaming and Delta Lake, sharing insights into their journey of scaling data pipelines by 10x and maintaining performance and compliance.
## What You'll Learn (Key Takeaways)
- **Leveraging Micro-Batching for Efficiency** – By optimizing Spark micro-batch intervals and implementing deterministic deduplication, Adobe reduced redundant data processing by over 80%, stabilizing workloads and minimizing resource consumption.
- **Async Task Processing for Latency Reduction** – Implementing an asynchronous execution model for data ingestion allowed Adobe to offload I/O heavy tasks, resulting in more balanced resource utilization and reduced latency without increasing infrastructure costs.
- **Addressing Data Skew with Repartitioning** – Adobe solved data skew issues by introducing explicit repartitioning logic, achieving better parallelism and reducing task imbalance by over 40%.
- **Ensuring Compliance with Delta Lake** – Through the use of Delta Lake's vacuum feature and marker files, Adobe effectively managed data retention and regulatory compliance, ensuring secure and compliant data deletion processes.
## Q&A Highlights
Q: How did you evaluate Spark versus Flink, and why choose Spark for this use case?
A: We opted for Spark because it integrates seamlessly with our existing batch pipelines, reducing duplicate code and management overhead. Spark also benefits from managed services on platforms like Databricks, which was crucial for operational efficiency.
Q: Can you explain what identity graphs are and how they work with Spark?
A: Identity graphs unify fragmented identifiers into a single profile, crucial for personalization. Spark's distributed processing handles the large-scale data ingestion efficiently, supporting our proprietary algorithms for identity resolution.
Q: How do you ensure data freshness and avoid stale data issues with in-memory snapshots?
A: We use Netflix Holo for in-memory snapshots and implement a custom heartbeat check to ensure data remains fresh, preventing stale data from affecting long-running applications.
Q: How does your deployment strategy ensure no downtime?
A: We use a blue-green deployment strategy, leveraging geospace dependency injection and monitoring new deployments closely to ensure a smooth transition without any service disruption.
By sharing their journey, Adobe provides valuable insights into scaling real-time data processing while maintaining compliance, offering a robust example for data streaming practitioners seeking to optimize their own pipelines.
Видео Processing 1M Identity Graphs per Second with Spark Structured Streaming канала StreamNative
Speakers:
- Akanksha Nagpal (Sr Software Engineer, Adobe)
## TL;DR
Adobe Experience Platform addressed the challenge of processing over 1 million identity graphs per second by leveraging Spark Structured Streaming and Delta Lake, resulting in a 10x pipeline scalability and maintaining privacy compliance. This approach enabled efficient data handling with reduced latency and ensured system stability, even during peak loads.
## Opening
Imagine processing identity data at the scale of a bustling city like San Francisco every second—Adobe Experience Platform does precisely that. With a staggering 70 billion records flowing through their systems daily, Adobe faces the daunting task of keeping this data fresh and compliant with privacy standards. This session dives into how they tackled these challenges using Spark Structured Streaming and Delta Lake, sharing insights into their journey of scaling data pipelines by 10x and maintaining performance and compliance.
## What You'll Learn (Key Takeaways)
- **Leveraging Micro-Batching for Efficiency** – By optimizing Spark micro-batch intervals and implementing deterministic deduplication, Adobe reduced redundant data processing by over 80%, stabilizing workloads and minimizing resource consumption.
- **Async Task Processing for Latency Reduction** – Implementing an asynchronous execution model for data ingestion allowed Adobe to offload I/O heavy tasks, resulting in more balanced resource utilization and reduced latency without increasing infrastructure costs.
- **Addressing Data Skew with Repartitioning** – Adobe solved data skew issues by introducing explicit repartitioning logic, achieving better parallelism and reducing task imbalance by over 40%.
- **Ensuring Compliance with Delta Lake** – Through the use of Delta Lake's vacuum feature and marker files, Adobe effectively managed data retention and regulatory compliance, ensuring secure and compliant data deletion processes.
## Q&A Highlights
Q: How did you evaluate Spark versus Flink, and why choose Spark for this use case?
A: We opted for Spark because it integrates seamlessly with our existing batch pipelines, reducing duplicate code and management overhead. Spark also benefits from managed services on platforms like Databricks, which was crucial for operational efficiency.
Q: Can you explain what identity graphs are and how they work with Spark?
A: Identity graphs unify fragmented identifiers into a single profile, crucial for personalization. Spark's distributed processing handles the large-scale data ingestion efficiently, supporting our proprietary algorithms for identity resolution.
Q: How do you ensure data freshness and avoid stale data issues with in-memory snapshots?
A: We use Netflix Holo for in-memory snapshots and implement a custom heartbeat check to ensure data remains fresh, preventing stale data from affecting long-running applications.
Q: How does your deployment strategy ensure no downtime?
A: We use a blue-green deployment strategy, leveraging geospace dependency injection and monitoring new deployments closely to ensure a smooth transition without any service disruption.
By sharing their journey, Adobe provides valuable insights into scaling real-time data processing while maintaining compliance, offering a robust example for data streaming practitioners seeking to optimize their own pipelines.
Видео Processing 1M Identity Graphs per Second with Spark Structured Streaming канала StreamNative
Комментарии отсутствуют
Информация о видео
5 июня 2025 г. 15:00:06
00:30:44
Другие видео канала