Data Pipeline Observability, A Streamlit Application

  • Home
  • Blog
  • Data Pipeline Observability, A Streamlit Application

Overview  

One of our recent client projects struggled with observing their data pipelines across their two vendors – Matillion & DBT. This article addresses how the CloudEQS team built a Streamlit on top of Snowflake so data engineers, support engineers, and leadership have a centralized place to observe the health of their ETL pipelines.  

Business Problem 

Data Engineering teams often leverage numerous data pipeline vendors to get their data analytics & AI-ready. Because of the disparate nature of the solutions, meta-data silos still exist. Without a centralized observability layer, the time spent troubleshooting ETL pipelines increases. This results in less time building data pipelines and solving business problems.  

Why It’s Important 

Data pipeline observability addresses these challenges by providing organizations with the tools and insights needed to understand their data flow fully. Here’s why it’s essential: 

  1. Proactive Monitoring: Observability enables teams to detect issues before they escalate into larger problems.  
  1. Enhanced Data Quality: With observability, teams can implement data validation checks and establish benchmarks for quality. This leads to cleaner data, which in turn supports more reliable analytics. 
  1. Collaboration Across Teams: Improved visibility fosters collaboration between data engineers, analysts, and business stakeholders.  
  1. Compliance and Governance: Observability tools can automate data lineage tracking, ensuring organizations can demonstrate compliance with regulations and maintain trust with stakeholders. 

Solution Approach 

Leveraging meta-data driven designs, the CloudEQS team created a solution approach to solving this problem. Here are the steps we followed:  

  1. Metadata Tables: Leveraging the Matillion & DBT APIs, we built metadata tables that consolidates ETL job names, schedules, job status, run times, etc.  
  1. Snowflake Roles: Before creating a Streamlit app within a Snowflake account, we needed to create roles to have access to the database & schema where the metadata tables and master staging tables are created.  
  1. Streamlit Codebase: Leveraging the simplicity of Snowflake’s integration with Streamlit, the team was able to turn the underlying python code into a web application. 
  1. Consolidated Insights: The application dashboard surfaces insights across Maitllion & DBT.  

Expected Business Outcomes 

The outcomes our client has achieved with this new observability layer include: 

  1. Leadership-level observability on ETL pipeline health  
  1. Search ETL jobs within given timeframe 
  1. Streamlined communication to data consumers  
  1. Faster resource allocation to resolve ETL failures  
  1. Enhance knowledge base of ETL jobs & failures 

What’s Next? 

For now, this application is a Connected Application Powered by Snowflake. Meaning, a Snowflake user can access the application code base from CloudEQS’s domain and integrate it in their Snowflake instance.  

The evolution of this application is to make it a Native Application on the Snowflake Marketplace. This way any Snowflake, Matillion & DBT customer can self-serve this application. 

Conclusion 

The CloudEQS team addressed a client’s challenge in observing data pipelines across Matillion and DBT by building a centralized Streamlit application on Snowflake. This solution consolidates metadata from various ETL jobs, enhancing visibility and facilitating proactive monitoring, improved data quality, and streamlined collaboration among teams. The application allows leadership to monitor ETL pipeline health, search job statuses, and enhance communication for quicker resolution of failures 

img

Asheesh is our Head of Delivery and seasoned Data Professional with 30+ years of experience.

Comments are closed