dlt April ’24 updates: Growing together, one commit at a time
About our community
New ways of how we amplify the community
The community is growing and creating content about dlt. We will start showcasing this content more often, particularly when it is useful to the community. As a first step, we created a YouTube channel to start collecting video content. Find it here https://www.youtube.com/@dltHub/playlists.
Your content helps not only our community and your own learning and branding but also sometimes takes the place of our own demos, enabling us to focus on moving dlt forward in other areas. For example, Francesco and Willi, who worked closely with us on the REST API source, created tutorial videos on how to use this source. We will feature their videos as part of the REST API source launch, which will contain a variety of usage examples.
How to get featured by dltHub?
If you would like us to consider featuring one of your OSS projects, blog posts, events, or community offers, let us know in Slack in #sharing-and-contributing
Next event - PyCon US, May 16th - 18th
Our team presented at PyCon Germany, and we enjoyed hosting after-parties and meeting the community!
Our next stop is PyCon US in Pittsburgh next week. Meet Matt or Violetta at our booth on Startup Row from May 16 to 18, or DM them on our Slack.
Highlighted contributions
This month, we saw more community engagement, and we have a new destination donated!
🔥🔥🔥 Add Dremio as a destination by @maxfirman in #1026 🔥🔥🔥
Fix Athena Iceberg's trailing location by @romanperesypkin in #1230
Fix typo with switched column names in schema evolution docs page by @b-per in #1132
Adding images and wordsmithing to Prefect walkthrough by @WillRaphaelson in #1276
Update example connection string by @MiConnell in #1188
Remove upper bound on dlt dependency in Kafka source (#415) by @aksestok
Add optional type annotation to GitHub source for pyright adherence (408) by @cmpadden
Add include_custom_props to HubSpot source (#404) by @cmpadden
2. Recent product developments and new features
The long awaited REST API source is here!
We have been talking about this one for a while, and it’s finally here. If you have been paying attention, you can find the source in our verified sources and some videos on the YouTube channel we created. We are adding some usage examples and better docs, and we will announce it next Tuesday, May 14th.
You asked for SCD2 (Slowly Changing Dimension Type 2), and here it is!
Responding to your requests, we're excited to introduce SCD2 (Slowly Changing Dimension Type 2) capabilities — now available for more dynamic data handling! Learn all about implementing our SCD2 strategy here.
New sources
New backends for sql_database: We added pandas, pyarrow and connectorx backends, you can read more about it here. We did some benchmarks and they provide up to 30x speedup.
Postgres CDC enables real-time data replication from PostgreSQL databases. This source is designed to capture incremental changes, allowing for efficient data synchronization without the overhead of full database loads. It's particularly useful for scenarios where maintaining data freshness is critical, such as in data warehousing and real-time analytics. Link (website docs pending).
Google Ads: We had this one sitting on a branch for a long time, waiting for a test credential from Google, which never came, so here it is without daily tests: https://dlthub.com/docs/dlt-ecosystem/verified-sources/google_ads
New Destinations:
Dremio acts as a query acceleration layer that connects to various data sources. It helps simplify and speed up data analysis by allowing SQL queries to be executed directly on the data source, reducing the need for data movement and preparation. Dremio benefits organizations looking to enhance their BI tools' performance without extensive ETL processes.
Clickhouse is a high-performance columnar database management system optimized for OLAP. It’s particularly good at handling large-scale tables such as event data or activity schema architectures.
💡 Read the detailed commit/release logs here: dlt, Sources.
3. Coming up next
REST API Source + OpenAPI generator = ready-made pipelines
This month we focus on using the REST API source to generate many sources. We are aiming for 100+ pipelines. If you’d like to include your favorite OpenAPI spec, let us know on Slack.
ML destinations
Per popular demand, LanceDB is the next destination we are working on. Discussions are also ongoing around Hugging Face, IBIS, native PyIceberg, and delta table destinations.
Does your company want to upgrade its Pythonic data platform? Apply for a design partnership with dltHub
A design partnership offers companies a fast track to fix issues, prioritize OSS tickets, or build custom sources, plugins, and data transformations. It also helps us understand our customers to better serve their needs.
Our mission is to help data teams in building their Python-first data platforms, and in recent weeks we have seen increasing demand from the companies in our community for help in doing that. One of the drivers of this is the rise of AI, and the need to make new sources of data available to LLMs.
Last week we started a design partnership program for teams working with OSS dlt. If you are interested in a design partnership, let us know in the #3-technical-help Slack channel.
Future sources and destinations
We want to change the way you think about pipelines.
While it’s not feasible to call pipelines disposable or simple to make, we aim for a future where moving data is no longer challenging. A big step in this direction is the source generation project, which aims to standardise how data producers make data available.
Much like the OpenAPI standard enabled generating Swagger docs, it could also generate pipelines. Imagine if you just pointed dlt to the docs and done!
💡 Get involved: See our short-term roadmap here and tell us what you need.
Closing words
Our work at dlt lies at the intersection of human curiosity and technological potential.
Our team is committed to balancing innovative research with developing applied, practical, reliable data tools.
The latest updates to our library have been heavily influenced by your feedback and the demands of real-world usage.
Adrian Brudaru Co-founder, dlt