dlt June ’24 updates: Major Release, Workshops, and Community Contributions
Community updates
Rest API source is very popular
We are hearing about a lot of usage of the REST API generic source and we see a lot of downloads of the template. We are responding to your feedback and improving the source with your requirements. If you are missing anything about it, let us know and together we can make it into the best REST API generic source there is.
RAGs are becoming common in data engineering
We are seeing a lot of downloads of the inbox source and other sources of unstructured data, such as the scraper, intended for usage with RAGs.
To better support this, we added LanceDB support and did a workshop how to build them. More detail can be found further below.
Learn about dlt + LanceDB in this Data Talks Club LLM Zoomcamp workshop.
We added LanceDB support, and with it we showcased how easy it is to create modern data products with LLMs. Running from a free Colab notebook, we demo-ed how to create a production-grade RAG in under 2h.
Check it out here:
Workshop YouTube video: Open source data ingestion for RAGs with dlt - Akela Drissner
Workshop materials: Github repository, Slides, Colab Notebook
Conferences
EuroPython in Prague
We were at EuroPython this week where we had a lot of fun with interesting technical talks, meme competitions, and python quizzes at our Booth.
Check out our LinkedIn posts to see more photos from event!
Subscribe to our YouTube channel so as not to miss the recording of Adrian's and Violetta’s talk: “From pandas to production: ELT with dlt”. The channel also contains highlights from the conference, tutorials and workshops!
Upcoming: San Francisco, Seattle
Our CEO Matthaus is travelling to San Francisco the week of July 29th to August 2nd. Reach out if you want to meet him. Next month our CTO Marcin will be at DuckCon (15th of August in Seattle), and our friends from SDF will be there too. Marcin will be working out of MotherDuck office for the week, so let him know if you want to meet him.
First freelance & consulting technical certifications
Many of you in our community are agencies or freelancers. We want to work more closely with you and support you with your work delivering to 3rd parties. For this end, we have created a first version of a technical certification for freelancers and consultancies.
This is a stepping stone to becoming a partner. We are preparing an equivalent business certification, and a partnership framework that will define our collaboration options. So if you are interested, apply first for this step and we will let you know as soon as the other elements are in place.
Apply here.
Workshops for data professionals are next on the list
We have been working on content for a 4h data loading workshop for data professionals.
Our ask: We would like to understand the volume of commitment and the preferred ways to participate and learn in such a workshop. If any of you would give us 3 minute of your time to fill in some short answer to a few related questions, this would be appreciated. The respondents will be notified once we have selected the format and schedule.
Community contribution highlights:
Content:
Daniel’s tutorial on incremental loading was picked up by Towards data science! Thanks a lot Daniel! https://towardsdatascience.com/3-essential-questions-to-address-when-building-an-api-involved-incremental-data-loading-script-03723cad3411
Haq Nawaz created another video as part of his series. Thank you Haq!
Fran Lozano created a plug and play repo and medium post for Stripe→Postgres. Thank you Fran! https://franloza.medium.com/what-is-the-easiest-way-to-move-data-from-stripe-to-postgres-54d5e489bda8
Finally, Jorrit Sandbrink who supports with us as a freelance engineer wrote an article about how dlt uses arrow under the hood https://dlthub.com/docs/blog/how-dlt-uses-apache-arrow
Code & docs:
Add fallback value for tz in row_tuples_to_arrow (sql_database helpers) @khoadaniel dlt-hub/verified-sources#493
Add stargazers GraphQL query for GitHub dlt-hub/verified-sources#483 by @cybermaxs
Fix: allow loggeradapter in addition to logger in logcollector by @matsmhans1 in #1483
Docs: Fixed markdown issue in duckdb.md by @PabloCastellano in #1528
Docs: Update grouping-resources.md docs by @axellpadilla in #1538
💡 Sign up for the future Python ELT workshop here: Link
2. What we did - Major version release 🙌
This is a major release (0.4 -> 0.5) in our versioning scheme so please review the breaking changes. Most of them are relevant only for platform builders that use dlt
internals. Some of the long-deprecated components were removed as well. Read more here in the release notes.
LanceDB support
Why do we keep raving about it? We love that it’s part of the composable ecosystem that dlt also heavily leans on. This composable ecosystem is an open source standard that aims to modernise how we work with data in open source across applications like the BI-focused Modern Data Stack or AI or ML platforms. It’s similar to DuckDB in that you can run it in-process, which makes it a great building block for modern embedded applications or data products. Read more about it here:
Delta tables on filesystem!
You can now write delta table format to filesystem. PR, Docs.
@dlt.resource(table_name="a_delta_table", table_format="delta")
def a_resource():
...
Improvements to core sources
We made improvements to REST API and SQL sources. We will keep doing so based on the GitHub issues you open.
Improvements to existing things:
Snowflake technical certification requirements drive some performance enhancements such as enabling faster ingestion where normalisation is not needed: External files, including JSONL, CSV, and Parquet, can be imported and loaded directly. Also, Snowflake destination now supports CSV format.
Case sensitivity support for multiple destinations, along with fully customisable naming conventions.
Schema migrations are now faster with a single SQL statement where possible instead of per column/per table approach.
Various performance updates and bugfixes.
To help prioritise, we consider GitHub issues above slack requests because they show a deeper commitment to collaboration on requirements and testing of the feature.
💡 Read the detailed commit/release logs here: dlt, Sources.
3. Coming up next
Towards a paid offering.
We are learning from data platform builders about your needs, so we can create something of value around our open core.
If your team is doing one of the following 3 things and is interested in help,. The themes are:
Legacy modernisation / migrate from SaaS vendors.
Building a data platform & integration with your ecosystem and infrastruture.
Democratize data access securely with PII data handling and other similar requirements.
💡 Book a call with our support engineer Violetta or get in touch with her on Slack.
A short recap of what else is coming
We are improving the REST API source with your feedback.
Workshop for data engineers: Take a zero to hero ELT with dlt workshop (around 4h) principles first, dlt second. Please fill this application if you are interested to take the course, we will try our best to cater to your preferred format.
Meet Marcin at DuckCon (15th of August in Seattle).
And a variety of other things, read more in our GitHub project roadmap.
Thanks for being a part of the dltHub community!
Keep cool❄️🥶😎 this summer by building your data platform without catching fire and reducing your maintenance with dlt.
PS: dlt is the most downloaded ELT library with 0.5m monthly downloads. Let’s make it the top ELT tool.
💡 Get involved: See our short-term roadmap here and tell us what you need.