Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing
Podcast Image

Data Engineering Podcast

Technology Education

Episodes

Showing 401-494 of 494
«« ← Prev Page 5 of 5

Digging Into Data Replication At Fivetran

12 Aug 2019

Contributed by Lukas

Summary The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sou...

Solving Data Discovery At Lyft

05 Aug 2019

Contributed by Lukas

Summary Data is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data sources...

Simplifying Data Integration Through Eventual Connectivity

29 Jul 2019

Contributed by Lukas

Summary The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a smal...

Straining Your Data Lake Through A Data Mesh

22 Jul 2019

Contributed by Lukas

Summary The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a ...

Data Labeling That You Can Feel Good About With CloudFactory

15 Jul 2019

Contributed by Lukas

Summary Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is th...

Scale Your Analytics On The Clickhouse Data Warehouse

08 Jul 2019

Contributed by Lukas

Summary The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented d...

Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection

02 Jul 2019

Contributed by Lukas

Summary Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitori...

The Workflow Engine For Data Engineers And Data Scientists

25 Jun 2019

Contributed by Lukas

Summary Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of ...

Maintaining Your Data Lake At Scale With Spark

17 Jun 2019

Contributed by Lukas

Summary Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and free...

Managing The Machine Learning Lifecycle

10 Jun 2019

Contributed by Lukas

Summary Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are abl...

Evolving An ETL Pipeline For Better Productivity

04 Jun 2019

Contributed by Lukas

Summary Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In th...

Data Lineage For Your Pipelines

27 May 2019

Contributed by Lukas

Summary Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform...

Build Your Data Analytics Like An Engineer With DBT

20 May 2019

Contributed by Lukas

Summary In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming th...

Using FoundationDB As The Bedrock For Your Distributed Systems

07 May 2019

Contributed by Lukas

Summary The database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something...

Running Your Database On Kubernetes With KubeDB

29 Apr 2019

Contributed by Lukas

Summary Kubernetes is a driving force in the renaissance around deploying and running applications. However, managing the database layer is still a s...

Unpacking Fauna: A Global Scale Cloud Native Database

22 Apr 2019

Contributed by Lukas

Summary One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a c...

Index Your Big Data With Pilosa For Faster Analytics

15 Apr 2019

Contributed by Lukas

Summary Database indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting t...

Serverless Data Pipelines On DataCoral

08 Apr 2019

Contributed by Lukas

Summary How much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way...

Why Analytics Projects Fail And What To Do About It

01 Apr 2019

Contributed by Lukas

Summary Analytics projects fail all the time, resulting in lost opportunities and wasted resources. There are a number of factors that contribute to ...

Building An Enterprise Data Fabric At CluedIn

25 Mar 2019

Contributed by Lukas

Summary Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Ent...

A DataOps vs DevOps Cookoff In The Data Kitchen

18 Mar 2019

Contributed by Lukas

Summary Delivering a data analytics project on time and with accurate information is critical to the success of any business. DataOps is a set of pra...

Customer Analytics At Scale With Segment

04 Mar 2019

Contributed by Lukas

Summary Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are ...

Deep Learning For Data Engineers

25 Feb 2019

Contributed by Lukas

Summary Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and ma...

Speed Up Your Analytics With The Alluxio Distributed Storage System

19 Feb 2019

Contributed by Lukas

Summary Distributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different ...

Machine Learning In The Enterprise

11 Feb 2019

Contributed by Lukas

Summary Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execut...

Cleaning And Curating Open Data For Archaeology

04 Feb 2019

Contributed by Lukas

Summary Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curat...

Managing Database Access Control For Teams With strongDM

29 Jan 2019

Contributed by Lukas

Summary Controlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage en...

Building Enterprise Big Data Systems At LEGO

21 Jan 2019

Contributed by Lukas

Summary Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process ...

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

14 Jan 2019

Contributed by Lukas

Summary The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming an...

Performing Fast Data Analytics Using Apache Kudu - Episode 64

07 Jan 2019

Contributed by Lukas

Summary The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grow...

Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63

31 Dec 2018

Contributed by Lukas

Summary As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream process...

Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62

24 Dec 2018

Contributed by Lukas

Summary Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine t...

Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61

17 Dec 2018

Contributed by Lukas

Summary Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains...

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

10 Dec 2018

Contributed by Lukas

Summary Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the compl...

Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59

03 Dec 2018

Contributed by Lukas

Summary Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then r...

Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58

26 Nov 2018

Contributed by Lukas

Summary When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex question...

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

19 Nov 2018

Contributed by Lukas

Summary Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true...

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

11 Nov 2018

Contributed by Lukas

Summary A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-...

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

05 Nov 2018

Contributed by Lukas

Summary Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they coll...

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

29 Oct 2018

Contributed by Lukas

Summary Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. How...

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53

22 Oct 2018

Contributed by Lukas

Summary As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products ar...

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

15 Oct 2018

Contributed by Lukas

Summary With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal sp...

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov

09 Oct 2018

Contributed by Lukas

SummaryOne of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. W...

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

01 Oct 2018

Contributed by Lukas

Summary There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in a...

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

24 Sep 2018

Contributed by Lukas

Summary As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access...

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

17 Sep 2018

Contributed by Lukas

Summary Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions...

Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47

10 Sep 2018

Contributed by Lukas

Summary Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can beco...

An Agile Approach To Master Data Management with Mark Marinelli - Episode 46

03 Sep 2018

Contributed by Lukas

Summary With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more impor...

Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45

27 Aug 2018

Contributed by Lukas

Summary There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is s...

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

20 Aug 2018

Contributed by Lukas

Summary The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases,...

Putting Airflow Into Production With James Meickle - Episode 43

13 Aug 2018

Contributed by Lukas

Summary The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning ...

Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42

06 Aug 2018

Contributed by Lukas

Summary One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus...

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

30 Jul 2018

Contributed by Lukas

Summary With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data co...

Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40

16 Jul 2018

Contributed by Lukas

Summary When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem tha...

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

08 Jul 2018

Contributed by Lukas

Summary Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apa...

Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38

02 Jul 2018

Contributed by Lukas

Summary Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projec...

Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37

25 Jun 2018

Contributed by Lukas

Summary Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every ...

User Analytics In Depth At Heap with Dan Robinson - Episode 36

17 Jun 2018

Contributed by Lukas

Summary Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize t...

CockroachDB In Depth with Peter Mattis - Episode 35

11 Jun 2018

Contributed by Lukas

Summary With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed...

ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

04 Jun 2018

Contributed by Lukas

Summary Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a st...

The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

28 May 2018

Contributed by Lukas

Summary Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as n...

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

21 May 2018

Contributed by Lukas

Summary Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from acro...

Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31

14 May 2018

Contributed by Lukas

Summary The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of...

Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

07 May 2018

Contributed by Lukas

Summary The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of...

Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29

30 Apr 2018

Contributed by Lukas

Summary Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer qu...

Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28

23 Apr 2018

Contributed by Lukas

Summary The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management syst...

Data Engineering Weekly with Joe Crobak - Episode 27

15 Apr 2018

Contributed by Lukas

Summary The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data ...

Defining DataOps with Chris Bergh - Episode 26

08 Apr 2018

Contributed by Lukas

Summary Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be del...

ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25

01 Apr 2018

Contributed by Lukas

Summary Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requ...

MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24

25 Mar 2018

Contributed by Lukas

Summary The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational o...

Stretching The Elastic Stack with Philipp Krenn - Episode 23

19 Mar 2018

Contributed by Lukas

Summary Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality i...

Database Refactoring Patterns with Pramod Sadalage - Episode 22

12 Mar 2018

Contributed by Lukas

Summary As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and it...

The Future Data Economy with Roger Chen - Episode 21

05 Mar 2018

Contributed by Lukas

Summary Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase...

Honeycomb Data Infrastructure with Sam Stokes - Episode 20

26 Feb 2018

Contributed by Lukas

Summary One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly p...

Data Teams with Will McGinnis - Episode 19

19 Feb 2018

Contributed by Lukas

Summary The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenge...

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

11 Feb 2018

Contributed by Lukas

Summary As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The ma...

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

04 Feb 2018

Contributed by Lukas

Summary One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have b...

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16

29 Jan 2018

Contributed by Lukas

Summary Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a si...

Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15

22 Jan 2018

Contributed by Lukas

Summary The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, tha...

CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14

15 Jan 2018

Contributed by Lukas

Summary As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to dist...

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

08 Jan 2018

Contributed by Lukas

Summary PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports h...

Wallaroo with Sean T. Allen - Episode 12

25 Dec 2017

Contributed by Lukas

Summary Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the...

SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11

18 Dec 2017

Contributed by Lukas

Summary Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in ...

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

10 Dec 2017

Contributed by Lukas

Summary To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple ...

data.world with Bryon Jacob - Episode 9

03 Dec 2017

Contributed by Lukas

Summary We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same ca...

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

22 Nov 2017

Contributed by Lukas

Summary With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, crea...

Buzzfeed Data Infrastructure with Walter Menendez - Episode 7

14 Nov 2017

Contributed by Lukas

Summary Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This le...

Astronomer with Ry Walker - Episode 6

06 Aug 2017

Contributed by Lukas

Summary Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform...

Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5

18 Jun 2017

Contributed by Lukas

Summary Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possibl...

ScyllaDB with Eyal Gutkind - Episode 4

18 Mar 2017

Contributed by Lukas

Summary If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for....

Defining Data Engineering with Maxime Beauchemin - Episode 3

05 Mar 2017

Contributed by Lukas

Summary What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this ep...

Dask with Matthew Rocklin - Episode 2

22 Jan 2017

Contributed by Lukas

Summary There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about ho...

Pachyderm with Daniel Whitenack - Episode 1

14 Jan 2017

Contributed by Lukas

Summary Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for...

Introducing The Show

08 Jan 2017

Contributed by Lukas

Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscr...

«« ← Prev Page 5 of 5