July 11, 2022

Matt Rocklin: Data Science, Dask, Scale | Turn the Lens #12

Jeff Frick
People love using Python ... but when they hit larger datasets ...
everything breaks ...
Dask was designed to scale the existing ecosystem of tools ...
now we get to solve not just things on our laptops, we get to solve things on the largest computers in the world

- Matt Rocklin
Episode Description

Matt Rocklin has been in the center of the Python renaissance in enterprise data science, at Anaconda, NVIDIA, and now Coiled, Matt's start up focused on helping enterprises more easily deploy parallelized python in production at scale. Artificial Intelligence (AI) & Machine Learning (ML) are two of the most important technologies of our time, data scientists write the algorithms that make it possible. Coiled is enabling more data scientists to do their job better by removing friction, both for scaling (to cloud) and working within the confines of corporate governance, compliance, and economic requirements. In this far ranging interview, Matt discussed the continuing advances in data science,  open source in the enterprise, the python resurgence, and the challenges of building a company during a global pandemic.
Thanks Matt.

Date of record:

Chapters


00:00 Intro

00:29 Introducing Matt Rocklin, Coiled

01:16 Organizations are sitting on treasure troves of data, but gaining insight and value from it is hard

01:26 Data scientists pour through the data to construct new policies, operational systems, & algorithms

01:49 Python is the standard data science, data engineering, machine learning language today

02:02 Matt, open source maintainer for NumPy, Pandas, Scikit-learn, Jupyter, etc.

02:09 Focus on the problem of scaling up the existing community to work on larger data sets

03:11 Open source as a force multiplier, a way to help others be more productive

03:27 Dask origin story

03:38 People love Python, but when they go from a gigabyte to large data sets, everything breaks

03:53 Dask was designed to fix the disconnect, to scale the existing ecosystem of tools

04:03 Dask has breathed new life into the Python space, opening up completely different problem sets

04:51 Matt compares CPUs to GPUs in machine learning

05:28 NVIDIA RAPIDS, Dask and Changing the economics of AI to open up the problem set

05:56 NVIDIA RAPIDS, scale, not one GPU but hundreds

06:50 Parallel also means less power, lower costs, opens up a different set of problems

06:55 1,000x speed up for only a 10x bump in price

07:00 Better economics opens up a broader set of problems

09:38 Coiled origin story

09:38 Coiled was designed to make scaling Python easy in the cloud, most commonly for data science and machine learning

09:47 Coiled is based around the open source technology Dask, which is the pre-eminent Python native solution for distributed computing

10:23 People love Dask, but elevating it within an organizational is hard, often for DevOps of organizational challenges

11:12 Coiled offers a hosted and managed Dask offering in the cloud

11:19 A consistent challenge for data science teams, easily moving from desktop to cloud scale

11:55 Coiled solves the problems IT cares about, security, user management, cost management, et.

12:46 IT has a mandate to support scalable machine learning platforms, and Coiled fits their needs

12:54 IT likes Coiled because we solve one problem really well, and get out of the way for everything else

13:10 Coiled is also designed for the individual user, designed to make it easy for anyone to scale out Python anywhere

14:09 Adding new data sets to algorithms

14:51 By allowing yourself to hit the full date fire hose, you can make much more accurate models

16:51 By building applications for the largest organizations in the world, we're also able to serve broader public

16:57 Our Mission, providing computation and accessibility for everybody

17:36 Matt on modern data tooling changing the way we thing about analytics

17:53 Everything is broken apart so you can look at different kinds of applications and technologies

18:32 Open data

19:24 Impact of cloud infrastructure

21:21 Matt on the challenges in moving from developer to CEO

21:27 Chess and the CEO

26:24 Matt on open source

Transcript

>> Jeff: All right, good. So I'll just count it down and we'll just jump in.  >> Matt: Sounds great.

>> Hey, welcome everybody. Jeff Frick here from Turn the Lens coming to you from the home studio. And we're really excited to have somebody who's kind of at the cutting edge of what's going on in AI and machine learning and data science. You hear about algorithms all the time but somebody is actually got to write those algorithms and they don't necessarily get them right the first time. So, sometimes we might have to experiment or tweak them or work on them. And so it's really exciting process and coming to us from Austin today is Matt Rocklin. He is the founder and CEO of Coiled. Matt, great to see you.

>> Hey, Jeff.

>> So you've been in the middle of this forever. You've got a long history with Python and before we get into kind of what's going on, explain for the people that aren't familiar with Python, kind of the role of the data scientists. 'Cause we talk a lot about applied AI in giving me an auto response on my Google Maps or my Google Mail or auto reroute on my Google Maps. But for the people actually writing the algorithms in day-to-day world of trying to figure out what the algorithm is going to be, how does that work? And what's that kind of like in the day of a data scientist building algorithms?

>> Yeah, so I think actually, so just to take a step back for a second like, all sorts of organizations today are sitting on these mountains of data which is, these treasure shows that they have. Especially gaining insight and gaining company value out of them is actually really hard and that's why they hire data scientists. To go through, pour through those mountains and construct hopefully, some PowerPoint that changes company policy or some operational system that actually makes decisions day by day. This could be as simple as a PowerPoint slide or it could be advanced algorithm like, an Uber or Lyft, figuring out where the closest car is. Those groups are often using... More and more they're using Python. So Python is by far the standard data science, data engineering machine learning language today. And they're using a bunch of algorithms that are actually already pre-built often by a bunch of the open source Committees.

That's where I come from. So for familiar with libraries like NumPy, Pandas, Scikit-learn and Jupyter, I've been in that space for a long time. It's an open source maintainer. And I've mostly thought about the problem. How do we scale up that existing community to work on larger sets of data? So often we're using Python very happily, they want to scale out. It becomes a bit of a problem. So that's really my focus and that adds a whole another layer of complexity in lots of different ways.

>> So let's jump into it then 'cause then you were one of the founding creators of DASK which as I understand is really the ability now to take Python beyond the physical limitations of your memory or whatever you're running on your desktop to parallelize it, right? Which is, we know kind of the great thing that's so important about computing now is parallelization. And that allows all these massive scale and cloud computing and all the rest. So when you started on kind of looking at parallelization, was it to fix something? A problem that you had? Did you just see kind of the big potential if I could paralyze Python and enable it to be applied to much more problems, bigger problems? What kind of drew you into DASK?

>> Yeah, so I was drawn into open source just 'cause I found I was so much more productive in that space by enabling other groups to be productive themselves. Previously, I was a graduate student researcher working on power systems, renewable energy, the weather or that kind of thing. I started to ask what was that Anaconda and so at this point we already had a very broad set of users who are asking us to fix problems. And the very common pain point you would run into, is people love using Python. They love all the capabilities that it provides but when they hit that larger dataset, when they go from a gigabyte to 10 gigabytes or a 100 gigabytes or to a petabyte, everything breaks. And there's this real disconnect that occurs and DASK was designed to fix that disconnect to scale the existing ecosystem of tools to scale those out, to handle these larger problems. And that's really breathed a sort of a new breath of life into the Python space 'cause now we get to solve not just things in a laptop, we could to solve things in the largest computers in the world that opens up just a completely different set of problems for us.

>> Right. And then you took a little journey to NVIDIA and we all know NVIDIA from the GPUs, right? And back in the day, GPUs were known for gaming, right? But now GPUs have had this massive breakthrough as an engine, as an enabler for this massive compute. So when you took the tour at NVIDIA and you looked at the power of GPUs, for the layman wired GPU so much more powerful than a CPU, why is that such a game-changing technology? And then in the application of the stuff you're working on Python and DASK, how did that change, you could do with those tools?

>> Yeah, there's a whole history of that. We give a whole talk on that. Yeah. So the CPU is a really general purpose processor that's designed to do a sequence of things all at once or all one at a time. A GPU is like a small army of really dumb processors. It's great if you want to change pixels on a screen, there's lots of pixels of change at once. Turns out to also be great for a lot of interesting machine learning applications. And it's really been interesting watching and video shift from a gaming and graphics company to an AI and a data company. If you look at their keynotes these days, it's like 80% data processing, which is really different. What's been interesting, so I worked on the RAPIDS team and what's really exciting about RAPIDS is that... So NVIDIA made a bunch of money because of deep learning, turns out deep learning has a really good application for GPUs really touches that sweet spot but you can ask the question, well what else can we do with GPUs outside of deep learning?

And it turns out you can do a lot, right? Even more sort of general data engineering and data processing technologies. So things like passing CSV files or JSON data or doing database joins are also really well-suited to GPUs. So NVIDIA made a technology suite called RAPIDS to apply GPUs to these problems. And they wanted to scale it out. They wanted to hit not one GPU, but 100s of GPUs. And so they chose to use DASK, right? So DASK is this very general purpose toolkit to build distributed systems. And so they brought me in, I hired out a team of a bunch of folks and went to town, scaling out RAPIDS onto multiple machines, really just blow the competition out of the water. It's actually been really amazing to kind of advance is we're able to make that are game-changing in our ability to solve new and different kinds of problems.

>> So really, really a step function in terms of capability because of the massive parallelization versus the sequential. Is that probably the best way for people to compare the two, it's really the sequential versus parallel between those kind of CPU architectures?

>> Yeah. And as a result of that being parallel, there'll be much lower power. There'll be much lower cost. They're able to get just like a 1000x speed up for only a 10x bump in price. So it just opens up different set of problems you couldn't have solved before.

>> Right. So in terms of kind of... I'm going to use the word porting but I have all these old vocabulary words. You will laugh but kind of porting it to the GPU from had been running on CPUs mainly on laptops. Is that easy to do? Is that one of these great things about open source that you can move? 'Cause I'm asking in the context of we hear about RISC-V, microprocessor architecture, Amazon has the Graviton processor. Now we know there's the TensorFlow, the TPU out there. So it's really in my mind is kind of an XPU. Kind of new world order in terms of a different way to apply computing technology again to these big problems.

>> Yeah. No, it's incredibly hard. Fortunately, that wasn't my job. So, GPUs have it a bit harder than some other systems like ARM chips which are also very exciting. We're doing this interview on an MI chip MI on a Mac, which is also new custom Silicon. It's changing the game of consumer electronics. But yeah, on Kuda, it's actually quite hard but fortunately NVIDIA is very, very good at accelerating these technologies. What's wonderful though about the Python space is that we've constructed a level playing field for all the hardware vendors to come in and compete for our cycles and for our user base, right. Everyone likes using Python and video can come in. I don't care how it's written internally, as long as they're able to out-compete Intel or ARM or Apple. And that's what's really exciting, is that now the Python space is large enough but all the hardware vendors are competing for our attention span.

From a DASK perspective, we're actually a level above all of that. We don't really care about how those algorithms work. We care about how to paralyze them and parallelization ends up being sort of a more uniform problem to solve than the individual architectures.

>> Hmm Hmm. So you're there, you're having fun with GPUs and then you decide, "Hey, maybe I should start a company." And so February, 2020, I don't know how much kind of insight you could see on COVID or if you were keeping an eye on the East where if we'd been paying closer attention, we would've saw it coming, but you started a company in February. I'm looking at your LinkedIn, COVID hits in March. So first off, why start the company? Kind of what is the origin story behind Coiled? And then specifically, maybe some of the challenges you started in February and then everything gets locked down about four weeks after you started the company. Certainly not an easy challenge like starting a company isn't hard enough already?

>> Yeah. Everything turned out in the end, but yeah. So Coiled is a company that was designed to make scaling Python easy in the cloud most commonly for a data science machine learning capabilities. As I mentioned, we're based around open source technology DASK which is the preeminent Python native solution today for distributed computing. DASK has this really long history Coiled is super young. We're really propelled forward by the open source Dask user base and by a lot of common trends. We've mentioned GPUs but also the growth of AI machine learning and data science movement to the cloud and just the growth of Python generally and enterprise. We started Coiled mostly out of demand. We've found that many users... People love using DASK 'cause of incredible grassroots adoption but elevating it within an organization ends up being hard either for DevOps reasons or for organizational challenges and Coiled really designed to solve those problems.

>> So it's really, really in enabling people to move from the individual person working on their laptop to now kind of making it corporate friendly or enterprise friendly or big institution friendly. So that's would be, begs question all the kind of typical things, right? So support, service, best practices, consulting, et cetera. So what are some of the things you guys help or is it really more about IT control and security and manageability and that side? Is it more for the data scientist side or more for kind of the IT side?

>> Yeah. So we really serve both masters Coiled right in the middle of those two groups. In a smaller sale so mostly what we do, we provide a hosted and managed task offering in the cloud. That's our primary product. And the common story that we see is that, there's a small data science team but two and 10 people in an organization. And there's some early adopter in that group who loves using DASK, maybe using DASK for six to 12 months on a workstation, they've kind of broken out their cloud kit. They've deployed Kubernetes. They love it, but it's overwhelming do that kind of work. They sort of done this sort of accidental foray into DevOps. And now they're trying to unwind. They see Coiled and we just provide all those solutions for them and so much more. So, I mean if they actually want to scale out in the organization, once you got a few groups, IT tends to get involved, we also solve all the problems that IT cares about. Things like security, off cost management, user management. And so we're really sort of a future-proofed way for a data science team to get results quickly.

>> And when do you usually get called in? Is it by the frustrated research team that's trying to move beyond their laptops and get access to the corporate cloud or is it from the the IT guy that says, "Hey, these people are banging on me to try to get access to the cloud to bust out their programs. I need more control. Come help me." Who generally reaches out to you guys to bring you in on a project?

>> Yeah. So we see maybe three different categories but definitely centering on that data science team lead. That's the most common story but in a group that's more advanced, maybe they've got a few teams internally already using DASK, they've already gone to IT. And IT now has this mandate to support, scalable Python machine learning platforms internally and Coiled slots in and fits a lot of their needs. IT really likes Coiled because we solve one problem. We solve it really well. And we get out of the way for everything else. We're not very opinionated on how you manage your models or on how you run the notebooks. We do one thing really well. On the other end of that spectrum, Coiled is also designed for individual users. We're very much a developer tool and we wan to make it easy for anybody to scale out Python anywhere. And so we also sort of target individuals who are just swiping a credit card on the cloud.

>> Right. So I know you're working on the infrastructure to enable people to do cool things. But I wonder if there's some cool things that you've come across either in helping customers or getting engaged in specific projects that you can share with the audience in terms of, what are some of the actual outputs of these data science projects that people are running on this technology?

>> Yeah. Let me mention like three briefly. So a classic example, I think every bank does this but credit risk modeling and banks. So a credit card company needs to figure out what line of credit to give you or what rate to give you. In order to figure out that number, they want a massive machine learning model. Using DASK companies like Capital One are able to train that model on fundamentally different datasets of much larger scale than they were before.

>> Right. I was going to say 'cause they've been doing credit risk analysis forever, right? So what's the big thing that's changed now that they can do that they couldn't do five years ago?

>> Yeah. So I won't say specifically for Capital One but generally what we see in banks is, you'll switch from a user level data. So, who are you? Your name is Jeff. You're a certain age within a certain zip code and they'll switch to transaction level data, every single credit card swipe that you've ever done historically. And so that's the kind of shift where there's a lot more information in that, right? Like if your transactions shift from, buying Netflix to buy vodka, that might be a good signal that maybe they should be reducing your limits. So there's things like that that end up by allowing yourself to hit sort of the full firehose of data, you can make much more accurate models.

>> That's a great example. All right. That was a good one. What other one? Do you had a couple more?

>> Yeah. Example two. So the air force is a new Coiled customer. This is the 309th airborne division, they're based out of Hill Air Force Base. And they do a maintenance for all the aircraft in the air force. And so that's a sort of a standard predictive maintenance play but with a little bit of a twist, they've got both through the standard time series and all of their parts will sort of like an IOT kind of application. We also got video on the wings. They've got video in the cockpit. Their parts undergo a level of stress that is, very high G-forces is quite different from what you'd find in a lot of industrial settings. As they've gotten this much more advanced data pipeline that they want to build, industrially thrives in that more custom or bespoke setting.

And then (indistinct) I mentioned is actually just an individual. So we mentioned the sort of split between these IT departments and individuals. So there's a gentleman, Andrew Terrio, who was at the time an independent consultant. He was working for Bloomberg Media on the U.S. 2020 elections. This is back in late October trying to figure out voter turnout. Classic story, he was in his apartment on his laptop using Scikit-learn to print the models on a few precincts, got everything working right and needed to scale out to every precinct in United States. What started out as a few minute jobs, switched to be an overnight job. And that's fine if you got lots of time, but he was under a really tight deadline 'cause the election's coming in a few days.

>> And he was working for Bloomberg. So he's trying to get data out for Bloomberg to publish with some prediction or whatever.

>> Correct. Yeah. This is Bloomberg Media.

>> Yeah, yeah, yeah.

>> Yeah. And so this is... Using Coiled he was able to sign up for Coiled, use Coiled to get those results back in around 10 or 20 minutes. And so that really speaks to our focus on the developer experience and our focus on accessibility. What I really like about this comparison is that we're able to by building applications from the let's say some of the largest organizations in the world, we're also able to serve the broader public. And that's really core to our mission of providing computation accessibility for everybody so that they can better understand the world around them.

>> Right. I just want to double down on kind of you said, using additional datasets and more datasets and you talked about camera feeds on the air force. So not that specifically but are you seeing more analysis of video and audio now being factored into these models beyond kind of the standard time series stuff that we were seeing before? 'Cause that's a huge opportunity when we start to be able to really see into audio and see into video and start to pull that data out.

>> Yeah, no, absolutely. I think as data tooling has advanced, we've changed from thinking mostly about business applications where you've got a SQL database and you're really restricted to SQL to thinking about like Hadoop and Spark. We were thinking about a table of records that you can analyze in sort of a MapReduce kind of way. To something like DASK and Python, now suddenly everything is broken apart and you can look at all different kinds of applications with all different kinds of technologies. So Python comes from a very scientific background where there's decades of image analysis research or three-dimensional volume research or research into complicated graph systems and using technologies like DASK, we're able to really open up all of those much more rich kinds of data to operate at scale. And that's something that's really exciting. There's new domains in some biomedical imaging, genomics analysis, satellite imagery, all of this stuff is now suddenly possible. All these much richer datasets are possible because of technologies like DASK and Python.

>> Right? The other pieces you're talking about the different datasets that I saw in a different interview somewhere getting ready for this that you talked about, kind of open data. And I think you used the example of the city of Chicago, public making open data. So again, it's a really interesting trend. I don't know that a lot of people are paying attention to that, that not only do you have the capability to do things to the data that you couldn't do before but there's also new datasets available or maybe just more accessible is a better word where now people can go in and do their own analysis on other datasets and make relationships, find correlations that potentially before, it just wasn't really an option. So where some of the cool datasets you guys see out there and where are you seeing some of this application of, I don't want to say the citizen data scientists you still got to be pretty sophisticated but you do have access to these datasets that you can now start to do your own analysis on.

>> Yeah, no, that's a great example. I think what's really accelerate a lot of this is the cloud, right? The cloud is this massive shared hard drive that we all can look at. We can all write to and we can all read from. And that means that there's just so much more available to us, allows us to cross different kinds of data with each other. Now it's challenging though, is accessing that hard drive and making changes to that hard drive, right? That requires a lot of infrastructure and it's not really very accessible to the average developer out there or the average data scientist and Coiled really brings a lot of those capabilities much closer to home. We think a lot more about that proximate computing so that you feel like you're just operating on your laptop but actually you're transforming a 100 terabyte dataset on S3 that everyone in your organization and possibly if you want to, everyone in the world can also read.

>> Right. I want to shift gears and talk a little bit about kind of you and your journey. You're a developer and you've been playing with code and you're a smart guy at your doctorate degree and now you're running a company, decided to be a CEO, very different kind of a challenge. I wonder if you can share it because I also think, having a technical founder is a real asset and not all companies can pull it off and not all technical founders can pull it off and you're getting into it and learning as you go. So I wonder if you can share some thoughts about making that transition and really having a different kind of rule then kind of of hands on keys to, as you said helping others with hands-on keys do their job better.

>> Yeah. I mean, a technical founder can be an asset or liability depending on the technical founder. It definitely increases the risk of the company. Yeah. I mean, first of all like running an open source organization of the scale of DASK, it's not actually so dissimilar from running a company, right? We've got budgets, we've got lawyers, we've got 30 employees, we've got org charts, whatever it is, we don't have money. We actually can't pay anybody to do anything. It's a different game. But to your point though, I think being able to step back and make space for others is something that I've really, really struggled with personally. The analogy I like here is that it's not like playing chess. So like as an individual developer, you look upon on a chess board and you have to be a team lead and like a night, you can kind of do a bit more things. You have a bit more leverage. And as I sort of evolved my career, became a Bishop and then a Rook. And I had thought that, as being an executive in a company, I'd be like a queen and have this amazing leverage over the board.

And it turns out that actually I became a king on this useless piece that just gets in the way, I'm mostly a liability. And the most strategic play I can make, is the castle is to get out of the way and put a Rook into place. And so that's been a learning experience for me. I think a lot of my ego has had to been stripped away but it's also experience that I've heard is common among lots of people in my position.

>> Right. Well, it's good to get on new learning curve, right? You've already got your PhD in the computer science and the data science. So it's good to get back at the bottom end of a steep learning curve.

>> I don't know. Too much growth is a bad thing.

>> So give us kind of the basic so you've been at it since February of 2020. So kind of where's the company now in terms of, number of employees, I think I saw something too where you're a 100% distributed and that was an objective from early on. So give us kind of the one-on-one on the state of Coiled from, if I looked up on tech grounds, what's it going to say?

>> Yeah, I'll maybe start back in February, 2020. So we raised a bit of seed money. It's like 5 million. And we burned that very slowly early on. I come from more of a consulting background. So we had five to seven employees for most of 2020 and it quickly became obvious that demand was just huge relative to the size of our team. Everybody right now wants to do scaleable Python and we've been in that space for five years. Like we're very much out in front of that. And so just demand coming in was much more than that team could handle. So right now we're at about I want to say like 25 or 30 employees, which has been facilitated by a recent raise. We raised about 20 million led by Bessemer.

>> Congratulations.

>> Yeah, no, it's been transformative to the company honestly. So yeah, we're about I want to say 30 people right now, most of that has been put into engineering development. Mostly right now we're focusing on really just honing that developer experience and the onboarding experience. Really focused right now on getting small teams engaged as quickly and as easily as possible.

>> That's great. And then again, back to the kind of the foundational cloud piece of this, so you've launched as a cloud service, right? I mean, that's the fundamental go to market. Do you have a multi-cloud strategy? Do you have an on-prem strategy? Where do those two pieces fit?

>> Yeah, so we do multi-cloud. So DASK usage right now is actually spread relatively uniformly across the three major clouds. We display all of them and our customers like that 'cause it reduces lock-in, especially the larger customers. We started out being open to on-prem. We happen to just know like the IT departments in every major bank in New York City and that was original customer base. COVID actually affected that, right? When COVID happened there was about three months where everyone thought like, "Oh, we're going to wait and see." And so that actually pushed us more towards these smaller companies who were able to move faster, it ended up being a wonderful outcome for us. Actually, we've gotten so much information out of those small groups. They move so much faster that it's allowed us to develop our products much more quickly.

>> Right. I'm just curious how much do you hear digital transformation in these engagements? 'Cause clearly you're a big piece of it but you're kind of the engine behind a lot of it. That's not necessarily kind of exposed to the end user, right? You're delivering it but digital transformation clearly a giant progress in the marketplace disrupting everybody. And I'm sure it's a huge tailwind behind you guys.

>> Yeah, definitely. I mean, what you say about we're not effecting end-users, that's a core part of our business as well. Like we're very much a best of breed product. We do one thing well, we scale Python, but often that's not the whole story, right? We're in the mix with a bunch of other products in a company and a bunch of other developers internally building out system solutions. By being best of lead, we're able to have a tremendous reach because we're able to solve all different kinds of problems in a very general way. But we really do need a lot of partners like services groups or partners within an organization to make our products successful. This mimics how Python works which is this ecosystem of tools. We're an ecosystem of products.

>> Right. So I want to close on something I saw in another one of your interviews researching for this where you made a statement that the combination of open source plus a commercial company, can do things together that neither of them could do by themselves. And I wonder if you could kind of unpack that a little bit for us 'cause you'd seen it in your past life and clearly you believe that. 'Cause now you started Coiled on that same kind of emission, so where does that belief come from? What are some of the examples that you'd like to state as clearly that's part of your foundational kind of mission?

>> Yeah. So I mean clearly open source has out competed proprietary software in terms of feature set, right? Deployments of Spark, of DASK, of other technologies, outstrips Oracle at this stage. People are very excited about these new technologies, just so much in-

>> And certainly the innovation engine cannot even be touched, right?

>> That being said a pure open source play like I did that for five years and it is incredibly good at making solid and usable technology, but also organizational challenges. It just doesn't meet. So for one, it's actually really hard to build on always deployed services that can scale out to serve the world. You really need people to be paid long-term to solve those problems. We saw this in organizations like Pangeo, we started Pangeo to serve a lot of earth science needs. And we threw DASK and JupyterHub on the cloud with Kubernetes and it was fantastic, but all of these problems came on and it was clearly something that the open source community was not well-prepared to handle. Coiled solves a lot of those needs. Additionally, just talking to organizations is a skillset that open source communities don't have.

It's quite hard for Dell or for NVIDIA or for Ford to engage with the Pandas community or the DASK community. You really need a for-profit entity that's there and able to negotiate, able to present the needs and the mission of that community to the for-profit world. Doing both open source and for-profit is hard. We've seen it fail many times, but we're working our butts off to make it work at this time.

>> Yeah. Well, to get your take on in terms of typically we'll hear at a big company that's not necessarily dedicated to open source project, but they have contributors. We'll just pick Google to pick on somebody, right? There'll be people working on Kubernetes or people working on TensorFlow, et cetera. And it's always interesting to me, how do you compensate them or how do you manage their time to contribute to the open source project which is so important to their personal persona and their own kind of feedback to their own work? That's one challenge. You're in a really interesting different situation where your fundamental, around these open source projects. So in terms of allocating time and how people work and how they prioritizing and how much are they working on the open source piece of the Coiled offering versus the enterprise piece and the ugly stuff like you said administration consoles and the stuff that's not the open source components. How do you manage that? 'Cause I would imagine, do you still have a lot of people that are really active contributors to the core 'cause that's such an important piece of your total package?

>> Yeah. DASK is maybe unique among open source packages and then we've always been paid by for-profit companies to build out DASK. We had our first client, I think it was D.E. Shaw the hedge fund like a few months after starting the project. We've always been responsive to companies because we've always been on the cusp of what everybody needs. We're very much driven by the day-to-day necessities of companies. So it's actually not hard for us to find companies willing to pay us, to maintain the core. We actually get to pick and choose who we allow to pay in that context.

>> Right. So in our situation is actually not-

>> (indistinct)

>> Of your own people. How much do people have some set aside time of 10% or 20% to work on the core? Is it just so mixed up that you really can't separate it out that way?

>> Yeah. No, it's a separate group of people who work on DASK but even then, like they're paid. We have customers that pay us to work on DASK. And there's a separate group of people who actually have a different skill set, the skill set of enterprise auth or keeping cloud services running all the time or security. And it's actually a different subset from the people who would work on open source task. Most of our engineers aren't actually working on DASK. Most of our engineers are working on this Coiled product that supports all of the work from all the DASK engineers around the world. We're very much more of a high leverage organization focusing on that infrastructure piece and all of that is proprietary code.

>> Right. Right. All right, now. Well, this has been a super catch-up. I'll give you the last word, unlike when you started the company I think we're coming out of COVID a little bit. So things are going to change in the positive direction I think, but I wonder if you can just tell us kind of what are your priorities for the next several months. Whatever the right timeframe is, kind of what are you working on? What are you excited about? And now you got some fresh powder to work with to give us a little bit about kind of Coiled 2021, 2022.

>> Yeah. So on the open source task side, I'm really excited about the increasing growth that we see, not in DASK itself but in other projects adopting DASK. So for example, like Microsoft announced last month their planetary computer plus out a lot of satellite imagery and help scientists understand how our earth changes. And the Chan Zuckerberg Initiative by Priscilla Chan and Mark Zuckerberg, they're working on napari  for biomedical imaging. Jeff Hammerbacher the Cloudera founder is now working on related to look at DASK for population genomics. These are all really key problems that aren't just business problems, they're also humanistic problems that are all now sort of DASK first. And it's really exciting to see that work happen, not just for data scientists, but also for genuine scientists. So I'm excited about that growth. On the Coiled side, like we're just trying to keep up all of those different endeavors on both the scientific and the for-profit side, all bring in these new communities that all need supported DevOps.

They all need supported solutions. They all need cloud products and allow them to run efficiently and easily in the cloud. And so just trying to keep up with that is our main focus for the next year.

>> Yeah. That's great. It's great you mentioned Jeff Hammerbacher 'cause I think he's famous. I'm pretty sure it's attributed to him. I don't know that it's actually him that the greatest minds by generation are trying to get you to click on an ad. So it's nice to see some of the smart people out there working on some of these big problems 'cause there are big problems that need to be solved beyond clicking on an ad. So this is great. That's good news. Well, Matt, thank you. Thank you for the update. It sounds like exciting times. And again, it's nice to get the new dry powder you can start to solve all the problems your customers have 'cause it sounds like you guys are nothing but busy.

>> Great. All right. Well, thanks a lot Matt.

>> Thanks Jeff.

>> All right. He's Matt, I'm Jeff. You're watching Turn the Lens with Jeff Frick. We'll see you next time. Thanks for watching. All right.

Links and References

Post - Just add "easy"​ ... economically expanding the machine learning problem space, Jeff Frick, LinkedIn, July 2021

Matt Rocklin, LinkedIn Profile, Blog, MatthewRocklin.com, GitHub

Coiled.io

Python.org

DASK.org

50 Best Open Data Sources Ready to be Used Right Now, Devin Pickell, Learn Hub, March 2019

Anaconda

Apache Arrow

Apache Flink

Apache Hadoop

Apache Spark

Apple M1

ARM

AWS and NVIDIA to bring Arm-based Graviton2 instances with GPUs to the Cloud, Geoff Murase, AWS Blog, April 2021

AWS Graviton Processor, AWS Blog Search "Graviton"

Bokeh

A brief history of Python, Vasilisa Sheromova, Exyte Blog, Nov 2020

Chan Zuckerberg Initiative

Citizens Police Data Project, Invisible Institute

Cloud Tensor Processing Units (TPUs)

Dask.DataFrame

Data Science for Social Good: Best Sources for Free Open Data, Ioana Spanache, Toward Data Science

Famous Actuaries, Travelling Actuary,

Guido van Rossum, Main creator of Python, BDFL (benevolent dictator for life), LinkedIn,

History of Python, GeeksforGeeks, May 2019

Intel

Invisible Institute Relaunches The Citizens Police Data Project, Jamie Kalven, The Intercept, Aug 2018

IPython

James Anderson, Insurance Hall of Fame Inductee, 2001, Insurance Hall of Fame, International Insurance Society

James Hamilton Announces New Amazon EC2, M6g, C6g, and R6g Instances, Powered by AWS Graviton2, James Hamilton, Amazon Web Services YouTube, June 2020

Jeffrey Hamerbacher, LinkedIn, Wikipedia, GitHub, Twitter, Hammer Lab, O'Reilly, Amazon,

Jupyter.org

JupyterHub

Keras

Kubernetes

Matplotlib.org

Matplotlib Basemap

NetworkX

NumPy.org

NVIDIA Cloud and Data Center

NVIDIA RAPIDS

Observable

Observable: An Earthquake Globe in Ten Minutes, Jeremy, Observable, Jan 2018

O'Reilly Strata Data Conference (fka Strata + Hadoop World), O'Reilly

PanGeo

Pandas.pydata.org

Perfect

The Planetary Computer by Microsoft

Python Scalability: A Convenient Truth, Travis Oliphant, Continuum Analytics (renamed Anaconda June, 2017), Mar 2016

PyTorch

Qualitative comparison between Dask and Vaex, Jovan Veljanoski, Toward Data Science, June 2021

Scikit-learn.org

SciPy.org

Scrapy

Seaborn

TensorFlow

Top 10 Open Data Resources Online, InvestInTech

Vaex.io

Xarray

XGBoost

______________________________________________________________________________________________________

#MattRocklin #Coiled #Dask #Python #TurnTheLens #JeffFrick #DataScience #ArtificialIntelligence #AI #MachineLearning #ML

______________________________________________________________________________________________________

Disclaimer and Disclosure*

DISCLOSURE*: This interview was sponsored by Coiled. Neither Coiled nor other sponsors have editorial control over the content.

Quotations are attributed to the original authors and sources.

All products, product names, companies, logos, names, brands, service names, trademarks, and registered trademarks (collectively, *identifiers) are the property of their respective owners. All *identifiers used are for identification purposes only. Use of these *identifiers does not imply endorsement. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and/or names of their products and are the property of their respective owners.

We disclaim proprietary interest in the marks and names of others. No representation is made or warranty given as to their content. User assumes all risks of use.  

Subscribe for updates:

Recent Episodes:

Lorem ipsum

Casey Neistat: Return to NYC, Discovery, Community, Collaboration, Connection | Turn the Lens #21

Martina Lauchengco: LOVED, Lessons, Modern Marketing Leadership | Turn the Lens #20