Open Source Contributors Analysis and Deep Dive into OSS Databases

In the last months, the VC and OSS community actively discussed the best open-source metrics for startups (e.g. good posts by Bessemer and Basis Set Ventures). Which parameter can better estimate user adoption of OS startups?

There is no one correct answer, only a holistic view works. And rephrasing Albert Camus’ famous quote, perhaps the easiest way of making an open-source repository’s acquaintance is to ascertain how the people in it work [commit], how they love [star], and how they die [retention].

While stars are a metric representing fans (minimal involvement), commits characterize the repo’s most advanced users (maximal engagement). User adoption could be estimated using these metrics, and I have already covered stars in my previous posts (1, 2). So it’s time to dive into the other side of the spectrum — contributors who actually create open-source software.

A fresh approach to OSS analytics

Product analysts and VCs typically use the set of standard user metrics (e.g. MAU and ARPU) to assess digital products like SaaS or marketplaces, but I’ve never seen it publicly applied to OSS contribution.

Participation in open-source projects is a special kind of digital service, so why not to try this? Let’s imagine that contributors are users and commits are their purchases, then it’s relatively easy to calculate classic parameters for an OSS project: active and total users, growth, churn, retention, ARPU, LTV, etc.

Noteworthy, contribution patterns differ between various software — e.g., it is easier to commit something meaningful to a JavaScript front-end framework than to a C-based high-performance web server. Thus it’s worth comparing repos only within a specific domain, and for a few reasons, I chose open source databases to illustrate contribution metrics.

Why open source databases?

Firstly, almost all known databases (735) are listed on dbdb.io, which could be automatically scrapped. After filtering open-source projects and enriching data with Github API one can get 358 repos. An excellent dataset to play with!

Secondly, in January 2021, open source databases finally won the popularity battle with closed source rivals on db-engines.com, the most popular website for DBMS selection.

Source: db-engines.com (Jan 2021)

Databases are essential software powering almost all modern businesses, and with the paradigm shift from proprietary databases vendors to open source, the world becomes more antifragile. Time for a small celebration!

Finally, I am a big fan of databases, considering deals in this field, and Runa has MariaDB founded by MySQL’s creator in its portfolio. We ❤️ databases.

Contributor dynamics

We will focus on the dynamics of database repos because static data is often misleading. For instance, you will see how a mature project could have high absolute metrics but be stagnating or even declining due to new rivals.

Active contributors

Let an active contributor (AC) be a user who made a commit in the last 12 months. One could notice that most ACs (58% for database repos) add 1–2 commits into a repo and vanish, while the main chunk of code (69% commits) is created by contributors with 3+ commits in a target repo.

We’ll call them active qualified contributors (AQC) and measure separately. They are advanced developers of a repo, exceeding average contributors by the quantity and usually also by the quality of commits.

There are ~19K contributors who made at least one commit to main database repos ever, but 5.6K were active, and only 3.7K users were AQCs in 2020. The share of qualified contributors fluctuates around 70% active since 2015. They represent the core of the database development community.

Share of qualified contributors for ”active” and “new” declines in 2020 partly because part of new contributors in 2020 will make their third commits in 2021 and be accounted as qualified contributors in the future.

Roughly estimating the market size of open source DBMS at $11B+ in 2020, open source databases generate ~$2M annual revenue per AC, while Oracle has just $1M revenue per its engineer. Not surprising, but still impressive 🙂

The community of database contributors grows, but where exactly do they contribute? There are 176 active database repos with non-zero contributor activity in 2020, and we will consider them further.

Top 20 open-source databases by active contributors in 2020. The lines represent the number of contributors, who made at least 1 commit for the last 12 months.

There are 4 clear leaders having 200+ active contributors in 2020:

  1. ElasticSearch, an enterprise search & analytics engine developed and commercialized by Elastic (NYSE: ESTC, $14B market cap), that went public in 2018. The project was based on the Apache Lucene library.

  2. Clickhouse, a column-oriented OLAP database developed by Yandex (NASDAQ:YNDX, $26B market cap) and commercialized by a few early-stage startups, the most notable of those is Altinity.

  3. Apache Spark, an analytics engine for big data processing developed at UC Berkeley and commercialized by Databricks (valued at $28B in 2021). In Q2-2015 — Q1-2016 period it had record 486 ACs, reached maturity and started to lose active contributors afterwards.

  4. TiDBa distributed HTAP database developed and commercialized by PingCAP (raised $270M Series D in 2020).

The second leading group consists of CockroachDB, Prometheus, MongoDB, and TrinDB (formerly PrestoSQL), which all had 150–170 active contributors in 2020. All others have less than 135 ACs.

Acquisition

A median database acquired 7 new contributors in 2020, but the overall inflow is very concentrated. Top-17 databases (out of 176 “active”) gained 50.1% of total new contributors in 2020. Here they are:

The four leaders remain the same but are now headed by Clickhouse — in 2021, it will likely become the number one database globally by active contributors. Another interesting detail is the close positions of PrestoDB and TrinoDB (former PrestoSQL), which have a non-trivial joint history.

Growth

Both amounts of total contributors (TC) and active contributors (AC) are essential but carry different meanings. While high TC tells us about large and diverse codebase created in the past, high AC signals about developers’ current interest, seeing the strong potential of an OS project in the future

We have data for the first 30+ quarters of life for active OSS databases and can derive meaningful AC and TC benchmarks similar to what we did with star growth. They are well-approximated by linear regression, especially since one year after the first commit:

Coloured lines represent three percentiles: 25%, 50% (median) and 75%. They divide space into 4 zones and could be used for assessment. For instance, if your database repo is above the green line, then it belongs to the top quartile and performs better than 75% of database repos at this stage of development.

Let’s go to the current market specifics — what are the best databases by active qualified contributors (AQC) and their growth in 2020? Here they are:

Represents only repositories having 5+ active qualified contributors in 2019.

Retention

All contributors tend to churn, but the best open-source projects keep them engaged for a longer period of time. For instance, the top-25% database repos retain more than 31% active contributors after 1 year since their first commit, while the bottom-25% repos — less than 11%.

Coloured lines represent three percentiles: 25%, 50% (median) and 75% for the retention rates. These rates are calculated for contributors’ quarterly cohorts as the relation of # active contributors in Nth quarter to the first quarter. All lines obviously start at 100% (the first quarter).

Let’s consider all repos having 20+ active contributors in 2019 and 2020, and sort them by retention rate after 1 year since the first commit. So we can get essential repos with the highest contributor retention.

Postgres is the undisputable leader having a stable core of maintainers, while most of the other repos with high retention either have a strong company or Apache Foundation behind them.

Engagement

A typical engagement metric is DAU/MAU, but given that commits occur less often than a regular user interaction in apps, we’ll use the relation of average daily active contributors to all active contributors in 2020. The top-20 repos by engagement having 50+ contributors are:

For instance, Neo4J has an engagement rate of 8%, which means an average active contributor committed something to its repository for 29 days during 2020.

Concentration

Open-source projects are community-driven, but not all community members are equal — often a few developers are responsible for almost the whole code.

In the recent book “Working in Public” Nadia Eghbal suggests classes for OS projects depending on the growth of contributors and users. It is a convenient framework but hardly describes the balance between contributors. So I would like to suggest one more approach and consider the decentralization of open-source projects using Herfindahl–Hirschman Index, that is often used by US and EU governments to measure the market concentration.

The index is a sum of market shares for its players (in percents), but instead of revenues, one could use commits in an OSS repo. For a repo with only one contributor (monopoly) HHI will be 10000 =(100)². For two ideally equal contributors (duopoly) it is 5000=(50)²+(50)² and so on. Now we can apply HHI for database repos and see how their decentralization evolves in time:

Unlike growth rates and other metrics above, high/low concentration does not mean good or bad — both models could work. But it is interesting to see, that even very concentrated database projects starting at HHI ~ 10000 eventually become community-driven. That’s the way open source works.

Instead of conclusion

As mentioned above, OS projects require a holistic approach. But if I have to focus only on one contribution metric, I’d use active qualified contributors (AQC) and their growth. Both parts of this definition are important:

  • Active contributors— a forward-looking metric representing developers’ interest in the project’s future (not in the past, like “total contributors”).

  • Qualified contributors — core developers and the essential part of an OSS project’s community. Exactly they create most of the open-source code.

Thanks to Julia Schottenstein (Principal, NEA), Robert Hodges (CEO, Altinity) and Erin Price-Wright (Principal, Index Ventures) who reviewed the draft of this article and provided valuable feedback.


Looking for top-20 fastest-growing open-source companies? We have updated the Runa Open Source Startup (ROSS) index with Q4 2020 data. Check this out!