Short bio

Dr. Malkhi’s research over two decades spans broad aspects of reliability and security of distributed systems, recently with focus on blockchains and advances in financial technology. Her work resulted in over 150 publications as well as a strong impact on computing technology.

A select sample of contributions includes:

Presently, Malkhi serves as Distinguished Scientist of Chainlink Labs (since 2022). From 2019 to 2022, Malkhi served three roles in the Diem(Libra) project: CTO at the Diem Association, Lead Maintainer of the Diem open-source project, and Lead Researcher at Novi. In 2014, after the closing of the Microsoft Research Silicon Valley lab, Malkhi co-founded VMware Research and became a Principal Researcher at VMware until June 2019. Prior to that, Malkhi was a partner principal researcher at Microsoft Research, 2004-2014; a tenured Associate Professor (promoted 2003) of the Hebrew University of Jerusalem, 1999-2007; and a senior researcher at AT&T Labs, 1995-1999.

Selected academic roles and distinctions:

Technology Impact

For over two decades, my work has been straddling by choice between foundational and applied research. I published over 150 papers; recent ones are listed on my homepage, DBLP keeps track of the rest.

I was fortunate to bring several scientific results into fruition within leading industrial platforms. Below, I tell the stories of four technologies I participated in creating.

HotStuff and DiemBFT – Co-Inventor and Technical Lead, at VMware 2016, Diem(Libra) 2019

Renewed interest in the Blockchain world on scaling and robustifying the long standing problem of asynchronous Byzantine Fault Tolerant (BFT) Consensus.

In 2016 when designing the blockchain infrastructure at VMware’s blockchain project, we observed that all BFT solutions contain quadratic voting steps. Why is this so bad? When Byzantine consensus protocols were originally conceived, a typical target system size was n=4 or n=7, tolerating one or two faults. But scaling BFT consensus to n=2000 means that even on a ``good day’’ when communication is timely and a handful of failures occurs, quadratic steps require 4,000,000 messages. A cascade of failures might bring the communication complexity to whopping 8,000,000,000 transmissions for a single consensus decision. No matter how good the engineering and how we tweak and batch the system, these theoretical measures are a roadblock for scalability.

Around that time, tremendous innovation was occurring outside academic circles by blockchain startups. Two of these caught our attention, Tendermint and Casper. These protocols dramatically simplified the view change mechanism by introducing a synchronous delay when a leader starts. I observed that by adding one more phase to Tendermint, we can maintain the advantage of simplicity while avoiding the delay it introduced. The result is HotStuff: BFT Consensus in the Lens of Blockchain, named after a cartoon character in the same family of Casper, the first responsive BFT solution with a linear view-change.

Beyond improving communication complexity, HotStuff embodies a minimalist algorithmic framework that bridges between classical BFT solutions and the blockchain world; the entire protocol is captured in less than half a page of pseudo-code. HotStuff became popular in the blockchain developer community not only due to linearity, but (and perhaps mostly) due to its simplicity and developer-friendly design. Diem(Libra) adopted it to drive the blockchain infrastructure, as did (that we know of) Flow, Celo, and Cypherium.

Flexible Paxos – Co-inventor, at VMware 2016

In the summer of 2016, I hosted a research intern named Heidi Howard from Cambridge, UK. I told her about the CorfuDB protocol and encouraged her to think about the performance benefit of separating the sequencer role from the rest of the system. The result has been a stunning revelation we named Flexible Paxos: Quorum Intersection Revisited.:

Each of the phases of Paxos may use non-intersecting quorums. Only quorums from different phases are required to intersect. Majority quorums are not necessary as intersection is required only across phases.

Everyone in the field of distributed systems knows that quorums in Paxos must intersect, so what gives? What Heidi observed is that Paxos, which lies at the foundation of many production systems, is conservative. Within each of the phases of Paxos, it is safe to use disjoint quorums and majority quorums are not necessary. Since the second phase of Paxos (replication) is far more common than the first phase (leader election), we can use Flexible Paxos to reduce the size of commonly used second phase quorums. By no longer requiring replication quorums to intersect, we have removed an important limit on scalability. Through smart quorum construction and pragmatic system design, we enabled a new breed of scalable, resilient and performant consensus algorithms. The algorithmic core of a production scale-out messaging bus at Facebook called LogDevice is based on it, as is the more flexible paxos of YouTube’s distributed MySQL backbone.

CorfuDB – Initiator and Technical Lead, at Microsoft 2012 and VMware 2014

In 2012, Phil Bernstein approached me at Microsoft Research with the following observation. RAM has grown cheap/large enough to hold a complete database index in memory. Therefore, one can build a fully replicated transaction processing engine by storing a database index completely in-memory, persisting index modifications to a shared commit-log. His team prototyped an in-memory index called Hyder. The key enabler for this vision would be a reliable, high throughput distributed log, which Phil wanted to stripe across an array of SSDs. Unfortunately (yet fotunate for me), the initial design of his distributed commit-log was flawed. While fixing the design, I extracted a foundational insight that motivated me to establish and lead the CorfuDB project.

CorfuDB is a database-less database built around a global, reliable, high-throughput distributed commit-log. The CorfuDB log serves as the source of ground truth around which one builds distributed control-planes for large clusters. The key paradigm underlying CorfuDB is the reliable log that operates at high throughput. This was the foundational insight I have taken from Hyder. I built the first CorfuDB PoC at Microsoft with OS license, and later drove it at VMware to production. At VMware, CorfuDB serves as the a distributed control-plane for NSX-T, a leading SDN product that has market volume of over $1B. At Facebook, CorfuDB was re-engineered in Delos, a control plane underlying a dynamic cluster storage backend system.

You might wonder what happened to Phil’s in-memory fully replicated DB. Several years later, it became the backbone of the SQL Azure cloud database.

Fairplay – Co-Inventor at Hebrew University of Jerusalem, 2004

In 2004, Noam Nisan and I asked ourselves whether cryptographic primitives which were considered completely impractical are actually becoming practical. With my PhD student Yaron Sella, we implemented the MPC protocol, while Noam supervised his grad-students to implement a language that compiles into a binary circuit. The first fully implemented Fairplay MPC platform was alive shortly after. By 2008, the the millionaires problem, mini auctions, and other problems, could be solved over an interconnect in seconds. Since then, the Fairplay source code has been downloaded by hundreds of academic groups, and has sparked in the past decade a wave of crypto-engineering projects which bring crypto theory into practice, including heavy crypto methods like oblivious RAM, ZK proofs and PCP.

Academic descendants