The ants and the pheromones

February 8, 2021October 20, 2025 ~ adriancolyer

TLDR; this is the last edition of The Morning Paper for now. Plus: one strand of research you won’t want to miss! I was listening to a BBC Radio 4 podcast recently (More or Less: Behind the Stats – Ants and Algorithms) in which the host Tim Harford is interviewing David Sumpter about his recent … Continue reading The ants and the pheromones

An overview of end-to-end entity resolution for big data

December 14, 2020October 20, 2025 ~ adriancolyer

An overview of end-to-end entity resolution for big data, Christophides et al., ACM Computing Surveys, Dec. 2020, Article No. 127 The ACM Computing Surveys are always a great way to get a quick orientation in a new subject area, and hot off the press is this survey on the entity resolution (aka record linking) problem. It’s an … Continue reading An overview of end-to-end entity resolution for big data

Bias in word embeddings

December 8, 2020October 20, 2025 ~ adriancolyer

Bias in word embeddings, Papakyriakopoulos et al., FAT*’20 There are no (stochastic) parrots in this paper, but it does examine bias in word embeddings, and how that bias carries forward into models that are trained using them. There are definitely some dangers to be aware of here, but also some cause for hope as we … Continue reading Bias in word embeddings

Seeing is believing: a client-centric specification of database isolation

November 30, 2020October 20, 2025 ~ adriancolyer

Seeing is believing: a client-centric specification of database isolation, Crooks et al., PODC’17. Last week we looked at Elle, which detects isolation anomalies by setting things up so that the inner workings of the database, in the form of the direct serialization graph (DSG), can be externally recovered. Today’s paper choice, ‘Seeing is believing’ also deals … Continue reading Seeing is believing: a client-centric specification of database isolation

Elle: inferring isolation anomalies from experimental observations

November 23, 2020October 20, 2025 ~ adriancolyer

Elle: inferring isolation anomalies from experimental observations, Kingsbury & Alvaro, VLDB’20 Is there anything more terrifying, and at the same time more useful, to a database vendor than Kyle Kingsbury’s Jepsen? As the abstract to today’s paper choice wryly puts it, “experience shows that many databases do not provide the isolation guarantees they claim.” Jepsen captures … Continue reading Elle: inferring isolation anomalies from experimental observations

Achieving 100Gbps intrusion prevention on a single server

November 16, 2020October 20, 2025 ~ adriancolyer

Achieving 100 Gbps intrusion prevention on a single server, Zhao et al., OSDI’20 Papers-we-love is hosting a mini-event this Wednesday (18th) where I’ll be leading a panel discussion including one of the authors of today’s paper choice: Justine Sherry. Please do join us if you can. We always want more! This stems from a combination of Jevon’s paradox … Continue reading Achieving 100Gbps intrusion prevention on a single server

Virtual consensus in Delos

November 9, 2020October 20, 2025 ~ adriancolyer

Virtual consensus in Delos, Balakrishnan et al. (Facebook, Inc.), OSDI’2020 Before we dive into this paper, if you click on the link above and then download and open up the paper pdf you might notice the familiar red/orange splash of USENIX, and appreciate the fully open access. USENIX is a nonprofit organisation committed to making content and … Continue reading Virtual consensus in Delos

Helios: hyperscale indexing for the cloud & edge (part II)

November 2, 2020October 20, 2025 ~ adriancolyer

Helios: hyperscale indexing for the cloud & edge, Potharaju et al., PVLDB’20 Last time out we looked at the motivations for a new reference blueprint for large-scale data processing, as embodied by Helios. Today we’re going to dive into the details of Helios itself. As a reminder: Helios is a distributed, highly-scalable system used at Microsoft for … Continue reading Helios: hyperscale indexing for the cloud & edge (part II)

Helios: hyperscale indexing for the cloud & edge – part 1

October 26, 2020October 19, 2025 ~ adriancolyer

Helios: hyperscale indexing for the cloud & edge, Potharaju et al., PVLDB’20 On the surface this is a paper about fast data ingestion from high-volume streams, with indexing to support efficient querying. As a production system within Microsoft capturing around a quadrillion events and indexing 16 trillion search keys per day it would be interesting in its own right, … Continue reading Helios: hyperscale indexing for the cloud & edge – part 1

The case for a learned sorting algorithm

October 19, 2020October 19, 2025 ~ adriancolyer

The case for a learned sorting algorithm, Kristo, Vaidya, et al., SIGMOD’20 We’ve watched machine learning thoroughly pervade the web giants, make serious headway in large consumer companies, and begin its push into the traditional enterprise. ML, then, is rapidly becoming an integral part of how we build applications of all shapes and sizes. But what about systems … Continue reading The case for a learned sorting algorithm