James's Blog

Sharing random thoughts, stories and ideas.

2020 Highlights

Posted: Dec 28, 2020
◷ 4 minute read

Here is my annual roundup of some of the most interesting things I came across in 2020. Throughout the year, I have already written about a number of topics that caught my attention, and they will not be included here.

Bartosz Ciechanowski’s Cameras and Lenses

An absolutely fantastic introduction to the technical invention and development history of cameras. It starts from the first principle concepts of physics and optics, then walks through the step-by-step journey of how photography apparatus evolved to the way they are today. The interactive animations used are very well made, and clearly demonstrate the ideas they are trying to convey.

Apart from learning about the physics of cameras (which admittedly I was already familiar with), it serves as an example of the type of high quality teaching material that we desperate need yet remains in short supply today. Its realization requires the intersection of three already-limited sets of people: those with true passion about a certain topic, those with excellent articulation and explaining abilities, and those competent enough (and have enough free time) to create the content. Modern technology has certainly made the production of such materials easier than ever, but people in that 3-set intersection remains in extremely limited supply.

Reverse Engineering the SARS-CoV-2 Vaccine

This is an interesting piece that walks through some of the basics of how the BioNTech/Pfizer SARS-CoV-2 vaccine was made, but through the lens of a computer programmer. I have always been fascinated by the similarities and differences between artificial computing (e.g. Turing machines, programming, software) and biological computing (e.g. DNA, proteins, cells), which this article does a great job showcasing. The parts about biology are somewhat rudimentary, and given my much more limited knowledge in the field, I cannot adequately judge the accuracy of some of the inevitable simplifications made. Yet the cross-pollination of the two fields of computing and biology remains, in my eyes, as one of the most natural and fruitful of such pairings, and this article does an excellent job reinforcing that idea.

Masnick’s Impossibility Theorem (of Content Moderation)

Mike Masnick formulated this “theorem” in late 2019, which basically states that “content moderation at scale is impossible to do well”. The rationale for it being that you will always piss off a non-negligible amount of people regardless of which moderation action you take (i.e. take it down or leave it be) when you are dealing with a large enough group of people (the “at scale” part).

It reminded me of this piece by @vgr, also from this year, on the role of narratives in society. Through a particular lens of analysis, he noted that (grand) narratives are analogous to highways, and the people who share them are like cars on the highway sharing the same destination. Narratives exist to get as many people onto a particular highway to a specific destination as possible, yet people don’t all want to go to the same place. The idea that “none of us is free until all of us is free” doesn’t work. So, as the author concludes, universal liberty really must be a system of highways.

In particular, this “system of highways” approach doesn’t seem to work in content moderation. If everyone gets their own “highway”, i.e. content moderation preferences, then in a sense there is no longer any actual moderation. It solves a part of the problem, that is, nobody will be shown that which they do not wish to see, but not the whole problem. Arguably one of the key properties desired from a content moderation system is precisely that it is universal: if a piece of content is bad, nobody should see it. So everyone must be on the same “highway”, hence the impossibility part of the theorem.

But perhaps Masnick simply had too-strict a definition of “do well”. The core part of the theorem relies on the fact that at a large enough scale, even a small (e.g. 1%) dissatisfaction rate in a moderation decision will anger a large group of people (by absolute numbers). Some of these people will always be vocal about it, so it will always seem like the system is not doing well, but I don’t think the goal of the system should be to satisfy 100% of people. Making a moderation decision that 99% of the people agree with can be considered as “doing well”, despite the vocal minority of dissenters. Maybe the theorem should instead be: “content moderation at scale is impossible to have a positive perceived reputation, no matter how good the system approaches perfection.”

Does GPT-2 Know Your Phone Number?

This analysis of GPT-2 helped shone some light on one of the key questions that has always troubled me with such large language models: are they simply memorizing and reciting text from their training data? The TL;DR seems to be: yes, it does memorize texts, though not a lot (0.1% of all generated texts are recitations, at a minimum), yet it still poses some serious problems for copyright and privacy.

An interesting part of this research is how they had to do it. Since they did not have direct access to GPT-2’s training data, they had to rely on public search indexes (such as Google) to find instances of memorization. I feel like this would underestimate the amount of memorization detected, which probably explains why the 0.1% figure mentioned above is stated as a lower bound. The researchers also did not select content to check for memorization by quality, but rather by the confidence level as reported by the model. It’s possible that when we select by the human perceived quality of the generated content, the rate of occurrence of memorization will be different (probably higher if I had to guess).