Tools to identify types or regimes for which the system fails to operate successfully
Exploratory Data Science・DevOps・Design Research

Work In Progress

Our Elements guide is still in progress, and therefore lacks full visual and technical assets. We hope to release them by summer of 2020. Thanks for reading Lingua Franca!


The dynamic and emergent properties of AI can affect users in unpredictable ways. Managing this complexity requires its own tools and techniques, some of which may derive from the specifics of each modeling algorithm. However, in most cases, managers of an AI system will need a general-purpose tool for forensics, to identify which kinds of users the system fails to serve well. Due to the data-driven nature of AI, certain pockets can emerge where it will fail on a whole category of users. For example, a music recommendation service may fail egregiously on recommending Reggaeton music but nothing else. This failure may be attributed to insufficient training data, or perhaps because other music categories overlap too much with it. Either way, many such unpredictable cases will spring up in a large scale system. In order to combat this, operators of your AI system must be able to identify individual failure cases and then test various hypotheses to discover the root cause of such incorrect behavior.


Modern, large-scale AI systems are used to drive interactions across a stunningly wide variety of contexts, needs, and typologies of human behavior. This ‘scalability’ necessarily requires that a single system or algorithm is capable of dynamically serving diverse needs. Recommender systems on platforms such as Amazon do precisely this—giving users a list of similar products no matter which category the user is browsing. However, data collected across many domains and genres may not fit so neatly into the paradigm of a single algorithm. For example, a list of similar books may require slightly different assumptions from the AI that a list of similar clothes. These domains are subtly different—a nuance where the algorithm may show degraded performance on small pockets of products and categories. Additionally, a broad data collection effort may have to combine multiple datasets with different assumptions. Perhaps data collected from Asia used credit card information while data collected from South America used cash transactions. These subtle differences may not show glaring errors, but may create a more insiduous set of edge cases and errors (see Errata).

Forensics tools allow an operator or manager of such a large scale AI system to spot niche failure cases by allowing the operator to visualize, manipulate, cluster, and compare interaction data. The crucial task is to distinguish random errors (such as an app crashing unexpectedly) from errors that have a shared ‘signature’. These signatures may not have a direct causal link, given the non-linearity of AI. However, they will generally share similarities such as geography, demographics, and category. Once the forensic investigation yields a testable hypothesis, the operator can then validate and correct the behavior.


Many companies have created a ‘god view’ over their products for the purpose of forensics. However, this kind of interface has extreme privacy and ethical concerns, and has caused a backlash against companies employing it[1]. Instead, companies may want to consider an auditable forensics process where operators must request access to datasets along with a rationale. Additionally, many forensics tools such as clustering can function on anonymized data where personally-identifying information (PII) is not revealed.

Further Resources


  1. Uber Allegedly Stalked Users For Party-Goers’ Viewing Pleasure on Forbes ↩︎