Concept Explanations

Learn to quantify the importance of high-level concepts via Testing with Concept Activation Vectors (TCAVs).

Testing with Concept Activation Vectors (TCAVs)

Testing with Concept Activation Vectors (TCAVs) is a new interpretability method to understand which signals our neural network models use for prediction. It shows the importance of high-level concepts (e.g., color, gender, race) for a prediction class—this is how humans communicate!

Typical interpretability methods, like saliency maps, CAMs, counterfactuals, etc., require us to have one particular image we are interested in understanding. TCAV explains whether a particular class of examples is sensitive to certain human-defined concepts. For example, TCAV for a class zebra will try to measure how sensitive the class zebra’s prediction is to the presence of stripes in the input image.

Concept Activation Vectors (CAVs)

TCAVs uses directional derivatives (derivative in a particular direction) to quantify the degree to which a user-defined concept, such as stripes, is important to a classification result such as zebra. The algorithm derives Concept Activation Vectors (CAVs) by training a linear classifier between examples belonging to a concept and random counterexamples.

Mathematically, given a human-defined concept CC, such as striped textures, TCAV receives a set of positive examples PC={X1,X2,...,XN}P_C = \{ X_1, X_2, ..., X_N \} (e.g., photos of striped objects) and negative examples NC={X1,X2,...,XM}N_C = \{X_1, X_2, ..., X_M\} (e.g., photos of dotted or zigzagged objects) as input.

It then fits a linear classifier θC\theta_C to distinguish the last layer activations of two sets PCP_C and NCN_C:

Get hands-on with 1200+ tech skills courses.