The Curious Case of Lex

I recently spent a couple of days trying to train a convolutional neural network (CNN) to rerank documents using ColBERT similarity matrices as images. It appeared to double ColBERT's performance on various splits of BRIGHT, a reasoning-intensive retrieval benchmark (see below).

Method (nDCG@10)	Economics	AoPS	LeetCode	Robotics
CNN-Sim (Ours)	0.248	0.492	0.270	0.281
ColBERTv2	0.128	0.234	0.310	0.134

Table 1: Reranking performance (nDCG@10) with BM25 top-100 candidates (with gold documents additionally infused) on the test sets. For most splits, the CNN led to a 2× increase in nDCG@10.

Unfortunately, what initially seemed like a promising direction ended up being a dead end for reasons that, in hindsight, were obvious. The model learned spurious correlations and collapsed completely when tested on larger candidate sets. In this blog post, I'll detail my motivations for this direction, my implementation, and what went wrong.

Reasoning-Intensive Information Retrieval

Recent advances in information retrieval have shifted focus towards reasoning-intensive information retrieval (RiiR): queries that require intensive reasoning to retrieve the most pertinent documents from a large corpus. An example would be a medical query that requires extensive search through medical journals and reasoning beyond semantic matching to determine the proper diagnosis.

Before I dive into how CNNs got mixed up in all of this, I'll first give a rough outline of how ColBERT, an older retrieval model, works.

ColBERT is a late interaction model: it encodes queries and documents independently (like a dual encoder) but retains token-level embeddings rather than collapsing them into a single vector. The "late" refers to when query-document interaction occurs, after encoding, at scoring time, but at token granularity. Specifically, ColBERT builds a similarity matrix between query and document tokens, then derives a relevance score via MaxSim: the sum of maximum similarities for each query token across all document tokens.

The Problem Formulation

A natural question arises: do similarity matrices for query/gold-document pairings look different from their query/non-gold-document counterparts? More specifically, if we construct a similarity matrix and "highlight" the MaxSim at each row, do we see visual cues that allow us to differentiate between query/gold-doc, query/normal-doc, and query/bad-doc pairings? Here, I used BM25, a lexical method that is extremely cheap and fast, to initially rank the documents. I took the "normal doc" to be BM25's top-1 result and the "bad doc" to be one that falls outside the top 100. I tried this on a few queries (see an example below):

ColBERT similarity matrices visualization

Figure 1: A query from the Art of Problem Solving (AoPS) split of BRIGHT with similarity matrices for its gold document, normal document, and bad document.

Figure 1 reveals some subtle visual cues. If we define tokens used as the number of document tokens that get "highlighted" (i.e., are the maximum of at least one row) and max single usage as how often the most frequently used token appears, then query/gold-document pairings tend to be higher on the former and lower on the latter compared to query/non-gold-document pairings. That is, there's less "column" domination: the similarity matrix for gold documents looks flat because many tokens contribute, while for non-gold documents there's more clear token dominance, visible as "spikes."

Given these visual cues, I was curious whether a CNN could learn, from similarity matrices alone, to distinguish between gold documents and non-gold documents for a given query.

I'll note here, that the idea of treating text matching as image recognition isn't new. Pang, Lan, and colleagues explored this direction extensively in the mid-2010s. MatchPyramid (2016) applied CNNs to word-level interaction matrices, DeepRank (2017) added detection and aggregation stages to name a few. Results were mixed; the approach never decisively beat traditional methods. Dense retrievers and late interaction models like ColBERT largely superseded this line of work. I was curious whether revisiting the visual intuition with ColBERT's richer token-level representations might yield something, especially for reasoning-intensive queries where the patterns seemed visually distinct.

Building the Pipeline

The goal was to see if a lightweight CNN could serve as an efficient final reranking step, even with minimal training data. I focused on the four hardest splits of BRIGHT: LeetCode, AoPS, Economics, and Robotics, training a separate CNN for each using only 30% of queries (30–42 queries per split). During training, the CNN only saw samples (32-64/query) from BM25's top 100 documents plus the gold documents. We have to inject gold documents since BM25 has poor recall for reasoning-intensive queries. At test time, I applied the same setup: retrieve BM25's top 100, inject gold documents, shuffle, and let the CNN rerank.

Training the CNN

The CNN treats ColBERT similarity matrices as 128×128 grayscale images, using a 4-layer architecture (32 → 64 → 128 → 256 channels) with batch normalization, followed by an MLP head. Specifically, ColBERT (frozen) produces a similarity matrix, which is resized to 128×128 and treated as a grayscale image. The CNN progressively increases channels (1 → 32 → 64 → 128 → 256) while halving spatial dimensions (128 → 64 → 32 → 16 → 8) via max pooling. The final 8×8×256 = 16,384 features are flattened and passed through an MLP (16384 → 256 → 128 → 1) to produce a single relevance score. Training used listwise cross-entropy loss over candidate documents, with learning rate 10^-4, AdamW optimizer, and 15 epochs.

The Collapse

Given the amazing results from Figure 1, my immediate next step was to test a trained model on an expanded candidate set. I wanted to see how the CNN performed when given BM25's top 250, 500, 750, and 1000. This is with the same test queries, only candidate depth changes. As shown below, the system collapsed immediately. I then retrained the CNN models, this time sampling from BM25's top 1000 instead of the top 100. This minimized the collapse, but the results were still unimpressive. The k=1000 model was much more robust but lost the sharp edge at top-100. What was going on?

Method	100	250	500	750	1000
CNN-Sim (trained k=100)	0.248	0.10	0.04	0.02	0.02
CNN-Sim (trained k=1000)	0.165	0.114	0.093	0.088	0.086
ColBERTv2	0.128	0.108	0.099	0.095	0.095

Table 2: CNN performance on the Economics split across candidate set sizes. The k=100 model collapses catastrophically; the k=1000 model is more robust but loses its edge.

Visualizing the Collapse

CNN-Sim shows a 2× improvement over ColBERTv2 when reranking BM25's top-100... but catastrophically fails when the candidate set expands. The model trained on k=100 learned spurious correlations that don't generalize and the same stands for the model when trained on k=1000.

Pipeline Augmentation

I decided to change the pipeline: have ColBERT first rerank the documents given BM25's top 1000 (with gold documents infused), then have the CNN rerank ColBERT's rankings.

Unfortunately, this proved unfruitful. The CNN models trained on k=100 didn't improve ColBERT's ranking. Initially, I believed recall might have been the issue, but using a stronger model as the intermediary, such as Qwen3-reranker-0.6B, didn't help either (Table 3).

Base Model	Method	Econ	AoPS	LC	Robotics
ColBERTv2	CNN-Sim (Ours)	0.072	0.185	0.162	0.091
	ColBERTv2	0.095	0.156	0.283	0.077
	Recall %	38.1%	41.4%	48.4%	33.9%

Qwen3-0.6B	CNN-Sim (Ours)	0.102	0.232	0.198	0.181
	Qwen3-Reranker-0.6B	0.156	0.156	0.372	0.121
	Recall %	56.9%	42.9%	78.6%	54.4%

Table 3: Pipeline results: CNN reranking the base model's top-100 (ColBERTv2 or Qwen3-Reranker-0.6B). Recall is the percentage of gold documents the base model ranked in its top 100.

The Culprit: Distributional Shift

ColBERT's/Qwen's top-100 looks completely different from BM25's top-100 in terms of similarity matrix structure. ColBERT's/Qwen's "mistakes" are semantically plausible but wrong for reasoning—a totally different error pattern than BM25's lexical false positives. Essentially, the goal was to train a CNN to find patterns that translate to relevance, but in reality it learned patterns that translate to high lexical matches. High lexical matches in the top 100 correlate with irrelevance, but the injected gold documents, by virtue of not being in the top 100, don't have high lexical matches. When expanding to the top 1000, the bottom 900 also lack high lexical matches, so whatever signal the CNN learned collapses (gold and the bottom 900 become indistinguishable).

Said less plainly, the CNN learned: "flat, uniform matrices → relevant," capturing what I'd call, "texture statistics", not semantic relevance. Within BM25's top-100, gold documents exhibit dense but structured similarity patterns, while BM25 negatives show sharp lexical spikes. The CNN found a texture shortcut separating these classes. However, at ranks outside the top 100, many documents have uniform similarity (vague semantic overlap) with low variance, exactly what the CNN learned to prefer.

The Fundamental Limitation

This failure reveals a deeper issue. Similarity matrices capture which tokens align and how many align, but not how they logically connect. For reasoning-intensive retrieval, relevance depends on token interactions within the document which is structure that is invisible to the query-document similarity matrix. A gold document and a plausible-but-wrong document may have identical similarity matrices, both flat and distributed. The difference lies in the reasoning structure: how evidence chains together to answer the query. This structure exists within the document, in how its content interacts to answer the query but that isn't clear from a similarity matrix alone.

Conclusion

I tried tweaking the pipeline by training a CNN on ColBERT's top-100 instead of BM25's top-100, but this didn't lead to any improvements compared to using ColBERT alone. This tracks with the observations above: the similarity matrix is not the right medium for understanding relevance (visually). It's hard to disambiguate gold documents from decent documents. Maybe there are other structures worth exploring?

I skipped giving implementation details for the sake of brevity, but if you have any ideas/questions or just want to chat about this stuff (or information retrieval generally), please feel free to email me!

← Back to musings