╔═══════════════════════════════════════════════════════════╗
║ Leftmost-Longest-Common-Substring-Based Alignment with DP ║
╚═══════════════════════════════════════════════════════════╝

[2026-02-18@dgpb]

Table of Contents

Problem Statement

Find all hierarchical longest common substring matches between two sequences.

Hierarchical matching:
1. Find the leftmost longest common substring (the "anchor")
2. Recursively find matches before and after the anchor
3. Repeat until no matches remain

Example:

seq1: "a_bb_ccc_a_ab_dddd"
seq2: "a+bb+dddd+ccc"

Result: [(0,0,1), (2,2,2), (14,5,4)]
  Match 1: 'a'    @ (0,0)  len=1
  Match 2: 'bb'   @ (2,2)  len=2
  Match 3: 'dddd' @ (14,5) len=4

ccc is unmatched because it has conflict with dddd, which being largest has priority.

This document discusses 3 algorithms for computing hierarchical longest common substring matches:

  1. FindOneBest (difflib): Pure recursion, one best match per DP
  2. FindBestLayer: Improved recursion, all best length matches per DP
  3. FindMultiLayer: DP with many groups of best length matches (this document)

All produce identical results.
Focus is on FindMultiLayer, which reduces recursion depth multiplier to at most O(n^(1/4)) for small alphabet, O(1) for large alphabet.


Algorithm 1 - FindOneBest (difflib)

def find_one_best_recurse(seq1, seq2, offset1=0, offset2=0):
    if not seq1 or not seq2:
        return []
    
    pos1, pos2, length = find_longest_one_best(seq1, seq2)
    if length == 0:
        return []
    
    left = find_one_best_recurse(seq1[:pos1], seq2[:pos2], offset1, offset2)
    current = [(offset1 + pos1, offset2 + pos2, length)]
    right = find_one_best_recurse(
        seq1[pos1+length:], seq2[pos2+length:],
        offset1 + pos1 + length, offset2 + pos2 + length
    )
    
    return left + current + right

find_longest_one_best: Standard DP tracking one best match per i (leftmost in j).

Complexity: O(n₁ × n₂ × depth)
- Best: O(n₁ × n₂) - single match
- Worst: O(n₁ × n₂ × min(n₁,n₂)) - unbalanced tree


Algorithm 2 - FindBestLayer

Improved recursive approach finding all ordered non-overlapping best matches.

Enhanced DP returning all leftmost non-overlapping ordered blocks in single pass.
These are the same ones that Algorithm 1 would collect in as many DP passes as there are blocks of the same size.

def find_best_layer(seq1, seq2):
    j2len = {}
    best_k = 0
    results = []
    last_i_pk = -1
    last_j_pk = -1
    while i < stop1:
        elt = seq1[i]
        p2 = seq2_positions[elt]
        new_best = False
        i_cur = -1
        j_cur = -1
        k_cur = 0

        newj2len = {}
        for j in p2:
            if OUT_OF_RANGE:
                continue

            k = j2lenpop(j - 1, 0) + 1
            newj2len[j] = k
            if k > best_k:
                best_k = k
                i_cur = i
                j_cur = j
                new_best = True
            elif i_cur == -1 and k == best_k:
                if i >= last_i_pk and j >= last_j_pk:
                    i_cur = i
                    j_cur = j

        if i_cur != -1:
            if new_best:
                results.clear()
            results.append((i_cur, j_cur))
            last_i_pk = i_cur + best_k
            last_j_pk = j_cur + best_k

        j2len = newj2len
        i += 1

Key difference: FindOneBest keeps only leftmost j; FindBestLayer keeps all j with max length, giving better subdivisions.

Complexity improvement: Eliminates O(n) worst case by finding better subdivisions. However, O(√n) worst case remains when many matches have different lengths (must recurse through hierarchy).


Algorithm 3 - FindMultiLayer

Maximal match is a match that can not be extended in either direction.

Core idea:

  1. Single DP pass collects all maximal matches at all lengths.
  2. Then layers them in batches of the same length from longest to shortest.
  3. Memory of collection is limited to O(n) and the batch of lowest length is pruned if the limit is hit.
  4. If only one length bucket exists when memory limit is hit, it switches to FindBestLayer mode.

Why it works - two cases:

  1. Many same-length matches — memory limit is hit, FindBestLayer handles many same-length matches efficiently.
  2. Many different length matches — the longest buckets are small (few high-length matches exist), so layering places large amount of matches in one pass.

Main DP Loop

# 1. Multi-Layer
collected = {}      # dict[length, list[tuple[i_end, j_end]]]
min_ck = 1          # Minimum collected match length

# Set memory limit
memleft = int((len(seq1) + len(seq2)) * memult)
if memleft < 100:
    memleft = 100
i = start1
while i < stop1:
    elt = seq1[i]
    p2 = seq2_positions[elt]

    # 1.1. DP loop
    newj2len = {}
    j2lenpop = j2len.pop
    for j in p2:
        if OUT_OF_RANGE:
            continue
        newj2len[j] = j2lenpop(j - 1, 0) + 1

    # 1.2. Collect maximal matches
    # j2len holds entries not extended (popped) in this iteration
    #   = maximal matches ending at i-1
    i_m1 = i - 1
    for j, k in j2len.items():
        if k >= min_ck:
            collected.setdefault(k, []).append((i_m1, j))
            memleft -= 1

    j2len = newj2len
    i += 1

    # 1.3. Trim if needed
    while memleft < 0:
        if len(collected) == 1:
            # Break to single layer
            break

        # Drop shortest length bucket
        # Matches at min_k length are discarded;
        # min_ck ensures future matches at this length are not collected
        min_k = min(collected)
        memleft += len(collected.pop(min_k))
        min_ck = min_k + 1
    else:
        # else runs only if inner while completed
        #   without break (memory still OK)
        continue

    # Break to single layer
    break

else:
    # Finished successfully
    # Append maximal matches from last iteration
    i_m1 = stop1 - 1
    for j, k in j2len.items():
        if k >= min_ck:
            collected.setdefault(k, []).append((i_m1, j))

    # Layer and return
    results = self.layer_maximal_matches(collected, min_ck)
    return results

# 2. Broke out - continue with single layer (Algorithm 2)
best_k = max(collected)
...

Layering Phase

  1. Start from longest length match group
  2. Add them to results following leftmost first / non-overlapping logic as the original problem requires.
  3. Take the next best length match group and repeat, but additionally check if can be trimmed and re-inserted.

A match is trimmed when it partially overlaps an already-placed results.
Its start or end is adjusted to fit in the adjacent gap, and it is re-queued at the reduced length.

Implementation has few edge cases, but the logic is simple:

collected = {length: [match1, match2, ...], ...}
results = []
for k in sorted(collected, reverse=True):
    matches = collected.pop(k)
    for match in matches:
        if inserts_cleanly(match, results):
            insert(match, results)

        else:
            new_match, new_k = trim_match(match, results)
            if new_k >= min_valid_k:
                collected[new_k].insert(new_match)

Complexity / Worst Case

Space: O(n₂ + M), where M = total maximal matches collected (bounded by memleft cap, which is O(n₁ + n₂) in practice).

The cap memleft = n₁ + n₂ can be lowered (e.g. to (n₁ + n₂) / 2) at the cost of pruning more length buckets earlier.
Quality degrades gracefully — little is lost as small lengths have disproportionately many cross-matches.

DP Time: O(n₁ × n₂)
Layering Time: O(M)
Recursion depth multiplier Time: Large Alphabet O(1). Small alphabet worst case O(n^(1/4)) - empirically, on:

seq1 = 'a_aa_aaa_aaaa_aaaaa_aaaaaa_' ...
seq2 = 'a.aa.aaa.aaaa.aaaaa.aaaaaa.' ...

Performance Benchmarks

Performance across different sequence lengths (ms, CPython 3.13.4, macOS 11.7.11):

All benchmarks are produced with autojunk=False.

Assumption: DP for Small alphabet cases runs in O(n).
In reality, these cases are not purely O(n) as blocks are made up of many repeated characters.
Just different blocks are made up of different characters.


n = 750

Units: ms FindOneBest FindBestLayer FindMultiLayer
rng01 91 117 58
aaa / aaa 100 118 97
a_a_a / a.a.a 3,032 32 24
a_aa_aaa / a.aa.aaa 671 919 105
aaa_aa_a / aaa.aa.a 671 694 99
abc / abc 1 1 1
a_b_c / a.b.c 46 2 1
a_bb_ccc / a.bb.ccc 40 47 5
aaa_bb_c / aaa.bb.c 42 49 5
ccc_bb_a / a.bb.ccc 4 5 4
a.bb.ccc / ccc_bb_a 4 5 4
global-block-shuffle-LA 6 8 4
local-block-shuffle-LA 13 16 5
parallel-block-shuffle-LA 18 16 4
diff-simulation 32 31 11

n = 1500

Units: ms FindOneBest FindBestLayer FindMultiLayer
rng01 323 451 223
aaa / aaa 417 479 429
a_a_a / a.a.a 22,647 126 112
a_aa_aaa / a.aa.aaa 3,703 5,055 472
aaa_aa_a / aaa.aa.a 3,826 3,852 509
abc / abc 2 2 2
a_b_c / a.b.c 180 5 3
a_bb_ccc / a.bb.ccc 150 184 11
aaa_bb_c / aaa.bb.c 153 196 13
ccc_bb_a / a.bb.ccc 12 16 13
a.bb.ccc / ccc_bb_a 11 14 11
global-block-shuffle-LA 23 28 14
local-block-shuffle-LA 33 40 14
parallel-block-shuffle-LA 42 51 14
diff-simulation 110 103 35

n = 10,000

Units: ms FindOneBest FindBestLayer FindMultiLayer
abc / abc 18 16 12
a_b_c / a.b.c --- --- 18
a_bb_ccc / a.bb.ccc --- --- 162
aaa_bb_c / aaa.bb.c --- --- 168
ccc_bb_a / a.bb.ccc 183 306 173
a.bb.ccc / ccc_bb_a 174 230 175
global-block-shuffle-LA 398 474 170
local-block-shuffle-LA 1,009 1,115 180
parallel-block-shuffle-LA 1,076 1,194 164
diff-simulation 1,750 1,856 536

Small Alphabet Cases

Random Binary — rng01

All algorithms ~O(n²) with similar constants.

Single Match — aaa / aaa

All algorithms ~O(n²) with similar constants.

Pathological O(n) Recursion — a_a_a / a.a.a

FindOneBest: O(n²) × O(n) = O(n³) due to O(n) recursion depth → 22,647ms at n=1500.
FindBestLayer/FindMultiLayer: O(n²) — eliminates depth multiplier entirely.

Pathological O(√n) Recursion — a_aa_aaa / a.aa.aaa and aaa_aa_a / aaa.aa.a

FindOneBest/FindBestLayer: O(n²) × O(√n) due to O(√n) recursion depth.
FindMultiLayer: O(n²) × O(n^(1/4)) - worst case.


Large Alphabet Cases

Single Match — abc / abc

Trivial, one match.
All algorithms O(n).

Pathological O(n) Recursion — a_b_c / a.b.c

FindOneBest: O(n) × O(n) = O(n²) — n matches of length 1, each triggers O(n) DP.
FindBestLayer/FindMultiLayer: O(n) — single pass finds all matches.

Pathological O(√n) Recursion — a_bb_ccc / a.bb.ccc and aaa_bb_c / aaa.bb.c

FindOneBest/FindBestLayer: O(DP) × O(√n).
FindMultiLayer: O(DP) — eliminates depth multiplier.

Single Match — ccc_bb_a and a.bb.ccc variants

Single match invalidating all others.
All algorithms ~O(DP) × O(1).


Block Cases (Large Alpha)

global/local/parallel-block-shuffle

All algorithms ~O(n), small absolute times (measurement noise dominates).
FindMultiLayer competitive throughout.

diff-simulation

1 block of largest size, 2 blocks of 2nd largest, 3 blocks of the next, etc...

All shuffled in parallel with single character separators between them.

(Real life diffs are simpler.)

FindMultiLayer is 3x faster than FindOneBest and FindBestLayer.


Memory Benchmarks (n = 500)

Units: kB FindOneBest FindBestLayer FindMultiLayer (n1 + n2) // 8
rng01 43 49 116 63
aaa / aaa 78 79 198 109
a_a_a / a.a.a 75 83 135 92
a_aa_aaa / a.aa.aaa 69 77 153 111
aaa_aa_a / aaa.aa.a 69 74 149 89
abc / abc 85 86 86 88
a_b_c / a.b.c 90 95 123 110
a_bb_ccc / a.bb.ccc 30 32 91 39
aaa_bb_c / aaa.bb.c 31 40 88 44
ccc_bb_a / a.bb.ccc 30 29 88 35
a.bb.ccc / ccc_bb_a 30 30 90 34
global-block-shuffle-LA 30 30 89 36
local-block-shuffle-LA 30 30 88 39
parallel-block-shuffle-LA 30 31 89 40
diff-simulation 46 41 99 53
  1. FindOneBest worst case is 90 kB.
  2. FindBestLayer worst case is 95 kB.
  3. FindMultiLayer worst case is 198 kB and 111 kB with 1/8th of (n1 + n2) limit.

All methods have O(n) memory complexity.


Summary

Complexity Small Alpha Recursion (*) Large Alpha Recursion (*)
FindOneBest very simple O(√n) - O(n) O(√n) - O(n)
FindBestLayer simple Some O(√n) Some O(√n)
FindMultiLayer Moderate Rare O(n^(1/4)) Mostly O(1)

(*) Beyond O(n log n).

  1. For small alphabet, FindMultiLayer worst case recursion multiplier is O(n^(1/4)).
  2. For large alphabet, FindMultiLayer eliminates recursion entirely.
  3. Complexity of FindMultiLayer is sensible for the improvement it delivers.

When autojunk=True, this still leaves the FindMultiLayer with total O(n2) worst case.

The worst case can be constructed as follows:

  1. Ensure maximal number of element pairs
  2. Ensure that none of elements hit autojunk threshold
  3. Construct as many blocks of different length as possible
each = max(1, N // 100 - 10)
b = N // each + 1
chars = ''.join([each * chr(i + 100) for i in range(b)])
blocks = []
i = 1
s = j = 0
while s + i + 1 < N:
    blocks.append(chars[j:j+i])
    j += i
    s += i + 1
    i += 1
seq1 = '-'.join(blocks)
seq2 = '.'.join(blocks)

Runtimes for sequences constructed with code above:

FindMultiLayer FindMultiLayer
FindOneBest mem=(n1+n2) mem=(n1+n2) // 8
size 10K 10.6 s 0.15 s 0.2 s
size 15K 24.1 s 0.4 s 0.4 s
size 20K 60.5 s 0.6 s 0.8 s
size 50K ~ 13 m 4.4 s 6.6 s
size 100K ~ 1.5 h 20 s 31 s

difflib.SequenceMatcher with autojunk=True is used in many applications that occasionally hit pathological cases.

The FindMultiLayer offers non-trivial improvement, which would remove most of related issues.