str_alg_lcstr_dict

╔═══════════════════════════════════════════════════════════╗

║ Leftmost-Longest-Common-Substring-Based Alignment with DP ║

╚═══════════════════════════════════════════════════════════╝

[2026-02-18@dgpb]

Table of Contents
Problem Statement
Algorithm 1 - FindOneBest (difflib)
Algorithm 2 - FindBestLayer
Algorithm 3 - FindMultiLayer
Performance Benchmarks
Memory Benchmarks (n = 500)
Summary

Problem Statement

Find all hierarchical longest common substring matches between two sequences.

Hierarchical matching:
1. Find the leftmost longest common substring (the "anchor")
2. Recursively find matches before and after the anchor
3. Repeat until no matches remain

Example:

seq1: "a_bb_ccc_a_ab_dddd"
seq2: "a+bb+dddd+ccc"

Result: [(0,0,1), (2,2,2), (14,5,4)]
  Match 1: 'a'    @ (0,0)  len=1
  Match 2: 'bb'   @ (2,2)  len=2
  Match 3: 'dddd' @ (14,5) len=4

ccc is unmatched because it has conflict with dddd, which being largest has priority.

This document discusses 3 algorithms for computing hierarchical longest common substring matches:

FindOneBest (difflib): Pure recursion, one best match per DP
FindBestLayer: Improved recursion, all best length matches per DP
FindMultiLayer: DP with many groups of best length matches (this document)

All produce identical results.
Focus is on FindMultiLayer, which reduces recursion depth multiplier to at most O(n^(1/4)) for small alphabet, O(1) for large alphabet.

Algorithm 1 - FindOneBest (difflib)

def find_one_best_recurse(seq1, seq2, offset1=0, offset2=0):
    if not seq1 or not seq2:
        return []
    
    pos1, pos2, length = find_longest_one_best(seq1, seq2)
    if length == 0:
        return []
    
    left = find_one_best_recurse(seq1[:pos1], seq2[:pos2], offset1, offset2)
    current = [(offset1 + pos1, offset2 + pos2, length)]
    right = find_one_best_recurse(
        seq1[pos1+length:], seq2[pos2+length:],
        offset1 + pos1 + length, offset2 + pos2 + length
    )
    
    return left + current + right

find_longest_one_best: Standard DP tracking one best match per i (leftmost in j).

Complexity: O(n₁ × n₂ × depth)
- Best: O(n₁ × n₂) - single match
- Worst: O(n₁ × n₂ × min(n₁,n₂)) - unbalanced tree

Algorithm 2 - FindBestLayer

Improved recursive approach finding all ordered non-overlapping best matches.

Enhanced DP returning all leftmost non-overlapping ordered blocks in single pass.
These are the same ones that Algorithm 1 would collect in as many DP passes as there are blocks of the same size.

def find_best_layer(seq1, seq2):
    j2len = {}
    best_k = 0
    results = []
    last_i_pk = -1
    last_j_pk = -1
    while i < stop1:
        elt = seq1[i]
        p2 = seq2_positions[elt]
        new_best = False
        i_cur = -1
        j_cur = -1
        k_cur = 0

        newj2len = {}
        for j in p2:
            if OUT_OF_RANGE:
                continue

            k = j2lenpop(j - 1, 0) + 1
            newj2len[j] = k
            if k > best_k:
                best_k = k
                i_cur = i
                j_cur = j
                new_best = True
            elif i_cur == -1 and k == best_k:
                if i >= last_i_pk and j >= last_j_pk:
                    i_cur = i
                    j_cur = j

        if i_cur != -1:
            if new_best:
                results.clear()
            results.append((i_cur, j_cur))
            last_i_pk = i_cur + best_k
            last_j_pk = j_cur + best_k

        j2len = newj2len
        i += 1

Key difference: FindOneBest keeps only leftmost j; FindBestLayer keeps all j with max length, giving better subdivisions.

Complexity improvement: Eliminates O(n) worst case by finding better subdivisions. However, O(√n) worst case remains when many matches have different lengths (must recurse through hierarchy).

Algorithm 3 - FindMultiLayer

Maximal match is a match that can not be extended in either direction.

Core idea:

Single DP pass collects all maximal matches at all lengths.
Then layers them in batches of the same length from longest to shortest.
Memory of collection is limited to O(n) and the batch of lowest length is pruned if the limit is hit.
If only one length bucket exists when memory limit is hit, it switches to FindBestLayer mode.

Why it works - two cases:

Many same-length matches — memory limit is hit, FindBestLayer handles many same-length matches efficiently.
Many different length matches — the longest buckets are small (few high-length matches exist), so layering places large amount of matches in one pass.

Main DP Loop

# 1. Multi-Layer
collected = {}      # dict[length, list[tuple[i_end, j_end]]]
min_ck = 1          # Minimum collected match length

# Set memory limit
memleft = int((len(seq1) + len(seq2)) * memult)
if memleft < 100:
    memleft = 100
i = start1
while i < stop1:
    elt = seq1[i]
    p2 = seq2_positions[elt]

    # 1.1. DP loop
    newj2len = {}
    j2lenpop = j2len.pop
    for j in p2:
        if OUT_OF_RANGE:
            continue
        newj2len[j] = j2lenpop(j - 1, 0) + 1

    # 1.2. Collect maximal matches
    # j2len holds entries not extended (popped) in this iteration
    #   = maximal matches ending at i-1
    i_m1 = i - 1
    for j, k in j2len.items():
        if k >= min_ck:
            collected.setdefault(k, []).append((i_m1, j))
            memleft -= 1

    j2len = newj2len
    i += 1

    # 1.3. Trim if needed
    while memleft < 0:
        if len(collected) == 1:
            # Break to single layer
            break

        # Drop shortest length bucket
        # Matches at min_k length are discarded;
        # min_ck ensures future matches at this length are not collected
        min_k = min(collected)
        memleft += len(collected.pop(min_k))
        min_ck = min_k + 1
    else:
        # else runs only if inner while completed
        #   without break (memory still OK)
        continue

    # Break to single layer
    break

else:
    # Finished successfully
    # Append maximal matches from last iteration
    i_m1 = stop1 - 1
    for j, k in j2len.items():
        if k >= min_ck:
            collected.setdefault(k, []).append((i_m1, j))

    # Layer and return
    results = self.layer_maximal_matches(collected, min_ck)
    return results

# 2. Broke out - continue with single layer (Algorithm 2)
best_k = max(collected)
...

Layering Phase

Start from longest length match group
Add them to results following leftmost first / non-overlapping logic as the original problem requires.
Take the next best length match group and repeat, but additionally check if can be trimmed and re-inserted.

A match is trimmed when it partially overlaps an already-placed results.
Its start or end is adjusted to fit in the adjacent gap, and it is re-queued at the reduced length.

Implementation has few edge cases, but the logic is simple:

collected = {length: [match1, match2, ...], ...}
results = []
for k in sorted(collected, reverse=True):
    matches = collected.pop(k)
    for match in matches:
        if inserts_cleanly(match, results):
            insert(match, results)

        else:
            new_match, new_k = trim_match(match, results)
            if new_k >= min_valid_k:
                collected[new_k].insert(new_match)

Complexity / Worst Case

Space: O(n₂ + M), where M = total maximal matches collected (bounded by memleft cap, which is O(n₁ + n₂) in practice).

The cap memleft = n₁ + n₂ can be lowered (e.g. to (n₁ + n₂) / 2) at the cost of pruning more length buckets earlier.
Quality degrades gracefully — little is lost as small lengths have disproportionately many cross-matches.

DP Time: O(n₁ × n₂)
Layering Time: O(M)
Recursion depth multiplier Time: Large Alphabet O(1). Small alphabet worst case O(n^(1/4)) - empirically, on:

seq1 = 'a_aa_aaa_aaaa_aaaaa_aaaaaa_' ...
seq2 = 'a.aa.aaa.aaaa.aaaaa.aaaaaa.' ...

Performance Benchmarks

Performance across different sequence lengths (ms, CPython 3.13.4, macOS 11.7.11):

All benchmarks are produced with autojunk=False.

Assumption: DP for Small alphabet cases runs in O(n).
In reality, these cases are not purely O(n) as blocks are made up of many repeated characters.
Just different blocks are made up of different characters.

n = 750

Units: ms	FindOneBest	FindBestLayer	FindMultiLayer
rng01	91	117	58
aaa / aaa	100	118	97
a_a_a / a.a.a	3,032	32	24
a_aa_aaa / a.aa.aaa	671	919	105
aaa_aa_a / aaa.aa.a	671	694	99
abc / abc	1	1	1
a_b_c / a.b.c	46	2	1
a_bb_ccc / a.bb.ccc	40	47	5
aaa_bb_c / aaa.bb.c	42	49	5
ccc_bb_a / a.bb.ccc	4	5	4
a.bb.ccc / ccc_bb_a	4	5	4
global-block-shuffle-LA	6	8	4
local-block-shuffle-LA	13	16	5
parallel-block-shuffle-LA	18	16	4
diff-simulation	32	31	11

n = 1500

Units: ms	FindOneBest	FindBestLayer	FindMultiLayer
rng01	323	451	223
aaa / aaa	417	479	429
a_a_a / a.a.a	22,647	126	112
a_aa_aaa / a.aa.aaa	3,703	5,055	472
aaa_aa_a / aaa.aa.a	3,826	3,852	509
abc / abc	2	2	2
a_b_c / a.b.c	180	5	3
a_bb_ccc / a.bb.ccc	150	184	11
aaa_bb_c / aaa.bb.c	153	196	13
ccc_bb_a / a.bb.ccc	12	16	13
a.bb.ccc / ccc_bb_a	11	14	11
global-block-shuffle-LA	23	28	14
local-block-shuffle-LA	33	40	14
parallel-block-shuffle-LA	42	51	14
diff-simulation	110	103	35

n = 10,000

Units: ms	FindOneBest	FindBestLayer	FindMultiLayer
abc / abc	18	16	12
a_b_c / a.b.c	---	---	18
a_bb_ccc / a.bb.ccc	---	---	162
aaa_bb_c / aaa.bb.c	---	---	168
ccc_bb_a / a.bb.ccc	183	306	173
a.bb.ccc / ccc_bb_a	174	230	175
global-block-shuffle-LA	398	474	170
local-block-shuffle-LA	1,009	1,115	180
parallel-block-shuffle-LA	1,076	1,194	164
diff-simulation	1,750	1,856	536

Small Alphabet Cases

Random Binary — rng01

All algorithms ~O(n²) with similar constants.

Single Match — aaa / aaa

All algorithms ~O(n²) with similar constants.

Pathological O(n) Recursion — a_a_a / a.a.a

FindOneBest: O(n²) × O(n) = O(n³) due to O(n) recursion depth → 22,647ms at n=1500.
FindBestLayer/FindMultiLayer: O(n²) — eliminates depth multiplier entirely.

Pathological O(√n) Recursion — a_aa_aaa / a.aa.aaa and aaa_aa_a / aaa.aa.a

FindOneBest/FindBestLayer: O(n²) × O(√n) due to O(√n) recursion depth.
FindMultiLayer: O(n²) × O(n^(1/4)) - worst case.

Large Alphabet Cases

Single Match — abc / abc

Trivial, one match.
All algorithms O(n).

Pathological O(n) Recursion — a_b_c / a.b.c

FindOneBest: O(n) × O(n) = O(n²) — n matches of length 1, each triggers O(n) DP.
FindBestLayer/FindMultiLayer: O(n) — single pass finds all matches.

Pathological O(√n) Recursion — a_bb_ccc / a.bb.ccc and aaa_bb_c / aaa.bb.c

FindOneBest/FindBestLayer: O(DP) × O(√n).
FindMultiLayer: O(DP) — eliminates depth multiplier.

Single Match — ccc_bb_a and a.bb.ccc variants

Single match invalidating all others.
All algorithms ~O(DP) × O(1).

Block Cases (Large Alpha)

global/local/parallel-block-shuffle

All algorithms ~O(n), small absolute times (measurement noise dominates).
FindMultiLayer competitive throughout.

diff-simulation

1 block of largest size, 2 blocks of 2nd largest, 3 blocks of the next, etc...

All shuffled in parallel with single character separators between them.

(Real life diffs are simpler.)

FindMultiLayer is 3x faster than FindOneBest and FindBestLayer.

Memory Benchmarks (n = 500)

Units: kB	FindOneBest	FindBestLayer	FindMultiLayer	(n1 + n2) // 8
rng01	43	49	116	63
aaa / aaa	78	79	198	109
a_a_a / a.a.a	75	83	135	92
a_aa_aaa / a.aa.aaa	69	77	153	111
aaa_aa_a / aaa.aa.a	69	74	149	89
abc / abc	85	86	86	88
a_b_c / a.b.c	90	95	123	110
a_bb_ccc / a.bb.ccc	30	32	91	39
aaa_bb_c / aaa.bb.c	31	40	88	44
ccc_bb_a / a.bb.ccc	30	29	88	35
a.bb.ccc / ccc_bb_a	30	30	90	34
global-block-shuffle-LA	30	30	89	36
local-block-shuffle-LA	30	30	88	39
parallel-block-shuffle-LA	30	31	89	40
diff-simulation	46	41	99	53

FindOneBest worst case is 90 kB.
FindBestLayer worst case is 95 kB.
FindMultiLayer worst case is 198 kB and 111 kB with 1/8th of (n1 + n2) limit.

All methods have O(n) memory complexity.

Summary

	Complexity	Small Alpha Recursion (`*`)	Large Alpha Recursion (`*`)
FindOneBest	very simple	O(√n) - O(n)	O(√n) - O(n)
FindBestLayer	simple	Some O(√n)	Some O(√n)
FindMultiLayer	Moderate	Rare O(n^(1/4))	Mostly O(1)

(*) Beyond O(n log n).

For small alphabet, FindMultiLayer worst case recursion multiplier is O(n^(1/4)).
For large alphabet, FindMultiLayer eliminates recursion entirely.
Complexity of FindMultiLayer is sensible for the improvement it delivers.

When autojunk=True, this still leaves the FindMultiLayer with total O(n²) worst case.

The worst case can be constructed as follows:

Ensure maximal number of element pairs
Ensure that none of elements hit autojunk threshold
Construct as many blocks of different length as possible

each = max(1, N // 100 - 10)
b = N // each + 1
chars = ''.join([each * chr(i + 100) for i in range(b)])
blocks = []
i = 1
s = j = 0
while s + i + 1 < N:
    blocks.append(chars[j:j+i])
    j += i
    s += i + 1
    i += 1
seq1 = '-'.join(blocks)
seq2 = '.'.join(blocks)

Runtimes for sequences constructed with code above:

		FindMultiLayer	FindMultiLayer
	FindOneBest	mem=(n1+n2)	mem=(n1+n2) // 8
size 10K	10.6 s	0.15 s	0.2 s
size 15K	24.1 s	0.4 s	0.4 s
size 20K	60.5 s	0.6 s	0.8 s
size 50K	~ 13 m	4.4 s	6.6 s
size 100K	~ 1.5 h	20 s	31 s

For 100K length sequences FindMultiLayer with memory=(n1 + n2) processes 90% of size in 1st pass.
For 100K length sequences FindMultiLayer with memory=(n1 + n2) // 8 processes 50% of size in 1st pass.

difflib.SequenceMatcher with autojunk=True is used in many applications that occasionally hit pathological cases.

The FindMultiLayer offers non-trivial improvement, which would remove most of related issues.

Table of Contents