Data obfuscation on the Russian Central Election Commission website

§

Data obfuscation on the Russian Central Election Commission website
§

Unum facit, aliud vastat

Since the early 2000s, the Central Election Commission of Russia, the ultimate arbiter of Russian elections, has published detailed election results and related data down to full records from each polling station. For most of the existence of the service, the only functional output format was HTML, so researchers who studied that data (see, e. g., Kobak et al. (2016), Enikolopov et al. (2013), and other sources referenced in Shen’s living review) mostly had to resort to scaping the web site, which was never particularly pleasant. Nevertheless, useful (if politically provocative) results were obtained, and collections of more easily accessible datasets were gradually being amassed.

The winds first started changing after the 2018 gubernatorial elections. Amid allegations of widespread fraud, original versions of election records that were later modified were discovered to be readily available in the public system. A hasty frontend patch appears to have been applied several days later to bar access to any and all addresses containing the string version, though it could be easily bypassed using standard alternate encoding techniques.

In December 2019, the second revision of the regulations governing the publication of election data contained a subtly different wording that excluded any mention of automated access, this being almost the only change to the preceding version from 2010. A subpar but inconvenient mandatory CAPTCHA was imposed shortly afterwards on all visitors and subsequently underwent severalrounds of relaxation and tightening after analysts cried foul. The next year saw an introduction of IP-address–based rate limiting of about 100 requests per hour, invisible to normal visitors but thoroughly thwarting any attempts at real-time large-scale downloads without the use of proxies (obtaining the complete precinct-level data for a single federal election requires fetching almost 3000 summary reports, and gathering supplementary information such as early voting numbers can require visiting the pages of all 98000 precincts).

The third revision, put in effect shortly before the federal parliamentary elections of 17–19 September 2021, contained a further change of wording to exclude mentions of users being able to search or copy the data to their machines and to formally require “protection” from automated tools. This change was at first thought to be a largely symbolic formalization of the existing practice and declaration of intent, until, on 19 September, it came to light that the Commission introduced a form of obfuscation onto its web pages. In a graphical browser with JavaScript enabled, the results appeared correctly, but attempting to copy them into the clipboard yielded gibberish — or worse, wrong numeric values — even though no interception of clipboard events was taking place. Direct inspection of the HTML markup revealed different and even more mangled numbers, misplaced table cells, garbage characters in alphabetic strings, and a soup of seemingly meaningless nested span elements.

This note describes the techniques used by the code on the page to deobfuscate the mangled markup for the user’s consumption. It also doubles as example Python code for performing that deobfuscation without a full web browser. It was written by Alexander Shpilkin and describes research current as of 21 September. Basic knowledge of web technology is required, but the Python can be ignored without harm.

The canonical distribution point for this document as of the present version is on GitHub, which hosts the Markdown source, the extracted Python code, and the web version generated by Docco. All of this is freely redistributable and modifiable without legal restrictions as per the Creative Commons CC0 1.0 public domain dedication, although the author asks you to exercise your judgment regarding the degree of dissemination in light of the Commission’s hostile behaviour and to follow common-sense attribution practices.
```
#!/usr/bin/env python3
#{ SPDX-License-Identifier: CC0-1.0 }

from collections import namedtuple
from fontTools.ttLib import TTFont
from io import BytesIO
from lxml.html import document_fromstring
from lxml.etree import tostring
from re import finditer, compile as re_compile
from requests import get
from sys import stdin, stdout
```
§

Outline
§

The Central Election Commission website serves its pages in the Windows-1251 character encoding with Windows line endings. The example code accepts the HTML markup of the page to be deobfuscated on standard input.
```
stdin.reconfigure(encoding='cp1251', newline=None)
tree = document_fromstring(stdin.read())
```
§

Apart from the HTML markup for the data table, data necessary for deobfuscation include the CSS stylesheet and JavaScript code that are output on a single line just afterwards, as well as an external font file referenced in the stylesheet. These deobfuscate the data for the user’s consumption by applying the following largely independent transformations:
1. Permute, replace, or delete some of the text using JavaScript.
2. Hide some HTML elements using either references to styles in the stylesheet or inline styles. As an additional obfuscation measure, some of the style declarations appear like they should result in hiding the element, but are in fact illegal CSS ignored by the user agent in accordance with the specification.
3. Insert additional text after some HTML elements using CSS generated content. As an additional obfuscation measure, some of this text is hidden using the techiques of the previous point.
4. Apply a simple substitution cipher to some of the text by displaying it in a special font that has its characters in the wrong positions. This is only done for strings of digits, and only a single font is used on a given page, although it changes if the page is reloaded.
```
container, = tree.xpath('//*[contains(concat(" ", @class, " "), " show ")]')
css, = container.xpath('.//style'); css.drop_tree(); css = str(css.text)
js, = container.xpath('.//script'); js.drop_tree(); js = str(js.text)

byclass = {}
for node in container.xpath('.//*[@class]'):
	for cls in node.classes:
		byclass.setdefault(cls, []).append(node)
```
§

Style syntax
§

There are three main things that interest the deobfuscator in a style declaration: whether it hides the element, whether it applies the scrambled font to it, and, for declarations applying to the ::after pseudoelement, what the element content will be sent to. Additionally, the URL of the font has to be extracted from the @font-face declaration. As the CSS cascade is effectively not used, which specific properties are used to attain these ends can be ignored.

Property declarations used for hiding are:
- display: none; other valid values such as inline and inline-block, as well as the illegal inlineblock, occur as well and must be ignored;
- top: -9...9px, left: -9...9px; the value 0 or 0px and values using the illegal unit xp occur as well and must be ignored;
- z-index: -9...9; the value 1 occurs as well and must be ignored;
- font-size: 0; the default value inherit occurs as well and must be ignored;
- opacity: 0; the illegal value 0px occurs as well and must be ignored;
- width: 0 or 0px, height: 0 or 0px; values using the illegal unit xp occur here as well and must be ignored;
- color: white or transparent; the default value inherit occurs as well and must be ignored;
- visibility: hidden; the illegal value true occurs as well and must be ignored.
As we can see, a relatively substantial effort seems to have been put into thwarting simple substring matching, but no really broken CSS is output, so matching each declarations completely against a good enough set of regular expressions while ignoring all unrecognized declarations is sufficient.

The font file is provided in multiple formats for compatibility, including TTF, OTF, EOT, WOFF and WOFF2, but all of these contain equivalent data.
```
Style = namedtuple('Style',
                   'visible scramble content',
                   defaults=(True, False, None))

HIDE    = re_compile(r"display: *none|(top|left): *-9+px|z-index: *-9+|(font-size|opacity): *0|(width|height): *0(px)?|color: *(white|transparent)|visibility: *hidden")
CONTENT = re_compile(r"content: *'([^\']*)'")
FONTFAM = re_compile(r'font-family: *"([^\"]*)"( *!important)?')
FONTURL = re_compile(r'src:.* url\("\./([^\"]*\.ttf)"\).*')

fontfam = fonturl = None

def parsestyle(decs):
	global fontfam, fonturl
	style = Style()
	for dec in decs.split(';'):
		dec = dec.strip()
		if HIDE.fullmatch(dec):
			style = style._replace(visible=False)
		elif m := CONTENT.fullmatch(dec):
			style = style._replace(content=m[1])
		elif m := FONTFAM.fullmatch(dec):
			if fontfam is None:
				fontfam = m[1]
			assert m[1] == fontfam
			style = style._replace(scramble=True)
		elif m := FONTURL.fullmatch(dec):
			assert fonturl is None
			fonturl = m[1]
	return style
```

Each rule in the stylesheet has a selector of the form .T .C or .T .C::after, where T is the randomly named class assigned to the data table as a whole and C is a randomly named class assigned to one or more of its children. There is also a single @font-face rule to point the browser to the font file.

SELECTOR = re_compile(r"\.([a-z_]*(::after)?)")

styles, afters = {}, {}
end = 0
for m in finditer(r' *([-@a-z_.: ]+?) *\{([^}]*)\}', css):
	assert m.start() == end; end = m.end()
	sel, decs = m.groups()
	style = parsestyle(decs)

	if sel == '@font-face':
		continue
	parent, sel = sel.split()
	assert parent[0] == '.' and len(byclass.get(parent[1:], ())) == 1
	m = SELECTOR.fullmatch(sel)
	assert m is not None
	if m[1].endswith('::after'):
		assert m[1].removesuffix('::after') not in afters
		afters[m[1].removesuffix('::after')] = style
	else:
		assert m[1] not in styles
		styles[m[1]] = style

assert not css[end:].strip()

§

Font
§

The server generates a new URL for the font on each request, but the file it refers to only appears to change once per several seconds. The generated URL is relative to the scheme and domain of the page URL, which is usually http://www.R.vybory.izbirkom.ru/ for some region R, but, as always, all of the region-specific domains point to exactly the same data as http://www.vybory.izbirkom.ru/.

It is unlikely that the unique URLs will be retained forever, so care must be taken when scraping the obfuscated HTML to extract the URL for and save the font file as well, as the substitution doesn’t seem to be recoverable from the URL string alone. A simple if technically incorrect regular expression such as url\("\./([^\"]*\.ttf)"\) can be used to extract the necessary URL from the markup. The example code ignores this potential problem and downloads the font during execution.

Sometime on 20 September, the web server has started rejecting requests mentioning some common HTTP automation tools (in particular, curl and requests, but not wget) in their User-Agent HTTP header with a 403 Forbidden status code; changing the header value to Mozilla/5.0 (which is contained in the values sent by popular web browsers) appears to be enough to avoid the ban.

A number of advanced approaches could have been used to recover the substitution from the font file, from matching the character contours exactly against the original font (PT Sans by Paratype) to rendering the characters into a bitmap and applying perceptual hashes or OCR. None of this turns out to be necessary, as while the glyph IDs, character positions, and glyph order in the font file all appear to be scrambled, the PostScript glyph names (which for European digits are zero, one, etc.) are intact.
```
ttf = get('http://www.vybory.izbirkom.ru/' + fonturl,
          headers={'User-Agent': 'Mozilla/5.0'})
ttf = TTFont(BytesIO(ttf.content))
subst = {v: k for k, v in ttf.getBestCmap().items()}
subst = {chr(subst[n]): str(k) for k, n in
         enumerate('zero one two three four five six seven eight nine'.split())}

def unscramble(s):
	return ''.join(subst[c] for c in s)
```
§

JavaScript semantics
§

The three possible operations applied from JavaScript are fixed, and no attempts seems to have been made at obfuscating their implementation. The original names of the transformations are unknown, so code names corresponding to their behaviour are used below.
- The operation setInner(C, V, E) replaces the innerHTML of each element with class C (passed as a single-quoted JavaScript string literal) with V (passed the same way). The scope of the operation is E, which is invariably the whole data table (passed as a variable with name identical to the randomly-generated class name of the table). The element can only contain text with no markup both before and after the operation, and the text may or may not be numeric.
- The operation splice(C, I, E) deletes the character at zero-based position I (passed as a decimal JavaScript integer literal) in the innerHTML of each element with class C (passed as above). If I is negative, then -1 refers to the last character, -2 to the second-to-last character, etc. The scope of the operation is E, as above. The character is always inside the text either before or after any child elements, which there can be.
- The operation swapLast(I, J, E) exchanges the text content (or, equivalently, innerHTML) of the last leaf children of the td elements with zero-based numbers I and J (passed, likely to encourage confusion with setInner, as single-quoted JavaScript strings containing non-negative decimal numerals) during a preorder traversal of the element tree. The scope of the operation is E, as above. Here, the last leaf child of an element is the last node encountered during a preorder traversal that contains only text.
Several successive references can be made to the same element using the same class name.

Finally, until the deobfuscation is complete, several elements including the data table are completely hidden by inline styles, probably so that the user does not see the document change. They are then found by their randomly-generated class names and revealed after a short delay. Those are the elements that can be passed as the scope E to the operations above.
```
def string(src):
	assert src[0] == "'" and src[-1] == "'" and "\\" not in src
	return src[1:-1]

revealed = set()
def reveal(cls):
	styles.setdefault(cls, None)
	node, = byclass.get(cls, ())
	decs = ';'.join(dec for dec in node.attrib.pop('style').split(';')
	                if not HIDE.fullmatch(dec.strip()))
	if decs:
		node.set('style', decs)
	revealed.add(node)

def dosetinner(cls, val, elt):
	cls, val = string(cls), string(val)
	assert byclass[elt][0] in revealed
	styles.setdefault(cls, None)
	assert '<' not in val and '&' not in val
	for node in byclass.get(cls, ()):
		assert not list(node)
		node.text = val

def dosplice(cls, idx, elt):
	cls, idx = string(cls), int(idx)
	assert byclass[elt][0] in revealed
	styles.setdefault(cls, None)
	for node in byclass.get(cls, ()):
		children = list(node)
		if children and idx < 0:
			text = children[-1].tail
		else:
			text = node.text
		assert idx < len(text) and -idx <= len(text)
		text = text[:idx] + text[idx+1:] if idx != -1 else text[:-1]
		if children and idx < 0:
			children[-1].tail = text
		else:
			node.text = text

def lec(node):
	children = list(node)
	return lec(children[-1]) if children else node

def doswaplast(fst, snd, elt):
	fst, snd = int(string(fst)), int(string(snd))
	table, = byclass[elt]
	assert table in revealed
	nodes = table.xpath('.//td') # FIXME compile?
	fst, snd = lec(nodes[fst]), lec(nodes[snd])
	fst.text, snd.text = snd.text, fst.text
```
§

JavaScript syntax

The JavaScript code of the deobfuscator has a straightforward structure. First, functions implementing some or all of the operations above, as well as an auxiliary function for finding the last leaf child of an element, are defined as necessary and in random order. The auxiliary function is always called lec, but the rest have random names which an independent deobfuscator needs to extract and remember. The implementations used are fixed except for some of the variable names (so can be distinguished using regular expressions) and, in a pretty-printed form, are as follows (randomized names in capitals):

var SetInner = function(C, V, E) {
    var L = E.getElementsByClassName(C);
    for (var i = 0; i < L.length; i++) {
        L[i].innerHTML = V;
    };
};

var Splice = function(C, I, E) {
    var L = E.getElementsByClassName(C);
    for (var i = 0; i < L.length; i++) {
        var v = L[i].innerHTML.split('');
        v.splice(I, 1);
        L[i].innerHTML = v.join('');
    };
};

if (!lec) {
    var lec = function(a) {
        var b = a.lastElementChild;
        if (!b) return a;
        if (b.lastElementChild) return lec(b);
        return b;
    };
};;

var SwapLast = function(I, J, E) {
    var L = E.getElementsByTagName('td');
    var X = lec(L[I]);
    var Y = lec(L[J]);
    var S = X.innerHTML;
    var T = Y.innerHTML;
    X.innerHTML = T;
    Y.innerHTML = S;
};

The two redundant semicolons after the if are not a typo in this document, but apparently a mistake made by the original programmer.

Next, a function called a is defined that calls the above three to perform the deobfuscation, and also schedules elements to be revealed using a copy of a code snippet for each. Finally, a is scheduled to be executed as soon as the user agent has finished building the document tree. The general structure, using the same conventions as above, is

var a = function() {
    /* ... */
    var X = document.getElementsByClassName('X')[0];
    X.style.position = 'relative';
    setTimeout(function() {
        X.style.removeProperty('opacity');
        X.style.removeProperty('visibility');
    }, 700);
    /* ... */
    F('Y', 'data', X);
    /* ... */
};
document.addEventListener('DOMContentLoaded', a);

IGNORE   = re_compile(r" +|;|if *\(!lec\) *\{[^}]*\{[^}]*\}[^}]*\}|var *a *= *function\(\) *\{")
SETINNER = re_compile(r"var +([a-z_]+) *= *function\([a-z_]+, *[a-z_]+, *[a-z_]+\) *\{[^}]*\{[^}]*innerHTML *= *[a-z_]+ *;[^}]*\}[^}]*\} *;")
SPLICE   = re_compile(r"var +([a-z_]+) *= *function\([a-z_]+, *[a-z_]+, *[a-z_]+\) *\{[^}]*\{[^}]*splice[^}]*\}[^}]*\} *;")
SWAPLAST = re_compile(r"var +([a-z_]+) *= *function\([a-z_]+, *[a-z_]+, *[a-z_]+\) *\{[^}]*getElementsByTagName\('td'\)[^}]*\} *;")
REVEAL   = re_compile(r"var +([a-z_]+) *= *document\.getElementsByClassName[^}]*setTimeout\(function *\(\) *\{[^}]*\}[^)]*\) *;")
CALL     = re_compile(r"([a-z_]*)\(('[^\']*'), *(-?[0-9]*|'[^\']*'), *([a-z_]*)\) *;")
QUIT     = re_compile(r"\} *; *document\.addEventListener\('DOMContentLoaded', *a\) *;")

setinner = splice = swaplast = None
i = 0
while True:
	if m := IGNORE.match(js, i):
		i = m.end()
	elif m := SETINNER.match(js, i):
		i = m.end(); setinner = m[1]
	elif m := SPLICE.match(js, i):
		i = m.end(); splice = m[1]
	elif m := SWAPLAST.match(js, i):
		i = m.end(); swaplast = m[1]
	elif m := REVEAL.match(js, i):
		i = m.end(); reveal(*m.groups())
	elif m := CALL.match(js, i):
		i = m.end(); func = m[1]
		if func == setinner:
			dosetinner(*m.groups()[1:])
		elif func == splice:
			dosplice(*m.groups()[1:])
		elif func == swaplast:
			doswaplast(*m.groups()[1:])
		else:
			assert not "possible"
	elif m := QUIT.match(js, i):
		i = m.end()
		assert not js[i:].strip()
		break
	else:
		assert not "possible"

§

Style semantics

While a full implementation of CSS, even restricted to the properties listed above, would be unbearably complicated, the deobfuscation styles do not make use of any particularly tricky features like the interpretation of top and left with position: static, cascading or inheritance. Furthermore, either a class reference or an inline style can be used on any given element, but not both. A naïve quasi-CSS processor is thus sufficient.

The processor needs to run after JavaScript execution. It should add the generated content to the document (unless it is hidden by the accompanying styles), remove nodes which are hidden, and decipher all text under nodes which have the scrambled font applied to them. Generated content is never scrambled.

for cls, style in afters.items():
	styles.setdefault(cls, None)
	if not style.visible or not style.content:
		continue
	assert not style.scramble
	for node in byclass.get(cls, ()):
		if children := list(node):
			children[-1].tail = ((children[-1].tail or '') +
			                     style.content)
		else:
			node.text = (node.text or '') + style.content

def applystyle(node, style):
	if not style.visible:
		node.drop_tree()
	elif style.scramble:
		node.text = unscramble(node.text)
		for n in node.iterdescendants():
			if n.text is not None:
				n.text = unscramble(n.text)
			if n.tail is not None:
				n.tail = unscramble(n.tail)

for node in container.xpath('.//*[@style]'):
	assert all(styles.get(cls) is None for cls in node.classes)
	style = parsestyle(node.get('style'))
	if node in revealed:
		assert not style.scramble
		continue
	del node.attrib['style']
	applystyle(node, style)

for cls, style in styles.items():
	for node in byclass.get(cls, ()):
		node.classes.remove(cls)
		if style is not None:
			applystyle(node, style)

§

Cleanup
§

After both the JavaScript execution and the CSS processing are complete, the randomly-generated classes and the inline styles can be discarded for clarity, as they do not affect the user experience in any other way.

A lot of both empty and nonempty span elements will be left as well, which can also be deleted if their content is retained. The rest of the markup is significant.
```
for node in container.xpath('.//span'):
	if not node.attrib:
		node.drop_tag()
```
§

All done!
§

The example code writes (only) the deobfuscated data to standard output according to the system encoding and line ending convention.
```
stdout.write(tostring(container, encoding='unicode', method='html'))
```

Data obfuscation on the Russian Central Election Commission website

Outline

Style syntax

Font

JavaScript semantics

JavaScript syntax

Style semantics

Cleanup

All done!