Unum facit, aliud vastat
Since the early 2000s, the Central Election Commission of Russia, the
ultimate arbiter of Russian elections, has published detailed
election results and related data down to full records from each polling
station. For most of the existence of the service, the only functional
output format was HTML, so researchers who studied that data (see,
e. g., Kobak et al. (2016), Enikolopov et al.
(2013), and other sources referenced in Shen’s living review)
mostly had to resort to scaping the web site, which was never
particularly pleasant. Nevertheless, useful (if politically
provocative) results were obtained, and collections of more
easily accessible datasets were gradually being amassed.
The winds first started changing after the 2018 gubernatorial elections.
Amid allegations of widespread fraud, original versions of election
records that were later modified were discovered to be readily
available in the public system. A hasty frontend patch appears to have
been applied several days later to bar access to any and all addresses
containing the string version
, though it could be easily bypassed
using standard alternate encoding techniques.
In December 2019, the second revision of the regulations governing
the publication of election data contained a subtly different
wording that excluded any mention of automated access, this being
almost the only change to the preceding version from 2010. A
subpar but inconvenient mandatory CAPTCHA was imposed shortly afterwards
on all visitors and subsequently underwent severalrounds of relaxation
and tightening after analysts cried foul. The next year saw an
introduction of IP-address–based rate limiting of about 100
requests per hour, invisible to normal visitors but thoroughly
thwarting any attempts at real-time large-scale downloads without the
use of proxies (obtaining the complete precinct-level data for a single
federal election requires fetching almost 3000 summary reports, and
gathering supplementary information such as early voting numbers can
require visiting the pages of all 98000 precincts).
The third revision, put in effect shortly before the federal
parliamentary elections of 17–19 September 2021, contained a
further change of wording to exclude mentions of users being able
to search or copy the data to their machines and to formally require
“protection” from automated tools. This change was at first thought to
be a largely symbolic formalization of the existing practice and
declaration of intent, until, on 19 September, it came to light
that the Commission introduced a form of obfuscation onto its web pages.
In a graphical browser with JavaScript enabled, the results appeared
correctly, but attempting to copy them into the clipboard yielded
gibberish — or worse, wrong numeric
values — even though no interception of clipboard
events was taking place. Direct inspection of the HTML markup revealed
different and even more mangled numbers, misplaced table cells, garbage
characters in alphabetic strings, and a soup of seemingly meaningless
nested span
elements.
This note describes the techniques used by the code on the page to
deobfuscate the mangled markup for the user’s consumption. It also
doubles as example Python code for performing that deobfuscation without
a full web browser. It was written by Alexander Shpilkin and
describes research current as of 21 September. Basic knowledge of web
technology is required, but the Python can be ignored without harm.
The canonical distribution point for this document as of the present
version is on GitHub, which hosts the Markdown source, the
extracted Python code, and the web version generated by Docco.
All of this is freely redistributable and modifiable without legal
restrictions as per the Creative Commons CC0 1.0 public domain
dedication, although the author asks you to exercise your judgment
regarding the degree of dissemination in light of the Commission’s
hostile behaviour and to follow common-sense attribution practices.
from collections import namedtuple
from fontTools.ttLib import TTFont
from io import BytesIO
from lxml.html import document_fromstring
from lxml.etree import tostring
from re import finditer, compile as re_compile
from requests import get
from sys import stdin, stdout