How To Count Isomers
Mathematics is rich in technique and arguments. In this great variety one of the most basic tools is counting. Strangely enough, it is one of the most difficult. – Herstein, Topics in Algebra
Orgo
Early in the first semester of every organic chemistry class (“orgo”) students are taught to enumerate the variations that can occur in small straight-chain hydrocarbons like pentane and hexane. It’s called finding the structural isomers of alkanes. This often-infuriating exercise is meant to drive home the concept of molecular shapes and bond counting. But it also contains a hidden math puzzle that tormented researchers for the better part of a century.
Before elaborating it helps to get a basic idea of the isomer exercise itself starting with the skeletal drawings shown below. It’s a simple idea. Either end of any line segment represents a carbon atom. Every vertex (carbon) has four edges (its valence). What’s not shown is assumed to be hydrogen. That’s about it. For more, see [IR].
The next part is a little more involved. The skeletal drawings below correspond to the chemical formula for hexane, $C_6H_{14}$, a hydrocarbon in the alkane family (alkanes have the chemical formula $C_nH_{2n+2}$). The lines can be maneuvered in various ways. The challenge is to do so without altering the chemical formula or violating the valence of carbon (at most four bonds). The chemistry details are best left to the various links in the reference section at the bottom ([KA], et al), but as shown, it’s mostly a visual puzzle.
For orgo students, exercises like this tend to be a brute force endeavor and part of the challenge is knowing when to stop. Butane, for example, is fairly simple: there are only two solutions, as shown above. Yet that simplicity fades as the number of carbons increase. Apart from dealing with more subdivisions, there’s also a deeper challenge. For example, looking at the answer for hexane, one may wonder whether a bond could be attached at the fourth spot too. This would then yield six isomers instead of five.
But imagine both are spinning around in space. Could they be distinguished? If one flips around a bit, won’t it mirror the other? This is where some students get in trouble: they fail to notice the symmetries.
Symmetries are also the key to the underlying math conundrum: How does one enumerate the distinguishable elements in a set that has these type of symmetries? For the orgo problem, the challenge is to only count the distinct isomers for a given hydrocarbon. That is, how does one count without overcounting?
Related question: is there a closed formula that spits out the number of isomers of a given size, $n$? And final question: how fast does that number grow with $n$?
These turned out to be difficult problems. It took almost seventy years from when they were first articulated to when they finally got resolved. And as it often happens with open questions in mathematics, the pursuit led to many innovations in the field with applications far beyond the problem itself.
History
Chemistry is an acknowledged origin for the beginning and development of graph theory – Balaban
[Counting] can sometimes be done by a brute force case-by-case exhaustion but such a routing is invariably dull and violates a mathematician’s sense of aesthetics. One prefers the light, deft, delicate touch to the hammer blow. – Herstein, Topics in Algebra
The now ubiquitous skeletal form of organic molecules was devised by August Kekulé in the 1850s. This inspired (or was inspired by perhaps) contemporaneous work in graph theory with the emergence of trees attributed to Kirchhoff, i.e., graphs without cycles. Pairing these, skeletal drawings transform straight-chain hydrocarbons into connected acyclic graphs whose vertices have a degree of 4 or less (i.e., “4-ary”).
Initial work on this was done by Arthur Cayley through a series of papers that included “On the Analytic Forms Called Trees” [AC] and “On the Mathematical Theory of Isomers” [CI]. These were the first mathematical forays into molecular shapes as trees. In these papers, he correctly concluded that there was no closed-form formula for the alkane isomer counting problem. He further developed an iterative technique for obtaining the number of “topologically different planted trees.” A planted tree has an endpoint that is distinguished from the others. A planted tree is just a special case of a rooted tree, in which one point, the root, is distinguished. Conversely a unrooted tree is one in which no such root is identified. In chemistry terms, an alcohol is an example of a rooted tree, as is an alkyl; whereas, an alkane is an example of a unrooted tree. The unrooted tree (e.g., alkanes) proved to be more difficult to enumerate than its rooted cousins.
His counting formula for the number of topologically distinct planted trees is shown below, the derivation of which is found in the 1857 paper [AC].
$ \prod_{i=1}^n(1 – x^i)^{-T_{i-1}} = 1 + \sum_{i=1}^n T_ix^{i} $
It may not be obvious how to use this or how this relates to the counting problem. As will be shown, it’s actually the coefficients of these terms that are of interest here not the value of $x$. Each coefficient $i$ represents the count for the correspondingly sized tree. For example, the coefficient on $T_4$ will be the number of distinct rooted trees of size 4.
Power series like this are called generating functions. Generating functions provide a formal power series representation of sequences, which means that the $x^n$ terms serve as placeholders since only the coefficients are being studied. To put it more colorfully, “A generating function is a clothesline on which we hang up a sequence of numbers for display” [HW].
To get started, plug in initial known values for these coefficients (based on eyeballing trees of that size). Starting simple, $T_1 = 1$ and for notational convenience, let $T_0=1$. The next coefficient would be found by expanding terms and solving:
$ (1 — x)^{-T_0}(1 — x^2)^{-T_1} = 1 + T_1*x +T_2 * x^2 $, or
$(1 — x)^{-1}(1 — x^2)^{-1} = 1 + x +T_2 * x^2$.
The binomal theorem makes this doable: $\frac{1}{1-x} = \sum_{k=0}^{\infty} x^k$. Combining terms gives:
$ (\sum_{k=0}^{\infty} x^k )( \sum_{k=0}^{\infty} x^2k) = 1 + x + 2x^2 +…. = 1 + x +T_2 * x^2 $
The left-hand product is easy if you write it columnar form and then group like terms. Stop once you reach the squared term since we’re interested in that terms’ coefficient, $T_2$. It follows immediately that $T_2 = 2$. Subsequent terms follow the same pattern, which means that each term requires the computation of the one that precedes it.
As expected, the algebra gets gruesome as the terms accumulate. Cayley only got to $N=11$ and some of those had errors, but it was an important first strike on this problem. Cayley later extended this work by producing other iterative formulas to compute free tree counts with the restriction of valence added [AC3]. In doing so, he computed the first several values of the alkane (or “paraffin,” as it was known then) isomer counts. In retrospect this technique did not prove particularly useful: “Cayley’s approach to counting alkanes … turns out to be an awkward way to attack the problem … and may explain why no one else has used this approach” [RS]. Nonetheless, Cayley’s formulations helped shape modern graph theory.
After a period of dormancy on this topic, Henry Henze, a chemist at University of Texas at Austin, published the 1931 paper with colleague C.M.J. Blair, “The Number of Isomeric Hydrocarbons of the Methane Series” [HH]. Their work greatly improved on previous attempts by using a recursion algorithm that summed the isomer count of various constituent alkyls (rooted trees). Henze and Blair later developed counting solutions for alkane stereoisomers, unsaturated hydrocarbons and many other important aliphatic compounds too.
The Hungarian mathematician George Pólya continued this work in 1936 with his “Combinatorial Enumeration of Groups, Graphs and Chemical Compounds” [GP]. This landmark paper expanded on these pioneering works by introducing an innovative mathematical framework (Pólya Enumerations) that proved useful to a wide variety of spatially-aware counting problems, not just isomers. It’s worth noting that a similar theory had been developed independently in 1927 by J. H. Redfield but his paper languished in obscurity for decades. Conversely, Pólya’s theorem became the basis for the majority of future work in this area. Still, many now refer to it as the Redfield-Pólya Theorem.
In the 1960s, mathematicians Frank Harary, Robert Norman, et al, developed a new recursive solution to this problem by combining Pólya’s counting theorem with their own results in graph theory. This solution has been since widely re-printed is now regarded as the de facto technique. These authors went on to develop enumerative solutions for a wide body of chemical compounds, as described in the 1973 text, Chemical Applications of Graph Theory [AB].
By this point computers were now available to churn through these formulae too, allowing researchers to generate values far beyond what could be reasonably performed by hand (as Cayley and Henze found out). But as will be shown, they quickly ran into limits of that computational power when solving this problem. Still, it put an end to the numerical errors that plagued earlier papers.
The problem appeared to be solved but then in 1999 Cal Tech mathematician Dr. Eric Rains and his colleague Dr. Neil Sloane at Bell Labs revisited Cayley’s work on centered and bi-centered trees by pairing it with Pólya’s enumerative technique. In a terse two page proof they developed a new solution that is beautifully efficient. The alkane isomer count sequence is now registered as A602 in the Online Encyclopedia of Integer Sequence. See [RS] for details.
French mathematician Phillipe Flajolet and Princeton computer scientist Robert Sedgewick have since developed a precise framework for the more general problems of this type from which the field of Analytical Combinatorics [FAC] was born. This is the from the preface of their textbook of the same name:
Analytic combinatorics aims to enable precise quantitative predictions of the properties of large combinatorial structures. The theory has emerged over recent decades as essential both for the analysis of algorithms and for the study of scientific models in many disciplines, including probability theory, statistical physics, computational biology and information theory. With a careful combination of symbolic enumeration methods and complex analysis, drawing heavily on generating functions, results of sweeping generality emerge that can be applied in particular to fundamental structures such as permutations, sequences, strings, walks, paths, trees, graphs and maps.
If interested, they offer free textbooks and online courses on these subjects. But before jumping into that, it’s helpful to get more historical perspective by looking in detail at Pólya’s work on the subject since his work is so fundamental to everything that followed.
Pólya and Symmetry
Pólya begins his 1936 paper [GP] with a counting problem.
Suppose you have six balls with three different colors, three red, two blue, one yellow… In how many ways can you assign the six balls to the six vertices of an octahedron which moves freely in space?
This may seem like a basic multinomial combinatorics question. And indeed if the question was, how many ways can three red, two blue and one yellow ball be placed into one of six labelled boxes, the answer would be:
$\frac{6!}{3!2!1!}=60$.
But he added the wrinkle of it being an octahedron floating in space. Imagine making a particular color choice and then rotating the octahedron in one of the ways shown below. Would that specified coloring remain distinct among the other colorings?
After posing the question, he outlined a solution and then developed the theory to prove it. For the purposes of this post, it may be simpler to demonstrate the technique using a simpler problem, one found in [JJ] [KT] and others.
The Two Color Square Problem
Imagine you’re going to paint the vertices of a square with one of two colors, red and blue. Each corner can be either color. Try to enumerate all the ways this can happen. Then look at the permutations that arise when that square is rotated and reflected in the plane.
Before worrying about the spatial aspects, start by enumerating the ways to assign colors. This part is easy. Since there are four vertices that can be assigned to one of two colors, there are 16 (=2⁴) outcomes as shown below.
What spatial patterns emerge? There’s a big hint in the yellow rows. Colorings within those rows only differ by their orientation, otherwise they’re identical. But what do these rows represent mathematically?
A little group theory
Start by considering the available rigid motions on a square. That’s three rotations, four reflections and the identity (no change) for a total of eight motions (see below).
The set of permutations on the set of vertices, $\{1, 2, 3, 4\}$ form an algebraic group called the dihedral group of a square, denoted $D_4$. $D_4$ is a type of permutation group, a group whose members are permutations of a finite set.
Permutations like those in $D_4$ can further be expressed in a short-hand called cycle notation. For example, (13)(2)(4) indicates that vertex 1 maps to vertex 3 and vertex 3 maps to vertex 2 while both 2 and 4 remain unchanged. Using this notation, $D_4$ can be represented as the following set:
$\displaylines{D_4 = \{(1)(2)(3)(4), (1 2 3 4), (1 3)(2 4), (1 4 3 2), \\
(2)(4)(1 3), (1)(3)(2 4), (1 2)(3 4), (1 4)(2 3) \}}$
Cycle notation can be further simplified by combining each cycle into a single term (a monomial) $f_i^j$, where the $i$ is the number of elements inside each parenthetical (also called the order) and $j$ is the number of times it occurs.
For example, the expression (2)(4)(13) has two first order permutations composed with one second order permutation. This is denoted $f_1^2f_2$, meaning two one element cycles composed with one two element cycle.
The complete set of cycle indices for $D_4$ is shown below.
$\displaylines{(1)(2)(3)(4) = f_1^4 \\
(1 2 3 4) = f_4 \\
(1 3)(2 4) = f_2^2 \\
(1 4 3 2) = f_4 \\
(2)(4)(1 3) =f_1^2f_2 \\
(1)(3)(2 4) =f_1^2f_2 \\
(1 2)(3 4) = f_2^2 \\
(1 4)(2 3) = f_2^2}$
Collectively these monomials can be grouped algebraically to produce the following polynomial expression:
$ f_1^4 + 2f_1^2f_2 + 3f_2^2+ 2f_4 $
These are then normalized by the size or order of the group, $|D_4|=8$ (note that for dihedral groups, the order is the twice the number of vertices).
The revised polynomial is called a cycle index for $D_4$.
$ Z_{D_4}(f_1, f_2, f_3, f_4) = \frac{1}{8}(f_1^4 + 2f_1^2f_2 + 3f_2^2+ 2f_4 )$
Or more generally,
If $X$ is a set with $|X|=r$, let $G$ is a group acting on $X$ and let $j_k(g)$ be the number of cycles of g of length k, then the cycle index is
$ Z_G(f_1, f_2, …, f_r) = \frac{1}{|G|} \sum_{g \in G} z_g \prod_kf^{j_k(g)}$
To get back to the coloring question a few more terms are needed.
Given a group $G$ and an arbitrary set $X$, a group action of $G$ on $X$ is a map $G \times X \to X$ given by $(g,x)=gx$. That is, $gx = y$ where $g \in G$ and $x,y \in X$.
For any element $x \in X$, the set of elements obtained through the group action of $G$ on $x$ is called the orbit of that element under $G$. That is, $O_{x, G} = \{ y : gx = y, g \in G, y \in X\}$.
For example, let C be the set of colorings in Figure 5. It’s members are ordered sets such as {r, b, b, b}. For the group action of $D_4$ on $C$, the orbit of {r, b, b, b} is the set {{b, r, b, b}, {b, b, r, b}, {b, b, b, r}}. This is just the second row in that figure.
Orbits partition the set with respect to that group action. This means that every element in the set of colorings belongs to one and only one orbit. Consequently, each yellow row in Figure 5 is a member of an orbit. The number of orbits therefore counts the number of distinguishable elements with respect to these motions. Or in this case, the number of distinguishable colorings. It should be no surprise then that determining the number orbits is essential to solving the isomer problem.
Here comes Pólya
Observe that there is no mention of color in the cycle index above. That’s because this cycle index only encodes information about the group action of $D_4$ on $X$, the vertices of this square. Pólya’s ingenious idea was to compose these monomials with a representation of the coloring [GP].
This is accomplished by composing a second generating function into the cycle index. A set of objects $Y$ (for example, colors or functional groups) called figures have integral weights assigned to them ($w: Y \to \{0, 1, 2…\}$). This produces a figure counting function, $c(x) = c_0 + c_1 x + c_2 x^2 + … + c_mx^m$, where the coefficient $c_i$ is the number of elements in $Y$ with weight $i$. Typically these coefficients are just $c_i=1$ because there’s only one color in the set of colors $Y$ associated with each weight. But it was designed to be generalizable. See [SA] for example.
One version of the Pólya-Redfield Enumeration Theorem states:
Let $X$ be a set with $|X|=r$ and let $G$ be a group acting on $X$ with a cycle index $Z_G(f_1, …, f_r)$. Let $c(x)$ be a figure counting series. The generating function $F_G(c(x))$ that enumerates the orbits of $c(x)$ with respect to the permutation group $G$ acting on $X$ is:
$F_G(c(x)) = Z_G(c(x), c(x^2), …, c(x^r))$
Note that $c(x^i) = c_0 + c_1x^i + c_2x^{2i} + … + c_mx^{mi}$. For example, if $c(x) = 1 + x $ then $c(x^3) = 1 + x^3$.
The set of figures in the square coloring problem are just the colors red and blue, $Y=\{r, b\}$. We can assign weight(red) = 0 and weight(blue) = 1. The corresponding figure counting function is then $c(x) = 1 + x$ since there’s a one-to-one mapping of the elements of $Y$ to $\{0,1\}$. Thus $c_0=c_1=1$.
Alternatively, you can use the colors themselves as the terms in the counting function. For the two color problem, this is just $r + b$, which is more intuitive. Both versions are used in the literature because Pólya’s paper represented these figures as multivariable objects. For simple coloring problems like this there’s no loss of meaning in the reduction to one variable.
Composing the latter version into the corresponding cycle index $\frac{1}{8}(f_1^4 + 2f_1^2f_2 + 3f_2^2+ 2f_4 )$ produces the following generating function:
$ \frac{1}{8}((b + r)^4 + 2(b + r)^2(b^2 + r^2) +3(b^2 + r^2)^2 + 2(b^4 + r^4))$
which after regrouping leaves
$ b^4 + b^3 r + 2 b^2 r^2 + b r^3 + r^4$
This coefficients of this generating function correspond to the counts of the corresponding monomial. Based on the coefficients above, there is only one topologically distinguishable way to paint the vertices all red or all blue as 1 is the coefficient on both the b⁴ and r⁴ terms. There are two ways to distinctly paint the corners with two reds and two blues (coefficient on b²r²), one way for three blues and one red (coefficient on rb³), and one way for one blue and three reds (coefficient on br³). The sum of the coefficients is six, as six of the sixteen total colorings are distinguishable, as expected (see Figure 5). This is our orbit count too.
Pólya’s paper referred to these orbits as configurations. To be precise, they are not defined in the same way, but they can be shown to be equivalent [MB]. The net effect is the same set of unique colorings under this group action. Configurations are also referred to as patterns and the analysis of counts as the pattern inventory. Related to this is Burnside’s Lemma [KT], which only predicts the number of orbits and does not provide the pattern inventory like Pólya’s does.
Returning to the original octahedron three color problem (r, b, y), the symmetries for the vertices of an octahedron consists of 24 rigid motions [GP] which produce the following cycle index
$ \frac{1}{24}(f_1^6 + 6f_1^2f_4 + 3f_1^2f_2^2 + 6f_2^3 + 8f_3^2)$
For the three colors, red, blue and yellow, the figure counting function will be $r + b + y$. The result after reducing the polynomial is shown below (use a computer unless you really, really enjoy doing high school algebra). And note, this was done using sage.
Here the r³b²y term has a coefficient of 3. Ergo, there are only 3 distinguishable colorings using 3 red, 2 blue and one yellow ball. In other words, only 3 the 60 possible colorings are topologically distinct. Quite a reduction!
Summing the coefficients in the resulting generating function also tells us that of the 729 (=3⁶) possible ways to apply one of three colors to the 6 vertices, only 57 are topologically distinguishable, i.e., there are 57 orbits.
This framework provides a huge step toward solving this orgo question. If one can count distinct colorings on a shape then the same can be done with isomers. But first more tools…
Recursion and Generating Functions
The generating function method described in the previous section works well for a fixed sized, but for the isomer problem the permutations are nested. For example, if we’re interested in a carbon chain of length 14, we’ll need the counts for the chains of size 13, 12, 11, and so on.
The brute force method of removing elements one by one and reattaching what remains onto a shorter chain (see Figure 1) is an example of a recursive process. Mathematically, this is a function with one or more initial conditions that generates a new value from its previous (or next) terms. A famous example is $a_n = a_{n-1} + a_{n-2}$ with $a_0 = a_1 = 1$, which produces the Fibonacci series, $1, 1, 2, 3, 5, 8, 13, 21, …$.
This can also be expressed as a generating function, the first few terms of which are
$F(x) = 1 + x + 2x^2 + 3x^3 + 5x^4 + 8x^5 + … $
Or more generally,
$F(x) = \sum a_k x^k$
Multiplying F(x) by x shifts the series as shown
$F(x) = \sum a_{k-1} x^k$
Similarly $x^2 F(x)$ is
$F(x) = \sum a_{k-2} x^k$
Since $a_n – a_{n-1} – a_{n-2} = 0$ it follows that $F(x) — xF(x) — x^2F(x) = 0$ or $F(x) =\frac{1}{(1 – x – x^2)}$
This expression is kind of neat in and of itself in that the term on the left contains the Fibonacci sequence on the right.
$\frac{1}{(1 – x – x^2)} = 1 + x + 2x^2 + 3x^3 + 5x^4 + 8x^5 + … $
The trick for the isomer problem will be to impose recursion on the previously described cycle indexes. As a setup, suppose that a recursive relationship exists between generating functions for successive values of $n$, say
$ A_n(x) = 1 + \frac{x}{6}(A_{n-1}(x)^3 + 3A_{n-1}(x)A_{n-1}(x^2) + 2A_{n-1}(x^3)) $
As a generating function, $A_n(x)$ may also be written as a polynomial per the previous discussion of cycle indices.
$ A_n(x) = \sum_0^{n}{a_k x^k}$
To solve, apply the same principal described for solving Cayley’s equation. Start with a known value of $a_k$ and apply those values to the first terms of the equation, leaving a placeholder for the next value. Additionally, set all higher coefficients to zero.
$A_2(x) = 1 + a_1x $
Next, substitute this expression into the recursive formula and expand terms.
$ A_2(x) = 1 + \frac{x}{6}(A_1(x)^3 + 3A_1(x)A_1(x^2) + 2A_1(x^3)) $
$ A_2(x) = 1 + \frac{x}{6}(1 + 3 + 2) = 1 + x $
It follows that $a_1$, the coefficient on $x$, is also 1. Next, set both into the first 3 terms of the equation, leaving a spot for the next value:
$A_3(x) = 1 + x + a_2x^2 $
Now substitute this expression into the recursive formula and expand terms.
$ A_3(x) = 1 + \frac{x}{6}(A_2(x)^3 + 3A_2(x)A_2(x^2) + 2A_2(x^3)) $
$= 1 + \frac{x}{6}((1 + x)^3 + 3(1 + x)(1 + x^2) + 2(1 + x^3)) $
$ A_3(x) = 1 + \frac{x}{6}(6 + 6x + 6x^2 + 6x^3) $
$= 1 + x + x^2 + …$
Once the coefficients are aligned, the next value can be inferred, which is $a_2 = 1$. The higher order terms are discarded since it was assumed the higher order terms were all zero. Rinse, lather, repeat. The next value is $a_3 =2$, for example. The algebra will become unbearable after a few iterations but thankfully tools like Sage, Wolframalpha and MatLab can churn on symbolic equations with ease.
Details on these sorts of problems can be found at [GF]. There’s also an entire online textbook on this subject [HW].
This particular recursion will appear again in the alkyl discussion below. It can also be shown that this particular relationship can be expressed in a product form similar to the one that Cayley devised in the previous section.
Constructing a Solution for Alkyls
Alkyls have the chemical formula, $C_nH_{2n-1}$. These are just alkanes with a missing hydrogen atom. As shown below this missing atom on the left provides a free-bonding site. The other three spots, denoted $R_i$, are placeholders for either additional hydrogen atoms or additional alkyls molecules that have at most n-1 carbon atoms (i.e., the carbons in all spots sums to n-1 since the center atom is a carbon too).
In Cayley’s language, alkyls are rooted trees whose root is the free-bonding site. These free-bonding sites also provide an analog to the coloring from earlier: each site can be composed with a figure. In this case, the figure is just another alkyl group. This implies recursion. Thus the generating function for the (n-1)th term, $A_{n-1}(x)$, is the figure counting function for the nth term, $A_n(x)$. These are then composed using Pólya’s Enumeration Theorem.
To adjust for the carbon difference between $n-1$ and $n$, multiply the cycle index by $x$, e.g., $xA_{n-1}(x)$. This shifts the terms by one ($x$ becomes $x^2$, and so on) thereby keeping the coefficient counts consistent. Because of this shift, the equation includes a replacement term (the leading “$1 + …$”) to represent the lone hydrogen that disappeared in the multiplication. This may seem counterintuitive but in the world of generating functions it’s a standard technique.
Next, there are three vertices in the alkyl molecule, which suggests the S3 permutation group (like a triangle). The cycle index for a triangle is even easier than that of a square. There are six elements in its permutation group and three monomials in its cycle index. Our figures (i.e., colors) are the $A_{n-1}(x)$ terms. After substituting, the cycle index becomes
$ A_n(x) = 1 + xZ_{S_3}(A_{n-1}(x)) $
$ A_n(x) = 1 + \frac{x}{6}(A_{n-1}(x)^3 + 3A_{n-1}(x)A_{n-1}(x^2) + 2A_{n-1}(x^3)) $
Iterating on this produces the following generating function whose coefficients are the counts for the respective alkyl molecules. Note this is the solution found in the example in the recursion section.
$ A(x) = 1 + x + x^2 + 2x^3 + 4x^4 + 8x^5 + 17x^6 + … $
Constructing a Solution for Alkanes
Counting the number of distinguishable acyclic alkane isomers is the same as counting the number of distinguishable unrooted and unlabeled 4-ary trees. The alkyl solution, $A(x)$, handles rooted and unlabeled 3-ary trees. This serves as the basis for several classic solutions. The first by Henze and Blair in the 1930s recursively splits the alkane into a variety of tree subcomponents and then sums the results [HH]. The more well known solutions are those of Harary and Norman’s from the 1960s and a newer one by Rains and Sloane [RS]. These also split the alkane (unrooted tree) by various symmetry properties and then sum the tree components into a single generating function. Finally, there’s the approach using analytic combinatorics described by Sedgewick and Flajolet in their text of the same name [FAC]. This latter approach differs from the chemistry-minded solutions in that they treat it as a special case of tree enumerations using the language of combinatoric classes. It’s a rather beautiful theory that possibly trumps everything else before it. But first the classic solutions.
The Harary/Norman solution partitions the alkane into two classes. In the first, a single carbon atom is labelled to allow for a rooted tree enumeration on the resulting alkyls. The second uses a labelled edge and then studies the enumeration of the resulting two rooted trees that it connects.
They then apply a novel symmetry relationship between equivalence classes of edges and vertices that provide a summing formula for these components. The resulting recursive expression is shown below. Details are found in various texts listed below [SR][PP][GE].
$$\begin{align}
N(x) = \frac{x}{24}(A(x)^4 + 6A^2(x)A(x^2) + 3A^2(x)A(x^3) + 6A(x^4)) \, – \, \\
\frac{1}{2}[(A(x) – 1)^2 + A(x^2) + \,1]
\end{align}$$
Cayley’s early attempt tried to split the alkane into centered and bi-centered trees. This is the basis for the Rains/Sloane solution. Here they construct what Cayley was unable to sort out, which is a decomposition between centered rooted trees and bi-centered rooted trees. The trick was to focus on the diameter of those trees once they were split apart. The nice thing about this solution is that it does not require anything beyond Polya enumerations and the resulting cycle indexes. The key points are now described.
The core term is a generating function for (k-1)-ary rooted trees with height at most $h$. For a given size of $n$ this is denoted $T_{h, n}$ and the generating function is $T_{h}$. A recursive relation is deduced that includes a cycle index term for the permutation group, $S_{k-1}$. In the alkane (carbon) case, $k=4$, which is the $S_3$ group from the alkyl discussion above.
$T_{h+1} = 1 + xS_{k-1}(T_h(x))$
The next term represents the center case. Here the interest is in organizing these centered trees by their diameter, $d=2h$. If the center node is deleted, what remains are k rooted trees, at least two of which have height of $h-1$. That is, $2*(h – 1) + 2 = 2h$, accounting for the two removed edges. Because the function $T_{h}$ enumerates trees whose height is at most $h$, they find this center count by using the following clever summation:
$C_{2h}(x) = (1 + xS_{k}(T_{h-1}(x))) -$
$(1 + xS_{k}(T_{h-2}(x))) -$
$(T_{h-1}(x) – T_{h-2}(x))(T_{h-1}(x) -1) $
To understand this, first note that “at least two with a height of h-1” can be found by first subtracting those with heights “at most h-1” from those with heights “at most h-2”. The last term addresses an overcounting of those arrangements where only one tree of height h-1 occurs instead. The goal is at least two.
For the bi-centered case, splitting a tree of diameter $2h-1$ leaves a pair of rooted trees of height $h$. This can be arranged using the $S_2$ permutation of the remaining two trees whose heights are exactly $h$.
$B_{2h -1}(x) = S_2(T_{h}(x) – T_{h-1}(x))$
These are then both summed over $h$ as a generating function. The special case of alkanes is solved for $k=4$.
The last approach is the one that appears to be the new standard in this topic. Analytic Combinatorics [FAC] provides a mathematical framework in which combinatoric problems are generalized into a system that relates combinatorial classes (a set of discrete objects and a size function) to generating functions. Certainly not something that can be easily summarized in the context of this larger topic but it is absolutely relevant too. This technique yields Cayley’s product series in a few simple steps, for example, and it can be used to model the alkane isomer problem too. In the end, recursive computations are still required but the theory is both innovative and elegant.
Regards of the technique used the alkane generating function is shown below. The coefficients on $x^n$ correspond to the isomer counts for alkanes of size $n$.;
$ N(x) = 1 + x + x^2 + x^3 + 2x^4 + 3x^5 + 5x^6 + 9x^7 + 18x^8 + 35x^9+… $
Computations
While the above derivations provide a mathematical representation of the solution, implementing this as a computer program can be quite different. Algorithms define a precise sequence of steps to produce a desired output in a performant manner. This latter requirement often stands in contrast to the mathematical expression itself. For example, an algorithm can grind through the repetitious steps of testing whether an isomer has already been discovered and can optimize that discovery through memoization. There has been a separate thread of research in the computer science community on this problem, sometimes without awareness of the work done by aforementioned mathematicians/chemists. This is because it lands squarely in the domain of divide and conquer problems familiar to all computer scientists. Specifically, it’s a variation on counting the nodes in binary and 2-3 trees. The former maps to the famed Catalan numbers.
So how good are these algorithms? A 2002 analysis of several prominent alkane enumeration tree algorithms found them to perform at or around $O(n^4S_n)$ [AHM]. Certainly not a performance metric that you’d want in a frequently-run application or one that operates on a large data set. But these sorts of programs usually only need to run once.
Tree recursion examples: This one was written in Java. This one was written in Python.
Ohers have implemented the program using the mathematical solution derived in the previous section. This version in python written by Mingda Zhang uses the Harary method. To that end, here are a few notes about implementing these symbolic expressions.
First, the coefficients are, of course, stored in an array that grows as the program recurses. But dealing with convolved terms like $A[x]*A[x^2]$ requires an understanding of how to do convolutions on generating functions. A good reference is found starting on slide 24 of these lecture notes. Based on this rule, it follows that
When $A(x) = \sum_{n=0}^{\infty}a_n x^n$ and $A(x^2) = \sum_{n=0}^{\infty}a_n x^2$.
Then $C(x) = A(x)A(x^2)$ has coefficients of the form $c_n = \sum_{k=0}^{n}a_k a_{2*(n-k)}$
Note that the individual coefficients are finite sums. Thus computing these terms requires a loop per the above expression. This is exactly what Zhang’s python script is doing.
Lastly, careful handling of large integers is required to run programs like this (BigInteger in Java, CDN library in C++, etc.; Python handles this natively). As shown in the next section, these counts get insanely large even at small values of $n$.
Asymptotics
Early studies of the problem were limited to small values of $n$ as anything larger proved too complex to compute by hand. This should give a hint to the final question: how fast does it grow?
Here’s a better hint. For $n=10$ (“decane”), there are 75 isomers. For $n=30$ (“triacontane”) there are 4,111,846,763. And at $n=100$ (“hectane”) it starts to approach the number atoms on our planet ~$10^{40}$. This suggests that as $n$ gets large, the number of isomers exceeds the number of available atoms to construct them, which is of course ludicrous. The reality is that many of these isomers (if not most when dealing with larger numbers) can not exist for physical reasons (steric hindrance, etc.). Or many (or most) simply will never exist even if they could. The upshot is that these counts are theoretical not factual. Nonetheless, the predicted value for hectane is
$5,921,072,038,125,809,849,884,993,369,103,538,010,139$
Number of Carbon Atoms | Number of Structural Isomers |
---|---|
1, Methane | 1 |
2, Ethane | 1 |
3, Propane | 1 |
4, Butane | 2 |
5, Pentane | 3 |
6, Hexane | 5 |
7, Septane | 9 |
8, Octane | 18 |
9, Nonane | 35 |
10, Decane | 75 |
20, Isocane | 366,319 |
50, Pentacontane | 1,117,743,651,746,953,270 |
100, Hectane | 5.92107203812581×1039 |
Pólya also solved the asymptotic question in his famed paper. He did this by considering the radius of convergence of a related generating function and from that constructing upper bounds. He found the growth is $O(q^n)$, where $q$ is inverse of the corresponding finite radius of convergence, $\rho$ and where $\frac{1}{e}>\frac{1}{\rho}$.
Specifically, he showed that the number of indistinguishable acyclic alkane isomers grows at $(\rho^n n^{5/2})^{-1}$. In his paper, he bounded the value at $0.35 < \rho < 0.36$.
$ N(x) < (\rho^n n^{5/2})^{-1} \,\,, \rho \approx .35 $
Experiments in the graphs below show a good match of $\rho = 0.35668$ up to $N=100$ at least. This was done using Google Sheets and includes error plots for the model fit.
For an elegant mathematical analysis, see page 477 of [FAC] or read the original analysis in [GP].
Closing Thoughts
To therefore answer the questions posed: a) yes it’s solvable, b) but there is no closed-form formula, and c) the values grow exponentially. The foundational work is found in the not-so gory details are found in the surprisingly readable translation (from German) of Pólya’s original paper [GP] with supplemental essays by computational chemist, R. C. Read [AB]. There are also many illuminating papers by the mathematician Frank Horary and his colleagues. Additional references are found below.
As noted, these techniques can be applied to the enumeration of a wide range of molecules, not just alkanes. This includes alcohols, phenols and many of the aliphatic compounds. These techniques can also be applied to stereoisomer enumerations too. See the [SR] for a detailed dissection. At the same time, chemists have pointed out that many of these isomers can not exist due to steric hinderance and other molecular constraints. So, while the math produces a well-argued numeric value, the physical reality is likely far lower.
One final note. It is still unknown whether this isomer counting problem is NP-Complete. [FAC]
References
Detailed coverage of the math and chemistry in this post:
[SR] “Sandia Report: Enumerating Molecules,” Faulon, Visco, Roe, 2004 (excellent and comprehensive!).
[GP] Combinatorial Enumeration of Groups, Graphs and Chemical Compounds, Pólya, G., 1936. Translated to english with additional essays by Read, R. C., 1987. (textbook)
[GE] Graphical Enumeration, Harary, F., Palmer, E., 1973 (textbook).
[AB] Chemical Applications of Graph Theory, Balaban. A. T., Read, R. C., and others, 1975 (textbook).
[HH] “The Number of Isomeric Hydrocarbons of the Methane Series,” Henze/Blair, 1931.
[CI] “On the Mathematical Theory of Isomers,” Cayley, A., 1874.
[AC] “On the Analytic Forms Called Trees,” Cayley, A., 1881.
[AC3] “On the analytical forms called Trees, with application to the theory of chemical combinations“, Cayley, A., 1896.
[KT] Explainer on Pólya’s Enumerations, Keller/Trotter (online book).
[MB] Polya’s Enumeration Theorem and its Applications, Bell, M., Master’s Thesis, University of Helsinki, 2015.
[SA] How to Count, An Exposition of Polya’s Theory of Enumeration, Anand, S, 2002.
[RS] “On Cayley’s Enumeration of Alkanes (or 4-Valent Trees),” Rains/Sloane, 2002.
[PP] Enumerating Chemical Compounds, Prochazka, P. YouTube.
[JJ] MIT Lecture on Enumerations, online pdf
[OS] Orbits and Stabilizers Primer, online pdf
[HW] GeneratingFunctionology, online textbook by H. Wilf.
[GF] Princeton CompSci Tutorial on Generating Functions
[LT] Libretexts Chemistry on Alkane Isomers
[IR] On Skeletal Drawings
[FAC] Analytic Combinatorics, Flajolet, Sedgewick, 2009.
[AHM] Chemical trees enumeration algorithms, Aringhieri, R., Hansen, P., and Malucelli, F. (2003). Quarterly Journal of the Belgian, French and Italian Operations Research Societies, 1(1):67–83.
If you enjoyed this post please click the like button, share it, or leave a comment. It would mean a lot considering how much effort went into writing it. Thanks!