Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
306 changes: 306 additions & 0 deletions content/dev/blog/crunch-cube-vs-r.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,306 @@
# Analysing contingency tables in R vs NumPy

Sharing an insight about analysis metrics calculation

## Introduction
The basic unit of analysis in Crunch.io is the CrunchCube, also referred to as just the "cube". The cube represents a contingency table, and gets generated by Crunch.io backend. as a response to a client's query. The clients that use these results can be any third party libraries that have access to Crunch.io.

At Crunch.io, we develop a few different libraries which consume CrunchCubes, and produce second order analyses based on them. Those libraries must all produce the same results so that clients don’t see different calculations depending on which client they are using. The methods to implement the calculations, however, will differ client to client, since they are implemented in different programming languages. This post covers one subtle difference we found when comparing algorithms across Crunch’s Python package `cr.cube` and Crunch’s R client library `rcrunch`.

## Second order analysis
A cube typically contains the counts, which represent the observed frequency of a two (or more) combined categories. It might be best to illustrate this with an example:

| Gender vs Education | No HS | HS | College | 2-year | 4-year | Post-grad |
| --- | --- | --- | --- | --- | --- | --- |
| **Male** | 3 | 19 | 12 | 7 | 5 | 6 |
| **Female** | 1 | 24 | 8 | 1 | 4 | 2 |

Users typically care about the percentages though, or other types of measurements, which are referred to as "second order analysis". The percentages can be calculated in three ways, and these are: the row, the column and the table percentages. They are calculated by dividing each of the observed frequencies (the counts in the cells) with the corresponding base (or margin). How bases are calculated is probably most important in order to come up with the right numbers, for all different types of measurements. These calculations, however, are done on the client side, and not on the Crunch.io backend (for various reasons, with one exception being the exporter).

## Bases - Margins
The difference between the terms "base" and "margin" is somewhat arbitrary, and they're often used interchangeably. But we can equate them for the purpose of this demonstration. We will consider that they're always calculated based on the unweighted counts of a particular cube (for clarity purposes only). The term "base" is often used to describe the denominator, which is necessary for obtaining the correct percentage (observed probability). The term "margins" is often used to designate different summaries, that are displayed below or alongside the table.

### Column percentages
This is the default "direction" in which percentages are calculated. They represent the observed probability of a particular cell in the corresponding column to be of that particular category (row). In the above example, if we take e.g. the `"College"` column, the probability of all college graduates being `"Male"` is 60%. This number is calculated as 12 / (12 + 8). That is: the observed frequency, divided by the base, which is the total of the observed column.

The margin that can be displayed below the table is the same thing as the base, in this case. That is, the margin _is_ the denominator of the "row percentages" measurement. Observe that the value of the denominator is obtained by summing the entire table across rows (for each column, which are not summed between themselves).

| Gender vs Education | No HS | HS | College | 2-year | 4-year | Post-grad |
| --- | --- | --- | --- | --- | --- | --- |
| **Male** | 3 | 19 | 12 | 7 | 5 | 6 |
| **Female** | 1 | 24 | 8 | 1 | 4 | 2 |
| **Unweighted N** | **4** | **43**| **20** | **8** | **9** | **8** |

The actual column percentages are:

| Gender vs Education | No HS | HS | College | 2-year | 4-year | Post-grad |
| --- | --- | --- | --- | --- | --- | --- |
| **Male** | 76 | 43 | 61 | 88 | 57 | 76 |
| **Female** | 24 | 57 | 39 | 12 | 43 | 24 |

### Row percentages
The values are calculated in basically the same way as for column percentages, with the row and column roles now inversed. So, if we want to calculate the probability of all `"Male"` participants having graduated from college, we do it by dividing the oberved frequency (12) by the total of all males (3 + 19 + 12 + 7 + 5 + 6).

The margin that can be displayed alongside the table, similarly as in previous example, _is_ the base, and is obtained by summing the entire table across columns (not adding the rows).

| Gender vs Education | No HS | HS | College | 2-year | 4-year | Post-grad | Unweighted N |
| --- | --- | --- | --- | --- | --- | --- | --- |
| **Male** | 3 | 19 | 12 | 7 | 5 | 6 | **52** |
| **Female** | 1 | 24 | 8 | 1 | 4 | 2 | **40** |

The actual row percentages are:

| Gender vs Education | No HS | HS | College | 2-year | 4-year | Post-grad |
| --- | --- | --- | --- | --- | --- | --- |
| **Male** | 6 | 35 | 24 | 14 | 10 | 12 |
| **Female** | 2 | 61 | 20 | 2 | 10 | 5 |

### Table percentages
Table percentages answer the question: "what is the probability of any given observation being of both category A and B?". In our example, what is the percentage of any observation being `"Male"` **and** having graduated from `"College"`? We arrive at the correct number by doing the following division: 12 / ((3 + 19 + 12 + 7 + 5 + 6) + (1 + 24 + 8 + 1 + 4 + 2)). The inner parentheses only serve as an illustration that we're actually adding up *all* of the values together, in order to come up with a denominator.

The base is displayed in the lower right corner of the table. It is obtained by summing the frequencies across both rows and columns. It can also be obtained by first calculating the row percentage base, and than summing all of its values (the same goes for column percentage base), which is visible from the table below.

| Gender vs Education | No HS | HS | College | 2-year | 4-year | Post-grad | Unweighted N |
| --- | --- | --- | --- | --- | --- | --- | --- |
| **Male** | 3 | 19 | 12 | 7 | 5 | 6 | 52 |
| **Female** | 1 | 24 | 8 | 1 | 4 | 2 | 40 |
| **Unweighted N** | 4 | 43 | 20 | 8 | 9 | 8 | **92** |

The actual table percentages are:

| Gender vs Education | No HS | HS | College | 2-year | 4-year | Post-grad |
| --- | --- | --- | --- | --- | --- | --- |
| **Male** | 3 | 20 | 13 | 8 | 6 | 7 |
| **Female** | 1 | 26 | 8 | 1 | 4 | 2 |

### Calculating bases (denominators)
For the purposes of this illustration, let's assume that contingency table frequencies are stored in a 2D array-like structure. The real structure of cubes in R and Python are more elaborate, of course, and have many additonal properties, methods, etc. But the actual calculation happens, ultimately, on these simple structures. Note: the tables can also be 1D or 3D, depending on the type of analysis, but for this demonstration, a 2D example will suffice.
So, the goal is to come up with a unique way of saying: "perform the summation of the frequencies, in the desired directio(s)". The following illustrates how this is achieved in both R and Python, and what the differences are. Hint: The actual difference is connected with the interpretation of input arguments for the summation methods (direction vs dimension).

#### `margin.table` in R
Let's begin by creating the contingency table from the above example. We'll create it as a 2D matrix. You can also generate the _actual_ cube from the Crunch.io, but it's not considered in this post.
```
> cube <- matrix(c(3, 19, 12, 7, 5, 6, 1, 24, 8, 1, 4, 2), nrow = 2, byrow = TRUE)
> cube
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3 19 12 7 5 6
[2,] 1 24 8 1 4 2
```
We now arrive at correct base values, by using R's function `margin.table`, with _appropriate_ argument values. Note: the parameter to the function is called the `margin` (we'll see what happens in numpy a bit later). The following snippet calculates column, row, and table percentages. The argument values are: 2, 1, and NULL, respectively.
```
> # base for column percentages
> margin.table(cube, 2)
[1] 4 43 20 8 9 8
>
> # base for row percentages
> margin.table(cube, 1)
[1] 52 40
>
> # base for table percentages
> margin.table(cube)
[1] 92
```

If one analyses the arguments one could arrive at the following conclusion: `margin = 1` means to sum across the rows, `margin = 2` means to sum down the columns, and `margin = NULL` means to sum across both (in other words: sum all the values in the table). Spoiler alert: we will see in a second that we need to refine this definition when looking at higher dimension cubes.

#### `numpy.sum` in Python
Let's create the same example as in R, using NumPy. If you're to try the following snippets in a command line, an import statement is necessary, like so `import numpy as np`.
```
>>> cube = np.array([
... [3, 19, 12, 7, 5, 6],
... [1, 24, 8, 1, 4, 2],
... ])
>>> cube
array([[ 3, 19, 12, 7, 5, 6],
[ 1, 24, 8, 1, 4, 2]])
```
The correct percentage bases are obtained by using the `np.sum` function. The argument to this function, contrary to the R world, is called `axis`. We'll demonstrate later that this parameter is _essentially_ different in it's meaning than R's `margin`. But for now, let's continue with the example, and the deduction trap, that one might fall into. Here are the correct base values for different percentages:
```
>>> np.sum(cube, 0) # base for column percentages
array([ 4, 43, 20, 8, 9, 8])
>>>
>>> np.sum(cube, 1) # base for row percentages
array([52, 40])
>>>
>>> np.sum(cube) # base for table percentages
9e
```
Accounting for python being 0-indexed and R being 1-indexed, we arrive at the following conclusion: "columns are 0, rows are 1, and table is None". In the case of `numpy.sum` this _is_ correct.

#### The catch
Based on previous (limited) deductions, we can be tempted to think that NumPy is _wrong_, or that it's _backwards_ in comparison to R. Of course, if you're coming from Python, this almost sounds like an insult. And on top of that, the indexing kinda works the same. Let's index a 2D array. If you fixate the first (or the zeroth in NumPy) index, and let other take all the values, you get the same results in R and NumPy (and Matlab, and probably others). Here it is in action:
```
> cube[1,]
[1] 3 19 12 7 5 6
```
```
>>> cube[0, :]
array([ 3, 19, 12, 7, 5, 6])
```
So, why then, if the rows and columns _are_ what you'd expect them to be, do these methods behave differently? Well, it's because they're different methods, and have different arguments (that mean different things). Let's illustrate with a 3D example of matrix of only ones. The example will demonstrate how axis is different than margin, and how the dimensions get reduced based on summation of different things. We'll also demonstrate how to arrive at the equivalents in R and NumPy.

Here's the R version of 2 x 3 x 4 matrix of all ones:
```
> cube <- array(1, c(2, 3, 4))
> cube
, , 1

[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1

, , 2

[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1

, , 3

[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1

, , 4

[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
```
And here's NumPy's
```
>>> cube = np.ones((2, 3, 4))
>>> cube
array([[[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]],

[[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]]])
```
They _look_ differently, because the default way of displaying them differs in R and NumPy. R does slicing based on the last dimension (i.e. it displays the matrices of the first 2 available dimensions, and does slicing based on the remaining dimensions). NumPy does slicing based on the first dimension, and displays the matrices based on the last 2 dimensions. That's why we _see_ different thngs. But they're basically the same 3D arrays, each with 24 elements, all equal to 1, and of shape 2 x 3 x 4.

Let's try to calculate bases for percentages. We'll consider the first(0th) dimension to represent tabs, because that's how cubes get formulated in Crunch.io. So, the default way of slicing in NumPy is what you should be concentrating at, when thinking about this (even though they're still the exact same matrices). We want to obtain column bases for each of the slices. We have 2 slices of shapes 3 x 4. Thus, we need two 2 row vectors of shape 1 x 4 (the 1 gets flattened, but we'll keep it for illustrations purposes).

In R, based on previous deductions about what we consider "column", let's try and do something like:
```
> margin.table(cube, 1)
[1] 12 12
```
and in NumPy, based on the same 2D experience, let's try something like:
```
>>> np.sum(cube, 0)
array([[2., 2., 2., 2.],
[2., 2., 2., 2.],
[2., 2., 2., 2.]])
```
But wait, these results are completely different! It's not just that the summation takes place across dimensions that are interpreted differently, it's that the resulting matrices have _different_ number of dimensions. Ok, so what's the catch here? The answer is that the R's `margin` instructs the `margin.table` function "which dimension to keep", while it performs summation across all other dimensions. The result of the previous command in R is the same as:
```
> cube[,1,1] + cube[,1,2] + cube[,1,3] + cube[,1,4] +
+ cube[,2,1] + cube[,2,2] + cube[,2,3] + cube[,2,4] +
+ cube[,3,1] + cube[,3,2] + cube[,3,3] + cube[,3,4]
[1] 12 12
```
We can clearly see that the argument we passed to `margin` gets interpreted as "the index of the dimension that's allowed to keep all of its elements".

The NumPy's `axis` instructs the `np.sum` "which dimension to flatten", performing summation over it, and keeping all other dimensions. The result of the previous command is the same as:
```
>>> cube[0, :, :] + cube[1, :, :]
array([[2., 2., 2., 2.],
[2., 2., 2., 2.],
[2., 2., 2., 2.]])
```
Again, we can clearly see that the interpretation of the parameter is different, and that in NumPy's case it means something like: "allow all other dimensions to keep all their elements, and sum across the provided one".

Ok, so how do we arrive at the equivalent? And more importantly, how do we accomplish our goal for this demonstration (getting two 1 x 4 rows)?

In R, since we want to arrive from the shape 2 x 3 x 4 to the shape 2 x 1 x 4, we need to provide the indices of the dimensions that we get to keep after the summation. These are the dimensions 1 and 3. The command for doing that is:
```
> margin.table(cube, c(1, 3))
[,1] [,2] [,3] [,4]
[1,] 3 3 3 3
[2,] 3 3 3 3
```

In NumPy, we need to pass in the indices of the dimensions we want flattened after summation. In this case, this is just the columns dimension (per slice). The index of the columns dimension (for each slice) is 1 (because it's 0 offset, think of it as 2 in R). The command becomes:
```
>>> np.sum(cube, 1)
array([[3., 3., 3., 3.],
[3., 3., 3., 3.]])
```
We have finally arrived at the equivalent behavior of R's and NumPy's methods. Hooray!!!

Here are some additional examples, for calculating column, row and table bases. Last is the example for calculating the total base, although this is rarely used in cube analysis (summing all the slices, columns and rows together). The calculations are performed sequentionally and respectively, so that they produce the same results in R and in NumPy.

R:
```
> margin.table(cube, c(1, 3))
[,1] [,2] [,3] [,4]
[1,] 3 3 3 3
[2,] 3 3 3 3
>
> margin.table(cube, c(1, 2))
[,1] [,2] [,3]
[1,] 4 4 4
[2,] 4 4 4
>
> margin.table(cube, 1)
[1] 12 12
>
> margin.table(cube)
[1] 24
```

NumPy:
```
>>> np.sum(cube, 1)
array([[3., 3., 3., 3.],
[3., 3., 3., 3.]])
>>>
>>> np.sum(cube, 2)
array([[4., 4., 4.],
[4., 4., 4.]])
>>> np.sum(cube, (1, 2))
array([12., 12.])
>>>
>>> np.sum(cube, (0, 1, 2))
24.0
```

## Takeaway
It's not easy to develop complex analytical software. Especially if it's not specified in greatest detail, and in the style that would accommodate even a golden retriever (i.e. the person writing this post, in the time of implementing the software, in the domain of contingency tables). And this level of detail in specification is almost never the case in real world.

Do your dilligence, and strive to understand the basics properly. That's what the masters are best at.