# Stata and R in a jupyter notebook¶

The jupyter notebook project is now designed to be a ‘language agnostic’ web-application front-end for any one of many possible software language kernels. We’ve been mostly using python but there are in fact several dozen other language kernels that can be made to work with it including Julia, R, Matlab, C, Go, Fortran and Stata.

The ecosystem of libraries and packages for scientific computing with python is huge and constantly growing but there are still many statistics and econometrics applications that are available as built-in or user-written modules in Stata that have not yet been ported to python or are just simply easier to use in Stata. On the other hand there are some libraries such as python pandas and different visualization libraries such as seaborn or matplotlib that give features that are not available in Stata.

Fortunately you don’t have to choose between using Stata or python, you can use them both together, to get the best of both worlds.

# Jupyter running an R kernel¶

R is a powerful open source software environment for statistical
computing. R has R markdown which
allows you to create R-markdown
notebooks
similar in concept to jupyter notebooks. But you can also run R inside a
jupyter notebook (indeed the name ‘Jupyter’ is from **Ju**lia,
i**Pyt**hon and **R**).

See my notebook with notes on Research Discontinuity Design for an example of a jupyter notebook running R. To install an R kernel see the IRkernel project.

# Jupyter with a Stata Kernel¶

Kyle Barron has created a stata_kernel that offers several useful features including code-autocompletion, inline graphics, and generally fast responses.

For this to work you must have a working licensed copy of Stata version 14 or greater on your machine.

# Python and Stata combined in the same notebook¶

Sometimes it may be useful to combine python and Stata in the same
notebook. Ties de Kok has written a nice python library called
ipystata that allows one to
execute Stata code in codeblocks inside an ipython notebook when
preceded by a `%%stata`

magic command.

This workflow allows you to pass data between python and Stata sessions and to display Stata plots inline. Compared to the stata_kernel option the response times are not quite as fast.

The remainder of this notebook illustrates the use of ipystata.

## A sample ipystata session¶

For more details see the example notebook and documentation on the ipystata repository.

```
In [1]:
```

```
%matplotlib inline
import seaborn as sns
import pandas as pd
import statsmodels.formula.api as smf
import ipystata
```

The following opens a Stata session where we load a dataset and
summarize the data. The `-o`

flag following the `%%Stata``` magic
instructs it to output or return the dataset in Stata memory as a pandas
dataframe in python.

```
In [2]:
```

```
%%stata -o life_df
sysuse lifeexp.dta
summarize
```

```
(Life expectancy, 1998)
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
region | 68 1.5 .7431277 1 3
country | 0
popgrowth | 68 .9720588 .9311918 -.5 3
lexp | 68 72.27941 4.715315 54 79
gnppc | 63 8674.857 10634.68 370 39980
-------------+---------------------------------------------------------
safewater | 40 76.1 17.89112 28 100
```

Let’s confirm the data was returned as a pandas dataframe:

```
In [3]:
```

```
life_df.head(3)
```

```
Out[3]:
```

region | country | popgrowth | lexp | gnppc | safewater | |
---|---|---|---|---|---|---|

0 | Eur & C.Asia | Albania | 1.2 | 72 | 810.0 | 76.0 |

1 | Eur & C.Asia | Armenia | 1.1 | 74 | 460.0 | NaN |

2 | Eur & C.Asia | Austria | 0.4 | 79 | 26830.0 | NaN |

A simple generate variable command and ols regression in Stata:

```
In [4]:
```

```
%%stata -o life_df
gen lngnppc = ln(gnppc)
regress lexp lngnppc
```

```
(5 missing values generated)
Source | SS df MS Number of obs = 63
-------------+---------------------------------- F(1, 61) = 97.09
Model | 873.264865 1 873.264865 Prob > F = 0.0000
Residual | 548.671643 61 8.99461709 R-squared = 0.6141
-------------+---------------------------------- Adj R-squared = 0.6078
Total | 1421.93651 62 22.9344598 Root MSE = 2.9991
------------------------------------------------------------------------------
lexp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lngnppc | 2.768349 .2809566 9.85 0.000 2.206542 3.330157
_cons | 49.41502 2.348494 21.04 0.000 44.71892 54.11113
------------------------------------------------------------------------------
```

And the same regression using statsmodels and pandas:

```
In [5]:
```

```
model = smf.ols(formula = 'lexp ~ lngnppc',
data = life_df)
results = model.fit()
print(results.summary())
```

```
OLS Regression Results
==============================================================================
Dep. Variable: lexp R-squared: 0.614
Model: OLS Adj. R-squared: 0.608
Method: Least Squares F-statistic: 97.09
Date: Tue, 08 May 2018 Prob (F-statistic): 3.13e-14
Time: 14:25:48 Log-Likelihood: -157.57
No. Observations: 63 AIC: 319.1
Df Residuals: 61 BIC: 323.4
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 49.4150 2.348 21.041 0.000 44.719 54.111
lngnppc 2.7683 0.281 9.853 0.000 2.207 3.330
==============================================================================
Omnibus: 21.505 Durbin-Watson: 1.518
Prob(Omnibus): 0.000 Jarque-Bera (JB): 46.140
Skew: -1.038 Prob(JB): 9.57e-11
Kurtosis: 6.643 Cond. No. 52.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
```

### Back to ipython¶

Let’s change one of the variables in the dataframe in python:

```
In [6]:
```

```
life_df.popgrowth = life_df.popgrowth * 100
```

```
In [7]:
```

```
life_df.popgrowth.mean()
```

```
Out[7]:
```

```
97.20587921142578
```

And now let’s push the modified dataframe into the Stata dataset with
the `-d`

flag:

```
In [8]:
```

```
%%stata -d life_df
summarize
```

```
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
index | 68 33.5 19.77372 0 67
region | 68 .5 .7431277 0 2
country | 0
popgrowth | 68 97.20588 93.11918 -50 300
lexp | 68 72.27941 4.715315 54 79
-------------+---------------------------------------------------------
gnppc | 63 8674.857 10634.68 370 39980
safewater | 40 76.1 17.89112 28 100
lngnppc | 63 8.250023 1.355677 5.913503 10.59613
```

A Stata plot:

```
In [9]:
```

```
%%stata -d life_df --graph
graph twoway (scatter lexp lngnppc) (lfit lexp lngnppc)
```

Now on the python side use lmplot from the seaborn library to graph a similar scatter and fitted line but by region.

```
In [10]:
```

```
sns.set_style("whitegrid")
g=sns.lmplot(y='lexp', x='lngnppc', col='region', data=life_df,col_wrap=2)
```