Predicting the Status of Marijuana Legalization in U.S. States
For the past 20 years, marijuana legalization has gotten great deal of attention. In fact, and support for recreational use has risen steadily during that time: as of 2013, a majority of Americans support legalization. However, many states go through various stages before ultimately legalizing: the four stages are: prohibition, decriminalization, medicalization, and legalization.
The Models
Linear Discriminant Analysis
Imagine you plot points on a two-dimensional graph (two features). LDA generates a straight line that separates groups of labels in a way that minimizes errors -- thus misclassifying the least number of cases. Think OLS regression.
Logistic Regression
Using a two-dimensional graph, the logistic regression (logit) uses a transformation of the straight-line (a logit function of the line or probability curve between 0 and 1) to separate groups of labels in a way that minimizes errors -- thus misclassifying the least number of cases.
Support Vector Machines
SVMs attempt to separate out the groups defined by the labels using a line (or a transformation of that line) that maximizes the distance between the dividing line between the groups of labels. SVM has advantages over LDA (OLS-type regressions) in that, when dealing with data that can't easily be separated by a linear function, SVMs employ a "kernel" that converts the dataset into a function of itself, thus making it easier to come up with an equation for a dividing line between the groups of labels.
K Nearest Neighbors
We can think of KNN as a sort-of correlation or network based technique where the algorithm reads in all the features, plots them in n-dimensional space, and identifies the
Decision Tree
Think: flowchart. Creates a dividing line (slices the data, in any way) based on the largest percentage of cases that can be categorized as under one label, then creates another dividing line based on the second largest percentage of cases that can be categorized under one label, and so on, and so forth... Therefore, it is important to note that this method is prone to overfit.
The Exercise
To predict whether or not a state (in the future) will legalize marijuana for recreational use, we need to study the data from the past. Here, I use cross-sectional state-level data (aggregated over time up until 2016). I randomly select 80 percent of the data on which to train the algorithms (the training data set) whereby the algorithms can learn their predictions, and 20 percent as the test set, on which the algorithm can test it's predictions (from the training set) and calculate how accurate it was. I then apply the most accurate algorithm to hypothetical cases. I use the following label and features, based on state-level factors and prior research:
Label:
Features:
Name | Accuracy | Standard Deviation |
---|---|---|
Linear Discriminant Analysis | 0.641026 | 0.191880 |
Logistic Regression | 0.564103 | 0.158062 |
Support Vector Machines | 0.641026 | 0.072524 |
K Nearest Neighbors | 0.769231 | 0.062807 |
Decision Tree | 0.641026 | 0.191880 |