How to do multinomial logistic regression in R?

Performing multinomial logistic regression in R is straightforward, primarily leveraging the multinom() function from the nnet package. This method allows you to model an outcome variable with more than two nominal categories based on one or more predictor variables.

Getting Started: Installation and Loading the `nnet` Package

Before running a multinomial logistic regression model, you need to ensure the nnet package is installed and loaded into your R session. If you haven't installed it, use the install.packages() function.

# Install the nnet package if you haven't already
install.packages("nnet")

# Load the nnet package
library(nnet)

Building Your Multinomial Logistic Regression Model

The multinom() function provides a convenient way to fit multinomial logit models. Its syntax is quite similar to other regression functions in R, like lm() or glm(), making it intuitive for those familiar with R's modeling framework. However, it's crucial to remember that for multinomial outcomes, you use multinom() instead of glm().

A basic model can be specified by providing your categorical outcome variable (dependent variable) on the left side of the ~ operator and your independent variables (predictors) on the right.

Syntax:

model_name <- multinom(outcome_variable ~ predictor1 + predictor2 + ..., data = your_data_frame)

Key Considerations for multinom():

Outcome Variable: Must be a factor with three or more levels. The function automatically selects one category as the reference level (usually the first level alphabetically or numerically).
Predictor Variables: Can be numeric, character, or factors. multinom() handles factor predictors by creating dummy variables internally.
data Argument: Always specify the data frame containing your variables for clarity and to avoid issues with variable scope.

Example: Modeling Academic Program Choice

Let's consider a hypothetical scenario where we want to model a student's choice of academic program (e.g., General, Academic, Vocational) based on their socioeconomic status (SES) and scores on a standardized test.

First, let's create some dummy data for demonstration purposes:

# Create example data
set.seed(123) # for reproducibility
n <- 500
academic_data <- data.frame(
  program = factor(sample(c("General", "Academic", "Vocational"), n, replace = TRUE, prob = c(0.4, 0.35, 0.25))),
  ses = factor(sample(c("Low", "Middle", "High"), n, replace = TRUE, prob = c(0.3, 0.4, 0.3))),
  test_score = round(rnorm(n, mean = 60, sd = 15))
)

# Ensure test_score is numeric
academic_data$test_score <- as.numeric(academic_data$test_score)

Now, we can run the multinomial logistic regression:

# Run the multinomial logistic regression model
model_program_choice <- multinom(program ~ ses + test_score, data = academic_data)

# View a summary of the model results
summary(model_program_choice)

Interpreting Model Output

The summary() output for a multinom object can be quite dense. It typically provides coefficients, standard errors, z-values, and p-values for each predictor, for each non-reference category compared to the reference category.

Coefficients (Log-Odds): These represent the change in the log-odds of being in a specific outcome category (compared to the reference category) for a one-unit increase in the predictor, holding other predictors constant.
Reference Category: By default, multinom() chooses the first level of the outcome factor as the reference category. You can explicitly set the reference level using relevel() if needed, e.g., academic_data$program <- relevel(academic_data$program, ref = "General").

Example Interpretation Snippet:

If "General" is the reference program, a coefficient for test_score for the "Academic" program would indicate how a one-unit increase in test_score affects the log-odds of choosing "Academic" versus "General".

For easier interpretation, you might want to calculate odds ratios. An odds ratio is obtained by exponentiating the coefficients (exp(coef)).

# Calculate odds ratios
exp(coef(model_program_choice))

This will give you the odds ratio for each predictor relative to the reference category for each non-reference outcome category. For example, exp(coef(model_program_choice)["test_score", "Academic"]) would show how the odds of choosing "Academic" vs. "General" change for a one-unit increase in test_score.

Post-Estimation Analysis and Prediction

Once your model is built, you can use it for various post-estimation tasks, such as predicting probabilities for new data or assessing model fit.

Predicting Probabilities

To predict the probabilities of each outcome category for new observations, use the predict() function with type = "probs".

# Predict probabilities for the original data
predicted_probs <- predict(model_program_choice, newdata = academic_data, type = "probs")

# View the first few rows of predicted probabilities
head(predicted_probs)

Predicting Classes

To predict the most likely outcome class for new observations, use predict() with type = "class".

# Predict the most likely program choice for the original data
predicted_classes <- predict(model_program_choice, newdata = academic_data, type = "class")

# Compare actual vs. predicted (first few)
head(data.frame(Actual = academic_data$program, Predicted = predicted_classes))

Model Evaluation

You can evaluate your model's performance using metrics such as accuracy, precision, recall, or F1-score, often derived from a confusion matrix.

# Create a confusion matrix
conf_matrix <- table(Actual = academic_data$program, Predicted = predicted_classes)
print(conf_matrix)

# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat("Model Accuracy:", round(accuracy, 3), "\n")

Key Considerations for Robust Analysis

Multicollinearity: Check for high correlations among your independent variables, as this can affect coefficient estimates.
Assumptions: While multinomial logistic regression is less restrictive than OLS, it still assumes independence of observations and linearity in the log-odds for continuous predictors.
Sample Size: Ensure an adequate sample size, especially if you have many predictors or outcome categories.
Alternative Packages: While nnet::multinom() is widely used, other packages like mlogit offer more advanced features, such as handling choice-specific attributes or panel data.
- For more detailed control over model fitting and different types of discrete choice models, explore the mlogit package: mlogit package on CRAN
- For general regression in R, the official documentation for glm from stats package provides a good base: glm function on RDocumentation

Summary of `multinom()` Usage

Aspect	Description
Function Call	`multinom(formula, data, ...)`
Package	`nnet`
Outcome	Categorical variable with 3+ levels (factor in R). Automatically chooses a reference level.
Predictors	Can be numeric or factors. `multinom` handles creation of dummy variables internally for factors.
Interpretation	Coefficients are log-odds ratios comparing each non-reference category to the reference category. Exponentiate them (`exp(coef)`) for odds ratios.
Prediction	Use `predict(model, newdata, type = "probs")` for probabilities or `type = "class")` for the most likely category.
Advantages	Easy to use, familiar syntax for R users, handles standard multinomial logistic regression efficiently.

In essence, performing multinomial logistic regression in R with the multinom() function from the nnet package is a straightforward process, providing a robust tool for analyzing categorical outcomes with multiple levels.