zaro

How to Filter Data in Pandas in Python?

Published in Pandas Data Filtering 4 mins read

Filtering data in pandas DataFrames is a fundamental operation for selecting rows or columns based on specific criteria. Pandas provides several powerful and flexible methods to achieve this, catering to different filtering needs.

Here are the primary ways to filter data in pandas, based on common techniques:

1. Filtering by Logical Operators (Boolean Indexing)

One of the most common ways to filter is by creating boolean masks using logical operators. This involves checking conditions on one or more columns and using the resulting boolean Series or DataFrame to select rows.

  • You apply a condition directly to a column (e.g., df['column_name'] > 10).
  • This condition returns a True or False value for each row.
  • Passing this boolean Series to the DataFrame selects only the rows where the value is True.
  • Combine multiple conditions using logical operators: & (AND), | (OR), ~ (NOT).
    • Conditions must be enclosed in parentheses.

Example:

# Assuming 'df' is your pandas DataFrame
# Filter rows where the 'Age' column is greater than 30
filtered_df = df[df['Age'] > 30]

# Filter rows where 'City' is 'New York' AND 'Salary' is less than 60000
filtered_df = df[(df['City'] == 'New York') & (df['Salary'] < 60000)]

This method leverages the underlying data values (df.values conceptually, though you interact via column names like df.column or df['column']) to create the filtering condition.

2. Filtering by a List of Values (isin())

The isin() method is ideal for selecting rows where a column's value is present in a specified list of values.

  • It checks if each element in a Series is contained in a sequence of values.
  • It returns a boolean Series that can be used for indexing.

Example:

# Filter rows where the 'Status' column is either 'Pending' or 'Processing'
valid_statuses = ['Pending', 'Processing']
filtered_df = df[df['Status'].isin(valid_statuses)]

3. Filtering Based on String Patterns (str Accessor)

For columns containing strings, pandas provides a special .str accessor that allows you to apply string methods for filtering.

  • Common methods include startswith(), endswith(), contains(), match(), etc.
  • These methods return a boolean Series based on the string operation result.

Example:

# Filter rows where the 'Email' column ends with '@example.com'
filtered_df = df[df['Email'].str.endswith('@example.com')]

# Filter rows where the 'Description' column contains the word 'urgent' (case-insensitive)
filtered_df = df[df['Description'].str.contains('urgent', case=False, na=False)] # na=False handles missing values

4. Filtering Based on a Query Expression (query())

The query() method allows you to filter a DataFrame using a string expression, which can often make filtering code more readable, especially with complex conditions.

  • The expression is evaluated within the DataFrame's context.
  • Column names can be used directly within the string.

Example:

# Filter rows using a query string
filtered_df = df.query('Age > 30 and City == "New York"')

# Using a variable in the query
min_salary = 50000
filtered_df = df.query('Salary >= @min_salary') # Prefix variable with '@'

5. Filtering by Largest or Smallest Values (nlargest(), nsmallest())

The nlargest() and nsmallest() methods are convenient for selecting the top N or bottom N rows based on values in a specified column.

  • They return a new DataFrame containing the n rows with the largest or smallest values in the given column(s).

Example:

# Get the 5 rows with the highest 'Score'
top_performers = df.nlargest(5, 'Score')

# Get the 3 rows with the lowest 'Cost'
lowest_cost_items = df.nsmallest(3, 'Cost')

6. Filtering by Label or Index (loc[], iloc[])

While primarily used for selecting data by labels (index/column names) or integer positions, loc[] and iloc[] can also be used for filtering based on index values or in conjunction with boolean arrays.

  • loc[] filters by label. You can pass a boolean Series (like those created in method 1) directly to loc[].
  • iloc[] filters by integer position. It requires integer-based indexing or a boolean array that aligns with the DataFrame's integer positions.

Example:

# Using loc with a boolean condition (same as method 1, but explicit with loc)
filtered_df = df.loc[df['Category'] == 'Electronics']

# Selecting a specific row by index label
single_row = df.loc['Row_Label_5']

# Selecting rows by integer position (not strictly filtering based on value, but index)
first_three_rows = df.iloc[0:3]

Each of these methods offers a different approach to slicing and dicing your data, allowing you to efficiently isolate the specific information you need within your pandas DataFrames. Choosing the right method depends on the complexity of your condition and the nature of the data you are filtering.