Pandas: How to Select Rows Based on Column Values

If you ever worked with datasets there is mostly a need to get specific information about that whole data. If you use python and handle data with pandas, then filtering data is quite easy. In order to select rows based on column values, in a DataFrame there are many different ways in Pandas.

In this tutorial, we will explore various techniques for selecting rows from a DataFrame. By the end, you will be able to filter and extract specific rows of interest from your data, making it easier to analyze and draw insights.

Method 1: Boolean Indexing

The most straightforward and easiest method to select rows based on column values is by boolean indexing. To create a boolean mask, you can use comparison operators (>, <, >=, <=, ==, !=) or other boolean operators (| for OR, & for AND, ~ for NOT). You can then use this mask to select the corresponding rows from the DataFrame by passing it inside square brackets.

One can also combine multiple conditions using parentheses and boolean operators,  as shown below where we selected two cities. It allows you to create more complex boolean masks that can select rows based on multiple conditions.

import pandas as pd

# create a sample DataFrame
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David'],
                   'Age': [25, 30, 35, 40], 
                   'City': ['New York', 'Paris', 'London', 'Sydney']})

# select rows where Age is greater than 30
df_gt_30 = df[df['Age'] > 30]

# select rows where City is 'Paris' or 'Berlin'
df_paris_berlin = df[(df['City'] == 'Paris') | (df['City'] == 'Berlin')]

In the above sample DataFrame we have three columns (Name, Age, and City) and five rows. The resulting DataFrame after boolean indexing will contain only the selected rows but all columns. If you want to choose specific columns as well, you can pass a list of column names inside the square brackets. Consider the following example:

df_gt_30_names = df.loc[df['Age'] > 30, ['Name', 'City']]

This will select only the Name and City columns for the rows where Age is greater than 30.

Boolean indexing can also be used with loc method. In Pandas, the loc method is used to select data from a DataFrame based on the labels of rows and columns. Boolean indexing with loc method allows you to select rows based on a boolean condition.

Here’s an example:

import pandas as pd

# create a DataFrame
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David'],
                   'Age': [25, 30, 35, 40], 
                   'City': ['New York', 'Paris', 'London', 'Sydney']})

# boolean indexing with loc
condition = df['Age'] > 30
result = df.loc[condition]
print(result)

In the above code, the condition variable contains a boolean condition, which is True for rows where the Age is greater than 30, and False for rows where the Age is less than or equal to 30. In order to select rows where the condition is True, the loc method is used.

The result of this code will be a new DataFrame that contains only the rows where the Age is greater than 30:

     Name  Age    City
2  Charlie   35  London
3    David   40  Sydney

Method 2: Using isin for a list of values

If you have a list of values and you want to select DataFrame rows based on all the values in the list then you can use isin operator.

Suppose you have a DataFrame df with columns “name” and “age”:

import pandas as pd

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'Dave'], 
                   'Age': [25, 30, 35, 40]})

Now you want to select only the rows where the name column is either Alice or Charlie. The following code will do the job.

selected_rows = df[df['Name'].isin(['Alice', 'Charlie'])]

This will create a new DataFrame selected_rows that contains only the rows where the name column is either Alice or Charlie.

Note that the isin method takes a list of values to match against and returns a boolean Series that can be used to select the desired rows from the DataFrame using the indexing operator [].

Method 3: Where Method

Another way to filter DataFrame in Pandas is to use the where method.  The where method returns a new DataFrame that is a copy of the original, but the values in the rows that do not meet the specified condition are set to NaN. This means that the resulting DataFrame will have the same shape as the original DataFrame, but some of its values will be missing.

Consider the following code

import pandas as pd 

# create a DataFrame 
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David'], 
        'Age': [25, 30, 35, 40], 
        'City': ['New York', 'Paris', 'London', 'Sydney']}) 

# Use the where method to select rows where Age > 30
df_filtered = df.where(df['Age'] > 30) 
print(df_filtered)

Output:

      Name   Age    City
0      NaN   NaN     NaN
1      NaN   NaN     NaN
2  Charlie  35.0  London
3    David  40.0  Sydney

As you can see, it has returned a new DataFrame where the rows that do not meet the condition  have been set to NaN.

To remove the rows with missing values, you can use the dropna method

df_filtered = df.where(df['Age'] > 30).dropna() 
print(df_filtered)

Output:

     Name   Age    City
2  Charlie  35.0  London
3    David  40.0  Sydney

Method 4: Query Method

If you use SQL often, you might want to use a query to filter a DataFrame. The query method will save you. This method provides a way to filter and select rows from a DataFrame using a string expression. It allows you to write a query-like syntax to filter rows based on certain conditions, making the code more concise and readable.

The syntax of the query() method is as follows:

DataFrame.query(expr, inplace=False, **kwargs)

where:

  • expr: A string expression to filter rows based on certain conditions.
  • inplace: A boolean value indicating whether to modify the original DataFrame or return a new one. If this is set to True the original DataFrame will have the permanent change.
  • **kwargs: Additional parameters to be passed to the query function.

Here is an example:

import pandas as pd

# create a sample dataframe
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 40],
    'gender': ['F', 'M', 'M', 'M']
})

# select rows where age is greater than 30 and gender is male using query method
selected_rows = df.query('age > 30 and gender == "M"')

print(selected_rows)

Output:

      name  age gender
2  Charlie   35      M
3    David   40      M

Note that the query expression is written as a string and enclosed in quotes. It returns a new DataFrame with the selected rows. It can be beneficial when dealing with large datasets, as it can make the code more readable and maintainable by separating the filtering logic from the rest of the code.

Time Comparison

Now let’s have a time comparison of the methods we have discussed. If you are looking for a time-efficient method this block will help you.

Let’s first generate a sample DataFrame with 10 million rows and two columns:

import pandas as pd 
import numpy as np 
df = pd.DataFrame(np.random.randint(0, 10, size=(10000000, 2)), columns=['A', 'B'])

We will compare the following methods for selecting rows based on column conditions:

  1. Using boolean indexing with the loc method
  2. Using the query method
  3. Using the where method

Following is a time comparison for selecting rows where the value of column A is greater than 5:

import timeit 
# Boolean indexing with loc 
start_time = timeit.default_timer() 
df.loc[df['A'] > 5] 
print("Time taken by boolean indexing with loc: {:.5f} seconds".format(timeit.default_timer() - start_time)) 
# Query method 
start_time = timeit.default_timer() 
df.query('A > 5') 
print("Time taken by query method: {:.5f} seconds".format(timeit.default_timer() - start_time)) 
# Where method 
start_time = timeit.default_timer() 
df.where(df['A'] > 5).dropna() 
print("Time taken by where method: {:.5f} seconds".format(timeit.default_timer() - start_time))

Output:

Time taken by boolean indexing with loc: 0.16906 seconds
Time taken by query method: 0.21511 seconds
Time taken by where method: 0.69927 seconds

By this, we can see that boolean indexing with loc is the fastest method, followed by the query method and the where method. This is a rough estimate. However, it is essential to note that the exact timings may vary depending on the size and complexity of the DataFrame and the specific conditions used for selection.

Conclusion

That concludes this tutorial. Do practice these methods on your own, that’s how you will master these and will be applying to different scenarios. The ability to select rows based on column values is an essential skill for any data analyst or data scientist working with Pandas, and it can significantly improve the efficiency and accuracy of your data analysis workflow. If you like this tutorial then consider checking out our Python tutorials page, where we regularly post content for beginners and advanced developers. You’re sure to find something interesting there.