5 🐼 Welcome to the Grand Library of DataFrames!

Step into a world where data transforms into organized knowledge. This is the Grand Library of DataFrames, a place where we’ll learn to manage, explore, and understand our data using the powerful magic of Pandas in Python.

Think of a DataFrame as a grand catalog within this library, filled with structured information. Let’s begin our magical journey by opening the gates to this library.

5.0.1 Creating Your First Catalog (DataFrame)

Every library needs a catalog to keep track of its treasures. In our Grand Library, we create catalogs (DataFrames) from various sources, like enchanted scrolls (dictionaries), lists of artifacts, or even ancient texts (CSV files).

Let’s create a simple catalog of magical artifacts:

5.0.2 Exploring Your Catalog

Now that we have our magical catalog, let’s learn how to browse through its entries.

Peeking at the first or last entries:

You can quickly peek at the first few artifacts using the .head() spell or the last few with the .tail() spell.

Understanding Your Catalog’s Secrets:

To truly understand the nature of your catalog, you can use special incantations.

The .info() spell reveals a concise summary of the catalog, including the type of magic (data type) in each column and how many entries are not empty. The .describe() spell conjures up descriptive statistics of the numerical aspects of your artifacts, like the average power level.

Focusing on Specific Columns:

Sometimes you only need to focus on specific types of information in your catalog, like just the ‘Artifact Name’ or ‘Power Level’. You can select a single column by calling its name in square brackets [] or multiple columns by listing their names.

5.1 Introduction to Pandas DataFrames

Pandas is a powerful open-source library for data analysis and manipulation in Python. Its core data structure is the DataFrame, which is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table.

Let’s start by importing the pandas library.

import pandas as pd

print("The gates to the Grand Library of DataFrames are now open!")

The gates to the Grand Library of DataFrames are now open!

5.1.1 Creating a DataFrame

You can create a DataFrame from various data sources, such as dictionaries, lists, or CSV files. Here’s an example of creating a DataFrame from a dictionary:

data = {'Artifact Name': ['Phoenix Feather', 'Dragon Scale', 'Unicorn Horn', 'Griffin Claw'],
        'Power Level': [10, 9, 8, 7],
        'Location': ['Forbidden Forest', 'Dragon Mountains', 'Mystical Meadow', 'Sky Peaks']}
magic_artifacts_df = pd.DataFrame(data)
print("Behold! Your first magical catalog (DataFrame):")
display(magic_artifacts_df)

Behold! Your first magical catalog (DataFrame):

	Artifact Name	Power Level	Location
0	Phoenix Feather	10	Forbidden Forest
1	Dragon Scale	9	Dragon Mountains
2	Unicorn Horn	8	Mystical Meadow
3	Griffin Claw	7	Sky Peaks

5.1.2 Basic DataFrame Operations

Once you have a DataFrame, you can perform various operations on it.

Viewing data:

You can view the first few rows using .head() and the last few rows using .tail().

print("Peeking at the first 2 artifacts:")
display(magic_artifacts_df.head(2))

print("\nLooking at the last artifact:")
display(magic_artifacts_df.tail(1))

Peeking at the first 2 artifacts:

	Artifact Name	Power Level	Location
0	Phoenix Feather	10	Forbidden Forest
1	Dragon Scale	9	Dragon Mountains


Looking at the last artifact:

	Artifact Name	Power Level	Location
3	Griffin Claw	7	Sky Peaks

Getting information about the DataFrame:

.info() provides a concise summary of the DataFrame, including the data types of each column and the number of non-null values. .describe() generates descriptive statistics of the numerical columns.

print("Unveiling the catalog's information:")
display(magic_artifacts_df.info())

print("\nDescribing the magical properties (numerical columns):")
display(magic_artifacts_df.describe())

Unveiling the catalog's information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Artifact Name  4 non-null      object
 1   Power Level    4 non-null      int64 
 2   Location       4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes

None


Describing the magical properties (numerical columns):

	Power Level
count	4.000000
mean	8.500000
std	1.290994
min	7.000000
25%	7.750000
50%	8.500000
75%	9.250000
max	10.000000

Selecting Rows:

Sometimes you need to retrieve specific artifacts from your catalog. You can select rows by their position in the catalog using .iloc[] (integer-based) or by their magical label (index) using .loc[] (label-based).

Selecting columns:

You can select a single column using square brackets [] or multiple columns using a list of column names.

print("\nFocusing on just the names of the artifacts:")
display(magic_artifacts_df['Artifact Name'])

print("\nExamining the power levels and locations:")
display(magic_artifacts_df[['Power Level', 'Location']])


Focusing on just the names of the artifacts:

	Artifact Name
0	Phoenix Feather
1	Dragon Scale
2	Unicorn Horn
3	Griffin Claw

dtype: object


Examining the power levels and locations:

	Power Level	Location
0	10	Forbidden Forest
1	9	Dragon Mountains
2	8	Mystical Meadow
3	7	Sky Peaks

Selecting rows:

You can select rows by their index using .loc[] (label-based) or .iloc[] (integer-based).

print("\nRetrieving the first artifact in the catalog using iloc (by position):")
display(magic_artifacts_df.iloc[0])


Retrieving the first artifact in the catalog using iloc (by position):

	0
Artifact Name	Phoenix Feather
Power Level	10
Location	Forbidden Forest

dtype: object

print("\nGetting the second and third artifacts by their position using iloc:")
display(magic_artifacts_df.iloc[1:3])


Getting the second and third artifacts by their position using iloc:

	Artifact Name	Power Level	Location
1	Dragon Scale	9	Dragon Mountains
2	Unicorn Horn	8	Mystical Meadow

# Example using .loc with the indexed DataFrame
print("\nRetrieving the 'Dragon Scale' artifact using loc (by magical name):")
display(magic_artifacts_indexed_df.loc['Dragon Scale'])

print("\nRetrieving multiple artifacts using loc:")
display(magic_artifacts_indexed_df.loc[['Phoenix Feather', 'Unicorn Horn']])

print("\nRetrieving 'Power Level' and 'Location' for 'Dragon Scale' using loc:")
display(magic_artifacts_indexed_df.loc['Dragon Scale', ['Power Level', 'Location']])

print("\nRetrieving 'Power Level' for multiple artifacts using loc:")
display(magic_artifacts_indexed_df.loc[['Phoenix Feather', 'Unicorn Horn'], 'Power Level'])

print("\nRetrieving all columns for artifacts from 'Dragon Scale' to 'Griffin Claw' using loc:")
display(magic_artifacts_indexed_df.loc['Dragon Scale':'Griffin Claw', :])

# Adding examples from user's notes for selecting rows and columns:
print("\nRetrieving 'Power Level' for 'Dragon Scale' using loc (row and column label):")
display(magic_artifacts_indexed_df.loc['Dragon Scale', 'Power Level'])

print("\nRetrieving 'Power Level' for 'Dragon Scale' using iloc (row and column index):")
display(magic_artifacts_indexed_df.iloc[1, 0]) # Dragon Scale is at index 1, Power Level is at index 0

print("\nRetrieving 'Power Level' for 'Dragon Scale' and 'Unicorn Horn' using loc (row labels and column label):")
display(magic_artifacts_indexed_df.loc[['Dragon Scale', 'Unicorn Horn'], 'Power Level'])

print("\nRetrieving 'Power Level' for 'Dragon Scale' and 'Unicorn Horn' using iloc (row indices and column index):")
display(magic_artifacts_indexed_df.iloc[[1, 2], 0]) # Dragon Scale at 1, Unicorn Horn at 2, Power Level at 0

print("\nRetrieving 'Power Level' and 'Location' for 'Dragon Scale' and 'Unicorn Horn' using loc (row labels and column labels):")
display(magic_artifacts_indexed_df.loc[['Dragon Scale', 'Unicorn Horn'], ['Power Level', 'Location']])

print("\nRetrieving 'Power Level' and 'Location' for 'Dragon Scale' and 'Unicorn Horn' using iloc (row indices and column indices):")
display(magic_artifacts_indexed_df.iloc[[1, 2], [0, 1]]) # Dragon Scale at 1, Unicorn Horn at 2, Power Level at 0, Location at 1


Retrieving the 'Dragon Scale' artifact using loc (by magical name):

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipython-input-1707977752.py in <cell line: 0>()
      1 # Example using .loc with the indexed DataFrame
      2 print("\nRetrieving the 'Dragon Scale' artifact using loc (by magical name):")
----> 3 display(magic_artifacts_indexed_df.loc['Dragon Scale'])
      4 
      5 print("\nRetrieving multiple artifacts using loc:")

NameError: name 'magic_artifacts_indexed_df' is not defined

5.1.3 Creating Your First Catalog (DataFrame)

Let’s create a simple catalog of magical artifacts:

data = {'Artifact Name': ['Phoenix Feather', 'Dragon Scale', 'Unicorn Horn', 'Griffin Claw'],
        'Power Level': [10, 9, 8, 7],
        'Location': ['Forbidden Forest', 'Dragon Mountains', 'Mystical Meadow', 'Sky Peaks']}
magic_artifacts_df = pd.DataFrame(data)
print("Behold! Your first magical catalog (DataFrame):")
display(magic_artifacts_df)

Behold! Your first magical catalog (DataFrame):

	Artifact Name	Power Level	Location
0	Phoenix Feather	10	Forbidden Forest
1	Dragon Scale	9	Dragon Mountains
2	Unicorn Horn	8	Mystical Meadow
3	Griffin Claw	7	Sky Peaks

5.1.4 Setting a Magical Identifier (Index)

By default, your catalog uses a simple numerical order as its identifier. However, you can set one of the columns as a unique magical identifier (index) for easier retrieval of artifacts. We can use the set_index() spell for this.

print("\nSetting 'Artifact Name' as the magical identifier:")
magic_artifacts_indexed_df = magic_artifacts_df.set_index('Artifact Name')
display(magic_artifacts_indexed_df)


Setting 'Artifact Name' as the magical identifier:

	Power Level	Location
Artifact Name
Phoenix Feather	10	Forbidden Forest
Dragon Scale	9	Dragon Mountains
Unicorn Horn	8	Mystical Meadow
Griffin Claw	7	Sky Peaks

5.1.5 Adding a New Artifact to Your Catalog

To add a new artifact to your catalog, you can create a new DataFrame for the artifact and then use the pd.concat() spell to merge it with your existing magic_artifacts_df.

new_artifact_data = {'Artifact Name': ["Goblin's Gold Coin"],
                     'Power Level': [6],
                     'Location': ["Goblin's Lair"]}
new_artifact_df = pd.DataFrame(new_artifact_data)

print("\nOur new artifact:")
display(new_artifact_df)

# Concatenate the new artifact to the existing DataFrame
magic_artifacts_df = pd.concat([magic_artifacts_df, new_artifact_df], ignore_index=True)

print("\nCatalog with the new artifact added:")
display(magic_artifacts_df)


Our new artifact:

	Artifact Name	Power Level	Location
0	Goblin's Gold Coin	6	Goblin's Lair


Catalog with the new artifact added:

	Artifact Name	Power Level	Location
0	Phoenix Feather	10	Forbidden Forest
1	Dragon Scale	9	Dragon Mountains
2	Unicorn Horn	8	Mystical Meadow
3	Griffin Claw	7	Sky Peaks
4	Goblin's Gold Coin	6	Goblin's Lair

5.1.6 Filtering Artifacts with Logical Comparisons

Just as you can compare numbers or text, you can also use logical comparisons to filter your DataFrame and find artifacts that meet specific criteria. This is like casting a spell to reveal only the artifacts you are interested in!

We can use operators like >, <, ==, >=, <=, and != to create conditions based on the values in our columns.

print("\nFinding artifacts with a Power Level greater than 8:")
powerful_artifacts = magic_artifacts_df[magic_artifacts_df['Power Level'] > 8]
display(powerful_artifacts)

print("\nFinding artifacts located in the 'Forbidden Forest':")
forbidden_forest_artifacts = magic_artifacts_df[magic_artifacts_df['Location'] == 'Forbidden Forest']
display(forbidden_forest_artifacts)

print("\nFinding artifacts with a Power Level less than or equal to 7:")
lesser_artifacts = magic_artifacts_df[magic_artifacts_df['Power Level'] <= 7]
display(lesser_artifacts)


Finding artifacts with a Power Level greater than 8:

	Artifact Name	Power Level	Location
0	Phoenix Feather	10	Forbidden Forest
1	Dragon Scale	9	Dragon Mountains


Finding artifacts located in the 'Forbidden Forest':

	Artifact Name	Power Level	Location
0	Phoenix Feather	10	Forbidden Forest


Finding artifacts with a Power Level less than or equal to 7:

	Artifact Name	Power Level	Location
3	Griffin Claw	7	Sky Peaks
4	Goblin's Gold Coin	6	Goblin's Lair

5.1.7 Setting a Magical Identifier (Index)

print("\nSetting 'Artifact Name' as the magical identifier:")
magic_artifacts_indexed_df = magic_artifacts_df.set_index('Artifact Name')
display(magic_artifacts_indexed_df)

Now you can retrieve artifacts directly using their magical name:

print("\nRetrieving the 'Dragon Scale' artifact using its magical name:")
display(magic_artifacts_indexed_df.loc['Dragon Scale'])

5.1.8 Assigning New Magical Identifiers (Index)

Instead of using an existing column, you can also assign a new list of magical identifiers to your catalog.

magic_artifacts_df_new_index = magic_artifacts_df.copy() # Create a copy to keep the original DataFrame

new_magical_ids = ['Artifact_1', 'Artifact_2', 'Artifact_3', 'Artifact_4']
magic_artifacts_df_new_index.index = new_magical_ids

print("\nCatalog with new magical identifiers:")
display(magic_artifacts_df_new_index)

5.1.9 Adding a New Artifact to Your Catalog

To add a new artifact to your catalog, you can create a new DataFrame for the artifact and then use the pd.concat() spell to merge it with your existing magic_artifacts_df.

Now you can use these new magical identifiers to retrieve artifacts:

print("\nRetrieving 'Artifact_3' using its new magical identifier:")
display(magic_artifacts_df_new_index.loc['Artifact_3'])

new_artifact_data = {'Artifact Name': ["Goblin's Gold Coin"],
                     'Power Level': [6],
                     'Location': ["Goblin's Lair"]}
new_artifact_df = pd.DataFrame(new_artifact_data)

print("\nOur new artifact:")
display(new_artifact_df)

# Concatenate the new artifact to the existing DataFrame
magic_artifacts_df = pd.concat([magic_artifacts_df, new_artifact_df], ignore_index=True)

print("\nCatalog with the new artifact added:")
display(magic_artifacts_df)

5.1.10 Sorting Your Magical Artifacts

To bring order to your magical catalog, you can sort the artifacts based on the values in one or more columns. The .sort_values() spell allows you to arrange your artifacts. Let’s sort them by ‘Power Level’ to see which are the most powerful!

print("\nSorting artifacts by Power Level (ascending):")
sorted_artifacts_ascending = magic_artifacts_df.sort_values(by='Power Level')
display(sorted_artifacts_ascending)

print("\nSorting artifacts by Power Level (descending):")
sorted_artifacts_descending = magic_artifacts_df.sort_values(by='Power Level', ascending=False)
display(sorted_artifacts_descending)

5.1.11 Importing Artifacts from Ancient Scrolls (CSV Files)

Sometimes your magical artifacts are stored in ancient scrolls (CSV files). You can import this data directly into your catalog using the pd.read_csv() spell. You can even specify a column to be the magical identifier (index) when you import it using the index_col parameter.

# Let's use a sample CSV file available in this environment
csv_file_path = '/content/sample_data/california_housing_train.csv'

print(f"\nImporting artifacts from the ancient scroll: {csv_file_path}")
housing_df = pd.read_csv(csv_file_path, index_col='longitude')

print("\nBehold! Your new catalog conjured from the ancient scroll:")
display(housing_df.head())

5.1.12 Selecting Artifacts by Numerical Position (Slicing)

Just like selecting items from a list, you can select a range of artifacts from your catalog using numerical intervals within square brackets []. This is often called “slicing”. Remember that the end of the interval is exclusive, meaning the artifact at the end index is not included.

print("\nRetrieving the first two artifacts using slicing:")
display(magic_artifacts_df[0:2])

print("\nRetrieving artifacts from the third to the fifth (index 2 to 4) using slicing:")
display(magic_artifacts_df[2:5])

5.1.13 Filtering Artifacts with Logical Comparisons

We can use operators like >, <, ==, >=, <=, and != to create conditions based on the values in our columns.

print("\nFinding artifacts with a Power Level greater than 8:")
powerful_artifacts = magic_artifacts_df[magic_artifacts_df['Power Level'] > 8]
display(powerful_artifacts)

print("\nFinding artifacts located in the 'Forbidden Forest':")
forbidden_forest_artifacts = magic_artifacts_df[magic_artifacts_df['Location'] == 'Forbidden Forest']
display(forbidden_forest_artifacts)

print("\nFinding artifacts with a Power Level less than or equal to 7:")
lesser_artifacts = magic_artifacts_df[magic_artifacts_df['Power Level'] <= 7]
display(lesser_artifacts)


Finding artifacts with a Power Level greater than 8:

	Artifact Name	Power Level	Location
0	Phoenix Feather	10	Forbidden Forest
1	Dragon Scale	9	Dragon Mountains


Finding artifacts located in the 'Forbidden Forest':

	Artifact Name	Power Level	Location
0	Phoenix Feather	10	Forbidden Forest


Finding artifacts with a Power Level less than or equal to 7:

	Artifact Name	Power Level	Location
3	Griffin Claw	7	Sky Peaks
4	Goblin's Gold Coin	6	Goblin's Lair

This is just a basic introduction. Pandas DataFrames offer many more functionalities for data manipulation, cleaning, and analysis. Feel free to ask if you have any specific questions or want to explore more advanced topics!

This is just the beginning of our adventure in the Grand Library of DataFrames! There are many more spells (operations) to learn for manipulating, cleaning, and analyzing your data.

Would you like to learn how to:

Filter artifacts based on their properties (e.g., find all artifacts with Power Level greater than 8)?
Add new artifacts to your catalog?
Sort the artifacts by Power Level?
Perform magical calculations on the Power Levels?

Let me know what magical data skill you’d like to unlock next!