Week 03: Overview of Python NumPy

Objective: Introduce the essential functionality of NumPy for efficient numerical computations in Python. Cover the creation, manipulation, and analysis of arrays, along with basic statistical operations relevant for data science and engineering applications.

  • Introduction to NumPy and Creating and Manipulating Arrays Overview of NumPy’s role in Python for efficient numerical and scientific computing. Discuss its benefits for handling large datasets and performing complex calculations. Learn to create arrays from lists, ranges, and functions like np.zeros and np.ones. Covers reshaping, changing dimensions, and understanding array structures.
  • Array Operations and Math Functions and Indexing, Slicing, and Filtering Perform element-wise operations such as addition and multiplication. Explore aggregate functions like np.sum and np.mean for data summarization. Access specific elements or subsets using indexing and slicing. Apply Boolean indexing to filter data based on conditions.
  • Random Number Generation and Statistical Analysis with NumPy Generate random numbers and arrays for simulations using np.random functions. Useful for creating test data and sampling. Use statistical functions to calculate mean, median, variance, and percentiles. Analyze data distributions and trends within datasets.

Introduction to NumPy

NumPy (Numerical Python) is a powerful Python library for numerical and scientific computing. Built on top of C, it allows for fast operations on arrays and matrices and provides an extensive collection of mathematical functions. NumPy is essential for handling large datasets and performing complex calculations efficiently, making it a cornerstone for data science, machine learning, and scientific computing.

Core Concepts of NumPy

1. Getting Started with NumPy

  • Installation (pip install numpy) and importing (import numpy as np)
  • Overview of NumPy’s core functionality and differences from Python lists
  • Advantages: speed, memory efficiency, and broadcasting

2. NumPy Arrays

  • Creating Arrays: Arrays can be created from lists, tuples, or built-in functions.
    • np.array(): Basic array creation from a list or tuple
    • np.zeros(), np.ones(), np.arange(), and np.linspace() for creating arrays with specific patterns
  • Array Dimensions: Understanding 1D, 2D, and ND arrays, with examples
  • Array Indexing and Slicing: Accessing elements, slicing arrays for subsets, and using Boolean indexing for filtering
  • Array Shape and Reshape: Modifying the dimensions of an array (reshape, flatten, ravel)

3. Array Operations and Manipulations

  • Basic Math Operations: Element-wise addition, subtraction, multiplication, and division
  • Matrix Operations: Dot product, cross-product, and matrix multiplication using np.dot()
  • Aggregations: Common functions like np.sum(), np.mean(), np.min(), np.max(), and np.std() for data aggregation
  • Broadcasting: How NumPy handles operations on arrays of different shapes, examples of element-wise operations

4. Data Types and Conversions

  • Understanding and specifying data types with dtype
  • Type conversions with .astype() method for changing data types (e.g., integer to float)

5 Random Number Generation

  • Generating Random Arrays: Using np.random for generating random numbers, such as np.random.rand(), np.random.randint(), and np.random.normal()
  • Setting random seed with np.random.seed() for reproducible results

6. Statistical Functions and Analysis

  • Common statistical functions (mean, median, variance, std, percentile)
  • Practical exercises to analyze sample datasets

1. Getting Started with NumPy

To effectively utilize the NumPy library, we need to start with the basics, from installation and setup to understanding its core functionality. NumPy stands out as a powerful library because it enables efficient numerical computations on large datasets like enriched_sales_data.xlsx.

1.1 Installation and Importing NumPy

To use NumPy, you’ll need to have it installed in your Python environment. Install it via pip and import it for use:

pyhton
pip install numpy

Once installed, you can import it in your code:

import numpy as np

1.2 Overview of NumPy’s Core Functionality

NumPy is built around a key data structure called the array. This array is different from Python lists in several ways:

  • Fixed Size: NumPy arrays are of fixed size, meaning their size cannot change after creation. This differs from Python lists, which are dynamic.
  • Type Homogeneity: Each element in a NumPy array must be of the same data type (e.g., all integers or all floats). This uniformity leads to optimized performance and memory usage.
  • Fast Operations: NumPy operations are implemented in C, making them faster than similar operations on Python lists.

1.3 Differences Between NumPy Arrays and Python Lists

Consider a small subset of enriched_sales_data.xlsx for a practical illustration. Let’s say we want to store and analyze the Units_Sold column for product sales.

Example Using Python Lists:

# Python list
units_sold = [45, 65, 30, 10, 90, 50]

# Adding 5 to each unit sold using a loop
updated_units = [unit + 5 for unit in units_sold]
print(updated_units)  # Output: [50, 70, 35, 15, 95, 55]

Example Using NumPy Arrays:

# NumPy array
units_sold_np = np.array([45, 65, 30, 10, 90, 50])

# Adding 5 to each unit sold using broadcasting
updated_units_np = units_sold_np + 5
print(updated_units_np)  # Output: [50 70 35 15 95 55]

Here, you can see that NumPy’s broadcasting feature (explained in detail below) allows us to perform operations without explicitly looping over elements, making the code cleaner and faster.

1.4 Advantages of Using NumPy

  • Speed: NumPy is highly optimized for numerical calculations. For instance, processing the Units_Sold column with millions of entries would be significantly faster in NumPy than with regular Python lists.

Example: Summing Units Sold

# Generating large random data for simulation
units_sold_np_large = np.random.randint(1, 100, size=1000000)  # 1 million entries

# Summing all units
total_units = np.sum(units_sold_np_large)
print(total_units)
NumPy’s internal optimizations allow this to be done in milliseconds, whereas Python lists would take much longer.

  • Memory Efficiency: NumPy arrays are more memory-efficient than Python lists because they store data in a contiguous block of memory. This layout allows for faster access and reduced memory usage, particularly beneficial when dealing with extensive datasets like enriched_sales_data.xlsx.
  • Broadcasting: One of NumPy’s standout features is broadcasting. Broadcasting enables NumPy to handle operations on arrays of different shapes. If we want to add a constant to every element in an array or perform arithmetic operations across arrays of different sizes, broadcasting does it effortlessly. Video Array Broadcasting Example: Applying Discounts Let’s say we want to apply a 10% discount across all entries in the Unit_Price column:
# Original unit prices
unit_prices = np.array([150, 200, 300, 250, 180, 120])

# Apply a 10% discount
discounted_prices = unit_prices * 0.9
print(discounted_prices)  # Output: [135. 180. 270. 225. 162. 108.]
  • Convenience with Multidimensional Data: While Python lists require nested structures to handle multi-dimensional data (like rows and columns in a table), NumPy can handle multi-dimensional arrays seamlessly. For instance, we can load multiple columns like Units_Sold, Unit_Price, and Sales_Amount from enriched_sales_data.xlsx as a 2D array and perform operations across the rows and columns easily.
# Sample data for demonstration
    data = np.array([
        [45, 150, 6750],   # Units_Sold, Unit_Price, Sales_Amount
        [60, 100, 6000],
        [75, 200, 15000]
    ])

    # Calculate total units sold, mean unit price, and total sales
    total_units = np.sum(data[:, 0])
    mean_unit_price = np.mean(data[:, 1])
    total_sales = np.sum(data[:, 2])

    print("Total Units Sold:", total_units)
    print("Mean Unit Price:", mean_unit_price)
    print("Total Sales:", total_sales)

Summary

Using the enriched_sales_data.xlsx dataset with NumPy will empower us to perform these kinds of efficient, large-scale data operations throughout the course. This foundational understanding of NumPy’s speed, memory efficiency, and broadcasting capabilities will be essential as we explore data manipulation and analysis techniques in upcoming weeks.

pip install numpy
Requirement already satisfied: numpy in c:\users\mohsin\anaconda3\lib\site-packages (1.26.4)
Note: you may need to restart the kernel to use updated packages.
import numpy as np

#Example Using Python Lists:
print ('# Python list')
units_sold = [45, 65, 30, 10, 90, 50]
 
print ('# Adding 5 to each unit sold using a loop')
updated_units = [unit + 5 for unit in units_sold]
print(updated_units)  # Output: [50, 70, 35, 15, 95, 55]

#Example Using NumPy Arrays:

print ('# NumPy array')
units_sold_np = np.array([45, 65, 30, 10, 90, 50])

print ('# Adding 5 to each unit sold using broadcasting')
updated_units_np = units_sold_np + 5
print(updated_units_np)  # Output: [50 70 35 15 95 55]
# Python list
# Adding 5 to each unit sold using a loop
[50, 70, 35, 15, 95, 55]
# NumPy array
# Adding 5 to each unit sold using broadcasting
[50 70 35 15 95 55]
print('\n\n# Generating large random data for simulation')
units_sold_np_large = np.random.randint(1, 100, size=1000000)  # 1 million entries
print('# Summing all units')
total_units = np.sum(units_sold_np_large)
print(total_units)
# Generating large random data for simulation
# Summing all units
49973793
print ('# Original unit prices')
unit_prices = np.array([150, 200, 300, 250, 180, 120])

print('# Apply a 10% discount')
discounted_prices = unit_prices * 0.9
print(discounted_prices)  # Output: [135. 180. 270. 225. 162. 108.]
# Original unit prices
# Apply a 10% discount
[135. 180. 270. 225. 162. 108.]
print ('# Sample data for demonstration')
data = np.array([
    [45, 150, 6750],   # Units_Sold, Unit_Price, Sales_Amount
    [60, 100, 6000],
    [75, 200, 15000]
])

print ('\n\n# Calculate total units sold, mean unit price, and total sales')
total_units = np.sum(data[:, 0])
mean_unit_price = np.mean(data[:, 1])
total_sales = np.sum(data[:, 2])

print("Total Units Sold:", total_units)
print("Mean Unit Price:", mean_unit_price)
print("Total Sales:", total_sales)
# Sample data for demonstration


# Calculate total units sold, mean unit price, and total sales
Total Units Sold: 180
Mean Unit Price: 150.0
Total Sales: 27750

NumPy Methods and Usage in Python

2. NumPy Arrays

Category Function Explanation Example Code Output
Creating Arrays np.array() Creates an array from a list or tuple. Useful for converting data into a NumPy array format. np.array([1, 2, 3]) [1 2 3]
Creating Arrays np.zeros() Creates an array filled with zeros. Handy for initializing arrays with default values of zero. np.zeros(3) [0. 0. 0.]
Creating Arrays np.ones() Creates an array filled with ones. Useful for initial setups with a constant value of one. np.ones(3) [1. 1. 1.]
Creating Arrays np.arange() Creates an array with a specific range (start, stop, step). Helps in generating sequences. np.arange(1, 10, 2) [1 3 5 7 9]
Creating Arrays np.linspace() Creates an array with evenly spaced values, ideal for smooth transitions (e.g., graphs). np.linspace(0, 1, 5) [0. 0.25 0.5 0.75 1.]
Array Dimensions np.array() Array dimensions show the structure: 1D (line), 2D (table), or ND (multiple layers). np.array([[1, 2], [3, 4]]) [[1 2] [3 4]] (2D array)
Array Dimensions np.array() 3D Array: Has "depth" (like layers of 2D tables). np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) [[[1 2][3 4]][[5 6][7 8]]]
Indexing and Slicing Basic Indexing Accesses elements based on position. arr[1] accesses the second element. arr = np.array([10, 20, 30]); arr[1] 20
Indexing and Slicing Basic Slicing Slicing extracts ranges; arr[0:2] gets the first two elements. arr[0:2] [10 20]
Indexing and Slicing Boolean Indexing Returns elements based on a condition. Helps filter data (e.g., values > 15). arr[arr > 15] [20 30]
Shape and Reshape .shape Shows the layout of an array, e.g., (2, 3) for 2 rows, 3 columns. arr = np.array([[1, 2], [3, 4]]); arr.shape (2, 2)
Shape and Reshape .reshape() Changes the layout of an array without modifying data. np.arange(6).reshape(2, 3) [[0 1 2] [3 4 5]]
Shape and Reshape .flatten() Converts a multi-dimensional array to 1D, often used for simplifying data. arr.flatten() [0 1 2 3 4 5]
Shape and Reshape .ravel() Similar to flatten but returns a "view" (linked to original), not a copy, for temporary flattening. arr.ravel() [0 1 2 3 4 5]

3. Array Operations and Manipulations

Category Function Explanation Example Code Output
Basic Math Operations Addition + Element-wise addition; adds corresponding elements in two arrays. Useful for basic calculations. np.array([1, 2]) + np.array([3, 4]) [4 6]
Basic Math Operations Subtraction - Subtracts corresponding elements. np.array([3, 4]) - np.array([1, 2]) [2 2]
Basic Math Operations Multiplication * Multiplies corresponding elements. np.array([1, 2]) * np.array([3, 4]) [3 8]
Matrix Operations np.dot() Dot product for matrices, yielding a single value for each row and column pair (commonly used in data science). np.dot([1, 2], [3, 4]) 11
Matrix Operations np.cross() Cross product results in a vector perpendicular to two input vectors, useful in physics for calculating torque. np.cross([1, 0, 0], [0, 1, 0]) [0 0 1]
Aggregations np.sum() Adds up all elements. Often used to get totals in datasets. np.sum([1, 2, 3]) 6
Aggregations np.mean() Calculates the average, providing a "central" value. Useful in statistics. np.mean([1, 2, 3]) 2.0
Aggregations np.std() Measures data spread from the mean. High std means more variation; low std means values are close to the mean. np.std([1, 2, 3]) 0.816
Broadcasting Addition with Scalar Adds a single value (scalar) to each element of an array by "stretching" the scalar. np.array([1, 2, 3]) + 5 [6 7 8]
Broadcasting Addition with Array Aligns smaller array dimensions with larger ones for element-wise operations (e.g., 1D added to 2D). np.array([[1, 2], [3, 4]]) + np.array([5, 6]) [[6 8] [8 10]]

4. Data Types and Conversions

Category Function Explanation Example Code Output
Data Types dtype Specifies or checks data type (e.g., integer, float), helping NumPy manage memory effectively. arr = np.array([1, 2, 3], dtype=np.float64); arr.dtype float64
Type Conversions .astype() Converts data type (e.g., integer to float or string). Helps in compatibility for calculations or data manipulation. arr = np.array([1, 2, 3]); arr.astype(float) [1.0 2.0 3.0]

5. Random Number Generation

Category Function Explanation Example Code Output
Random Arrays np.random.rand() Generates random float numbers between 0 and 1, useful in simulations where unpredictable data is needed. np.random.rand(3) [0.45 0.76 0.89] (example)
np.random.randint() Generates random integers within a range, useful for random selection in games or sampling. np.random.randint(1, 10, 3) [3 7 1] (example)
np.random.normal() Generates numbers following a "normal" (bell curve) distribution, with most values near the mean. Ideal for simulating real-world distributions in statistics. np.random.normal(0, 1, 3) [-0.2 0.4 1.1] (example)
Setting Seed np.random.seed() Sets a fixed starting point for random numbers, making results reproducible. Great for testing and experiments where you want consistent results. np.random.seed(42); np.random.rand(3) [0.37 0.95 0.73]

6. Statistical Functions and Analysis

Category Function Explanation Example Code Output
Statistical Functions np.mean() Calculates the mean (average) of elements. np.mean([1, 2, 3, 4]) 2.5
np.median() Calculates the median (middle value) of elements. np.median([1, 2, 3, 4]) 2.5
np.var() Calculates the variance, indicating how spread out the numbers are from the mean. np.var([1, 2, 3, 4]) 1.25
np.std() Calculates the standard deviation, showing data spread relative to the mean. np.std([1, 2, 3, 4]) 1.118
np.percentile() Calculates a specified percentile, useful for determining value distribution (e.g., 50th). np.percentile([1, 2, 3, 4], 50) 2.5
 


Rating
0 0

There are no comments for now.