Data Analytics with Python

0 %

Course content

Detail overview of Data

Prev Next

Fullscreen Share

Week 02: Data Manipulation and Cleaning with Pandas

Objective: Teach participants to clean and prepare data for analysis, handling common issues in business datasets.

Importing and exporting data (CSV, Excel, JSON)
Data manipulation with Pandas: selecting, filtering, merging, and aggregating data
Data cleaning: handling missing values, duplicate data, outliers, and data types
Exploratory Data Analysis (EDA) techniques for business insights
Hands-on exercise: Cleaning and preparing a business dataset
Mini-project: Using Pandas to prepare a dataset for analysis (e.g., sales or financial data)

Detail overview of Data

Category of data

1. Primary Data

Primary data is original and collected firsthand by the researcher for a specific purpose. It is typically gathered through methods like surveys, experiments, or direct observation. Primary data can further be divided into:

Interactive: Data collected through direct interaction with subjects, like interviews, surveys, or questionnaires. Example: Conducting a survey to understand consumer preferences.
Observational: Data collected by observing subjects without direct interaction, typically in a natural setting. Example: Observing customer behavior in a retail store.
Simulation: Data generated by simulating a real-world process or system under controlled conditions. Example: Running a simulation to understand the potential spread of a virus in a population.

2. Secondary Data

Secondary data is pre-existing data that was collected by someone else, usually for a different purpose. It can be accessed through sources like research publications, databases, and historical records. Secondary data can be categorized as:

Cross-Sectional: Data collected at a single point in time, providing a snapshot of a particular phenomenon. Example: Census data that captures the demographic distribution at one specific time.
Time Series: Data collected over a period of time, allowing analysis of trends and changes. Example: Monthly unemployment rates collected over several years to study economic trends.
Panel: A type of data that involves multiple observations over time for the same subjects, combining elements of cross-sectional and time-series data. Example: Tracking the income levels of the same group of individuals over several years.

Each category serves different research purposes, and the choice depends on the nature of the study, the research question, and the availability of resources.

Data Types in Statistics

1. Quantitative Data:

This type of data can be measured numerically and is often divided into Continuous and Discrete types.

Continuous Data: This type of data has a continuous range of values, like length, weight, or temperature. It can be measured in fractions or decimals.
Discrete Data: This type of data includes specific values only, such as the number of students in a class or the count of coins.

Qualitative Data:

This data cannot be measured numerically and is instead categorized based on attributes or qualities.

Nominal Data: In this type, data is categorized by names or labels, without any inherent order, such as gender (male, female) or color (red, blue).
Ordinal Data: This type has a clear order or ranking among categories, but the intervals between them cannot be measured, such as rankings like "Good," "Very Good," "Excellent."

Design Data (Social Sciences) vs Organic Data (Computer Sciences)

In social sciences, "design data" refers to structured data collected through planned research designs, such as surveys, experiments, and observational studies. This data is purposefully gathered to answer specific questions or test hypotheses within a controlled environment, allowing researchers to analyze behavior, attitudes, or interactions systematically.

In computer science, "organic data" refers to naturally occurring data, often unstructured, that is generated without a specific research design in mind. This data comes from real-world sources like social media posts, sensor outputs, transaction logs, and web activity. Organic data is typically analyzed to identify patterns, trends, and insights in a way that reflects real-world, spontaneous behavior rather than controlled experimental settings.

Population and Sample in Data Analysis

Population: Refers to the complete group of individuals or items sharing a specific characteristic and targeted in a study. Researchers define the population based on study goals. Example: Studying the eating habits of college students in Pakistan would have all Pakistani college students as the population.
Sample: A subset of the population chosen for analysis, often due to practical constraints (time, cost, etc.). Ideally, the sample is representative of the population to generalize findings. Example: Selecting 500 students from different universities across Pakistan as a sample to represent all college students.

- Types of Sampling:

Probability Sampling: Each population member has a known, non-zero chance of selection. This method is randomized and aims to create a representative sample, enhancing result generalization.
- Types: Simple Random, Systematic, Stratified, and Cluster Sampling.
Non-Probability Sampling: Selection chances are not known for all members, making it less representative and potentially introducing bias. It’s often faster and cheaper but limits generalizability.
- Types: Convenience, Purposive (Judgmental), Quota, and Snowball Sampling.

Sample Size Determination Using Krejcie and Morgan Table

Population (N)	Sample Size (S)
10	10
50	44
100	80
200	132
500	217
1,000	278
5,000	357
10,000	370
50,000	381
100,000	384
1,000,000	384

Source for detail:

Sample Size Determination Using Krejcie and Morgan Table

Pandas

Pandas, named after "panel data" from econometrics, was created by Wes McKinney to simplify handling multi-dimensional, labeled, and time-series data in Python. It emerged as a tool to streamline complex data manipulation tasks, making them accessible with just a few lines of code.
The playful panda bear reference added a friendly touch, making pandas not only technically powerful but also memorable and approachable in the Python ecosystem. Today, it’s an essential library for data manipulation, beloved by data scientists and analysts alike.

Probability and Statistis for Computer Science"

Its Cover several key concepts

Probability Basics: Fundamental principles of probability, including definitions, types of probabilities, and basic operations like union and intersection of events.
Random Variables and Distributions: Explanation of random variables, both discrete and continuous, and how probability distributions apply to each. It also includes types of distributions like binomial, Poisson, and normal.
Central Tendency and Dispersion: Measures like mean, median, mode, variance, and standard deviation to describe the central point and spread of data. Includes tools like quartiles and boxplots.
Entropy and Information Theory: Introduction to entropy as a measure of uncertainty or disorder in data, as well as its applications in machine learning and communication systems.
Data Visualization: Techniques for visualizing data through graphs such as histograms, contour plots, and scatter plots to better understand data distribution and relationships.
Correlation and Dependence: Concepts of correlation, dependence, and methods to analyze the relationship between two or more variables, including covariance and correlation coefficients.
Statistical Hypothesis Testing: Introduction to hypothesis testing for statistical inference, including p-values and confidence intervals, commonly used to validate or refute assumptions about data.
Regression Analysis: Basics of linear regression, modeling relationships between variables, and evaluating the fit of these models.
Time Series Analysis: Methods for analyzing data points collected or sequenced over time, including trends, seasonality, and smoothing techniques.
Applications in Machine Learning: Use of statistical concepts, especially entropy and data distributions, in machine learning to improve model training, understanding, and evaluation.

Sources of Secondary Data:

Sr. No.	Source Name	Authority	Industry / Indicator Type	Source Locale
1	Pakistan Bureau of Statistics	Government of Pakistan	National Statistics	Local
2	Punjab Statistics Department	Government of Punjab	Provincial Statistics	Local
3	Insurance Association of Pakistan	IAP	Insurance Industry	Local
4	National Database & Registration Authority	NADRA	Population and National ID Registration	Local
5	KSE Stocks	Pakistan Stock Exchange	Stock Market	Local
6	Pakistan Bankers Association	PBA	Banking Industry	Local
7	Pakistan Stock Exchange	PSX	Stock Market	Local
8	State Bank of Pakistan	SBP	Monetary Policy and Banking Indicators	Local
9	Securities and Exchange Commission	SECP	Corporate Governance	Local
10	Crypto Data Download	Binance	Cryptocurrency	International
11	Bloomberg	Bloomfinberg	Global Finance and Economic Data	International
12	Federal Reserve Economic Data (FRED)	Federal Reserve	U.S. Economic Indicators	International
13	Economic Freedom Index	Heritage Foundation	Economic Freedom	International
14	Additional IMF Data	International Monetary Fund	Financial Statistics	International
15	IMF Data	International Monetary Fund	Global Economic Indicators	International
16	KOF Globalisation Index	KOF Swiss Economic Institute	Globalisation Indicators	International
17	Oxford Economics Group	Oxford Economics	Global Economic Analysis	International
18	Datastream	Thomson Reuters	Financial and Economic Data	International
19	Trading Economics	Trading Economics	Economic Indicators, Global Statistics	International
20	International Financial Statistics	United Nations	International Finance	International
21	University of Gothenburg Databases	University of Gothenburg	Academic and Economic Research	International
22	World Bank Data	World Bank	Global Development Indicators	International
23	World Bank Finance	World Bank	Global Financial Indicators	International
24	World Development Indicators	World Bank	Economic Development	International
25	Yahoo Finance	Yahoo	General Finance and Stock Market	International

Mathematical and Statistics Techniques for Applied Data Analysis

Sr. No.	Technique	Type	Category	Purpose
1	Correlation	Analysis	Statistical Techniques	Measure relationship between variables
2	Regression	Analysis	Statistical Techniques	Predict values based on relationships between variables
3	Multiple Regression Analysis	Analysis	Statistical Techniques	Examine relationships with multiple independent variables
4	Logistic Regression Analysis	Analysis	Statistical Techniques	Predict binary outcomes
5	Structural Equation Modeling (SEM)	Theory/Modeling	Statistical Techniques	Examine complex relationships among variables
6	Canonical Correlation	Analysis	Statistical Techniques	Analyze relationships between multiple independent and dependent variables
7	Discriminant Analysis	Classification	Statistical Techniques	Classify observations into predefined groups
8	Factor Analysis	Data Reduction	Statistical Techniques	Identify underlying factors in data
9	Confirmatory Factor Analysis (CFA)	Technique	Statistical Techniques	Validate hypothesized factor structure
10	Exploratory Factor Analysis (EFA)	Technique	Statistical Techniques	Identify latent structure
11	Principal Component Analysis (PCA)	Data Reduction	Statistical Techniques	Reduce data dimensionality
12	Analysis of Variance (ANOVA)	Hypothesis Testing	Statistical Techniques	Test group mean differences
13	Multivariate Analysis of Variance (MANOVA)	Hypothesis Testing	Statistical Techniques	Test group differences across multiple variables
14	Multivariate Analysis of Covariance (MANCOVA)	Hypothesis Testing	Statistical Techniques	Control covariates when comparing groups
15	Cluster Analysis	Grouping	Statistical Techniques	Group data based on similarity
16	Multidimensional Scaling (MDS)	Visualization	Statistical Techniques	Spatially represent data similarities
17	Correspondence Analysis	Dimension Reduction	Statistical Techniques	Reduce dimensions for categorical data
18	Conjoint Analysis	Preference Analysis	Statistical Techniques	Analyze preferences among multiple attributes
19	Hypothesis Testing	Statistical Testing	Statistical Techniques	Test assumptions in data
20	T-Test	Hypothesis Testing	Statistical Techniques	Compare means of two groups
21	Z-Test	Hypothesis Testing	Statistical Techniques	Compare population proportions
22	Chi-Square Test	Hypothesis Testing	Statistical Techniques	Test for independence or goodness of fit
23	F-Test	Hypothesis Testing	Statistical Techniques	Compare variances
24	Interpretive Structural Modeling (ISM)	Modeling	Mathematical Techniques	Model complex system structures
25	Total Interpretive Structural Modeling (TISM)	Modeling	Mathematical Techniques	ISM with total interpretation
26	Modified Interpretive Structural Modeling	Modeling	Mathematical Techniques	Modified ISM for specific applications
27	Polarized Interpretive Structural Modeling	Modeling	Mathematical Techniques	Polar perspective in ISM
28	Fuzzy Interpretive Structural Modeling	Modeling	Mathematical Techniques	Incorporate fuzzy logic in ISM
29	Matrix of Cross-Impact Multiplication Applied to Classification (MICMAC)	Classification	Mathematical Techniques	Classify variables by impact and dependency
30	Grey Relational Analysis (GRA)	Decision Making	Mathematical Techniques	Rank alternatives based on criteria
31	Multi-Criteria Decision Making (MCDM)	Decision Making	Mathematical Techniques	Evaluate multiple criteria for decision making
32	Technique for Order Preference by Similarity to Ideal Solution (TOPSIS)	Decision Making	Mathematical Techniques	Rank options by closeness to ideal solution
33	Stepwise Weight Assessment Ratio Analysis (SWARA)	Weight Assessment	Mathematical Techniques	Assess criteria weight in decision making
34	Vise Kriterijumska Optimizacija I Kompromisno Resenje (VIKOR)	Decision Making	Mathematical Techniques	Determine compromise solutions
35	Decision-Making Trial and Evaluation Laboratory (DEMATEL)	Decision Making	Mathematical Techniques	Identify relationships between factors
36	Elimination and Choice Expressing Reality (ELECTRE I & II)	Decision Making	Mathematical Techniques	Multicriteria ranking and selection
37	Preference Ranking Organization Method for Enrichment Evaluation (PROMETHEE)	Decision Making	Mathematical Techniques	Preference ranking for alternatives
38	Wavelet Analysis	Signal Processing	Mathematical Techniques	Analyze frequency and time components
39	Fourier Transform	Signal Processing	Mathematical Techniques	Transform data to frequency domain
40	Data Envelopment Analysis (DEA)	Efficiency Analysis	Mathematical Techniques	Measure efficiency in production
41	Decision Tree Analysis	Predictive Modeling	Other Techniques	Classify or predict outcomes using decision rules
42	Formal Logic	Theory	Other Techniques	Apply logical principles in analysis
43	Content Analysis	Qualitative Analysis	Other Techniques	Analyze text or media content
44	Financial Analysis	Financial Modeling	Other Techniques	Assess financial health and metrics

Software of Data Analysis

Sr. No.	Software	Description	Authority	Techniques Performed	Variables Covered
1	SPSS	Statistical software for complex data analysis, data mining, and descriptive statistics.	IBM	Correlation, Regression, ANOVA, MANOVA, Hypothesis Testing	Social, Behavioral, Health Data
2	AMOS	Used for SEM, path analysis, and confirmatory factor analysis.	IBM	SEM, CFA, Path Analysis	Latent and Observed Variables
3	EViews	Software tailored for time-series, forecasting, and panel data econometrics.	IHS Global Inc.	Regression, Time Series, Forecasting, Panel Data Analysis	Economic, Financial Time-Series Data
4	Stata	General-purpose tool for statistical analysis, data visualization, and regression.	StataCorp	Regression, Hypothesis Testing, ANOVA, MANOVA	Public Health, Economics, Sociology
5	SmartPLS	Variance-based SEM for analyzing complex relationships between variables.	SmartPLS GmbH	SEM, Path Analysis	Latent Variables
6	WarpPLS	SEM software focused on identifying nonlinear relationships.	WarpPLS	SEM, Nonlinear Analysis	Latent and Observed Variables
7	MPlus	Versatile statistical modeling for SEM, growth modeling, and multilevel analysis.	Muthén & Muthén	SEM, Growth Modeling, Multilevel Modeling	Categorical, Continuous Variables
8	DEAP	Efficiency and performance analysis via Data Envelopment Analysis (DEA).	University of Queensland	DEA	Efficiency, Productivity Variables
9	MATLAB	High-level environment for numerical computation and visualization.	MathWorks	Regression, Fourier Transform, Wavelet Analysis	Scientific, Engineering, Financial Data
10	Python	Programming language for data analysis, machine learning, and AI.	Python Software Foundation	Regression, Clustering, Data Visualization	Wide Range (e.g., Social, Economic)
11	R	Open-source programming language for statistical computing and data visualization.	R Foundation	Regression, Clustering, Hypothesis Testing	Statistical, Financial Data
12	Lingo	Optimization and mathematical programming software.	Lindo Systems	Linear/Nonlinear Optimization, Integer Programming	Optimization, Planning Variables
13	LINDO	Used for optimization in linear, nonlinear, and quadratic programming.	Lindo Systems	Linear/Nonlinear Optimization, Integer Programming	Operations, Logistics Variables
14	NVivo	Software for qualitative data analysis, especially unstructured data.	QSR International	Content Analysis, Thematic Analysis	Qualitative Data (Interviews, Media)
15	Edraw Max	Diagramming and vector graphics for flowcharts, mind maps, and diagrams.	Wondershare	Visualization, Flowcharting	Visual Structures
16	LISREL	SEM and path analysis software focused on linear structural relations.	SSI	SEM, Path Analysis	Latent, Structural Variables
17	KH Coder	Text mining for quantitative content and text data analysis.	Koichi Higuchi	Text Mining, Content Analysis	Textual Data (Social Media, Reviews)
18	ATLAS.ti	Qualitative data analysis for text, audio, and video.	ATLAS.ti	Content Analysis, Thematic Coding	Qualitative Data (Interviews, Media)
19	MathPix Snip	Converts handwritten math to LaTeX for digital analysis or research papers.	MathPix	Data Conversion, Equation Parsing	Mathematical Equations
20	Smart ISM	Tool related to ISM for analyzing complex system relationships.	Smart ISM	ISM, Structural Analysis	Complex System Variables
21	ISM	Interpretive Structural Modeling for exploring relationships among variables in complex systems.	Various	ISM, Structural Analysis	System Components, Hierarchical Variables
22	RIDIT Analysis	Non-Parametric Analysis	Statistical Techniques	Compare distributions, especially for ordinal data	Mid Term

Greek Alphabet and It's Pronunciation

Upper Case	Lower Case	Full Name and Pronunciation	Purpose (Common Uses)	Formula Example
Α	α	Alpha (AL-fuh)	Significance level in statistics	α=0.05

(significance level)
Β	β	Beta (BAY-tuh)	Type II error in statistics, regression coefficients	y=β0+β1x


Γ	γ	Gamma (GAM-uh)	Gamma function in mathematics	Γ(n)=(n−1)!

(for integer n

)
Δ	δ	Delta (DEL-tuh)	Change or difference in variables	Δx=x2−x1


Ε	ε	Epsilon (EP-sil-on)	Arbitrary small quantity in limits	limx→af(x)=L±ϵ


Ζ	ζ	Zeta (ZAY-tuh)	Riemann zeta function in number theory	ζ(s)=∑∞n=11ns


Η	η	Eta (AY-tuh)	Efficiency in physics	η=useful energy outputtotal energy input


Θ	θ	Theta (THAY-tuh)	Angle measurement, statistical parameters	θ=45∘


Ι	ι	Iota (eye-OH-tuh)	Rarely used in formulas	-
Κ	κ	Kappa (KAP-uh)	Curvature in geometry, condition number in linear algebra	κ=1R

(curvature)
Λ	λ	Lambda (LAM-duh)	Wavelength, eigenvalues in linear algebra	λ=cf

(wavelength)
Μ	μ	Mu (MYOO)	Mean in statistics, coefficient of friction	μ=∑XN

(mean)
Ν	ν	Nu (NOO)	Frequency in physics	ν=cλ

(frequency)
Ξ	ξ	Xi (KS-EYE)	Random variable in statistics	ξ=X

(random variable)
Ο	ο	Omicron (OM-i-KRON)	Rarely used in formulas	-
Π	π	Pi (PIE)	Ratio of circumference to diameter in circles	π≈3.14159


Ρ	ρ	Rho (ROW)	Density in physics, correlation coefficient in statistics	ρ=mV

(density)
Σ	σ	Sigma (SIG-muh)	Summation, standard deviation	σ=√∑(X−μ)2N

(standard deviation)
Τ	τ	Tau (TAU)	Torque, time constant in RC circuits	τ=R×C

(time constant)
Υ	υ	Upsilon (OOP-si-LON)	Upsilon particles in physics	-
Φ	φ	Phi (FEE)	Golden ratio, angle in polar coordinates	ϕ=1+√52≈1.618

(golden ratio)
Χ	χ	Chi (K-EYE)	Chi-square distribution	χ2=∑(O−E)2E


Ψ	ψ	Psi (SIGH)	Wave function in quantum mechanics	ψ(x,t)=Aei(kx−ωt)


Ω	ω	Omega (oh-MAY-guh)	Angular velocity, ohms in electrical resistance	ω=vr

Data Analytics with Python

Completed

Detail overview of Data

Week 02: Data Manipulation and Cleaning with Pandas

Detail overview of Data

Category of data

1. Primary Data

2. Secondary Data

Data Types in Statistics

1. Quantitative Data:

Qualitative Data:

Design Data (Social Sciences) vs Organic Data (Computer Sciences)

Population and Sample in Data Analysis

Sample Size Determination Using Krejcie and Morgan Table

Pandas

Probability and Statistis for Computer Science"

Sources of Secondary Data:

Mathematical and Statistics Techniques for Applied Data Analysis

Software of Data Analysis

Greek Alphabet and It's Pronunciation