Detail overview of Data

Week 02: Data Manipulation and Cleaning with Pandas

Objective: Teach participants to clean and prepare data for analysis, handling common issues in business datasets.

  • Importing and exporting data (CSV, Excel, JSON)
  • Data manipulation with Pandas: selecting, filtering, merging, and aggregating data
  • Data cleaning: handling missing values, duplicate data, outliers, and data types
  • Exploratory Data Analysis (EDA) techniques for business insights
  • Hands-on exercise: Cleaning and preparing a business dataset
  • Mini-project: Using Pandas to prepare a dataset for analysis (e.g., sales or financial data)

Detail overview of Data

Category of data

1. Primary Data

Primary data is original and collected firsthand by the researcher for a specific purpose. It is typically gathered through methods like surveys, experiments, or direct observation. Primary data can further be divided into:

  • Interactive: Data collected through direct interaction with subjects, like interviews, surveys, or questionnaires. Example: Conducting a survey to understand consumer preferences.
  • Observational: Data collected by observing subjects without direct interaction, typically in a natural setting. Example: Observing customer behavior in a retail store.
  • Simulation: Data generated by simulating a real-world process or system under controlled conditions. Example: Running a simulation to understand the potential spread of a virus in a population.

2. Secondary Data

Secondary data is pre-existing data that was collected by someone else, usually for a different purpose. It can be accessed through sources like research publications, databases, and historical records. Secondary data can be categorized as:

  • Cross-Sectional: Data collected at a single point in time, providing a snapshot of a particular phenomenon. Example: Census data that captures the demographic distribution at one specific time.
  • Time Series: Data collected over a period of time, allowing analysis of trends and changes. Example: Monthly unemployment rates collected over several years to study economic trends.
  • Panel: A type of data that involves multiple observations over time for the same subjects, combining elements of cross-sectional and time-series data. Example: Tracking the income levels of the same group of individuals over several years.

Each category serves different research purposes, and the choice depends on the nature of the study, the research question, and the availability of resources.

Data Types in Statistics

1. Quantitative Data:

This type of data can be measured numerically and is often divided into Continuous and Discrete types.

  • Continuous Data: This type of data has a continuous range of values, like length, weight, or temperature. It can be measured in fractions or decimals.
  • Discrete Data: This type of data includes specific values only, such as the number of students in a class or the count of coins.

Qualitative Data:

This data cannot be measured numerically and is instead categorized based on attributes or qualities.

  • Nominal Data: In this type, data is categorized by names or labels, without any inherent order, such as gender (male, female) or color (red, blue).
  • Ordinal Data: This type has a clear order or ranking among categories, but the intervals between them cannot be measured, such as rankings like "Good," "Very Good," "Excellent."

Design Data (Social Sciences) vs Organic Data (Computer Sciences)

In social sciences, "design data" refers to structured data collected through planned research designs, such as surveys, experiments, and observational studies. This data is purposefully gathered to answer specific questions or test hypotheses within a controlled environment, allowing researchers to analyze behavior, attitudes, or interactions systematically.

In computer science, "organic data" refers to naturally occurring data, often unstructured, that is generated without a specific research design in mind. This data comes from real-world sources like social media posts, sensor outputs, transaction logs, and web activity. Organic data is typically analyzed to identify patterns, trends, and insights in a way that reflects real-world, spontaneous behavior rather than controlled experimental settings.

Population and Sample in Data Analysis

  • Population: Refers to the complete group of individuals or items sharing a specific characteristic and targeted in a study. Researchers define the population based on study goals. Example: Studying the eating habits of college students in Pakistan would have all Pakistani college students as the population.
  • Sample: A subset of the population chosen for analysis, often due to practical constraints (time, cost, etc.). Ideally, the sample is representative of the population to generalize findings. Example: Selecting 500 students from different universities across Pakistan as a sample to represent all college students.

- Types of Sampling:

  1. Probability Sampling: Each population member has a known, non-zero chance of selection. This method is randomized and aims to create a representative sample, enhancing result generalization.
    • Types: Simple Random, Systematic, Stratified, and Cluster Sampling.
  2. Non-Probability Sampling: Selection chances are not known for all members, making it less representative and potentially introducing bias. It’s often faster and cheaper but limits generalizability.
    • Types: Convenience, Purposive (Judgmental), Quota, and Snowball Sampling.

Sample Size Determination Using Krejcie and Morgan Table

Population (N) Sample Size (S)
10 10
50 44
100 80
200 132
500 217
1,000 278
5,000 357
10,000 370
50,000 381
100,000 384
1,000,000 384
  • Source for detail:

Sample Size Determination Using Krejcie and Morgan Table

Pandas

  • Pandas, named after "panel data" from econometrics, was created by Wes McKinney to simplify handling multi-dimensional, labeled, and time-series data in Python. It emerged as a tool to streamline complex data manipulation tasks, making them accessible with just a few lines of code.
  • The playful panda bear reference added a friendly touch, making pandas not only technically powerful but also memorable and approachable in the Python ecosystem. Today, it’s an essential library for data manipulation, beloved by data scientists and analysts alike.

Probability and Statistis for Computer Science"

Its Cover several key concepts

  • Probability Basics: Fundamental principles of probability, including definitions, types of probabilities, and basic operations like union and intersection of events.
  • Random Variables and Distributions: Explanation of random variables, both discrete and continuous, and how probability distributions apply to each. It also includes types of distributions like binomial, Poisson, and normal.
  • Central Tendency and Dispersion: Measures like mean, median, mode, variance, and standard deviation to describe the central point and spread of data. Includes tools like quartiles and boxplots.
  • Entropy and Information Theory: Introduction to entropy as a measure of uncertainty or disorder in data, as well as its applications in machine learning and communication systems.
  • Data Visualization: Techniques for visualizing data through graphs such as histograms, contour plots, and scatter plots to better understand data distribution and relationships.
  • Correlation and Dependence: Concepts of correlation, dependence, and methods to analyze the relationship between two or more variables, including covariance and correlation coefficients.
  • Statistical Hypothesis Testing: Introduction to hypothesis testing for statistical inference, including p-values and confidence intervals, commonly used to validate or refute assumptions about data.
  • Regression Analysis: Basics of linear regression, modeling relationships between variables, and evaluating the fit of these models.
  • Time Series Analysis: Methods for analyzing data points collected or sequenced over time, including trends, seasonality, and smoothing techniques.
  • Applications in Machine Learning: Use of statistical concepts, especially entropy and data distributions, in machine learning to improve model training, understanding, and evaluation.

Sources of Secondary Data:

Sr. No. Source Name Authority Industry / Indicator Type Source Locale
1 Pakistan Bureau of Statistics Government of Pakistan National Statistics Local
2 Punjab Statistics Department Government of Punjab Provincial Statistics Local
3 Insurance Association of Pakistan IAP Insurance Industry Local
4 National Database & Registration Authority NADRA Population and National ID Registration Local
5 KSE Stocks Pakistan Stock Exchange Stock Market Local
6 Pakistan Bankers Association PBA Banking Industry Local
7 Pakistan Stock Exchange PSX Stock Market Local
8 State Bank of Pakistan SBP Monetary Policy and Banking Indicators Local
9 Securities and Exchange Commission SECP Corporate Governance Local
10 Crypto Data Download Binance Cryptocurrency International
11 Bloomberg Bloomfinberg Global Finance and Economic Data International
12 Federal Reserve Economic Data (FRED) Federal Reserve U.S. Economic Indicators International
13 Economic Freedom Index Heritage Foundation Economic Freedom International
14 Additional IMF Data International Monetary Fund Financial Statistics International
15 IMF Data International Monetary Fund Global Economic Indicators International
16 KOF Globalisation Index KOF Swiss Economic Institute Globalisation Indicators International
17 Oxford Economics Group Oxford Economics Global Economic Analysis International
18 Datastream Thomson Reuters Financial and Economic Data International
19 Trading Economics Trading Economics Economic Indicators, Global Statistics International
20 International Financial Statistics United Nations International Finance International
21 University of Gothenburg Databases University of Gothenburg Academic and Economic Research International
22 World Bank Data World Bank Global Development Indicators International
23 World Bank Finance World Bank Global Financial Indicators International
24 World Development Indicators World Bank Economic Development International
25 Yahoo Finance Yahoo General Finance and Stock Market International

Mathematical and Statistics Techniques for Applied Data Analysis

Sr. No. Technique Type Category Purpose
1 Correlation Analysis Statistical Techniques Measure relationship between variables
2 Regression Analysis Statistical Techniques Predict values based on relationships between variables
3 Multiple Regression Analysis Analysis Statistical Techniques Examine relationships with multiple independent variables
4 Logistic Regression Analysis Analysis Statistical Techniques Predict binary outcomes
5 Structural Equation Modeling (SEM) Theory/Modeling Statistical Techniques Examine complex relationships among variables
6 Canonical Correlation Analysis Statistical Techniques Analyze relationships between multiple independent and dependent variables
7 Discriminant Analysis Classification Statistical Techniques Classify observations into predefined groups
8 Factor Analysis Data Reduction Statistical Techniques Identify underlying factors in data
9 Confirmatory Factor Analysis (CFA) Technique Statistical Techniques Validate hypothesized factor structure
10 Exploratory Factor Analysis (EFA) Technique Statistical Techniques Identify latent structure
11 Principal Component Analysis (PCA) Data Reduction Statistical Techniques Reduce data dimensionality
12 Analysis of Variance (ANOVA) Hypothesis Testing Statistical Techniques Test group mean differences
13 Multivariate Analysis of Variance (MANOVA) Hypothesis Testing Statistical Techniques Test group differences across multiple variables
14 Multivariate Analysis of Covariance (MANCOVA) Hypothesis Testing Statistical Techniques Control covariates when comparing groups
15 Cluster Analysis Grouping Statistical Techniques Group data based on similarity
16 Multidimensional Scaling (MDS) Visualization Statistical Techniques Spatially represent data similarities
17 Correspondence Analysis Dimension Reduction Statistical Techniques Reduce dimensions for categorical data
18 Conjoint Analysis Preference Analysis Statistical Techniques Analyze preferences among multiple attributes
19 Hypothesis Testing Statistical Testing Statistical Techniques Test assumptions in data
20 T-Test Hypothesis Testing Statistical Techniques Compare means of two groups
21 Z-Test Hypothesis Testing Statistical Techniques Compare population proportions
22 Chi-Square Test Hypothesis Testing Statistical Techniques Test for independence or goodness of fit
23 F-Test Hypothesis Testing Statistical Techniques Compare variances
24 Interpretive Structural Modeling (ISM) Modeling Mathematical Techniques Model complex system structures
25 Total Interpretive Structural Modeling (TISM) Modeling Mathematical Techniques ISM with total interpretation
26 Modified Interpretive Structural Modeling Modeling Mathematical Techniques Modified ISM for specific applications
27 Polarized Interpretive Structural Modeling Modeling Mathematical Techniques Polar perspective in ISM
28 Fuzzy Interpretive Structural Modeling Modeling Mathematical Techniques Incorporate fuzzy logic in ISM
29 Matrix of Cross-Impact Multiplication Applied to Classification (MICMAC) Classification Mathematical Techniques Classify variables by impact and dependency
30 Grey Relational Analysis (GRA) Decision Making Mathematical Techniques Rank alternatives based on criteria
31 Multi-Criteria Decision Making (MCDM) Decision Making Mathematical Techniques Evaluate multiple criteria for decision making
32 Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) Decision Making Mathematical Techniques Rank options by closeness to ideal solution
33 Stepwise Weight Assessment Ratio Analysis (SWARA) Weight Assessment Mathematical Techniques Assess criteria weight in decision making
34 Vise Kriterijumska Optimizacija I Kompromisno Resenje (VIKOR) Decision Making Mathematical Techniques Determine compromise solutions
35 Decision-Making Trial and Evaluation Laboratory (DEMATEL) Decision Making Mathematical Techniques Identify relationships between factors
36 Elimination and Choice Expressing Reality (ELECTRE I & II) Decision Making Mathematical Techniques Multicriteria ranking and selection
37 Preference Ranking Organization Method for Enrichment Evaluation (PROMETHEE) Decision Making Mathematical Techniques Preference ranking for alternatives
38 Wavelet Analysis Signal Processing Mathematical Techniques Analyze frequency and time components
39 Fourier Transform Signal Processing Mathematical Techniques Transform data to frequency domain
40 Data Envelopment Analysis (DEA) Efficiency Analysis Mathematical Techniques Measure efficiency in production
41 Decision Tree Analysis Predictive Modeling Other Techniques Classify or predict outcomes using decision rules
42 Formal Logic Theory Other Techniques Apply logical principles in analysis
43 Content Analysis Qualitative Analysis Other Techniques Analyze text or media content
44 Financial Analysis Financial Modeling Other Techniques Assess financial health and metrics

Software of Data Analysis

Sr. No. Software Description Authority Techniques Performed Variables Covered
1 SPSS Statistical software for complex data analysis, data mining, and descriptive statistics. IBM Correlation, Regression, ANOVA, MANOVA, Hypothesis Testing Social, Behavioral, Health Data
2 AMOS Used for SEM, path analysis, and confirmatory factor analysis. IBM SEM, CFA, Path Analysis Latent and Observed Variables
3 EViews Software tailored for time-series, forecasting, and panel data econometrics. IHS Global Inc. Regression, Time Series, Forecasting, Panel Data Analysis Economic, Financial Time-Series Data
4 Stata General-purpose tool for statistical analysis, data visualization, and regression. StataCorp Regression, Hypothesis Testing, ANOVA, MANOVA Public Health, Economics, Sociology
5 SmartPLS Variance-based SEM for analyzing complex relationships between variables. SmartPLS GmbH SEM, Path Analysis Latent Variables
6 WarpPLS SEM software focused on identifying nonlinear relationships. WarpPLS SEM, Nonlinear Analysis Latent and Observed Variables
7 MPlus Versatile statistical modeling for SEM, growth modeling, and multilevel analysis. Muthén & Muthén SEM, Growth Modeling, Multilevel Modeling Categorical, Continuous Variables
8 DEAP Efficiency and performance analysis via Data Envelopment Analysis (DEA). University of Queensland DEA Efficiency, Productivity Variables
9 MATLAB High-level environment for numerical computation and visualization. MathWorks Regression, Fourier Transform, Wavelet Analysis Scientific, Engineering, Financial Data
10 Python Programming language for data analysis, machine learning, and AI. Python Software Foundation Regression, Clustering, Data Visualization Wide Range (e.g., Social, Economic)
11 R Open-source programming language for statistical computing and data visualization. R Foundation Regression, Clustering, Hypothesis Testing Statistical, Financial Data
12 Lingo Optimization and mathematical programming software. Lindo Systems Linear/Nonlinear Optimization, Integer Programming Optimization, Planning Variables
13 LINDO Used for optimization in linear, nonlinear, and quadratic programming. Lindo Systems Linear/Nonlinear Optimization, Integer Programming Operations, Logistics Variables
14 NVivo Software for qualitative data analysis, especially unstructured data. QSR International Content Analysis, Thematic Analysis Qualitative Data (Interviews, Media)
15 Edraw Max Diagramming and vector graphics for flowcharts, mind maps, and diagrams. Wondershare Visualization, Flowcharting Visual Structures
16 LISREL SEM and path analysis software focused on linear structural relations. SSI SEM, Path Analysis Latent, Structural Variables
17 KH Coder Text mining for quantitative content and text data analysis. Koichi Higuchi Text Mining, Content Analysis Textual Data (Social Media, Reviews)
18 ATLAS.ti Qualitative data analysis for text, audio, and video. ATLAS.ti Content Analysis, Thematic Coding Qualitative Data (Interviews, Media)
19 MathPix Snip Converts handwritten math to LaTeX for digital analysis or research papers. MathPix Data Conversion, Equation Parsing Mathematical Equations
20 Smart ISM Tool related to ISM for analyzing complex system relationships. Smart ISM ISM, Structural Analysis Complex System Variables
21 ISM Interpretive Structural Modeling for exploring relationships among variables in complex systems. Various ISM, Structural Analysis System Components, Hierarchical Variables
22 RIDIT Analysis Non-Parametric Analysis Statistical Techniques Compare distributions, especially for ordinal data Mid Term

Greek Alphabet and It's Pronunciation

Upper Case Lower Case Full Name and Pronunciation Purpose (Common Uses) Formula Example
Α α Alpha (AL-fuh) Significance level in statistics α=0.05
(significance level)
Β β Beta (BAY-tuh) Type II error in statistics, regression coefficients y=β0+β1x
Γ γ Gamma (GAM-uh) Gamma function in mathematics Γ(n)=(n−1)!

(for integer n

)
Δ δ Delta (DEL-tuh) Change or difference in variables Δx=x2−x1
Ε ε Epsilon (EP-sil-on) Arbitrary small quantity in limits limx→af(x)=L±ϵ
Ζ ζ Zeta (ZAY-tuh) Riemann zeta function in number theory ζ(s)=∑∞n=11ns
Η η Eta (AY-tuh) Efficiency in physics η=useful energy outputtotal energy input
Θ θ Theta (THAY-tuh) Angle measurement, statistical parameters θ=45∘
Ι ι Iota (eye-OH-tuh) Rarely used in formulas -
Κ κ Kappa (KAP-uh) Curvature in geometry, condition number in linear algebra κ=1R
(curvature)
Λ λ Lambda (LAM-duh) Wavelength, eigenvalues in linear algebra λ=cf
(wavelength)
Μ μ Mu (MYOO) Mean in statistics, coefficient of friction μ=∑XN
(mean)
Ν ν Nu (NOO) Frequency in physics ν=cλ
(frequency)
Ξ ξ Xi (KS-EYE) Random variable in statistics ξ=X
(random variable)
Ο ο Omicron (OM-i-KRON) Rarely used in formulas -
Π π Pi (PIE) Ratio of circumference to diameter in circles π≈3.14159
Ρ ρ Rho (ROW) Density in physics, correlation coefficient in statistics ρ=mV
(density)
Σ σ Sigma (SIG-muh) Summation, standard deviation σ=√∑(X−μ)2N
(standard deviation)
Τ τ Tau (TAU) Torque, time constant in RC circuits τ=R×C
(time constant)
Υ υ Upsilon (OOP-si-LON) Upsilon particles in physics -
Φ φ Phi (FEE) Golden ratio, angle in polar coordinates ϕ=1+√52≈1.618
(golden ratio)
Χ χ Chi (K-EYE) Chi-square distribution χ2=∑(O−E)2E
Ψ ψ Psi (SIGH) Wave function in quantum mechanics ψ(x,t)=Aei(kx−ωt)
Ω ω Omega (oh-MAY-guh) Angular velocity, ohms in electrical resistance ω=vr
(angular velocity)
 


Rating
0 0

There are no comments for now.