Detail overview of Data
Week 02: Data Manipulation and Cleaning with Pandas
Objective: Teach participants to clean and prepare data for analysis, handling common issues in business datasets.
- Importing and exporting data (CSV, Excel, JSON)
- Data manipulation with Pandas: selecting, filtering, merging, and aggregating data
- Data cleaning: handling missing values, duplicate data, outliers, and data types
- Exploratory Data Analysis (EDA) techniques for business insights
- Hands-on exercise: Cleaning and preparing a business dataset
- Mini-project: Using Pandas to prepare a dataset for analysis (e.g., sales or financial data)
Detail overview of Data
Category of data
1. Primary Data
Primary data is original and collected firsthand by the researcher for a specific purpose. It is typically gathered through methods like surveys, experiments, or direct observation. Primary data can further be divided into:
- Interactive: Data collected through direct interaction with subjects, like interviews, surveys, or questionnaires. Example: Conducting a survey to understand consumer preferences.
- Observational: Data collected by observing subjects without direct interaction, typically in a natural setting. Example: Observing customer behavior in a retail store.
- Simulation: Data generated by simulating a real-world process or system under controlled conditions. Example: Running a simulation to understand the potential spread of a virus in a population.
2. Secondary Data
Secondary data is pre-existing data that was collected by someone else, usually for a different purpose. It can be accessed through sources like research publications, databases, and historical records. Secondary data can be categorized as:
- Cross-Sectional: Data collected at a single point in time, providing a snapshot of a particular phenomenon. Example: Census data that captures the demographic distribution at one specific time.
- Time Series: Data collected over a period of time, allowing analysis of trends and changes. Example: Monthly unemployment rates collected over several years to study economic trends.
- Panel: A type of data that involves multiple observations over time for the same subjects, combining elements of cross-sectional and time-series data. Example: Tracking the income levels of the same group of individuals over several years.
Each category serves different research purposes, and the choice depends on the nature of the study, the research question, and the availability of resources.
Data Types in Statistics
1. Quantitative Data:
This type of data can be measured numerically and is often divided into Continuous and Discrete types.
- Continuous Data: This type of data has a continuous range of values, like length, weight, or temperature. It can be measured in fractions or decimals.
- Discrete Data: This type of data includes specific values only, such as the number of students in a class or the count of coins.
Qualitative Data:
This data cannot be measured numerically and is instead categorized based on attributes or qualities.
- Nominal Data: In this type, data is categorized by names or labels, without any inherent order, such as gender (male, female) or color (red, blue).
- Ordinal Data: This type has a clear order or ranking among categories, but the intervals between them cannot be measured, such as rankings like "Good," "Very Good," "Excellent."
Design Data (Social Sciences) vs Organic Data (Computer Sciences)
In social sciences, "design data" refers to structured data collected through planned research designs, such as surveys, experiments, and observational studies. This data is purposefully gathered to answer specific questions or test hypotheses within a controlled environment, allowing researchers to analyze behavior, attitudes, or interactions systematically.
In computer science, "organic data" refers to naturally occurring data, often unstructured, that is generated without a specific research design in mind. This data comes from real-world sources like social media posts, sensor outputs, transaction logs, and web activity. Organic data is typically analyzed to identify patterns, trends, and insights in a way that reflects real-world, spontaneous behavior rather than controlled experimental settings.
Population and Sample in Data Analysis
- Population: Refers to the complete group of individuals or items sharing a specific characteristic and targeted in a study. Researchers define the population based on study goals. Example: Studying the eating habits of college students in Pakistan would have all Pakistani college students as the population.
- Sample: A subset of the population chosen for analysis, often due to practical constraints (time, cost, etc.). Ideally, the sample is representative of the population to generalize findings. Example: Selecting 500 students from different universities across Pakistan as a sample to represent all college students.
- Types of Sampling:
- Probability Sampling: Each population member has
a known, non-zero chance of selection. This method is randomized and
aims to create a representative sample, enhancing result generalization.
- Types: Simple Random, Systematic, Stratified, and Cluster Sampling.
- Non-Probability Sampling: Selection chances are
not known for all members, making it less representative and potentially
introducing bias. It’s often faster and cheaper but limits
generalizability.
- Types: Convenience, Purposive (Judgmental), Quota, and Snowball Sampling.
Sample Size Determination Using Krejcie and Morgan Table
Population (N) | Sample Size (S) |
---|---|
10 | 10 |
50 | 44 |
100 | 80 |
200 | 132 |
500 | 217 |
1,000 | 278 |
5,000 | 357 |
10,000 | 370 |
50,000 | 381 |
100,000 | 384 |
1,000,000 | 384 |
- Source for detail:
Sample Size Determination Using Krejcie and Morgan Table
Pandas
- Pandas, named after "panel data" from econometrics, was created by Wes McKinney to simplify handling multi-dimensional, labeled, and time-series data in Python. It emerged as a tool to streamline complex data manipulation tasks, making them accessible with just a few lines of code.
- The playful panda bear reference added a friendly touch, making pandas not only technically powerful but also memorable and approachable in the Python ecosystem. Today, it’s an essential library for data manipulation, beloved by data scientists and analysts alike.
Probability and Statistis for Computer Science"
Its Cover several key concepts
- Probability Basics: Fundamental principles of probability, including definitions, types of probabilities, and basic operations like union and intersection of events.
- Random Variables and Distributions: Explanation of random variables, both discrete and continuous, and how probability distributions apply to each. It also includes types of distributions like binomial, Poisson, and normal.
- Central Tendency and Dispersion: Measures like mean, median, mode, variance, and standard deviation to describe the central point and spread of data. Includes tools like quartiles and boxplots.
- Entropy and Information Theory: Introduction to entropy as a measure of uncertainty or disorder in data, as well as its applications in machine learning and communication systems.
- Data Visualization: Techniques for visualizing data through graphs such as histograms, contour plots, and scatter plots to better understand data distribution and relationships.
- Correlation and Dependence: Concepts of correlation, dependence, and methods to analyze the relationship between two or more variables, including covariance and correlation coefficients.
- Statistical Hypothesis Testing: Introduction to hypothesis testing for statistical inference, including p-values and confidence intervals, commonly used to validate or refute assumptions about data.
- Regression Analysis: Basics of linear regression, modeling relationships between variables, and evaluating the fit of these models.
- Time Series Analysis: Methods for analyzing data points collected or sequenced over time, including trends, seasonality, and smoothing techniques.
- Applications in Machine Learning: Use of statistical concepts, especially entropy and data distributions, in machine learning to improve model training, understanding, and evaluation.
Sources of Secondary Data:
Sr. No. | Source Name | Authority | Industry / Indicator Type | Source Locale |
---|---|---|---|---|
1 | Pakistan Bureau of Statistics | Government of Pakistan | National Statistics | Local |
2 | Punjab Statistics Department | Government of Punjab | Provincial Statistics | Local |
3 | Insurance Association of Pakistan | IAP | Insurance Industry | Local |
4 | National Database & Registration Authority | NADRA | Population and National ID Registration | Local |
5 | KSE Stocks | Pakistan Stock Exchange | Stock Market | Local |
6 | Pakistan Bankers Association | PBA | Banking Industry | Local |
7 | Pakistan Stock Exchange | PSX | Stock Market | Local |
8 | State Bank of Pakistan | SBP | Monetary Policy and Banking Indicators | Local |
9 | Securities and Exchange Commission | SECP | Corporate Governance | Local |
10 | Crypto Data Download | Binance | Cryptocurrency | International |
11 | Bloomberg | Bloomfinberg | Global Finance and Economic Data | International |
12 | Federal Reserve Economic Data (FRED) | Federal Reserve | U.S. Economic Indicators | International |
13 | Economic Freedom Index | Heritage Foundation | Economic Freedom | International |
14 | Additional IMF Data | International Monetary Fund | Financial Statistics | International |
15 | IMF Data | International Monetary Fund | Global Economic Indicators | International |
16 | KOF Globalisation Index | KOF Swiss Economic Institute | Globalisation Indicators | International |
17 | Oxford Economics Group | Oxford Economics | Global Economic Analysis | International |
18 | Datastream | Thomson Reuters | Financial and Economic Data | International |
19 | Trading Economics | Trading Economics | Economic Indicators, Global Statistics | International |
20 | International Financial Statistics | United Nations | International Finance | International |
21 | University of Gothenburg Databases | University of Gothenburg | Academic and Economic Research | International |
22 | World Bank Data | World Bank | Global Development Indicators | International |
23 | World Bank Finance | World Bank | Global Financial Indicators | International |
24 | World Development Indicators | World Bank | Economic Development | International |
25 | Yahoo Finance | Yahoo | General Finance and Stock Market | International |
Mathematical and Statistics Techniques for Applied Data Analysis
Sr. No. | Technique | Type | Category | Purpose |
---|---|---|---|---|
1 | Correlation | Analysis | Statistical Techniques | Measure relationship between variables |
2 | Regression | Analysis | Statistical Techniques | Predict values based on relationships between variables |
3 | Multiple Regression Analysis | Analysis | Statistical Techniques | Examine relationships with multiple independent variables |
4 | Logistic Regression Analysis | Analysis | Statistical Techniques | Predict binary outcomes |
5 | Structural Equation Modeling (SEM) | Theory/Modeling | Statistical Techniques | Examine complex relationships among variables |
6 | Canonical Correlation | Analysis | Statistical Techniques | Analyze relationships between multiple independent and dependent variables |
7 | Discriminant Analysis | Classification | Statistical Techniques | Classify observations into predefined groups |
8 | Factor Analysis | Data Reduction | Statistical Techniques | Identify underlying factors in data |
9 | Confirmatory Factor Analysis (CFA) | Technique | Statistical Techniques | Validate hypothesized factor structure |
10 | Exploratory Factor Analysis (EFA) | Technique | Statistical Techniques | Identify latent structure |
11 | Principal Component Analysis (PCA) | Data Reduction | Statistical Techniques | Reduce data dimensionality |
12 | Analysis of Variance (ANOVA) | Hypothesis Testing | Statistical Techniques | Test group mean differences |
13 | Multivariate Analysis of Variance (MANOVA) | Hypothesis Testing | Statistical Techniques | Test group differences across multiple variables |
14 | Multivariate Analysis of Covariance (MANCOVA) | Hypothesis Testing | Statistical Techniques | Control covariates when comparing groups |
15 | Cluster Analysis | Grouping | Statistical Techniques | Group data based on similarity |
16 | Multidimensional Scaling (MDS) | Visualization | Statistical Techniques | Spatially represent data similarities |
17 | Correspondence Analysis | Dimension Reduction | Statistical Techniques | Reduce dimensions for categorical data |
18 | Conjoint Analysis | Preference Analysis | Statistical Techniques | Analyze preferences among multiple attributes |
19 | Hypothesis Testing | Statistical Testing | Statistical Techniques | Test assumptions in data |
20 | T-Test | Hypothesis Testing | Statistical Techniques | Compare means of two groups |
21 | Z-Test | Hypothesis Testing | Statistical Techniques | Compare population proportions |
22 | Chi-Square Test | Hypothesis Testing | Statistical Techniques | Test for independence or goodness of fit |
23 | F-Test | Hypothesis Testing | Statistical Techniques | Compare variances |
24 | Interpretive Structural Modeling (ISM) | Modeling | Mathematical Techniques | Model complex system structures |
25 | Total Interpretive Structural Modeling (TISM) | Modeling | Mathematical Techniques | ISM with total interpretation |
26 | Modified Interpretive Structural Modeling | Modeling | Mathematical Techniques | Modified ISM for specific applications |
27 | Polarized Interpretive Structural Modeling | Modeling | Mathematical Techniques | Polar perspective in ISM |
28 | Fuzzy Interpretive Structural Modeling | Modeling | Mathematical Techniques | Incorporate fuzzy logic in ISM |
29 | Matrix of Cross-Impact Multiplication Applied to Classification (MICMAC) | Classification | Mathematical Techniques | Classify variables by impact and dependency |
30 | Grey Relational Analysis (GRA) | Decision Making | Mathematical Techniques | Rank alternatives based on criteria |
31 | Multi-Criteria Decision Making (MCDM) | Decision Making | Mathematical Techniques | Evaluate multiple criteria for decision making |
32 | Technique for Order Preference by Similarity to Ideal Solution (TOPSIS) | Decision Making | Mathematical Techniques | Rank options by closeness to ideal solution |
33 | Stepwise Weight Assessment Ratio Analysis (SWARA) | Weight Assessment | Mathematical Techniques | Assess criteria weight in decision making |
34 | Vise Kriterijumska Optimizacija I Kompromisno Resenje (VIKOR) | Decision Making | Mathematical Techniques | Determine compromise solutions |
35 | Decision-Making Trial and Evaluation Laboratory (DEMATEL) | Decision Making | Mathematical Techniques | Identify relationships between factors |
36 | Elimination and Choice Expressing Reality (ELECTRE I & II) | Decision Making | Mathematical Techniques | Multicriteria ranking and selection |
37 | Preference Ranking Organization Method for Enrichment Evaluation (PROMETHEE) | Decision Making | Mathematical Techniques | Preference ranking for alternatives |
38 | Wavelet Analysis | Signal Processing | Mathematical Techniques | Analyze frequency and time components |
39 | Fourier Transform | Signal Processing | Mathematical Techniques | Transform data to frequency domain |
40 | Data Envelopment Analysis (DEA) | Efficiency Analysis | Mathematical Techniques | Measure efficiency in production |
41 | Decision Tree Analysis | Predictive Modeling | Other Techniques | Classify or predict outcomes using decision rules |
42 | Formal Logic | Theory | Other Techniques | Apply logical principles in analysis |
43 | Content Analysis | Qualitative Analysis | Other Techniques | Analyze text or media content |
44 | Financial Analysis | Financial Modeling | Other Techniques | Assess financial health and metrics |
Software of Data Analysis
Sr. No. | Software | Description | Authority | Techniques Performed | Variables Covered |
---|---|---|---|---|---|
1 | SPSS | Statistical software for complex data analysis, data mining, and descriptive statistics. | IBM | Correlation, Regression, ANOVA, MANOVA, Hypothesis Testing | Social, Behavioral, Health Data |
2 | AMOS | Used for SEM, path analysis, and confirmatory factor analysis. | IBM | SEM, CFA, Path Analysis | Latent and Observed Variables |
3 | EViews | Software tailored for time-series, forecasting, and panel data econometrics. | IHS Global Inc. | Regression, Time Series, Forecasting, Panel Data Analysis | Economic, Financial Time-Series Data |
4 | Stata | General-purpose tool for statistical analysis, data visualization, and regression. | StataCorp | Regression, Hypothesis Testing, ANOVA, MANOVA | Public Health, Economics, Sociology |
5 | SmartPLS | Variance-based SEM for analyzing complex relationships between variables. | SmartPLS GmbH | SEM, Path Analysis | Latent Variables |
6 | WarpPLS | SEM software focused on identifying nonlinear relationships. | WarpPLS | SEM, Nonlinear Analysis | Latent and Observed Variables |
7 | MPlus | Versatile statistical modeling for SEM, growth modeling, and multilevel analysis. | Muthén & Muthén | SEM, Growth Modeling, Multilevel Modeling | Categorical, Continuous Variables |
8 | DEAP | Efficiency and performance analysis via Data Envelopment Analysis (DEA). | University of Queensland | DEA | Efficiency, Productivity Variables |
9 | MATLAB | High-level environment for numerical computation and visualization. | MathWorks | Regression, Fourier Transform, Wavelet Analysis | Scientific, Engineering, Financial Data |
10 | Python | Programming language for data analysis, machine learning, and AI. | Python Software Foundation | Regression, Clustering, Data Visualization | Wide Range (e.g., Social, Economic) |
11 | R | Open-source programming language for statistical computing and data visualization. | R Foundation | Regression, Clustering, Hypothesis Testing | Statistical, Financial Data |
12 | Lingo | Optimization and mathematical programming software. | Lindo Systems | Linear/Nonlinear Optimization, Integer Programming | Optimization, Planning Variables |
13 | LINDO | Used for optimization in linear, nonlinear, and quadratic programming. | Lindo Systems | Linear/Nonlinear Optimization, Integer Programming | Operations, Logistics Variables |
14 | NVivo | Software for qualitative data analysis, especially unstructured data. | QSR International | Content Analysis, Thematic Analysis | Qualitative Data (Interviews, Media) |
15 | Edraw Max | Diagramming and vector graphics for flowcharts, mind maps, and diagrams. | Wondershare | Visualization, Flowcharting | Visual Structures |
16 | LISREL | SEM and path analysis software focused on linear structural relations. | SSI | SEM, Path Analysis | Latent, Structural Variables |
17 | KH Coder | Text mining for quantitative content and text data analysis. | Koichi Higuchi | Text Mining, Content Analysis | Textual Data (Social Media, Reviews) |
18 | ATLAS.ti | Qualitative data analysis for text, audio, and video. | ATLAS.ti | Content Analysis, Thematic Coding | Qualitative Data (Interviews, Media) |
19 | MathPix Snip | Converts handwritten math to LaTeX for digital analysis or research papers. | MathPix | Data Conversion, Equation Parsing | Mathematical Equations |
20 | Smart ISM | Tool related to ISM for analyzing complex system relationships. | Smart ISM | ISM, Structural Analysis | Complex System Variables |
21 | ISM | Interpretive Structural Modeling for exploring relationships among variables in complex systems. | Various | ISM, Structural Analysis | System Components, Hierarchical Variables |
22 | RIDIT Analysis | Non-Parametric Analysis | Statistical Techniques | Compare distributions, especially for ordinal data | Mid Term |
Greek Alphabet and It's Pronunciation
Upper Case | Lower Case | Full Name and Pronunciation | Purpose (Common Uses) | Formula Example |
---|---|---|---|---|
Α | α | Alpha (AL-fuh) | Significance level in statistics | α=0.05 |
(significance level) | ||||
Β | β | Beta (BAY-tuh) | Type II error in statistics, regression coefficients | y=β0+β1x |
Γ | γ | Gamma (GAM-uh) | Gamma function in mathematics | Γ(n)=(n−1)! |
(for integer n
) | ||||
Δ | δ | Delta (DEL-tuh) | Change or difference in variables | Δx=x2−x1 |
Ε | ε | Epsilon (EP-sil-on) | Arbitrary small quantity in limits | limx→af(x)=L±ϵ |
Ζ | ζ | Zeta (ZAY-tuh) | Riemann zeta function in number theory | ζ(s)=∑∞n=11ns |
Η | η | Eta (AY-tuh) | Efficiency in physics | η=useful energy outputtotal energy input |
Θ | θ | Theta (THAY-tuh) | Angle measurement, statistical parameters | θ=45∘ |
Ι | ι | Iota (eye-OH-tuh) | Rarely used in formulas | - |
Κ | κ | Kappa (KAP-uh) | Curvature in geometry, condition number in linear algebra | κ=1R |
(curvature) | ||||
Λ | λ | Lambda (LAM-duh) | Wavelength, eigenvalues in linear algebra | λ=cf |
(wavelength) | ||||
Μ | μ | Mu (MYOO) | Mean in statistics, coefficient of friction | μ=∑XN |
(mean) | ||||
Ν | ν | Nu (NOO) | Frequency in physics | ν=cλ |
(frequency) | ||||
Ξ | ξ | Xi (KS-EYE) | Random variable in statistics | ξ=X |
(random variable) | ||||
Ο | ο | Omicron (OM-i-KRON) | Rarely used in formulas | - |
Π | π | Pi (PIE) | Ratio of circumference to diameter in circles | π≈3.14159 |
Ρ | ρ | Rho (ROW) | Density in physics, correlation coefficient in statistics | ρ=mV |
(density) | ||||
Σ | σ | Sigma (SIG-muh) | Summation, standard deviation | σ=√∑(X−μ)2N |
(standard deviation) | ||||
Τ | τ | Tau (TAU) | Torque, time constant in RC circuits | τ=R×C |
(time constant) | ||||
Υ | υ | Upsilon (OOP-si-LON) | Upsilon particles in physics | - |
Φ | φ | Phi (FEE) | Golden ratio, angle in polar coordinates | ϕ=1+√52≈1.618 |
(golden ratio) | ||||
Χ | χ | Chi (K-EYE) | Chi-square distribution | χ2=∑(O−E)2E |
Ψ | ψ | Psi (SIGH) | Wave function in quantum mechanics | ψ(x,t)=Aei(kx−ωt) |
Ω | ω | Omega (oh-MAY-guh) | Angular velocity, ohms in electrical resistance | ω=vr |
(angular velocity) |
There are no comments for now.