Data Science and Big Data Hadoop: Data Science Practice

========Types of datavariables =========
1- categorical -no direction no magnitude eg-male female can not perform arithmetic OP
2- ordinal - direction no magnitude
Eg - sizes of shirt xl, l, m, s etc they have direction but no magnitude can not perform arithmetic Op
3- interval both directions and magnitude but it is often place in ur dataset
U may or may not perform arithmetic OP
4-ratio / continuous /
Eg - temprature convert to ferhnt
parametric - both directions and magnitude eg -weiht and height they work OK N Dimensions
In DA we study and simplify to get this work on N Dimension

Median - 50th percentile of total Column value
Can apply on ordinal, ratio, interval
Mean-mean value can only on have direction and magnitude
Mode -calculate using ocurance vs frequency on any¹ Datavariable

In data analytics we work on large amount of data so we always on samples dataset extracted from population.
Inferential statistics - process of estimation base on samples taken from population.
Three types
Point estimate
1-Xbar =population mean (miu)
2-Xbar =miu with +-margine of error
3-test of hypothesis

Where we done analytics? On big data only whre large amount data set we have analyse after data put into HDFs analytics works start ,for that many technical existed one of them machine learning contains algorithms supervise learning and no supervise learning algorithms
Like
Many algorithms I mean to say

Calculate median for below Data set
15,21,54,23,89,
58, 72,33,45,68

Ans - Median=49.5

[2:24 PM, 9/20/2018] Nitin Damle: Standard deviation =
Sumission of Under root of (X-xmean)²/n
N-total variables
X-for every variable x turn
Xmean - mean of x
Calculate SD for given age data
12
85
54
56
69
56
32
78
74
Standard error(SE)=sigma(sd)/under root N

Or s(sample stanrd deviation) /under root n
: As We known we always work on sample drawn from population
Important formula for calculating variance
Sample of x-population mean /se

===============================
Margine of error
Confidence interval
T test
Z test
Z table
Chi test
P value
Alpha value
================ Central limit theorem (CLT) ======================
It is basic of stats
And important theorem for data analytics
Prerequisites are
1.miu
2.selection of sample with no bias
3 sample mean
4, sample std deviation S
5 population std deviation sigma
6.std error

What is CLT:- estimation of population mean base on samples taken from the Population.

Population mean Miu
Always don't know boz we can't calculate it ex. If lets says calculate avg age of all peoples living in India.its difficult to go every home and reach everyone living in county
It would be time consuming and cost wise high
Thus we always work on samples

Population===>pullout ==>>samples

Everyone has to get equal chance to extract in samples ie called Samples with no bias
Simple random sample -randomly select element from population
Cluster samples
-selecting samples from particular or define region or area.

Confidence interval
For 95% confidence margine of error -1.96
For 99‰= 2.57
This would be calculated using z table
X-miu/sigma
Case 95% CI-
As per clt Samples extracted from population approaches popln mean.
But it may possible to lie between 2.5 +be site or 2.5-ve site it would be chances.
Thus squaring and taking under root the variances so that +ve and -ve are cancel out each other
With std error or deviation 1.96

Data Science and Big Data Hadoop

Pages

Data Science Practice

No comments:

Post a Comment