Using machine learning and Python in a tool like Jupyter can be extremely powerful when attempting to performe predictive analysis.
Read the transcript of the video below:
Hi it’s Peter Keobel here again from
Data Sciencing Consultants making
another video for data science in comm
in this video I’ll be using machine
learning and multiple linear regression
to predict the profit of startup and we
using Python in Jupiter notebook so
first we’ll bring in the code here for
doing the free process and I’ll be
importing these libraries and next I’ll
be importing my data set and now let’s
check out the data set so here we have
50 startups and the categories they have
our research developments expenditures
administration spend ensures marketing
expenditures the state where this
startup is and their profit so next we
will clean the data of it
because in our x-values we don’t need
the independent variable that’s the
profit here or this is the dependent
variable these are independent variables
and in Y variable we just need the four
independent variables so then we’ll take
a look at those let’s say here perhaps
systems we or Y will just be the
variable
yeah so Y here is just the profit this
is just the dependent variable just
video come here at the end hopefully
dataset let’s never smoked before so now
we will to encode categorical data and
cook an independent variable because
here the state is my name and that
doesn’t help with the machine learning
because it’s a category so we got to
turn those into numbers that we can use
so we’ll just turn them into zero
someone so if this line has if it’s New
York then it will be one for the New
York column and for California Florida
so we’ll see how that goes here
look look at texts you know so this is
the California column here is Florida
and here’s New York so the first one
here is as one New York and zero
California Florida and the second one
here is California and Florida this well
has one here so here’s the first row
here’s the second row so it’s yes for
California note from Florida note for
New York and this one was no for
California yes Florida note for New York
since these variables the three
categorical variables since if it’s not
New York or not Florida it has to be
California we don’t need three dummy
variables so we’re gonna use this to get
rid of them so don’t have the extra
variable in there that might mess up our
analysis later
all right now we’re going to do one of
the main parts of machine learning and
that’s we’re going to split our data set
into the training set and the test set
so we’re gonna take 80% of our area data
to do the machine learning and that’s
going to train the model to predict
profit and then we’ll use the 20% left
over to test to basically compare that
model so so we’re not trying to find the
exact profit from the our sample here in
the 50 we just want to be able to
predict if we get a list of a thousand
startups with this information we won’t
be able to predict what the profit is so
we’re going to use this the 80% here to
Train
I’ll just show you here so these are 80%
that picked here will be the 20% but we
used for testing afterwards so now we’re
going to fit the multiple regression to
the training set here and now we’re
going to predict the test set results so
we’ll see why pretty much the variable
and I will see what happens here so this
is what it would predict so based on the
training set it’s predicting then we
take the the 20% from the test we’re
testing these and this is what the model
predicts the cropland would be for those
from the test set and now we will look
at the actual profit and compare see how
well the model did so here’s the actual
for this one is 103 thousand Trojan 82
and we predicted it would be hundred and
three thousand fifteen so that’s pretty
close
no look at the next one and here it’s
actual hundred and forty four thousand
two hundred fifty nine we predicted the
model predicted one hundred thirty two
thousand five minute juice that’s not
great this is 132 four four seven since
146 that’s a bit off this one here is a
bit closer and then this one here
hundred seventy eight 192 it’s not great
this one’s closer here hundred five
hundred fifteen thousand six ten
thousand and eighty one thousand is up
thirteen dozen off which isn’t great but
these ones are better here 97 98 and
then bunch of 13 and hundred
ten years 166 167 so it’s not was a
perfect way got most of these fairly
acronym now will you will build an
optimal model using backward elimination
to see if there’s hanging these
variables that aren’t important here
them see if any of these variables see
if they are not assisted for the
significant to predict the profit so
Louie will use backward elimination well
first we’ll use them all apply all these
into the model and then we will find
which ones are needed we’ll use will use
a p-value of 0.05 and so any of these
when we do the calculations I’ll show it
later there above that we can eliminate
but first we to make this work for this
model we need to add need to add a XC
after we got rid of after had these
dummy variables and we got rid of the
one we need to add a one for the
constant here I’ll show the picture
here’s the this is a simple linear
regression it’s y equals B naught plus B
1 x1 so just one independent variable
and here’s multiple in regression here
we have multiple independent variables
and here the coefficients and this
here’s the constant and technically I’d
have an X 0 here but that’s one so it’s
ignored but in our model need to have
that so we it knows what it used to know
that there’s something there that’s what
we’ll do at the just a 1 here the front
for the constant
so using you library
stats model and put that in here it’s
sprouting ones 250 rows and just one
column we’re gonna get back so it looks
like now see now the beam of each row
has the one to multiply to the constant
just so there’s a value there from the
model to calculate alright now we will
start optimizing model and we’ll use the
when use ordinary these squares and so
here’s all our variables and python it
starts at 0 so we have there 6 this is
our constant these are the florida new
york research development demonstration
and marketing so run this i will see
what we get so here’s our results and
here here’s the constant and here are
variables we’re looking here at the
largest p value which is x2 here that is
New York’s that is not relevant to
calculating the or predicting the profit
so we can remove that that is
second one here we get rid of that
variable see what we get no no this is
the highest peak value it’s x1 that is
for Florida’s that also it’s not
relevant to if the startup to predict
profit at the start so we can remove
that variable again now we have here the
next biggest one so here’s some constant
so this is this variable here that is
administration so more equating
administration doesn’t help predict
greater profit or less profit depending
on pay on the relationship between the
variable the independent dependent
variables so we can move that one see
where we have so these zeros the C
repair isn’t zero is just that’s it’s
really low so it goes beyond the decimal
points here so could this to be like a
one or something right here the fourth
decimal spot no we here we have this is
the highest p value and that is for
marketing and it’s really close to the p
value we picked was point zero five so
we’ll get rid of it here but it’s really
close so there are other methods to use
to see if this should be removed for
naught because it seems like it’s really
close to being statistically significant
to predicting the profit so we will
remove it for this purposes
and snow here we go so what we’re left
here is now the research development
columns so that is so since the speed of
so low that is highly significant
particularly significant to predicting
the profit of a start-up if we had those
variables if you had like a thousand
that we can just run into here and this
would be able to predict and this is the
main variable you need is the research
development thank you for watching and
please like share and subscribe to my
youtube channel also check out my
website data science income and
available for duffel for dev sciencing
consulting you have short-term projects
that are a couple hours or if you want
lon longer-term projects for a couple
months and you can email me at the Peter
Kerbal at hotmail.com or or fill out the
form on my website
If you need help with your Python, machine learning, or Jupyter, I’m Peter Koebel the owner of Data Sciencing Consultants. You can reach me at 204-770-6437, or email me at peter.koebel@datasciencing.com, or fill out the form on our website https://datasciencing.com. You can also check out our YouTube channel for more awesome Excel tips!
For more awesome python and machine learning tips, like this one about likert scales in Tableau, check out the Datasciencing Consultants blog.