Select Page

Predicting Profits of Start Ups using Machine Learning and Python

by | Dec 15, 2022 | Uncategorized | 0 comments

Join our founder Peter Koebel as he walks through Jupyter using machine learning and Python.

Using machine learning and Python in a tool like Jupyter can be extremely powerful when attempting to performe predictive analysis.

Read the transcript of the video below:

Hi it’s Peter Keobel here again from

Data Sciencing Consultants making

another video for data science in comm

in this video I’ll be using machine

learning and multiple linear regression

to predict the profit of startup and we

using Python in Jupiter notebook so

first we’ll bring in the code here for

doing the free process and I’ll be

importing these libraries and next I’ll

be importing my data set and now let’s

check out the data set so here we have

50 startups and the categories they have

our research developments expenditures

administration spend ensures marketing

expenditures the state where this

startup is and their profit so next we

will clean the data of it

because in our x-values we don’t need

the independent variable that’s the

profit here or this is the dependent

variable these are independent variables

and in Y variable we just need the four

independent variables so then we’ll take

a look at those let’s say here perhaps

systems we or Y will just be the


yeah so Y here is just the profit this

is just the dependent variable just

video come here at the end hopefully

dataset let’s never smoked before so now

we will to encode categorical data and

cook an independent variable because

here the state is my name and that

doesn’t help with the machine learning

because it’s a category so we got to

turn those into numbers that we can use

so we’ll just turn them into zero

someone so if this line has if it’s New

York then it will be one for the New

York column and for California Florida

so we’ll see how that goes here

look look at texts you know so this is

the California column here is Florida

and here’s New York so the first one

here is as one New York and zero

California Florida and the second one

here is California and Florida this well

has one here so here’s the first row

here’s the second row so it’s yes for

California note from Florida note for

New York and this one was no for

California yes Florida note for New York

since these variables the three

categorical variables since if it’s not

New York or not Florida it has to be

California we don’t need three dummy

variables so we’re gonna use this to get

rid of them so don’t have the extra

variable in there that might mess up our

analysis later

all right now we’re going to do one of

the main parts of machine learning and

that’s we’re going to split our data set

into the training set and the test set

so we’re gonna take 80% of our area data

to do the machine learning and that’s

going to train the model to predict

profit and then we’ll use the 20% left

over to test to basically compare that

model so so we’re not trying to find the

exact profit from the our sample here in

the 50 we just want to be able to

predict if we get a list of a thousand

startups with this information we won’t

be able to predict what the profit is so

we’re going to use this the 80% here to


I’ll just show you here so these are 80%

that picked here will be the 20% but we

used for testing afterwards so now we’re

going to fit the multiple regression to

the training set here and now we’re

going to predict the test set results so

we’ll see why pretty much the variable

and I will see what happens here so this

is what it would predict so based on the

training set it’s predicting then we

take the the 20% from the test we’re

testing these and this is what the model

predicts the cropland would be for those

from the test set and now we will look

at the actual profit and compare see how

well the model did so here’s the actual

for this one is 103 thousand Trojan 82

and we predicted it would be hundred and

three thousand fifteen so that’s pretty


no look at the next one and here it’s

actual hundred and forty four thousand

two hundred fifty nine we predicted the

model predicted one hundred thirty two

thousand five minute juice that’s not

great this is 132 four four seven since

146 that’s a bit off this one here is a

bit closer and then this one here

hundred seventy eight 192 it’s not great

this one’s closer here hundred five

hundred fifteen thousand six ten

thousand and eighty one thousand is up

thirteen dozen off which isn’t great but

these ones are better here 97 98 and

then bunch of 13 and hundred

ten years 166 167 so it’s not was a

perfect way got most of these fairly

acronym now will you will build an

optimal model using backward elimination

to see if there’s hanging these

variables that aren’t important here

them see if any of these variables see

if they are not assisted for the

significant to predict the profit so

Louie will use backward elimination well

first we’ll use them all apply all these

into the model and then we will find

which ones are needed we’ll use will use

a p-value of 0.05 and so any of these

when we do the calculations I’ll show it

later there above that we can eliminate

but first we to make this work for this

model we need to add need to add a XC

after we got rid of after had these

dummy variables and we got rid of the

one we need to add a one for the

constant here I’ll show the picture

here’s the this is a simple linear

regression it’s y equals B naught plus B

1 x1 so just one independent variable

and here’s multiple in regression here

we have multiple independent variables

and here the coefficients and this

here’s the constant and technically I’d

have an X 0 here but that’s one so it’s

ignored but in our model need to have

that so we it knows what it used to know

that there’s something there that’s what

we’ll do at the just a 1 here the front

for the constant

so using you library

stats model and put that in here it’s

sprouting ones 250 rows and just one

column we’re gonna get back so it looks

like now see now the beam of each row

has the one to multiply to the constant

just so there’s a value there from the

model to calculate alright now we will

start optimizing model and we’ll use the

when use ordinary these squares and so

here’s all our variables and python it

starts at 0 so we have there 6 this is

our constant these are the florida new

york research development demonstration

and marketing so run this i will see

what we get so here’s our results and

here here’s the constant and here are

variables we’re looking here at the

largest p value which is x2 here that is

New York’s that is not relevant to

calculating the or predicting the profit

so we can remove that that is

second one here we get rid of that

variable see what we get no no this is

the highest peak value it’s x1 that is

for Florida’s that also it’s not

relevant to if the startup to predict

profit at the start so we can remove

that variable again now we have here the

next biggest one so here’s some constant

so this is this variable here that is

administration so more equating

administration doesn’t help predict

greater profit or less profit depending

on pay on the relationship between the

variable the independent dependent

variables so we can move that one see

where we have so these zeros the C

repair isn’t zero is just that’s it’s

really low so it goes beyond the decimal

points here so could this to be like a

one or something right here the fourth

decimal spot no we here we have this is

the highest p value and that is for

marketing and it’s really close to the p

value we picked was point zero five so

we’ll get rid of it here but it’s really

close so there are other methods to use

to see if this should be removed for

naught because it seems like it’s really

close to being statistically significant

to predicting the profit so we will

remove it for this purposes

and snow here we go so what we’re left

here is now the research development

columns so that is so since the speed of

so low that is highly significant

particularly significant to predicting

the profit of a start-up if we had those

variables if you had like a thousand

that we can just run into here and this

would be able to predict and this is the

main variable you need is the research

development thank you for watching and

please like share and subscribe to my

youtube channel also check out my

website data science income and

available for duffel for dev sciencing

consulting you have short-term projects

that are a couple hours or if you want

lon longer-term projects for a couple

months and you can email me at the Peter

Kerbal at or or fill out the

form on my website

Likert Scales in Tableau

If you need help with your Python, machine learning, or Jupyter, I’m Peter Koebel the owner of Data Sciencing Consultants. You can reach me at 204-770-6437, or email me at, or fill out the form on our website  You can also check out our YouTube channel for more awesome Excel tips!

For more awesome python and machine learning tips, like this one about likert scales in Tableau, check out the Datasciencing Consultants blog.

Looking for more fun with data? You might enjoy some of our other posts!

US Population Density Map using Tableau Join our founder, Peter Koebel as he demonstrates a population density map using Tableau. You've probably seen population density visualizations everywhere, but how can you use them in your own...

Datasciencing Customer Demographics with Power BI Join our founder, Peter Koebel, as he walks through datasciencing customer demographics with Power BI. Datasciencing can be tough, but does it have to be? Check out this demonstration...