What exactly is the Data Science?
Before Answering to
this question, it is very important to understand why so much confusion around
it.
Why are there
multiple confusing in Definitions?
One reason is an
assortment of several tasks, these several tasks involved in the data science
pipeline. Also important of task depend on organization to organization or
application to application. Because of different level of distribution of task
or priority of task so there lot confusion around it. People are doing some
part of data science tasks so different opinion and definition comes from
different sources.
What are Task
involved in Data Science
1) Collect the Data
2) Store the Data
3) Process the Data
4) Describe the Data
5) Model the Data
Data Science is the
science of collecting, storing, processing, describing and modeling the data
1) Collecting Data
What
is involved in the Data Collection?
It
depends on question a data scientist is trying to answer and depend on the
environment in which the data scientist is working
Example
1: A Data Scientist working on e-commerce Company. Many cases E-commerce
company have lot of data about their customers and data rich organization. In
this case company has data and data scientist not required actually to collect
the data and store it. Just need process, describe and model the data. Data
scientist to write SQL Query to access the data. Also may required write Python
or Java Code to impended the SQL Query on it
One Question data scientist interesting in this context is which items do customers buy Together?
Example
2: Data Scientist working on political Party
Government
implemented new policy.
What
are people say about new policy? is it bad or good, people like the policy or
not, different opinion about policy, etc..
In
this case Data exist, but people discussing about it in different social media
platforms or public forum, but data not owned by government or us and not
stored in structured way
In
this case Data Scientist has some hacking skill to scroll the data from
different source of web. So basic knowledge of python or java or any
programming language
Example
3: Data Scientist working with Farmers
Effect
of type of seed, fertilizer, irrigation on yield?
In this case, data is not available within
the organization and not readily available on public. So Situation is really
different. So Now you to design experiment and collect the data
For example some particular seed not giving
good yield, does that mean that the seed is bad or irrigation method is bad or
with fertilizer. Now you will design the experiment about all of these effects.
We will take a piece of land and divided into 9 different parts. One part I
will use one combination of seeds, another combination irrigation method and
another combination of fertilizer and so on and so forth. Based on these experimental
enough sample data and we can find insight on this and find which method is
good for good yield. In this we need some statistical knowledge to draw diagram
based hypothesis testing and analysis of variance
So what are the skills required to collect
the data
Intermediate level of programming
Knowledge of database
Knowledge of Statistics
2) Store the Data:
Storing the data in relational database like customer data, employee
data, product inventory data etc.. Also
companies are storing data in multiple databases. So warehouse database is used
in this to store the data from multiple databases for analytics purpose
Unstructured data like text, image, videos and speech. Big-Data Data
lakes used to stored unstructured data, semi structured and structured data
3) Processing Data
Extract the data from different sources, transform and clean the data
which required for the data science project and load the data
4) Describe the Data
After loading the clean data, Visualizing Data and summarize the data
with plots. Bar graph, group bar graph etc..
Example: Sales with mobile phones or TV for past 3 years with graph for
visualization to easy to understand and communicate
5) Model the Data
Statistical Modeling: -
Underlying data distribution like normal
distribution and linear distribution
In statistical modeling, we assumed simple models which allowed robust
statistical analysis
Example: Reading about patient blood sugar level, height and weight
etc..
Algorithmic Modeling: -
A model represents what was learned by a machine learning algorithm. The model is the “thing” that is saved after running a machine learning algorithm on training data and represents the rules, numbers, and any other algorithm-specific data structures required to make predictions