Introduction to Data Science
What is Data
Data
- Data are a set of values of qualitative or quantitative variables about one or more persons or objects.
- Datum (singular of data) is a single value of a single variable.
Information
Data is usually redundant and uncertain, it only becomes information suitable for making decisions once it has been analyzed in some fashion.
Knowledge
- Knowledge is the personal understanding based on extensive experience dealing with information on a subject.
Wisdom
- Wisdom is the ability to think and act using knowledge, experience, understanding, common sense and insight.
The age of data
-
Volume of data created worldwide in Zetabyte ($10^{21}$ bytes):
-
Data in the 21st Century is like Oil in the 18th Century. *Like oil, for those who see Data’s fundamental value and learn to extract and use it there will be huge rewards.
What to do with data?
Data engineering
A data engineer is responsible for building, testing and maintaining the data architecture.
Roles:
- Develop, construct, test and maintain architectures and processing workflows
- Build robust, efficient and reliable data pipelines
- Develop solutions for data acquisition
- Ensure architecture supports business requirements
- Develop dataset processes for data modeling, mining, and production
- Drive the collection of new data and refinement of existing data sources
- Recommend ways to improve data reliability, efficiency, and quality
Core skills:
- Database systems: * SQL-based systems like MySQL, PostgreSQL Microsoft SQL Server, and Oracle Database * NoSQL databases including MongoDB, Cassandra, Couchbase, Oracle NoSQL Database
- Data warehouse software: * Cloud-based data warehouse like Amazon Redshift, Panoply, BigQuery and Snowflake
- Coding abiliy: * Python * Scala * Java * …
- Big Data tools: * Apache Spark * Apache Kafka * Apache Hadoop * Apache Cassandra
Data Analyst:
Guided by business questions, data analysts explore data to extract information for questions posed by businesses.
Roles:
- Collecting data basing on a specific request from leaders
- Familiarizing with the parameters of the data set (types of data, how it can be sorted)
- Pre-processing: making sure data is free of errors
- Interpreting data and analyzing ways it solves the business problem
- Drawing conclusions from the analysis
- Visualizing and presenting the findings to the managers
- Determine the meaning of data
- Create data quality dashboards and KPI reports about data
- Document structures and types of business data
Core skills:
- Statistics (the knowledge of stats makes exploring data easier and helps in avoiding logical errors): * SPSS * SAS * Matlab
- SQL (to extract data from the data warehouse for analysis)
- Data visualization tools (to create visual representations of complex data sets to make them easy for others to understand): * Tableau * Infogram * QuickSight * Power BI
Data Scientist:
Data Scientist lean on predictive analytics, machine learning, data conditioning, mathematical modeling, and statistical analysis.
Roles:
- Apply quantitative techniques from fields such as statistics, econometrics, optimization, and machine / deep learning toward the solution of important business problems from many areas of the automotive and mobility industry
- Utilize statistical approaches to build predictive models
- Enable evidence-based decision making by extracting insights from structured and unstructured data sets
- Identify new and novel data sources and explore their potential use in developing actionable business insights
- Explore new technologies and analytic solutions for use in quantitative model development
- Design and develop customized interactive reports and dashboards
- Help maintain and improve existing models
Core skills:
- Python
- R
- SQL
- Hadoop
- Algebra
- calculus
- Statistics
- Machine Learning
- Deep Learning
- Data visualization tools
- Business acumen
- Communication skills
- …
How these specialities fit together?
Watch this video by Krish Naik for more explanation.
What is Data Science?
-
A multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.
-
Data Science is the fourth paradigm of science:
- Empirical Science (Past thousand of years)
- Theoretical Science (Few past hundred of years)
- Computational Science (Last fifty years)
- Data Science (Recent years)
Data Science Lifecycle
-
Ask the right questions to begin the discovery process
-
Acquire data
-
Process and clean the data
-
Integrate and store data
-
Initial data investigation and exploratory data analysis
-
Feature Extraction and Feature Engineering
-
Choose one or more potential models and algorithms
-
Apply data science techniques, such as machine learning, statistical modeling, and artificial intelligence
-
Measure and improve results
-
Present final result to stakeholders
-
Make adjustments based on feedback
-
Repeat the process to solve a new proble
Watch this video by Krish Naik or this video by Arbita Chakravarty for more explanation.