Introduction to Data Science

What is Data

Data

  • Data are a set of values of qualitative or quantitative variables about one or more persons or objects.
  • Datum (singular of data) is a single value of a single variable.

Information

Data is usually redundant and uncertain, it only becomes information suitable for making decisions once it has been analyzed in some fashion.

Knowledge

  • Knowledge is the personal understanding based on extensive experience dealing with information on a subject.

Wisdom

  • Wisdom is the ability to think and act using knowledge, experience, understanding, common sense and insight.

The age of data

  • Volume of data created worldwide in Zetabyte ($10^{21}$ bytes):

  • Data in the 21st Century is like Oil in the 18th Century. *Like oil, for those who see Data’s fundamental value and learn to extract and use it there will be huge rewards.

What to do with data?

Data engineering

A data engineer is responsible for building, testing and maintaining the data architecture.

Roles:

  • Develop, construct, test and maintain architectures and processing workflows
  • Build robust, efficient and reliable data pipelines
  • Develop solutions for data acquisition
  • Ensure architecture supports business requirements
  • Develop dataset processes for data modeling, mining, and production
  • Drive the collection of new data and refinement of existing data sources
  • Recommend ways to improve data reliability, efficiency, and quality

Core skills:

  • Database systems: * SQL-based systems like MySQL, PostgreSQL Microsoft SQL Server, and Oracle Database * NoSQL databases including MongoDB, Cassandra, Couchbase, Oracle NoSQL Database
  • Data warehouse software: * Cloud-based data warehouse like Amazon Redshift, Panoply, BigQuery and Snowflake
  • Coding abiliy: * Python * Scala * Java * …
  • Big Data tools: * Apache Spark * Apache Kafka * Apache Hadoop * Apache Cassandra

Data Analyst:

Guided by business questions, data analysts explore data to extract information for questions posed by businesses.

Roles:

  • Collecting data basing on a specific request from leaders
  • Familiarizing with the parameters of the data set (types of data, how it can be sorted)
  • Pre-processing: making sure data is free of errors
  • Interpreting data and analyzing ways it solves the business problem
  • Drawing conclusions from the analysis
  • Visualizing and presenting the findings to the managers
  • Determine the meaning of data
  • Create data quality dashboards and KPI reports about data
  • Document structures and types of business data

Core skills:

  • Statistics (the knowledge of stats makes exploring data easier and helps in avoiding logical errors): * SPSS * SAS * Matlab
  • SQL (to extract data from the data warehouse for analysis)
  • Data visualization tools (to create visual representations of complex data sets to make them easy for others to understand): * Tableau * Infogram * QuickSight * Power BI

Data Scientist:

Data Scientist lean on predictive analytics, machine learning, data conditioning, mathematical modeling, and statistical analysis.

Roles:

  • Apply quantitative techniques from fields such as statistics, econometrics, optimization, and machine / deep learning toward the solution of important business problems from many areas of the automotive and mobility industry
  • Utilize statistical approaches to build predictive models
  • Enable evidence-based decision making by extracting insights from structured and unstructured data sets
  • Identify new and novel data sources and explore their potential use in developing actionable business insights
  • Explore new technologies and analytic solutions for use in quantitative model development
  • Design and develop customized interactive reports and dashboards
  • Help maintain and improve existing models

Core skills:

  • Python
  • R
  • SQL
  • Hadoop
  • Algebra
  • calculus
  • Statistics
  • Machine Learning
  • Deep Learning
  • Data visualization tools
  • Business acumen
  • Communication skills

How these specialities fit together?

Watch this video by Krish Naik for more explanation.

What is Data Science?

  • A multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.

  • Data Science is the fourth paradigm of science:

    • Empirical Science (Past thousand of years)
    • Theoretical Science (Few past hundred of years)
    • Computational Science (Last fifty years)
    • Data Science (Recent years)

Data Science Lifecycle

  1. Ask the right questions to begin the discovery process

  2. Acquire data

  3. Process and clean the data

  4. Integrate and store data

  5. Initial data investigation and exploratory data analysis

  6. Feature Extraction and Feature Engineering

  7. Choose one or more potential models and algorithms

  8. Apply data science techniques, such as machine learning, statistical modeling, and artificial intelligence

  9. Measure and improve results

  10. Present final result to stakeholders

  11. Make adjustments based on feedback

  12. Repeat the process to solve a new proble

Watch this video by Krish Naik or this video by Arbita Chakravarty for more explanation.