The Story of Data Science From Past to Present and Data with Types

Brief History of Data Science

From the early 19th century, since when information technology was invented, generating, capturing, and processing data have become more complex and extremely powerful. More than 50 years ago, the foundation of data science was laid by John W. Tukey, a mathematician, in his article "The Future of Data Analysis." The other prominent names in this domain are as follows:

1. John Chambers, Consulting Professor, Stanford University. He was presented the Software System Award for the design of the S system in 1999. The S system is the basis for all the future statistical programming languages, including the R language.

2. Jeff Wu, Coco Cola chair in Engineering Statistics and Professor at Georgia Tech, coined the term "Data Science" in 1997.

3. William Cleveland, Distinguished Professor of Statistics and Professor of Computer Science at Purdue University, authored many books on data visualization.

4. Leo Breiman, distinguished statistician at the University of California, Berkeley, was one of the pioneers in "machine learning," which is one of the advanced data science techniques.

Each of them played a crucial role in the emergence of data science. Over the last five decades, this field has evolved into an undisputable domain. However, the 20th century has witnessed drastic advancements particularly due to certain developments.

 

Definition of data:

        ·        Data refers to raw, unprocessed facts, figures, and details collected from various sources.

        ·        In data science, data is the foundational elements that are analyzed to extract insights and knowledge.

        ·        These data can come from various sources such as databases, sensors, social media, surveys, or experiments

        ·        These can include numbers, text, images, or other forms of information that represent observations, measurements, or descriptions of entities, events, or phenomena.

Data is typically categorized into several types:

Structured Data: Organized in a predefined format, usually in tables with rows and columns, such as databases or spreadsheets.

        ·        A rough estimate often cited is that about 20% of all data is structured

        ·        Structured data generated by humans is survey responses stored in a database.

        ·        Structured data generated by machines is sensor data from an Internet of Things (IoT) device, Server Logs, GPS Data etc

Unstructured Data: Lacks a specific structure or format.

·        Various studies and industry reports suggest that 80% to 90% of all data generated is unstructured.

·        unstructured data generated by humans is Text, images, videos, and comments shared on platforms like Twitter, Facebook, Instagram, and LinkedIn, Voice Recordings, Handwritten Notes etc

·        Examples of unstructured data generated by machines Surveillance Footage: Video recordings from security cameras, which capture continuous video streams without predefined structure, Satellite Imagery: High-resolution images captured by satellites, often used for geographic information systems (GIS) or environmental monitoring, Medical Imaging: Images from MRI, CT scans, X-rays, and ultrasounds generated by medical equipment, which require specialized processing to extract structured information.

Semi-Structured Data: Falls between structured and unstructured data, containing elements of both.

·        An example is JSON (JavaScript Object Notation) or XML (eXtensible Markup Language) files, where data is organized but not in a strict tabular format.

·        It is believed that 5% to 20% of all data generated falls into the semi-structured category

 

Fig1. structured, semi-structured, and unstructured data.

Comments

Popular posts from this blog

Virtual Private Network - VPN

Windows Registry Forensics: Detecting Malware Persistence with Process Monitor

Mastering Incident Response: Complete Guide to CrowdResponse Forensic Tool