Fatskills
Practice. Master. Repeat.
Study Guide: CompTIA Data+ DA0-001 Exam: Understanding Common Data Structures and File Formats
Source: https://www.fatskills.com/introdution-to-engineering/chapter/comptia-data-da0-001-exam-understanding-common-data-structures-and-file-formats

CompTIA Data+ DA0-001 Exam: Understanding Common Data Structures and File Formats

By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.

⏱️ ~21 min read

This guide covers Objective 1.3 (Compare and contrast common data structures and file formats) of the CompTIA Data+ exam and includes the following topics:
- Structured vs. unstructured data
- Various file formats

This guide covers topics related to data structures, structured data, unstructured data, and the file formats encountered in everyday work and personal life. These can be as trivial as rows and columns from a structured database or an Excel sheet or as complex as highly unstructured datasets used for artificial intelligence and machine learning. In addition, there can be multiple (and very familiar) file formats, such as text or flat files, as well as less common formats, such as XML. It is the vividness of data formats across storage, retrieval, and presentation that makes data ever so interesting.

Structured vs. Unstructured Data
Data structures have been used since the invention of computers to store, retrieve, and process data. A data structure is a representation of data in a memory space that is used to perform retrieval and storage operations effectively. In simple terms, a data structure provides a way to store and organize data so that it can be used efficiently and effectively. Efficiency or effectiveness implies speedy access to data (timeliness) as well as accuracy of data being structured in a certain format.
While different programming languages have different data structures (for example, a stack, an array, a linked list, queues, or trees), the concept of data structures is relevant across almost all languages. There are some exceptions where one or another data type is not found or is not prevalent; however, without data structures, there is no way a programmer or a data engineer can store data in either a structured or unstructured format, and this renders the usability of data close to meaningless.
Data, whether structured or unstructured, can be acquired from many sources, including social media, emails, photos, instant messages, blogs, and articles. These are all examples of human-generated data sources. In addition, machine-generated data is generated by software applications, IoT sensors, and other hardware devices. Examples include security alerts from a commercial warehouse security system and IoT sensors related to air quality control. All this data, if processed, needs to be stored in a certain format, and this is where the relevance of structured vs. unstructured data comes into the play.


The key types of data in terms of storage, processing, and querying are as follows:
The different types of data structured, unstructured, and semi-structured have their relevant places in terms of data handling and processing. These are a key topic for the CompTIA Data+ exam.
The following sections discuss the differences between structured and unstructured data and briefly touch on semi-structured data and metadata.

Structured vs. Unstructured Data
Data can be broadly classified into structured and unstructured data.

TABLE: Differences Between Structured and Unstructured Data

Characteristic Structured Data Unstructured Data
Data storage Data is stored in (pre)defined defined formats, such as columns and rows (in relational databases). Data is stored in undefined and native (or raw) formats.
Data types Data types can be dates, strings, and numbers. Data types can be audio, video, word processing files, images, and emails.
Storage space requirements Structured data requires less space for storage compared to unstructured data. Unstructured data needs much more storage space compared to structured data.
Security and legacy compatibility Structured data is simpler to secure and process/handle with legacy solutions. Unstructured data is much more difficult to secure and process/handle with legacy and requires modern solutions for management.
Data storage Structured data is often stored in data warehouses. Unstructured data is often stored in data lakes.
Volume of data in % Approximately 20% of organizational data is structured data. Approximately 80% of organizational data is unstructured data.
Quantitative vs. qualitative Structured data is quantitative. Unstructured data is qualitative.
Ease of accessing data With structured data, it is easy to search and query against the defined fields. As there is no structure to unstructured data, specialized mechanisms are needed to access it efficiently.


The following sections cover the key aspects of structured and unstructured data and provide real-world examples.

Structured Data
Structured data resides in a fixed field within a record or a file. Structured data may be text, numbers, and any other values that can be stored in a well-defined and linked format (used to capture relationships between different entities) such that a query against the data would yield meaningful results in an expected time frame. Structured data is often stored in tabular form. a.

Images
An Abstract View of Structured Data

For example, in a cookbook, a recipe page displays data about the ingredients, cooking time, calories, temperature, and other details. In addition, each recipe may be related to a table of contents (TOC), which makes it easier for readers to locate the information they want. A reader who is looking for a vegetarian curry recipe or a barbecue recipe, for example, can look at the TOC for the page numbers where the recipe can be found and jump straight to the recipe for the delicious dish. 4.2 illustrates this structured relationship.
As another example, structured data makes it simpler for online search engines to understand what data should be displayed as the result of a query. Structured data improves the accuracy and efficiency of the search engine.

Images
TOC of a Book as Structured Data

With massive data volumes of unstructured data, the speed at which the data can be queried decreases, and alternative algorithms may be needed to search through the unstructured data at unprecedented speeds. This topic and related intricacies are beyond the scope of CompTIA Data+ exam.
Structured data is mostly organized in tables as rows and columns (as well as key/value pairs). In the relational model, a database is represented as a collection of relations, and a relation is defined as a table. Relational database systems such as Microsoft SQL Server and PostgreSQL leverage the relational structure to store and process data. The next section discusses rows and columns in the context of relational databases.

Rows and Columns
A row consists of related cell values that run horizontally across a table. A table can contain one or more rows, and all rows are by default independent of other rows in the table. 4.3 provides an example of rows in an Excel sheet.

Images
Rows in an Excel Sheet

A column consists of vertically stacked cell values in a table. Like rows, columns are also independent of other columns in a table. A table can contain one or more columns.

Images
Columns in an Excel Sheet

Now to make things more interesting, while rows and columns by themselves may not be so useful as they do not relate to real-world relationships, when they intersect and create relations (yes—that’s why it is called relational database!) they become far more useful. The relationship of rows and columns is known as just that—relation (table)! A column header of a relational database is known as an attribute of a relation. The row is defined as a tuple in a relational table.

Images
Attributes, Tuples, and Relation

From a big data perspective, structured data is simpler for applications to consume for analytics than unstructured data. However, most modern data analytics solutions are making great strides in the area of unstructured data.

Key/Value Pairs
A key/value pair has two elements (hence the word pair): a key, which is a unique identifier that refers to the value, and a value, which is the data itself and may be based on set of variables.
Consider an example of a customer record being identified as a key/value pair, as shown here:

Key Value
C899-1 Always Light Technologies
C899-2 Kongrad St, Ethos, EU
C899-3 Antivirus, Anti-malware


Here, the key is a unique identifier that points to the relevant value, which can be customer name, customer address, or a product sold to customers.
The advantage of key/value pairs is that they are very flexible and offer very fast lookups for reads/writes as a single key returns its related value. This can be beneficial when you are not looking to run complex queries against linked tables.

Unstructured Data
Unstructured data is data that is not organized according to a schema or per a defined structure of preset data. In other words, it is data that does not conform to any data model or schemas. The two most relevant types of unstructured data are multimedia (for example, video, audio, images) and text. Unstructured data does not have a predefined framework, and it exists in all forms.
provides an abstract view of unstructured data.

Images
An Abstract View of Unstructured Data


Unstructured data is becoming more plentiful and common compared to structured data due to the spurt in data growth over the past few decades. There has been a proliferation of unstructured data from web searches, online sites, applications, software, email, and social media; in addition, machine data, point-of-sale (PoS) data, Internet of Things (IoT) sensor data, and other automated forms of data are becoming increasingly common. Most such data is left in its raw state—which is unstructured data.
Modern toolsets and software are required to process and analyze unstructured data as traditional tools and methods are not efficient with unstructured data. Specialized data analytics tools may be needed for preprocessing and management before queries can be run. For example, audio, video, text, image, and other unstructured data cannot be segmented into simple row-and-column constructs unless they have been preprocessed with specialized applications. One of the ways to handle unstructured data is to leverage non-RDBMS (relational database management system) or NoSQL tools/applications.
It is common for data lakes such as Google Cloud Platform (GCP) BigQuery and Microsoft Azure Synapse to leverage unstructured data for data analytics as well as machine learning. Raw data tends to be unstructured by default, and instead of structuring it, data lakes use unstructured data with advanced algorithms to reduce analysis time and to increase efficiency in terms of searching and indexing.
Most of the machine-generated data that is created automatically by the operations and activities of networked devices (for example, smartphones, PCs, linked wearable products, IoT sensors, and embedded systems) is by default unstructured and in the raw form (even, in most cases, after processing). A marketer can, for example, utilize the unstructured data by leveraging reports or dashboards to understand the trends and formulate the content to continue connecting with potential audiences and consumers.
Finally, the fields in a database can be defined as null, or they can be undefined. A Null value means an empty (yes, that’s right) field or a field with no value. Remember that null is not zero. You can use the SQL statements IS NULL and IS NOT NULL to find out whether a field has a null value. Undefined implies that the field might contain a variable for which a value is not yet defined.

Semi-structured Data
Semi-structured data is somewhere between structured and unstructured data. It shares the characteristics of both. Semi-structured data is mostly textual in nature and conforms to some level of structure (though it may not conform to the rigid structure used in relational databases). Semi-structured data follows certain patterns and schemas. Common examples of semi-structured data are JSON, XML, and CSV documents. While structured data is the easiest type of data to process, semi-structured is the next easiest, and it’s more straightforwardly processed than unstructured data. Data analytics tools are required for preprocessing and managing semi-structured data. 4.7 provides an abstract view of semi-structured data.

Images
An Abstract View of Semi-structured Data

Metadata
Data about data is known as metadata. It implies a description of and context for data. Metadata is used to find, organize, and analyze data. An example of metadata is the information contained in photographs, such as:
- The date and time the photo was taken
- The location where the photo was taken
- The filename of the photo
While photographers don’t typically use photo metadata, search engines and analytical tools can use metadata to sort, organize, analyze, and describe a photo.

Data File Formats
Data (including numeric, text, video, and audio data as well as images) can be represented in multiple file formats.

Images
Data File Formats


A data file format may include scripts, text, and documentation. For example, a text file and a web page may be written in a word processor and are both regarded as data files.
Various data file formats such as flat files, XML files etc. are an important area of focus for CompTIA Data+ exam.
The sections that follow cover the various data file formats.

Text/Flat File
Using a flat file (also known as a text database) is the simplest way to store information in plaintext. In this format, all the information (including numeric and alphanumeric values) is stored as text. Each line of the file contains one record of the dataset, as shown in 4.9.
With plaintext, the key benefit is that complex software is not needed to create or process a text file. In addition, it is easy to view and modify plaintext data. Flat files are mostly portable across different systems and require a low-level interface.
The main drawback of plaintext files is their simplicity. This may seem counterintuitive, based on what we just discussed as the benefits of plaintext. However, there are no standards that specify the data format of plaintext, and the process of accessing information in plaintext is inefficient compared to standardized databases.
For each line in a flat file, there are two main approaches to differentiating fields: using delimited format and using fixed-width format. Delimiters (commas, semicolons, braces, etc.) can be used to keep the data formatted at a fixed width, and they make it easier to find different fields within a record. 

Images
Flat File Delimiter Format

With the fixed-width format, each column is allocated a fixed width in number of characters with one entry per row. Refer to 4.9 for an example of fixed-width format.

Tab-Delimited File
A tab-delimited file is composed of records with datasets structured in a row format; as the name implies, the delimiter is a tab character. Every dataset in a row comprises more than one piece of information, and every piece of information is known as a field. With tab-delimited file format, the first row consists of headers for the column names; this provides a structured format for tab-delimited files.
Another term for delimiter is also field separator as delimiters (such as a tab character) separate fields from one another.


Images

Images
Saving in Tab-Delimited Format

The tab-delimited file format stores data from a spreadsheet or database in a tabular format. Tab-delimited format is also referred as tab-separated values, or TSV format.
It is important to remember that tab-delimited format and comma-delimited format (discussed in the next section) are text file formats. As you can see in 4.11, Excel warns that saving in tab-delimited or comma-delimited format causes all formatting to be lost. 

Images
Tab-Delimited Format is Plaintext Format

Comma-Delimited File
As you have likely guessed, in the comma-delimited format, the data is separated by commas. This is the one of the most common file formats for exchanging information between applications, and almost all data systems are capable of exporting and importing comma-delimited information. Also known as comma-separated values (CSV) format, the comma-delimited format is commonly used in many applications, such as Microsoft Excel and Google Docs. 

Images
Comma-Delimited Format

As you can see, comma-delimited format is essentially a text file with datasets delimited by commas.

Images
Comma-Delimited Format is a Text File

It is important to note that the delimiter is typically not within a field itself but is used to separate the fields from one another. In order to prevent delimiter mixing with data fields, qualifiers are used. A qualifier would be placed around each field to ensure that delimiters are not included as part of a field. The most common qualifier with CSV files is double quotes (that is, " ").
CSV files are commonly used in data science projects. In some files, semi-colons are used instead of commas to delimit values; in such a case, the file would be called a delimiter-separated values (DSV) file.

JavaScript Object Notation (JSON)
JSON is a lightweight text-based file format for storing and transporting data. It is often used when data is sent from a web server to a client web page. The JSON format was first detailed in March 2001 by Douglas Crockford. RFC 8259 (see https://datatracker.ietf.org/doc/html/rfc8259), which is the main reference for JSON data interchange format, was published in December 2017 by the Internet Engineering Task Force (IETF).
JSON is an open standard format and easy to understand. JSON is self-describing, as it enables the reader to read the actual content in a hierarchical manner. JSON is simple for users to write and read, and it makes it easy for machines to generate and parse the data. Example 4.1 shows an example of JSON syntax.

Example 4.1 JSON Syntax

{
"EmployeeInfo":[
{"firstName":"Ashley",
"lastName":"James",
"department":"sales",
"ID":111,
- },
{"firstName":"Jaime",
"lastName":"Angus",
"department":"marketing",
"ID":222,
- ]
- } -
 

JSON syntax follows these rules:
- Data is stored in name/value pairs. In a JSON file, a name/value pair consists of a field name, followed by a colon, followed by a value (with both field names and values in double quotes). In Example 4.1, “firstname”: “Ashley” is a name/value pair. The values can be of the following types:
- Boolean (true or false)
- Object
- Number
- Array

- Data fields are delimited by commas. In Example 4.1, note that there is a comma after each name/value pair.
- Square brackets ([]) are used to hold arrays. An array in JSON is an ordered collection of values.
- Curly braces ({}) are used to hold objects. An object in JSON is an unordered set of name/value pairs. An object begins with { and ends with }.

Extensible Markup Language (XML)
XML is a platform-independent (or platform-agnostic) standard markup language that uses the same rules of data formatting and encoding across platforms. The XML file format was created for storing and transporting data without being dependent on the underlying platform. In fact, the X in XML stands for Extensible, which implies that the format can be extended to any number of symbols, based on the user’s requirements.
Moreover, XML is represented as extensible because it is not a fixed format (as Hypertext Markup Language [HTML] is), and it makes it simple to denote metadata in a reusable and portable format. XML enables the use of structured and portable data for display on wireless devices such as smartphones. XML files have the extension .xml.
Standard Generalized Markup Language (SGML) is an international standard for the definition of markup languages. In other words, it is a metalanguage. Both XML and HTML are document formats derived from SGML.
Like many other web languages, XML is both human and machine comprehensible. XML stores data in plaintext in order to enable data exchange between incompatible systems. XML is widely used for exchanging data over the World Wide Web (WWW) as well as data storage. This is a key difference from HTML, which is used primarily for data representation on the web rather than for data transfer. HTML is covered in detail in the next section.
It is important to remember that XML is not only suited for web use but can be used across multiple platforms to achieve various outcomes. For example, XML can be used for sharing data between Internet of Things (IoT) sensors and IoT platforms.

Example 4.2 shows a standard tab-delimited file structured in XML.
XML Document on Employee Information

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN">
"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd">
<article lang="">
<para>        Employee        Department        Employee ID</para>
<para>Record 1        Ashley James        Sales        111</para>
<para>Record 2        Jaime Angus        Marketing        222</para>
</article> To create this example, authors used an online XML tool to generate XML from the tab-delimited file shown earlier.

Let’s examine the structure of this file and the syntax of XML in general:
- XML Prolog: This component added at the beginning of an XML document includes the XML declaration, DOCTYPE, comments, and processing instructions. In Example 4.2, XML Prolog is: "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> - XML declaration: This line right at the top (in Example 4.2, <?xml version="1.0" encoding="UTF-8"?>) tells the device reading the file that this is an XML document and gives the version of XML (in this case, version 1.0).
- XML tag: XML tags are used to mark the beginning and end of statements. In Example 4.2, the opening tag is <para>, and the closing tag is </para>.
- XML element: An XML element consists of an opening tag, attributes, content, and a closing tag. An element can contain:
- Text
- Attributes
- Other elements
- A mix of the above
-
In Example 4.2, one of the elements is
<para>Record 1        Ashley James        Sales        111</para>

Example 4.3 provides another XML document example.
Example 4.3 XML Document on Customer Information

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<?xml-stylesheet type="text/css" href="/style/design"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0
Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<customer_list>
<customer>
<lastname> Behl </lastname>
- <firstname> Akhil </firstname>
<location> AU </location>
</customer>
< lastname> G S </lastname>
- <firstname> Siva </firstname>
<location> UK </location>
</customer_list>
As you can see in Examples 4.2 and 4.3, XML does not have defined tags as HTML does; in fact, tags can vary from one XML document to another. XML tags are used to recognize the data and to arrange and store the data rather than to denote how to show it (whereas HTML tags are used to actually show the data).

Hypertext Markup Language (HTML)
Hypertext refers machine-readable text, and markup refers to structuring in a particular format. HTML is code that is used for web page structure and its information. HTML is the basic language of scripting and is used by web browsers to use pages on the WWW. Hypertext permits a user to click a link and be redirected to a new referenced page automatically.
It is important to remember that HTML is a presentation language, whereas XML is a data-description language. Basically, HTML defines the way a user sees a web page, whereas XML defines the way data is stored and transmitted across a server and a client or different systems. HTML is the standard markup language for web pages. HTML files have the extension .html or .htm.
HTML comprises a set of elements that allows a browser to show content. In HTML, the content must be structured within a group of paragraphs, using data tables or multimedia or using a list of bullet points.
Let’s take a look at an example of HTML. Example 4.4 shows the structure of a standard HTML document.

Example 4.4 HTML Document Structure

<html>
<head>
<title> Title of the page</title>
</head>
<body>
<h1> My First heading </h1>
<p> My First paragraph. </p>
</body>
</html>

In Example 4.4:
- <html> indicates the HTML page root element.
- <head> indicates the HTML page metadata.
- <title> denotes HTML page title.
- <body> denotes the body of the document, which is visible content.
- <h1> indicates a heading on the web page.
- <p> indicates a paragraph.
-
Much like XML documents, HTML documents include tags, and with each opening tag (for example, <html>), there is a closing tag (for example, </html>). Unlike in XML, however, in HTML tags are well defined.


Images
HTML Content in a Text File

Now you can save this file with the extension .html, as shown in 4.16.

Images
Saving the HTML File

If you now launch this page in a browser, you can see the effect the HTML tags have on the content of the file (see 4.17).

Images
HTML Page Viewed in a Browser

As shown above, which uses Google Chrome (though you can use any web browser for this code—go ahead try it for yourself!), the browser does not display the HTML tags. Rather, it uses them to determine how to display the document’s content.


TABLE: Comparing HTML, XML, and JSON

Characteristic HTML XML JSON
What markup language it is based on? HTML is based on SGML. SML is based on SGML. JSON is based on the JavaScript programming language.
What it is? HTML is a markup language. XML offers a framework to define markup languages. JSON is a lightweight format that is used for data interchange.
Static vs. dynamic HTML is static in nature and focused on data presentation. XML is dynamic in nature and focused on data storage and transfer from databases. JSON represents objects and is dynamic in nature.
Readability HTML is relatively easy for humans to read. XML is difficult for humans to read and interpret. JSON is easy for humans to read.
Case sensitivity HTML is not case sensitive. XML is case sensitive. JSON is case sensitive.

 

 

Notes:


Code Snippets

Images

Images

Images

Images

Images

Images



ADVERTISEMENT