By Fatskills Exam Guides Team — the exam nerds behind 28,500+ quizzes and 2.1M practice questions across 500+ global exams.
This guide covers Objective 1.3 (Compare and contrast common data structures and file formats) of the CompTIA Data+ exam and includes the following topics: - Structured vs. unstructured data - Various file formats
This guide covers topics related to data structures, structured data, unstructured data, and the file formats encountered in everyday work and personal life. These can be as trivial as rows and columns from a structured database or an Excel sheet or as complex as highly unstructured datasets used for artificial intelligence and machine learning. In addition, there can be multiple (and very familiar) file formats, such as text or flat files, as well as less common formats, such as XML. It is the vividness of data formats across storage, retrieval, and presentation that makes data ever so interesting. Structured vs. Unstructured Data Data structures have been used since the invention of computers to store, retrieve, and process data. A data structure is a representation of data in a memory space that is used to perform retrieval and storage operations effectively. In simple terms, a data structure provides a way to store and organize data so that it can be used efficiently and effectively. Efficiency or effectiveness implies speedy access to data (timeliness) as well as accuracy of data being structured in a certain format. While different programming languages have different data structures (for example, a stack, an array, a linked list, queues, or trees), the concept of data structures is relevant across almost all languages. There are some exceptions where one or another data type is not found or is not prevalent; however, without data structures, there is no way a programmer or a data engineer can store data in either a structured or unstructured format, and this renders the usability of data close to meaningless. Data, whether structured or unstructured, can be acquired from many sources, including social media, emails, photos, instant messages, blogs, and articles. These are all examples of human-generated data sources. In addition, machine-generated data is generated by software applications, IoT sensors, and other hardware devices. Examples include security alerts from a commercial warehouse security system and IoT sensors related to air quality control. All this data, if processed, needs to be stored in a certain format, and this is where the relevance of structured vs. unstructured data comes into the play.
The key types of data in terms of storage, processing, and querying are as follows: The different types of data structured, unstructured, and semi-structured have their relevant places in terms of data handling and processing. These are a key topic for the CompTIA Data+ exam. The following sections discuss the differences between structured and unstructured data and briefly touch on semi-structured data and metadata. Structured vs. Unstructured Data Data can be broadly classified into structured and unstructured data. TABLE: Differences Between Structured and Unstructured Data
The following sections cover the key aspects of structured and unstructured data and provide real-world examples. Structured Data Structured data resides in a fixed field within a record or a file. Structured data may be text, numbers, and any other values that can be stored in a well-defined and linked format (used to capture relationships between different entities) such that a query against the data would yield meaningful results in an expected time frame. Structured data is often stored in tabular form. a. An Abstract View of Structured Data For example, in a cookbook, a recipe page displays data about the ingredients, cooking time, calories, temperature, and other details. In addition, each recipe may be related to a table of contents (TOC), which makes it easier for readers to locate the information they want. A reader who is looking for a vegetarian curry recipe or a barbecue recipe, for example, can look at the TOC for the page numbers where the recipe can be found and jump straight to the recipe for the delicious dish. 4.2 illustrates this structured relationship. As another example, structured data makes it simpler for online search engines to understand what data should be displayed as the result of a query. Structured data improves the accuracy and efficiency of the search engine. TOC of a Book as Structured Data With massive data volumes of unstructured data, the speed at which the data can be queried decreases, and alternative algorithms may be needed to search through the unstructured data at unprecedented speeds. This topic and related intricacies are beyond the scope of CompTIA Data+ exam. Structured data is mostly organized in tables as rows and columns (as well as key/value pairs). In the relational model, a database is represented as a collection of relations, and a relation is defined as a table. Relational database systems such as Microsoft SQL Server and PostgreSQL leverage the relational structure to store and process data. The next section discusses rows and columns in the context of relational databases. Rows and Columns A row consists of related cell values that run horizontally across a table. A table can contain one or more rows, and all rows are by default independent of other rows in the table. 4.3 provides an example of rows in an Excel sheet. Rows in an Excel Sheet A column consists of vertically stacked cell values in a table. Like rows, columns are also independent of other columns in a table. A table can contain one or more columns. Columns in an Excel Sheet Now to make things more interesting, while rows and columns by themselves may not be so useful as they do not relate to real-world relationships, when they intersect and create relations (yes—that’s why it is called relational database!) they become far more useful. The relationship of rows and columns is known as just that—relation (table)! A column header of a relational database is known as an attribute of a relation. The row is defined as a tuple in a relational table. Attributes, Tuples, and Relation From a big data perspective, structured data is simpler for applications to consume for analytics than unstructured data. However, most modern data analytics solutions are making great strides in the area of unstructured data. Key/Value Pairs A key/value pair has two elements (hence the word pair): a key, which is a unique identifier that refers to the value, and a value, which is the data itself and may be based on set of variables. Consider an example of a customer record being identified as a key/value pair, as shown here:
Here, the key is a unique identifier that points to the relevant value, which can be customer name, customer address, or a product sold to customers. The advantage of key/value pairs is that they are very flexible and offer very fast lookups for reads/writes as a single key returns its related value. This can be beneficial when you are not looking to run complex queries against linked tables. Unstructured Data Unstructured data is data that is not organized according to a schema or per a defined structure of preset data. In other words, it is data that does not conform to any data model or schemas. The two most relevant types of unstructured data are multimedia (for example, video, audio, images) and text. Unstructured data does not have a predefined framework, and it exists in all forms. provides an abstract view of unstructured data. An Abstract View of Unstructured Data
Unstructured data is becoming more plentiful and common compared to structured data due to the spurt in data growth over the past few decades. There has been a proliferation of unstructured data from web searches, online sites, applications, software, email, and social media; in addition, machine data, point-of-sale (PoS) data, Internet of Things (IoT) sensor data, and other automated forms of data are becoming increasingly common. Most such data is left in its raw state—which is unstructured data. Modern toolsets and software are required to process and analyze unstructured data as traditional tools and methods are not efficient with unstructured data. Specialized data analytics tools may be needed for preprocessing and management before queries can be run. For example, audio, video, text, image, and other unstructured data cannot be segmented into simple row-and-column constructs unless they have been preprocessed with specialized applications. One of the ways to handle unstructured data is to leverage non-RDBMS (relational database management system) or NoSQL tools/applications. It is common for data lakes such as Google Cloud Platform (GCP) BigQuery and Microsoft Azure Synapse to leverage unstructured data for data analytics as well as machine learning. Raw data tends to be unstructured by default, and instead of structuring it, data lakes use unstructured data with advanced algorithms to reduce analysis time and to increase efficiency in terms of searching and indexing. Most of the machine-generated data that is created automatically by the operations and activities of networked devices (for example, smartphones, PCs, linked wearable products, IoT sensors, and embedded systems) is by default unstructured and in the raw form (even, in most cases, after processing). A marketer can, for example, utilize the unstructured data by leveraging reports or dashboards to understand the trends and formulate the content to continue connecting with potential audiences and consumers. Finally, the fields in a database can be defined as null, or they can be undefined. A Null value means an empty (yes, that’s right) field or a field with no value. Remember that null is not zero. You can use the SQL statements IS NULL and IS NOT NULL to find out whether a field has a null value. Undefined implies that the field might contain a variable for which a value is not yet defined. Semi-structured Data Semi-structured data is somewhere between structured and unstructured data. It shares the characteristics of both. Semi-structured data is mostly textual in nature and conforms to some level of structure (though it may not conform to the rigid structure used in relational databases). Semi-structured data follows certain patterns and schemas. Common examples of semi-structured data are JSON, XML, and CSV documents. While structured data is the easiest type of data to process, semi-structured is the next easiest, and it’s more straightforwardly processed than unstructured data. Data analytics tools are required for preprocessing and managing semi-structured data. 4.7 provides an abstract view of semi-structured data. An Abstract View of Semi-structured Data Metadata Data about data is known as metadata. It implies a description of and context for data. Metadata is used to find, organize, and analyze data. An example of metadata is the information contained in photographs, such as: - The date and time the photo was taken - The location where the photo was taken - The filename of the photo While photographers don’t typically use photo metadata, search engines and analytical tools can use metadata to sort, organize, analyze, and describe a photo. Data File Formats Data (including numeric, text, video, and audio data as well as images) can be represented in multiple file formats. Data File Formats
A data file format may include scripts, text, and documentation. For example, a text file and a web page may be written in a word processor and are both regarded as data files. Various data file formats such as flat files, XML files etc. are an important area of focus for CompTIA Data+ exam. The sections that follow cover the various data file formats. Text/Flat File Using a flat file (also known as a text database) is the simplest way to store information in plaintext. In this format, all the information (including numeric and alphanumeric values) is stored as text. Each line of the file contains one record of the dataset, as shown in 4.9. With plaintext, the key benefit is that complex software is not needed to create or process a text file. In addition, it is easy to view and modify plaintext data. Flat files are mostly portable across different systems and require a low-level interface. The main drawback of plaintext files is their simplicity. This may seem counterintuitive, based on what we just discussed as the benefits of plaintext. However, there are no standards that specify the data format of plaintext, and the process of accessing information in plaintext is inefficient compared to standardized databases. For each line in a flat file, there are two main approaches to differentiating fields: using delimited format and using fixed-width format. Delimiters (commas, semicolons, braces, etc.) can be used to keep the data formatted at a fixed width, and they make it easier to find different fields within a record. Flat File Delimiter Format
With the fixed-width format, each column is allocated a fixed width in number of characters with one entry per row. Refer to 4.9 for an example of fixed-width format. Tab-Delimited File A tab-delimited file is composed of records with datasets structured in a row format; as the name implies, the delimiter is a tab character. Every dataset in a row comprises more than one piece of information, and every piece of information is known as a field. With tab-delimited file format, the first row consists of headers for the column names; this provides a structured format for tab-delimited files. Another term for delimiter is also field separator as delimiters (such as a tab character) separate fields from one another. Saving in Tab-Delimited Format The tab-delimited file format stores data from a spreadsheet or database in a tabular format. Tab-delimited format is also referred as tab-separated values, or TSV format. It is important to remember that tab-delimited format and comma-delimited format (discussed in the next section) are text file formats. As you can see in 4.11, Excel warns that saving in tab-delimited or comma-delimited format causes all formatting to be lost. Tab-Delimited Format is Plaintext Format Comma-Delimited File As you have likely guessed, in the comma-delimited format, the data is separated by commas. This is the one of the most common file formats for exchanging information between applications, and almost all data systems are capable of exporting and importing comma-delimited information. Also known as comma-separated values (CSV) format, the comma-delimited format is commonly used in many applications, such as Microsoft Excel and Google Docs. Comma-Delimited Format As you can see, comma-delimited format is essentially a text file with datasets delimited by commas. Comma-Delimited Format is a Text File
It is important to note that the delimiter is typically not within a field itself but is used to separate the fields from one another. In order to prevent delimiter mixing with data fields, qualifiers are used. A qualifier would be placed around each field to ensure that delimiters are not included as part of a field. The most common qualifier with CSV files is double quotes (that is, " "). CSV files are commonly used in data science projects. In some files, semi-colons are used instead of commas to delimit values; in such a case, the file would be called a delimiter-separated values (DSV) file. JavaScript Object Notation (JSON) JSON is a lightweight text-based file format for storing and transporting data. It is often used when data is sent from a web server to a client web page. The JSON format was first detailed in March 2001 by Douglas Crockford. RFC 8259 (see https://datatracker.ietf.org/doc/html/rfc8259), which is the main reference for JSON data interchange format, was published in December 2017 by the Internet Engineering Task Force (IETF). JSON is an open standard format and easy to understand. JSON is self-describing, as it enables the reader to read the actual content in a hierarchical manner. JSON is simple for users to write and read, and it makes it easy for machines to generate and parse the data. Example 4.1 shows an example of JSON syntax. Example 4.1 JSON Syntax { "EmployeeInfo":[ {"firstName":"Ashley", "lastName":"James", "department":"sales", "ID":111, - }, {"firstName":"Jaime", "lastName":"Angus", "department":"marketing", "ID":222, - ] - } -
JSON syntax follows these rules: - Data is stored in name/value pairs. In a JSON file, a name/value pair consists of a field name, followed by a colon, followed by a value (with both field names and values in double quotes). In Example 4.1, “firstname”: “Ashley” is a name/value pair. The values can be of the following types: - Boolean (true or false) - Object - Number - Array - Data fields are delimited by commas. In Example 4.1, note that there is a comma after each name/value pair. - Square brackets ([]) are used to hold arrays. An array in JSON is an ordered collection of values. - Curly braces ({}) are used to hold objects. An object in JSON is an unordered set of name/value pairs. An object begins with { and ends with }. Extensible Markup Language (XML) XML is a platform-independent (or platform-agnostic) standard markup language that uses the same rules of data formatting and encoding across platforms. The XML file format was created for storing and transporting data without being dependent on the underlying platform. In fact, the X in XML stands for Extensible, which implies that the format can be extended to any number of symbols, based on the user’s requirements. Moreover, XML is represented as extensible because it is not a fixed format (as Hypertext Markup Language [HTML] is), and it makes it simple to denote metadata in a reusable and portable format. XML enables the use of structured and portable data for display on wireless devices such as smartphones. XML files have the extension .xml. Standard Generalized Markup Language (SGML) is an international standard for the definition of markup languages. In other words, it is a metalanguage. Both XML and HTML are document formats derived from SGML. Like many other web languages, XML is both human and machine comprehensible. XML stores data in plaintext in order to enable data exchange between incompatible systems. XML is widely used for exchanging data over the World Wide Web (WWW) as well as data storage. This is a key difference from HTML, which is used primarily for data representation on the web rather than for data transfer. HTML is covered in detail in the next section. It is important to remember that XML is not only suited for web use but can be used across multiple platforms to achieve various outcomes. For example, XML can be used for sharing data between Internet of Things (IoT) sensors and IoT platforms. Example 4.2 shows a standard tab-delimited file structured in XML. XML Document on Employee Information <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"> "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> <article lang=""> <para> Employee Department Employee ID</para> <para>Record 1 Ashley James Sales 111</para> <para>Record 2 Jaime Angus Marketing 222</para> </article> To create this example, authors used an online XML tool to generate XML from the tab-delimited file shown earlier. Let’s examine the structure of this file and the syntax of XML in general: - XML Prolog: This component added at the beginning of an XML document includes the XML declaration, DOCTYPE, comments, and processing instructions. In Example 4.2, XML Prolog is: "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"> - XML declaration: This line right at the top (in Example 4.2, <?xml version="1.0" encoding="UTF-8"?>) tells the device reading the file that this is an XML document and gives the version of XML (in this case, version 1.0). - XML tag: XML tags are used to mark the beginning and end of statements. In Example 4.2, the opening tag is <para>, and the closing tag is </para>. - XML element: An XML element consists of an opening tag, attributes, content, and a closing tag. An element can contain: - Text - Attributes - Other elements - A mix of the above - In Example 4.2, one of the elements is <para>Record 1 Ashley James Sales 111</para>
Example 4.3 provides another XML document example. Example 4.3 XML Document on Customer Information <?xml version="1.0" encoding="UTF-8" standalone="yes" ?> <?xml-stylesheet type="text/css" href="/style/design"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <customer_list> <customer> <lastname> Behl </lastname> - <firstname> Akhil </firstname> <location> AU </location> </customer> < lastname> G S </lastname> - <firstname> Siva </firstname> <location> UK </location> </customer_list> As you can see in Examples 4.2 and 4.3, XML does not have defined tags as HTML does; in fact, tags can vary from one XML document to another. XML tags are used to recognize the data and to arrange and store the data rather than to denote how to show it (whereas HTML tags are used to actually show the data). Hypertext Markup Language (HTML) Hypertext refers machine-readable text, and markup refers to structuring in a particular format. HTML is code that is used for web page structure and its information. HTML is the basic language of scripting and is used by web browsers to use pages on the WWW. Hypertext permits a user to click a link and be redirected to a new referenced page automatically. It is important to remember that HTML is a presentation language, whereas XML is a data-description language. Basically, HTML defines the way a user sees a web page, whereas XML defines the way data is stored and transmitted across a server and a client or different systems. HTML is the standard markup language for web pages. HTML files have the extension .html or .htm. HTML comprises a set of elements that allows a browser to show content. In HTML, the content must be structured within a group of paragraphs, using data tables or multimedia or using a list of bullet points. Let’s take a look at an example of HTML. Example 4.4 shows the structure of a standard HTML document. Example 4.4 HTML Document Structure <html> <head> <title> Title of the page</title> </head> <body> <h1> My First heading </h1> <p> My First paragraph. </p> </body> </html>
In Example 4.4: - <html> indicates the HTML page root element. - <head> indicates the HTML page metadata. - <title> denotes HTML page title. - <body> denotes the body of the document, which is visible content. - <h1> indicates a heading on the web page. - <p> indicates a paragraph. - Much like XML documents, HTML documents include tags, and with each opening tag (for example, <html>), there is a closing tag (for example, </html>). Unlike in XML, however, in HTML tags are well defined. HTML Content in a Text File Now you can save this file with the extension .html, as shown in 4.16. Saving the HTML File If you now launch this page in a browser, you can see the effect the HTML tags have on the content of the file (see 4.17). HTML Page Viewed in a Browser As shown above, which uses Google Chrome (though you can use any web browser for this code—go ahead try it for yourself!), the browser does not display the HTML tags. Rather, it uses them to determine how to display the document’s content. TABLE: Comparing HTML, XML, and JSON
Notes:
Code Snippets
Join 4M+ learners. Unlock unlimited quizzes, wrong-answer tracking, flashcards + reminders, study guides, and 1-on-1 challenges.