First, let's cover different kinds of structures data is stored in, and how it's stored to facilitate analytics and other purposes. We'll call this analytic data architecture. A database or a Database Management System or DBMS is a software application for storing, organizing, indexing, processing, and querying data. Some are general purpose while others specialize in particular types of data or analytics. Some also rely on specialized hardware as well. Data is typically organized into collections called tables, with records consisting of fields. Very similar to your Excel workbooks or Google Sheets, but built to interface directly with business applications and able to handle a high volume and velocity of data. Relational databases or RDBMSs store data in rows and columns, while non-relational databases use more sophisticated, distributed, and often higher performance methods. A data-set is a somewhat generic term for a collection of related records that could be in a database, a spreadsheet, or just a document. For instance, like a data-set of your customer contact information or a data-set of labor statistics. An extract is a data-set that's copied from a database. These are generally used to make a duplicate of the data somewhere else. Extracts are usually temporary. A staging area is where extracts are collected then integrated and imported into a data structure like a data warehouse or a data lake, which we'll get to in a moment. A data warehouse is a special data structure or schema or data model architected to integrate data from multiple source systems for high performance data querying and analysis. Usually, general purpose but specifically tuned DBMSs are used. Data in the data warehouse is updated periodically, often hourly or daily. This means that any piece of data may not be current, but as a whole the data is sufficient for most kinds of aggregate analytics. A data warehouse has four key characteristics. First, they're subject oriented, that is data is not organized by source system but rather by subject area. For example, all data about customers be it financial data, sales data, support data is all linked together to provide a complete picture of any customer. They're also historical. Data includes not just current data but also historical data like, what was the customer's old address, and previous mortgage, and even who used to be our customer? Third, data warehouses have time variant data. That is, the data is captured and organized into time slices so you get a complete picture of the business at any point in time in history. Fourth, they're non-volatile. Data in a data warehouse is not typically overwritten. When data in the source system is updated, the new values are included in the data warehouse without overriding the old ones, this of course enables time variance and historical data. Now, a data mart is a kind of mini data warehouse that's specific to a particular subject like customer data, or financial data, or for a particular department, or business unit. Data marts are sometimes extracted from data warehouses, and often they have a different kind of schema or structure called a multidimensional schema, that enables the ease of navigation and querying. An operational data store or an ODS is a data warehouse that integrates data from multiple systems on a more frequent, real-time or near real-time basis. But often, it doesn't include the history or time slices, and data is overwritten, so only the current data exists. An ODS is used for analyzing or looking up current data in a way that minimizes the disruption to and complexity of analyzing data from multiple source systems. Then there are data lakes. A data lake is the latest craze in analytic data structures. Whereas the data warehouse, data mart, and operational data store typically impose a strict and inflexible structure or schema on the data when it's stored, a data lake typically allows data to be structured kind of as is with structures being designed then applied at the time data is queried. Now, this puts an extra onus on business analysts and data scientists, but it enables incredible flexibility and expedience for including new kinds of data and doing new kinds of analyses. Most data lakes don't use traditional database products but rather those that are purpose-built for distributed, high performance programmatic analytics. Many times the large complex businesses and analytic architecture takes on more of a hybrid form, with a combination of these data structures we just discussed. This is often referred to as a logical data warehouse. A virtual data warehouse is when what's called a view is created that can make it look to a data analyst, that he or she is querying a single data warehouse, when in fact the query is being distributed across multiple source systems. Special-purpose software can enable this and they can be created really expediently, but it's usually no complete substitute for an actual physical data warehouse for a variety of reasons. Sometimes you'll hear people talk about a workspace or a sandbox. It's a special area where data scientists or advanced business analysts will play with extracts of data, often samples not the entire data-set. They do this to test hypotheses and develop algorithms. Think of them as an R&D lab for data. You'll probably also hear the term archive. An archive is a slower, cheaper form of database and storage, where seldom used data is kept so it won't affect the performance of online data. Of course, then there's the backup, which is an exact copy of a database, usually updated continuously. Backups are important to maintain business continuity, if for example the main database breaks, the data center goes dark due to fire or some other disaster, or data is inadvertently or intentionally damaged or hijacked. Backups are usually maintained off premises as opposed to on premises for obvious reasons. Backups can be online so they can be accessed or queried, or they can be offline, that is not available for direct access, only for restoring data. I mentioned schemas and data models earlier. Some of the terms you might hear are cubes, star schemas, or snowflake schemas. These are merely special ways to organize data for the ease of navigation and querying. They involve the representation of data in a multidimensional data model, consisting of separate entities or tables containing what are called facts and dimensions. Another type of schema you might hear about is the infrequently used but powerful hybrid schema called the data vault. Multidimensional databases are those that automatically create or impose multidimensional schemas, organizing data into what are called cubes. Now, most databases are best for handling structured data like numbers and codes, and in short segments of text. But special systems exist for storing, searching, and managing documents with special features. For versioning, workflow, and collaboration, and other systems are purpose-built for multimedia content like images and video. You've probably also heard people talk about blockchain. A blockchain is a distributed, decentralized public ledger of transactions or other records. Originally designed for cryptocurrencies, other uses are emerging such as smart contracts in which immutable, secure, trusted yet low volumes or transaction recording is required. Now, let's examine how we get data from the source or operational systems into these analytic databases and data structures. There's a class with special software application that one programs to extract, transform, and load data into data warehouses or other analytic structures called ETL software, for obvious reasons. The applications then run continuously or periodically. Sometimes the applications extract, load then transform the data, in which case you'll hear them referred to as ELT. An application programming interface or API is a defined method provided by business application developers to access functionality or its data without having to query the underlying database directly. APIs help maintain the integrity of data extracted when it must be extracted from multiple tables within the application's database. To expose an application's functionality such as data access over the web, a special API called a web service might be used. Then there are what are called data preparation or self-service data prep tools. These are a class of simple- to-use technology for business users rather than programmers, to help identify and pull together, and transform, and integrate data from across the organization.