Have you ever heard someone say, “I need access to all of our data for my analysis including historical data from at least 10 years ago”?
The reality is that this statement is becoming increasingly common across organizations. Organizations are starting to face new data challenges as companies who are using their data effectively are consistently outperforming those who don’t. For example, marketing departments suddenly want a 360-customer view, or a full view of all the touch-points a customer has with the organization. In the real-world, customer data is everywhere (websites, customer service, sales, marketing) and customer data spans across time, which means that a 360-customer view is challenging to provide. The sales department moves away from a commission-based pay and executives want to see the impact, overtime, of that change on the organization’s total sales, which requires multiple years of historical data. Customer service departments want to extract value from call center transcripts without manually reading through each transcript. Or maybe customer service departments want to extract value from customer uploaded photos of damaged products without having to manually sift through each photo individually.
Just 10 years ago, each of these analyses may have been out of reach for the average organization. Modern organizations have data stored in multiple platforms, such as Customer Relationship Managers (CRMs), Enterprise Resource Planners (ERPs), or cloud-based applications such as Salesforce, Workday, or Google Analytics for corporate websites. Each of these applications stores data but the data are stored in totally different back-end databases, making it difficult to aggregate and analyze all of that information collectively. Many organizations work in little silos – analyzing the data from Salesforce alone, or the data from the CRM alone, but the true value in data comes from the entire collection of an organization’s data analyzed together. Today, we have the power to aggregate all of an organization’s data, including photos, PDFs, emails, and other non-conventional forms of data, in a highly cost-effective way: a Data Lake.
Data lakes are designed to contain all of an organization’s data, including data necessary for in-depth analysis or data science work. Truly data-driven organizations don’t limit the amount or type of data they collect, store, and retain, which means that all of an organization’s data could be voluminous. Data-driven organizations store all social media data, email text, written documents, video content, photo images, and more; all formats that were previously expensive to store due to the size of these types of files.
Data lakes are highly cost-effective, cloud-based data stores that hold huge volumes of raw data in formats that were previously difficult (and sometimes impossible) to store in standard transactional databases.
Data lakes are designed to be accessed by a sophisticated user, such as a database engineer, a DBA, a software engineer, a technical business analyst, or a data scientist. They’re meant to work in conjunction with a data warehouse and solve a different business need than a data warehouse, to be further discussed during a future blog post.
While data lakes solve data access issues faced by many modern organizations, they do require some expertise to construct. Careful attention needs to be given to how the data are stored and organized within a data lake so that organizations can query data within the data lake in a timely way. Haphazardly throwing data into a data lake results in what’s referred to as a “data swamp” – a disorganized collection of data where it may take hours upon hours to extract data. However, once built correctly, a data lake will provide your organization with the next-level, timely data access it needs to compete with even the most data-driven organizations in each industry.