Data is getting more attention among technical and business people. More organizations are realizing the number of opportunities and impact data has on their operations and are investing in data management and analytics tools.
The amount of data being created is on the rise, and the need for efficient tools to ingest, process, and query them has given rise to several software solutions. Some of these technologies are relatively new, and industry standards are yet to be established. This means that it is not uncommon to see different organizations using varying technologies.
The exponential growth in data size has led to the use of the term Big Data. However, the definition of this term varies. But generally speaking, data that is huge and is still growing in size will pass as BigData. This kind of data requires very powerful technologies to handle. Ingestion, storage, and analytics of such Data cannot be handled by common software.
BigData Boutique understands the challenges that come with a massive amount of data. Whether you plan to design a data platform from scratch or redesign and upgrade your existing data architecture, you can be sure that skilled engineers with experience in the most powerful and efficient technologies in data stack will deliver the best solutions for you.
To understand Big Data and the kind of technologies that can handle it, a good place to start is understanding Data Platforms. Data warehouses and Data Lakes are data storage architectures that are sometimes mixed. So what are these “Data Platforms” about?
What Is A Data Platform?
The term “Data Platform” is used here on purpose. Whether a data lake, a data warehouse, or a lakehouse, the architecture is similar if properly implemented. Hence, we would refer to them as Data Platforms.
Data processing is an essential aspect of Data Platforms. This comes in either online analytical processing (OLAP) or online transaction processing (OLTP). Their difference is evident in their names and more obvious in practice.
The OLAP is designed to support business decisions, perform data mining, and carry out complex business calculations. OLTP system is designed to support real-time data transactions. These are usually numerous yet straightforward processes.
OLTP data are usually up-to-date, real-time data that become historical data that OLAP systems use. While OLTP systems make processes easy for end-users, OLAP systems help organizations make data-driven decisions.
What does a typical Data Platform look like today?
Data Source
Data flowing into the system comes in varying forms depending on the kind of source. The data repository could be in the Cloud or On-premise. The data could also follow a specific schema or be unstructured.
A Queue System
In a literal queue, movement is first-in-first-out. While data sources can be many, a queue system makes the ETL process efficient and is the backbone of the system. Kafka, Pulsar, AWS Kinesis, and Google PubSub are great examples.
Stream Analytics
In streaming, data is processed continuously and as it comes. It would not be wrong to say that Stream Analytics is essential for data in motion.
ETL/ELT Engines
ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) define how data moves from queue to storage, queue to queue, storage to storage, etc. While the difference between both appears to be a swap between two letters, the practical application is more complex. Data is extracted from the source in the form it exists in both ETL and ELT. However, in ETL, the data is transformed into the format required before it is loaded into its destination, many times a table. For ELT, the data is first loaded before transformation occurs. The system required will determine the engine to be used.
Storage
There are numerous data storage options available: S3, Ceph, HDFS in older systems, Google Cloud Storage, etc. Data storage ensures analytics and further processing can be done without disrupting the system or losing data.
Batch Processing
Batch processing refers to processing data in groups. Where data storage exists, we can run batch processing on available data, and implementing Machine Learning models will be possible here. The data required for the model would be sourced in batches of large data.
Analytics & Reporting
What use is data if we cannot draw insights and understand it? We cannot do this by looking at data in queues or storage. After ETL jobs have been run on the data, we should be able to get a high-level result from the data in the form of dashboards and IDEs. These interfaces would be updated periodically in a well-designed data platform to reflect new data. While the data would not likely be available in real-time, batch processing ensures that as much data as possible is provided routinely.
As more powerful technologies are being designed, the line between jobs you will run in batch processing, when data is at rest, and stream processing that uses processors like Flink, Spark Streamings and Kafka Dreams, is getting thinner. Flink, for example, is designed to allow you to run a job as both a streaming job and a batch job.
While the typical architecture looks simple, each point in the illustration has its own unique challenges. Some of these challenges are related to data collection, data cleanup, data quality, query and storage optimization, running ETL jobs, and more. These challenges are sufficient as separate blog posts owing to their varying degrees of complexity, likewise relevance to the overall performance of the data platform and their impact on one another.
Does Storage Impact Querying Data?
Querying data is a task that data professionals of all kinds come across. The larger the data, the more important it is for queries to be optimized for speed and efficiency in delivering results.
What if the data storage pattern impacts query efficiency?
Hive is a name you cannot miss out on in data platform discussions. The Hive Table Format is very popular for its data storage pattern. Some complex data storage formats leverage the Hive format.
The key idea in the Hive format is how the data is organized. A directory-like structure is applied allowing only relevant files to be read based on partitions. This means that only relevant files will be read and irrelevant files will be ignored. This is known as partition pruning. This approach makes querying data very efficient as the amount of data to be traversed is significantly reduced.
A very common approach to partitioning is to partition based on dates and the time of day. So, it is possible to have partitions by the day and then partition by each hour of the day.
However, regardless of its benefits, the Hive format is not flawless. Here are some challenges with the Hive format:
- As data size increases, the amount of work required for storage and partitioning significantly increases. Identifying files and partitions becomes even harder. The Hive Metastore (HMS) is one of the ways to fix this problem.
- Storage systems track files.
While partitions are created to make querying faster, some file systems still need to track the files in each partition. Some others have been able to largely solve this problem like S3 and Google Cloud Storage.
- Some file formats (even file sizes) are not optimized for querying operations. For efficient queries, an optimal format of files on a disk must exist.
- The existence of the metastore and filesystem means State is kept in both. This requires refreshes and both go out of sync sometimes.
- No room for atomic changes.
Syncing the metastore and filesystem is hard to do atomically.
- Getting the partition layout right can be challenging.
Changing an already defined partition can be costly since files are already distributed based on their partition key. This explains why date and time are popular for partitioning.
- There can be only one hierarchy of partitions.
No room for leading with multiple partitions. For example, if you chose to partition a table with days and hours, the partition has to be days then hours or vice versa, not both of them leading.
Requirements for a Modern Data Platform
In designing a modern data platform, there are specific requirements the system must meet. This is because a functional and efficient data platform cannot do without them.
- Support for UPDATES and DELETE of records.
- It is usually required to ensure compliance with regulations like GDPR. This supports anonymity and bug fixing of faulty data-write operations.
- Since file formats are in columns, in the event of an UPDATE or DELETE, the whole files will have to be rewritten, even on partitions.
- There is no ACIDity. This is a limitation caused by both storage formats and systems. Data availability for querying while a data update procedure is ongoing is not supported, making results unpredictable.
- In threading, multiple operations can be done in parallel. Here, several update operations can’t be done in parallel, making it not thread-safe.
- Support for late arrivals
- This is an update and allows such arrivals to happen long after the partition was originally created.
- Avoid duplicate records
- Duplicate records can significantly impact the outcome of analytics and BI activities.
- Removing duplicates on the data warehouse level might be more efficient than deduping while running the ETL jobs.
- Appropriate Data Partitions to prune data and support faster data querying
- Duplicate records can significantly impact the outcome of analytics and BI activities.
- Partitions are usually hierarchical and costly to change. Hence, they must be well thought out before implementation.
- Partitions are mostly defined early in the system design before data challenges are encountered.
- It is costly to change a partitioning scheme. But, it should be possible and not expensive.
- Both streaming and Batch-data use-cases should be supported.
As mentioned earlier, some frameworks can support both stream and batch analytics within the same framework.
- Schema Evolution
- As new data is added, schemas need to evolve to accommodate data changes.
- There must be support for addition, removal, or other operations on columns.
- While schema updating has use cases, the schema is saved within ORC/parquet files. This means that updating the schema rewrites all files.
- Most operations in schema evolution are unsafe. This is due to their impact on the rest of the system. They can break queries and existing code.
- Data Validation and Quality Controls
- This is a crucial requirement for protecting the system from data errors, crashes, and bugs.
- Such controls could include enforcing a timestamp that no records should be earlier than Regex constraints, validating that PII does not get stored unless its fields accept PII, etc.
- Point-in-time view of Data
- It should be possible to query or view deleted data or view existing data the way it was before changes were applied to it.
- It should be possible to query or view deleted data or view existing data the way it was before changes were applied to it.
- Support for Derived Tables and Views
- Data aggregation and merges can increase the size of data.
- It should be possible to save joins and costly aggregations on separate tables of their own or materialized views. This way, they can be queried effectively.
- Use of robust Cloud support
- Multicloud or a Federated Cloud is more of a nice-to-have feature. It makes it easier to manage demand spikes.
- For compliance, reduction in redundancy and efficiency, opting for either the multicloud or federated cloud is a good option.