Getting Started On Data Platforms


What Is A Data Platform?

Does Storage Impact Querying Data?

Querying data is a task that data professionals of all kinds come across. The larger the data, the more important it is for queries to be optimized for speed and efficiency in delivering results.

What if the data storage pattern impacts query efficiency?

Hive is a name you cannot miss out on in data platform discussions. The Hive Table Format is very popular for its data storage pattern. Some complex data storage formats leverage the Hive format.

The key idea in the Hive format is how the data is organized. A directory-like structure is applied allowing only relevant files to be read based on partitions. This means that only relevant files will be read and irrelevant files will be ignored. This is known as partition pruning. This approach makes querying data very efficient as the amount of data to be traversed is significantly reduced.

A very common approach to partitioning is to partition based on dates and the time of day. So, it is possible to have partitions by the day and then partition by each hour of the day.

However, regardless of its benefits, the Hive format is not flawless. Here are some challenges with the Hive format:

  • As data size increases, the amount of work required for storage and partitioning significantly increases. Identifying files and partitions becomes even harder. The Hive Metastore (HMS) is one of the ways to fix this problem.
  • Storage systems track files. 

While partitions are created to make querying faster, some file systems still need to track the files in each partition. Some others have been able to largely solve this problem like S3 and Google Cloud Storage.

  • Some file formats (even file sizes) are not optimized for querying operations. For efficient queries, an optimal format of files on a disk must exist.
  • The existence of the metastore and filesystem means State is kept in both. This requires refreshes and both go out of sync sometimes.
  • No room for atomic changes. 

Syncing the metastore and filesystem is hard to do atomically.

  • Getting the partition layout right can be challenging. 

Changing an already defined partition can be costly since files are already distributed based on their partition key. This explains why date and time are popular for partitioning.

  • There can be only one hierarchy of partitions. 

No room for leading with multiple partitions. For example, if you chose to partition a table with days and hours, the partition has to be days then hours or vice versa, not both of them leading.


Requirements for a Modern Data Platform

In designing a modern data platform, there are specific requirements the system must meet. This is because a functional and efficient data platform cannot do without them.

  • Support for UPDATES and DELETE of records.

    • It is usually required to ensure compliance with regulations like GDPR. This supports anonymity and bug fixing of faulty data-write operations.
    • Since file formats are in columns, in the event of an UPDATE or DELETE, the whole files will have to be rewritten, even on partitions.
    • There is no ACIDity. This is a limitation caused by both storage formats and systems. Data availability for querying while a data update procedure is ongoing is not supported, making results unpredictable.
    • In threading, multiple operations can be done in parallel. Here, several update operations can’t be done in parallel, making it not thread-safe.

  • Support for late arrivals

    • This is an update and allows such arrivals to happen long after the partition was originally created.

  • Avoid duplicate records

    • Duplicate records can significantly impact the outcome of analytics and BI activities.
    • Removing duplicates on the data warehouse level might be more efficient than deduping while running the ETL jobs.

  • Appropriate Data Partitions to prune data and support faster data querying

    • Duplicate records can significantly impact the outcome of analytics and BI activities.
    • Partitions are usually hierarchical and costly to change. Hence, they must be well thought out before implementation.
    • Partitions are mostly defined early in the system design before data challenges are encountered.
    • It is costly to change a partitioning scheme. But, it should be possible and not expensive.

  • Both streaming and Batch-data use-cases should be supported. 

As mentioned earlier, some frameworks can support both stream and batch analytics within the same framework.

  • Schema Evolution

    • As new data is added, schemas need to evolve to accommodate data changes.
    • There must be support for addition, removal, or other operations on columns.
    • While schema updating has use cases, the schema is saved within ORC/parquet files. This means that updating the schema rewrites all files.
    • Most operations in schema evolution are unsafe. This is due to their impact on the rest of the system. They can break queries and existing code.

  • Data Validation and Quality Controls

    • This is a crucial requirement for protecting the system from data errors, crashes, and bugs.
    • Such controls could include enforcing a timestamp that no records should be earlier than Regex constraints, validating that PII does not get stored unless its fields accept PII, etc.

  • Point-in-time view of Data

    • It should be possible to query or view deleted data or view existing data the way it was before changes were applied to it.

  • Support for Derived Tables and Views

    • Data aggregation and merges can increase the size of data.
    • It should be possible to save joins and costly aggregations on separate tables of their own or materialized views. This way, they can be queried effectively.

  • Use of robust Cloud support

    • Multicloud or a Federated Cloud is more of a nice-to-have feature. It makes it easier to manage demand spikes.
    • For compliance, reduction in redundancy and efficiency, opting for either the multicloud or federated cloud is a good option.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top