BossaBox

This is the playbook for engineering-playbook

Data and DataOps Fundamentals

Most projects involve some type of data storage, data processing and data ops. For these projects, as with all projects, we follow the general guidelines laid out in other sections around security, testing, observability, CI/CD etc.

Goal

The goal of this section is to briefly describe how to apply the fundamentals to data heavy projects or portions of the project.

Isolation

Please be cautious of which isolation levels you are using. Even with a database that offers serializability, it is possible that within a transaction or connection you are leveraging a lower isolation level than the database offers. In particular, read uncommitted (or eventual consistency), can have a lot of unpredictable side effects and introduce bugs that are difficult to reason about. Eventually consistent systems should be treated as a last resort for achieving your scalability requirements; batching, sharding, and caching are all recommended solutions to increase your scalability. If none of these options are tenable, consider evaluating the “New SQL” databases like CockroachDB or TiDB, before leveraging an option that relies on eventual consistency.

There are other levels of isolation, outside the isolation levels mentioned in the link above. Some of these have nuances different from the 4 main levels, and can be difficult to compare. Snapshot Isolation, strict serializability, “read your own writes”, monotonic reads, bounded staleness, causal consistency, and linearizability are all other terms you can look into to learn more on the subject.

Concurrency Control

Your systems should (almost) always leverage some form of concurrency control, to ensure correctness amongst competing requests and to prevent data races. The 2 forms of concurrency control are pessimistic and optimistic.

A pessimistic transaction involves a first request to “lock the data”, and a second request to write the data. In between these requests, no other requests touching that data will succeed. See 2 Phase Locking (also often known as 2 Phase Commit) for more info.

The (more) recommended approach is optimistic concurrency, where a user can read the object at a specific version, and update the object if and only if it hasn’t changed. This is typically done via the Etag Header.

A simple way to accomplish this on the database side is to increment a version number on each update. This can be done in a single executed statement as:

WARNING: the below will not work when using an isolation level at or lower than read uncommitted (eventual consistency).

-- Please treat this as pseudo code, and adjust as necessary.

UPDATE <table_name>
SET field1 = value1, ..., fieldN = valueN, version = $new_version
WHERE ID = $id AND version = $version

Data Tiering (Data Quality)

Develop a common understanding of the quality of your datasets so that everyone understands the quality of the data, and expected use cases and limitations.

A common data quality model is Bronze, Silver, Gold

Divide your data lake into three major areas containing your Bronze, Silver and Gold datasets.

Note: Additional storage areas for malformed data, intermediate (sandbox) data, and libraries/packages/binaries are also useful when designing your storage organization.

Data Validation

Validate data early in your pipeline

Idempotent Data Pipelines

Make your data pipelines re-playable and idempotent

Testing

Ensure data transformation code is testable

CI/CD, Source Control and Code Reviews

Security and Configuration

Observability

Monitor infrastructure, pipelines and data

End to End and Azure Technology Samples

The DataOps for the Modern Data Warehouse repo contains both end-to-end and technology specific samples on how to implement DataOps on Azure.

CI/CD Image: CI/CD for Data pipelines on Azure - from DataOps for the Modern Data Warehouse repo