Setup for development
Installation
- Clone the repository github.com/OpenEnergyPlatform/open-dapro and open it.
-
Install your Dagster code location as a Python package. By using the -e (editable) flag, pip will install your Python package in "editable mode" so that as you develop, local code changes will automatically apply.
-
Next, make sure you have docker and docker-compose installed. You can check this by running:
-
You now need to rename the
.env.template
file to.env
and change your credentials if needed. The.env
file will not be uploaded to git. Note that these credentials have to match the database created with thedevelopment/docker-compose.yml
file. -
To initialize the database and to create the docker container, run:
Check if the database is running on the server and port specified in the.env
file. -
Start the Dagster UI web server:
If the environment variables were loaded successfully, you should see the following line:
-
Open 127.0.0.1:3000 with your browser to see the project. You can start writing your own assets in
energy_dagster/assets.py
. The assets are automatically loaded into the Dagster code location as you define them.
Tools we use
pre-commit hooks
In this project, we use pre-commit hooks to lint the code before committing. The hooks are defined in the .pre-commit-config.yaml
file. To install the hooks, run the following command:
This will install the hooks in your local repository. They will be executed before every commit and check for linting errors using the sqlfluff and black packages.
dbt osmosis
You can use dbt-osmosis
for creating, updating, and deleting dbt property files.
This can be done using the following command:
dbt Style Guide
This style guide is a shortened version of the dbt-labs style guide.
Model Organization
Our models (typically) fit into two main categories:
Category | Description |
---|---|
Staging | Contains models which clean and standardize data |
Marts | Contains models which combine or heavily transform data |
Things to note:
- There are different types of models
that typically exist in each of the above categories.
See Model Layers for more information.
- Read How we structure our dbt projects for an example and more details around organization.
Model Layers
- Only models in
staging
should select from sources - Models not within the
staging
folder should select from refs. -
The following are the DAG stages that we tend to utilize:
dag_stage Typically found in description seed_ /seeds - Indicates a data set created from
dbt seed
.stg_ /models/staging - Indicates a data set that is being cleaned and standardized.
- In absence of a base_ layer, it representthe 1:1 relationship between the source and first layer of models. li>
int_ /models/marts - Indicates a logical step towards creating a final data set.
- Typically used for:
- Breaking ua very large fct_ or dim_ model into smaller pieces to reduce complexity
- Creating a reusable data set to reference in multiple downstream fctand dim_ models
dim_ /models/marts - Flags data which is used to describe an entity.
- Indicates a final data which is robust, versatile, anready for consumption.
base_ /models/staging - Indicates cleaning and standardization on a data set before joining to other data sets in
stg_
models.- Typically used when multiple sources are rarely used independently.
Example:
Location data in our org is seldom used partially, so we want to create one cleaned data set which puts it all together.
Step 1: Models to clean and standardize each data set:- base_location__addresses.sql
- base_location__countries.sql
- base_location__states.sql
Step 2: A model to join all location data as one entity for use in downstream modeling:- stg_location__locations.sql
- Indicates a data set created from
Model File Naming and Coding
-
All objects should be plural.
Example:stg_stripe__invoices.sql
vs.stg_stripe__invoice.sql
-
All objects should have a prefix to indicate their DAG stage in the flow.
See Model Layers for more information. -
All models should use the naming convention
<type/dag_stage>_<source/topic>__<additional_context>
. See this article for more information. - For models in the marts folder
__<additional_context>
is optional. -
Models in the staging folder should use the source's name as the
<source/topic>
and the entity name as theadditional_context
.Examples: - seed_snowflake_spend.csv - base_stripe__invoices.sql - stg_stripe__customers.sql - stg_salesforce__customers.sql - int_customers__unioned.sql - fct_orders.sql
-
Schema, table and column names should be in
snake_case
. -
Limit use of abbreviations that are related to domain knowledge. An onboarding employee will understand
current_order_status
better thancurrent_os
. -
Each model should have a primary key that can identify the unique row, and should be named
<object>_id
, e.g.account_id
– this makes it easier to know whatid
is being referenced in downstream joined models. -
For
base
orstaging
models, columns should be ordered in categories, where identifiers are first and date/time fields are at the end.
Example: -
Date/time columns should be named according to these conventions:
-
Timestamps:
<event>_at
Format: UTC
Example:created_at
-
Dates:
<event>_date
Format: Date
Example:created_date
-
Booleans should be prefixed with
is_
orhas_
.
Example:is_active_customer
andhas_admin_access
-
Avoid using reserved words (such as these for Snowflake) as column names.
-
Consistency is key! Use the same field names across models where possible.
Example: a key to thecustomers
table should be namedcustomer_id
rather thanuser_id
.
Model Configurations
- Model configurations at the folder level should be considered (and if applicable, applied) first.
- More specific configurations should be applied at the model level using one of these methods.
- Models within the
marts
folder should be materialized astable
orincremental
. - By default,
marts
should be materialized astable
withindbt_project.yml
. - If switching to
incremental
, this should be specified in the model's configuration.
Testing
- At a minimum,
unique
andnot_null
tests should be applied to the expected primary key of each model.
CTEs
For more information about why we use so many CTEs, check out this glossary entry.
-
Where performance permits, CTEs should perform a single, logical unit of work.
-
CTE names should be as verbose as needed to convey what they do.
-
CTEs with confusing or noteable logic should be commented with SQL comments as you would with any complex functions, and should be located above the CTE.
-
CTEs that are duplicated across models should be pulled out and created as their own models.
-
CTEs fall in to two main categories:
Term Definition Import Used to bring data into a model. These are kept relatively simple and refrain from complex operations such as joins and column transformations. Logical Used to perform a logical step with the data that is brought into the model toward the end result. -
All
{{ ref() }}
or{{ source() }}
statements should be placed within import CTEs so that dependent model references are easily seen and located. -
Where applicable, opt for filtering within import CTEs over filtering within logical CTEs. This allows a developer to easily see which data contributes to the end result.
-
SQL should end with a simple select statement. All other logic should be contained within CTEs to make stepping through logic easier while troubleshooting. Example:
select * from final
-
SQL and CTEs within a model should follow this structure:
with
statement- Import CTEs
- Logical CTEs
- Simple select statement
Example SQL with CTEs
-- Jaffle shop went international!
with
-- Import CTEs
regions as (
select * from {{ ref('stg_jaffle_shop__regions') }}
),
nations as (
select * from {{ ref('stg_jaffle_shop__nations') }}
),
suppliers as (
select * from {{ ref('stg_jaffle_shop__suppliers') }}
),
-- Logical CTEs
locations as (
select
{{ dbt_utils.generate_surrogate_key([
'regions.region_id',
'nations.nation_id'
]) }} as location_sk,
regions.region_id,
regions.region,
regions.region_comment,
nations.nation_id,
nations.nation,
nations.nation_comment
from regions
left join nations
on regions.region_id = nations.region_id
),
final as (
select
suppliers.supplier_id,
suppliers.location_id,
locations.region_id,
locations.nation_id,
suppliers.supplier_name,
suppliers.supplier_address,
suppliers.phone_number,
locations.region,
locations.region_comment,
locations.nation,
locations.nation_comment,
suppliers.account_balance
from suppliers
inner join locations
on suppliers.location_id = locations.location_sk
)
-- Simple select statement
select * from final
SQL style guide
- DO NOT OPTIMIZE FOR FEWER LINES OF CODE.
New lines are cheap, brain time is expensive; new lines should be used within reason to produce code that is easily read.
-
When dealing with long
when
orwhere
clauses, predicates should be on a new line and indented.
Example: -
Lines of SQL should be no longer than 80 characters and new lines should be used to ensure this.
Example: -
Use all lowercase unless a specific scenario needs you to do otherwise. This means that keywords, field names, function names, and file names should all be lowercased.
-
The
as
keyword should be used when aliasing a field or table -
Fields should be stated before aggregates / window functions
-
Aggregations should be executed as early as possible before joining to another table.
-
Ordering and grouping by a number (eg. group by 1, 2) is preferred over listing the column names (see this rant for why). Note that if you are grouping by more than a few columns, it may be worth revisiting your model design. If you really need to, the dbt_utils.group_by function may come in handy.
-
Prefer
union all
tounion
* -
Avoid table aliases in join conditions (especially initialisms) – it's harder to understand what the table called "c" is compared to "customers".
-
If joining two or more tables, always prefix your column names with the table alias. If only selecting from one table, prefixes are not needed.
-
Be explicit about your join (i.e. write
inner join
instead ofjoin
).left joins
are the most common,right joins
often indicate that you should change which table you selectfrom
and which one youjoin
to. -
Avoid the
using
clause in joins, preferring instead to explicitly list the CTEs and associated join keys with anon
clause. -
Joins should list the left table first (i.e., the table you're joining data to)
Example:
Example SQL
with
my_data as (
select * from {{ ref('my_data') }}
where not is_deleted
),
some_cte as (
select * from {{ ref('some_cte') }}
),
some_cte_agg as (
select
id,
sum(field_4) as total_field_4,
max(field_5) as max_field_5
from some_cte
group by 1
),
final as (
select [distinct]
my_data.field_1,
my_data.field_2,
my_data.field_3,
-- use line breaks to visually separate calculations into blocks
case
when my_data.cancellation_date is null
and my_data.expiration_date is not null
then expiration_data
when my_data.cancellation_date is null
then my_data.start_date + 7
else my_data.cancellation_date
end as cancellation_date,
some_cte_agg.total_field_4,
some_cte_agg.max_field_5
from my_data
left join some_cte_agg
on my_data.id = some_cte_agg.id
where
my_data.field_1 = 'abc'
and (
my_data.field_2 = 'def'
or my_data.field_2 = 'ghi'
)
qualify row_number() over(
partition by my_data.field_1
order by my_data.start_date desc
) = 1
)
select * from final
YAML and Markdown style guide
-
Every subdirectory contains their own
.yml
file(s) which contain configurations for the models within the subdirectory. -
YAML and markdown files should be prefixed with an underscore (
_
) to keep it at the top of the subdirectory. -
YAML and markdown files should be named with the convention
_<description>__<config>
.
Examples: _jaffle_shop__sources.yml
, _jaffle_shop__docs.md
description
is typically the folder of models you're setting configurations for.
Examples:core
,staging
,intermediate
config
is the top-level resource you are configuring.
Examples:docs
,models
,sources
-
Indents should use two spaces.
-
List items should be indented.
-
Use a new line to separate list items that are dictionaries, where appropriate.
-
Lines of YAML should be no longer than 80 characters.
-
Items listed in a single .yml or .md file should be sorted alphabetically for ease of finding in larger files.
-
Each top-level configuration should use a separate
.yml
file (i.e, sources, models) Example: -
dbt_project.yml
configurations should be prefixed with+
to avoid namespace collision with directories.
Example:
Example YAML
_jaffle_shop__models.yml
:
version: 2
models:
- name: base_jaffle_shop__nations
description: This model cleans the raw nations data
columns:
- name: nation_id
tests:
- unique
- not_null
- name: base_jaffle_shop__regions
description: >
This model cleans the raw regions data before being joined with nations
data to create one cleaned locations table for use in marts.
columns:
- name: region_id
tests:
- unique
- not_null
- name: stg_jaffle_shop__locations
description: "{{ doc('jaffle_shop_location_details') }}"
columns:
- name: location_sk
tests:
- unique
- not_null
#### Example Markdown
_jaffle_shop__docs.md
:
```markdown
Although most of our data sets have statuses attached, you may find some
that are enumerated. The following table can help you identify these statuses.
| Status | Description |
|--------|---------------|
| 1 | ordered |
| 2 | shipped |
| 3 | pending |
| 4 | order_pending |
{% enddocs %}
{% docs statuses %}
Statuses can be found in many of our raw data sets. The following lists
statuses and their descriptions:
| Status | Description |
|---------------|-----------------------------------------------------------------------------|
| ordered | A customer has paid at checkout. |
| shipped | An order has a tracking number attached. |
| pending | An order has been paid, but doesn't have a tracking number. |
| order_pending | A customer has not yet paid at checkout, but has items in their cart. |
{% enddocs %}
```
Jinja style guide
-
Jinja delimiters should have spaces inside of the delimiter between the brackets and your code.
Example:{{ this }}
instead of{{this}}
-
Use whitespace control to make compiled SQL more readable.
-
An effort should be made for a good balance in readability for both templated and compiled code. However, opt for code readability over compiled SQL readability when needed.
-
A macro file should be named after the main macro it contains.
-
A file with more than one macro should follow these conventions:
- There is one macro which is the main focal point
- The file is named for the main macro or idea
-
All other macros within the file are only used for the purposes of the main idea and not used by other macros outside of the file.
-
Use new lines to visually indicate logical blocks of Jinja or to enhance readability.
Example:
{%- set orig_cols = adapter.get_columns_in_relation(ref('fct_orders')) %} {%- set new_cols = dbt_utils.star( from=ref('fct_order_items'), except=orig_cols ) %} -- original columns. {{ col }} is indented here, but choose what will satisfy -- your own balance for Jinja vs. SQL readability. {%- for col in orig_cols %} {{ col }} {% endfor %} -- column difference {{ new_cols }}
-
Use new lines within Jinja delimiters and arrays if there are multiple arguments.
Example:
Metrics style guide
Organizing Metrics
- Metrics are categorized by entity (object grain that the metrics occurs), and filenames directly correspond to metrics.
Filenames are prefixed withbase__
only if they are pre-calculated inputs to derived metrics in other files.
Metrics Conventions
dbt Metrics fall into four broad categories: 1. Company metrics 2. Team KPIs 3. OKRs 4. Specific metrics related to a product area, business unit, or business function that is not necessarily a team KPI, but important to track nonetheless.
Because of the wide socialization of these docs and downstream usage in the BI layer, consistency and clarity are very important. Below are the general standards and examples of how we format and implement metrics at dbt Labs: * Metrics names must begin with a letter, cannot contain whitespace, and should be all lowercase. * The minimum required properties must be present in the metric definition. * Tags and/or Meta properties should match the categories above and be used to organize metrics at the category or business function level. * Meta properties should be used to track metric definition ownership. * For up-to-date information on metrics, please see the metrics docs on defining a metric or the dbt-labs/metrics README
Example Metrics YAML
version: 2
metrics:
- name: base__total_nps_respondents_cloud
label: (Base) Total of NPS Respondents (Cloud)
model: ref('fct_customer_nps')
description: >
'The count of users responding to NPS surveys in dbt Cloud.'
tags: ['Company Metric']
calculation_method: count
expression: unique_id
timestamp: created_at
time_grains: [day, month, quarter, year]
dimensions:
- feedback_source
filters:
- field: feedback_source
operator: '='
value: "'dbt_cloud_nps'"
meta:
metric_level: 'Company'
owner(s): 'Jane Doe'
- name: base__count_nps_promoters_cloud
label: (Base) Count of NPS Promoters (Cloud)
model: ref('fct_customer_nps')
description: >
'The count of dbt Cloud respondents that fall into the promoters segment.'
tags: ['Company Metric']
calculation_method: count
expression: unique_id
timestamp: created_at
time_grains: [day, month, quarter, year]
filters:
- field: feedback_source
operator: '='
value: "'dbt_cloud_nps'"
- field: nps_category
operator: '='
value: "'promoter'"
meta:
metric_level: 'Company'
owner(s): 'Jane Doe'
- name: promoters_pct
label: Percent Promoters (Cloud)
description: 'The percent of dbt Cloud users in the promoters segment.'
tags: ['Company Metric']
calculation_method: derived
expression: "{{metric('base__count_nps_promoters_cloud')}} / {{metric('base__total_nps_respondents_cloud')}}"
timestamp: created_at
time_grains: [day, month, quarter, year]
meta:
metric_level: 'Company'
owner(s): 'Jane Doe'