Setup for development

Installation

Clone the repository github.com/OpenEnergyPlatform/open-dapro and open it.
Install your Dagster code location as a Python package. By using the -e (editable) flag, pip will install your Python package in "editable mode" so that as you develop, local code changes will automatically apply.
```
pip install -e ".[dev]"
```
Next, make sure you have docker and docker-compose installed. You can check this by running:
```
docker-compose --version
```
You now need to rename the .env.template file to .env and change your credentials if needed. The .env file will not be uploaded to git. Note that these credentials have to match the database created with the development/docker-compose.yml file.
To initialize the database and to create the docker container, run:
```
python development/initialize.py
```
Check if the database is running on the server and port specified in the .env file.

Start the Dagster UI web server:

dagster dev

If the environment variables were loaded successfully, you should see the following line:

dagster - INFO - Loaded environment variables from .env file: pwd,uid,server,db,port,schema, MASTR_DOWNLOAD_DATE

Open 127.0.0.1:3000 with your browser to see the project. You can start writing your own assets in energy_dagster/assets.py. The assets are automatically loaded into the Dagster code location as you define them.

Tools we use

pre-commit hooks

In this project, we use pre-commit hooks to lint the code before committing. The hooks are defined in the .pre-commit-config.yaml file. To install the hooks, run the following command:

pre-commit install

This will install the hooks in your local repository. They will be executed before every commit and check for linting errors using the sqlfluff and black packages.

dbt osmosis

You can use dbt-osmosis for creating, updating, and deleting dbt property files. This can be done using the following command:

dbt-osmosis yaml refactor .\models\marts\

dbt Style Guide

This style guide is a shortened version of the dbt-labs style guide.

Model Organization

Our models (typically) fit into two main categories:

Category	Description
Staging	Contains models which clean and standardize data
Marts	Contains models which combine or heavily transform data

Things to note: - There are different types of models that typically exist in each of the above categories.
See Model Layers for more information.

Read How we structure our dbt projects for an example and more details around organization.

Model Layers

Only models in staging should select from sources
Models not within the staging folder should select from refs.

The following are the DAG stages that we tend to utilize:

dag_stage	Typically found in	description
seed_	/seeds	Indicates a data set created from `dbt seed`.
stg_	/models/staging	Indicates a data set that is being cleaned and standardized. In absence of a base_ layer, it representthe 1:1 relationship between the source and first layer of models. li>
int_	/models/marts	Indicates a logical step towards creating a final data set. Typically used for: Breaking ua very large fct_ or dim_ model into smaller pieces to reduce complexity Creating a reusable data set to reference in multiple downstream fctand dim_ models
dim_	/models/marts	Flags data which is used to describe an entity. Indicates a final data which is robust, versatile, anready for consumption.
base_	/models/staging	Indicates cleaning and standardization on a data set before joining to other data sets in `stg_` models. Typically used when multiple sources are rarely used independently. *Example: Location data in our org is seldom used partially, so we want to create one cleaned data set which puts it all together. Step 1: Models to clean and standardize each data set: base_location__addresses.sql base_location__countries.sql base_location__states.sql Step 2*: A model to join all location data as one entity for use in downstream modeling: stg_location__locations.sql

Model File Naming and Coding

All objects should be plural.
Example: stg_stripe__invoices.sql vs. stg_stripe__invoice.sql
All objects should have a prefix to indicate their DAG stage in the flow.
See Model Layers for more information.
All models should use the naming convention <type/dag_stage>_<source/topic>__<additional_context>. See this article for more information.
For models in the marts folder __<additional_context> is optional.
Models in the staging folder should use the source's name as the <source/topic> and the entity name as the additional_context.

Examples: - seed_snowflake_spend.csv - base_stripe__invoices.sql - stg_stripe__customers.sql - stg_salesforce__customers.sql - int_customers__unioned.sql - fct_orders.sql
Schema, table and column names should be in snake_case.
Limit use of abbreviations that are related to domain knowledge. An onboarding employee will understand current_order_status better than current_os.
Each model should have a primary key that can identify the unique row, and should be named <object>_id, e.g. account_id – this makes it easier to know what id is being referenced in downstream joined models.

For base or staging models, columns should be ordered in categories, where identifiers are first and date/time fields are at the end.
Example:

transformed as (
    select
        -- ids
        order_id,
        customer_id,

        -- dimensions
        order_status,
        is_shipped,

        -- measures
        order_total,

        -- date/times
        created_at,
        updated_at,

        -- metadata
        _sdc_batched_at
    from source
)

Date/time columns should be named according to these conventions:
Timestamps: <event>_at
Format: UTC
Example: created_at
Dates: <event>_date
Format: Date
Example: created_date
Booleans should be prefixed with is_ or has_.
Example: is_active_customer and has_admin_access
Avoid using reserved words (such as these for Snowflake) as column names.
Consistency is key! Use the same field names across models where possible.
Example: a key to the customers table should be named customer_id rather than user_id.

Model Configurations

Model configurations at the folder level should be considered (and if applicable, applied) first.
More specific configurations should be applied at the model level using one of these methods.
Models within the marts folder should be materialized as table or incremental.
By default, marts should be materialized as table within dbt_project.yml.
If switching to incremental, this should be specified in the model's configuration.

Testing

At a minimum, unique and not_null tests should be applied to the expected primary key of each model.

CTEs

For more information about why we use so many CTEs, check out this glossary entry.

Where performance permits, CTEs should perform a single, logical unit of work.
CTE names should be as verbose as needed to convey what they do.
CTEs with confusing or noteable logic should be commented with SQL comments as you would with any complex functions, and should be located above the CTE.
CTEs that are duplicated across models should be pulled out and created as their own models.

CTEs fall in to two main categories:

Term	Definition
Import	Used to bring data into a model. These are kept relatively simple and refrain from complex operations such as joins and column transformations.
Logical	Used to perform a logical step with the data that is brought into the model toward the end result.

All {{ ref() }} or {{ source() }} statements should be placed within import CTEs so that dependent model references are easily seen and located.
Where applicable, opt for filtering within import CTEs over filtering within logical CTEs. This allows a developer to easily see which data contributes to the end result.
SQL should end with a simple select statement. All other logic should be contained within CTEs to make stepping through logic easier while troubleshooting. Example: select * from final
SQL and CTEs within a model should follow this structure:
with statement
Import CTEs
Logical CTEs
Simple select statement

Example SQL with CTEs

 -- Jaffle shop went international!
with

-- Import CTEs
regions as (
    select * from {{ ref('stg_jaffle_shop__regions') }}
),

nations as (
    select * from {{ ref('stg_jaffle_shop__nations') }}
),

suppliers as (
    select * from {{ ref('stg_jaffle_shop__suppliers') }}
),

-- Logical CTEs
locations as (
    select
        {{ dbt_utils.generate_surrogate_key([
            'regions.region_id',            
            'nations.nation_id'
        ]) }} as location_sk,
        regions.region_id,
        regions.region,
        regions.region_comment,
        nations.nation_id,
        nations.nation,
        nations.nation_comment
    from regions
    left join nations
        on regions.region_id = nations.region_id
),

final as (
    select
        suppliers.supplier_id,
        suppliers.location_id,
        locations.region_id,
        locations.nation_id,
        suppliers.supplier_name,
        suppliers.supplier_address,
        suppliers.phone_number,
        locations.region,
        locations.region_comment,
        locations.nation,
        locations.nation_comment,
        suppliers.account_balance
    from suppliers
    inner join locations
        on suppliers.location_id = locations.location_sk
)

-- Simple select statement
select * from final

SQL style guide

DO NOT OPTIMIZE FOR FEWER LINES OF CODE.

New lines are cheap, brain time is expensive; new lines should be used within reason to produce code that is easily read.

When dealing with long when or where clauses, predicates should be on a new line and indented.
Example:

where 
    user_id is not null
    and status = 'pending'
    and location = 'hq'

Lines of SQL should be no longer than 80 characters and new lines should be used to ensure this.
Example:

sum(
    case
        when order_status = 'complete'
            then order_total 
    end
) as monthly_total,



{{ get_windowed_values(
      strategy='sum',
      partition='order_id',
      order_by='created_at',
      column_list=[
          'final_cost'
      ]
) }} as total_final_cost

Use all lowercase unless a specific scenario needs you to do otherwise. This means that keywords, field names, function names, and file names should all be lowercased.
The as keyword should be used when aliasing a field or table
Fields should be stated before aggregates / window functions
Aggregations should be executed as early as possible before joining to another table.
Ordering and grouping by a number (eg. group by 1, 2) is preferred over listing the column names (see this rant for why). Note that if you are grouping by more than a few columns, it may be worth revisiting your model design. If you really need to, the dbt_utils.group_by function may come in handy.
Prefer union all to union *
Avoid table aliases in join conditions (especially initialisms) – it's harder to understand what the table called "c" is compared to "customers".
If joining two or more tables, always prefix your column names with the table alias. If only selecting from one table, prefixes are not needed.
Be explicit about your join (i.e. write inner join instead of join). left joins are the most common, right joins often indicate that you should change which table you select from and which one you join to.
Avoid the using clause in joins, preferring instead to explicitly list the CTEs and associated join keys with an on clause.

Joins should list the left table first (i.e., the table you're joining data to)
Example:

select
    trips.*,
    drivers.rating as driver_rating,
    riders.rating as rider_rating
from trips
left join users as drivers
   on trips.driver_id = drivers.user_id
left join users as riders
    on trips.rider_id = riders.user_id

Example SQL

with

my_data as (
    select * from {{ ref('my_data') }}
    where not is_deleted
),

some_cte as (
    select * from {{ ref('some_cte') }}
),

some_cte_agg as (
    select
        id,
        sum(field_4) as total_field_4,
        max(field_5) as max_field_5
    from some_cte
    group by 1
),

final as (
    select [distinct]
        my_data.field_1,
        my_data.field_2,
        my_data.field_3,

        -- use line breaks to visually separate calculations into blocks
        case
            when my_data.cancellation_date is null
                and my_data.expiration_date is not null
                then expiration_data
            when my_data.cancellation_date is null
                then my_data.start_date + 7
            else my_data.cancellation_date
        end as cancellation_date,

        some_cte_agg.total_field_4,
        some_cte_agg.max_field_5
    from my_data
    left join some_cte_agg  
        on my_data.id = some_cte_agg.id
    where 
        my_data.field_1 = 'abc'
        and (
            my_data.field_2 = 'def'
            or my_data.field_2 = 'ghi'
        )
    qualify row_number() over(
        partition by my_data.field_1
        order by my_data.start_date desc
    ) = 1
)

select * from final

YAML and Markdown style guide

Every subdirectory contains their own .yml file(s) which contain configurations for the models within the subdirectory.
YAML and markdown files should be prefixed with an underscore ( _ ) to keep it at the top of the subdirectory.
YAML and markdown files should be named with the convention _<description>__<config>.

Examples: _jaffle_shop__sources.yml, _jaffle_shop__docs.md

description is typically the folder of models you're setting configurations for.
Examples: core, staging, intermediate
config is the top-level resource you are configuring.
Examples: docs, models, sources
Indents should use two spaces.
List items should be indented.
Use a new line to separate list items that are dictionaries, where appropriate.
Lines of YAML should be no longer than 80 characters.
Items listed in a single .yml or .md file should be sorted alphabetically for ease of finding in larger files.

Each top-level configuration should use a separate .yml file (i.e, sources, models) Example:

models
├── marts
└── staging
    └── jaffle_shop
        ├── _jaffle_shop__docs.md

        ├── _jaffle_shop__models.yml
        ├── _jaffle_shop__sources.yml
        ├── stg_jaffle_shop__customers.sql
        ├── stg_jaffle_shop__orders.sql
        └── stg_jaffle_shop__payments.sql

dbt_project.yml configurations should be prefixed with + to avoid namespace collision with directories.

Example:

models:
  my_project:
    marts:
      +materialized: table

Example YAML

_jaffle_shop__models.yml:

version: 2

models:

  - name: base_jaffle_shop__nations

    description: This model cleans the raw nations data
    columns:
      - name: nation_id
        tests:
          - unique
          - not_null   

  - name: base_jaffle_shop__regions
    description: >
      This model cleans the raw regions data before being joined with nations
      data to create one cleaned locations table for use in marts.
    columns:
      - name: region_id
        tests:
          - unique
          - not_null

  - name: stg_jaffle_shop__locations

    description: "{{ doc('jaffle_shop_location_details') }}"

    columns:
      - name: location_sk
        tests:
          - unique
          - not_null

#### Example Markdown _jaffle_shop__docs.md:

```markdown

  Although most of our data sets have statuses attached, you may find some
  that are enumerated. The following table can help you identify these statuses.
  | Status | Description                                                                 |
  |--------|---------------|
  | 1      | ordered       |
  | 2      | shipped       |
  | 3      | pending       |
  | 4      | order_pending |


{% enddocs %}

{% docs statuses %}

  Statuses can be found in many of our raw data sets. The following lists
  statuses and their descriptions:
  | Status        | Description                                                                 |
  |---------------|-----------------------------------------------------------------------------|
  | ordered       | A customer has paid at checkout.                                            |
  | shipped       | An order has a tracking number attached.                                    |
  | pending       | An order has been paid, but doesn't have a tracking number.                 |
  | order_pending | A customer has not yet paid at checkout, but has items in their cart. |

{% enddocs %}

```

Jinja style guide

Jinja delimiters should have spaces inside of the delimiter between the brackets and your code.
Example: {{ this }} instead of {{this}}
Use whitespace control to make compiled SQL more readable.
An effort should be made for a good balance in readability for both templated and compiled code. However, opt for code readability over compiled SQL readability when needed.
A macro file should be named after the main macro it contains.
A file with more than one macro should follow these conventions:
There is one macro which is the main focal point
The file is named for the main macro or idea
All other macros within the file are only used for the purposes of the main idea and not used by other macros outside of the file.

Use new lines to visually indicate logical blocks of Jinja or to enhance readability.
Example:

{%- set orig_cols = adapter.get_columns_in_relation(ref('fct_orders')) %}

{%- set new_cols = dbt_utils.star(
      from=ref('fct_order_items'),
      except=orig_cols
) %}

-- original columns. {{ col }} is indented here, but choose what will satisfy
-- your own balance for Jinja vs. SQL readability. 
{%- for col in orig_cols %}
      {{ col }}
{% endfor %}

-- column difference
{{ new_cols }}

Use new lines within Jinja delimiters and arrays if there are multiple arguments.
Example:

{%- dbt_utils.star(
      from=ref('stg_jaffle_shop__orders'),
      except=[
          'order_id',
          'ordered_at',
          'status'
      ],
      prefix='order_'
) %}

Metrics style guide

Organizing Metrics

Metrics are categorized by entity (object grain that the metrics occurs), and filenames directly correspond to metrics.
Filenames are prefixed with base__ only if they are pre-calculated inputs to derived metrics in other files.

├── dbt_project.yml
└── models
    ├── marts
    ├── staging
    └── metrics
        ├── projects
        |   ├── active_projects.yml
        ├── accounts
        |   ├── active_cloud_accounts.yml
        └── users
            ├── base__net_promoter_score.yml
            └── net_promoter_score.yml

Metrics Conventions

dbt Metrics fall into four broad categories: 1. Company metrics 2. Team KPIs 3. OKRs 4. Specific metrics related to a product area, business unit, or business function that is not necessarily a team KPI, but important to track nonetheless.

Because of the wide socialization of these docs and downstream usage in the BI layer, consistency and clarity are very important. Below are the general standards and examples of how we format and implement metrics at dbt Labs: * Metrics names must begin with a letter, cannot contain whitespace, and should be all lowercase. * The minimum required properties must be present in the metric definition. * Tags and/or Meta properties should match the categories above and be used to organize metrics at the category or business function level. * Meta properties should be used to track metric definition ownership. * For up-to-date information on metrics, please see the metrics docs on defining a metric or the dbt-labs/metrics README

Example Metrics YAML

version: 2

metrics:
  - name: base__total_nps_respondents_cloud
    label: (Base) Total of NPS Respondents (Cloud)
    model: ref('fct_customer_nps')
    description: >
      'The count of users responding to NPS surveys in dbt Cloud.'
    tags: ['Company Metric']

    calculation_method: count
    expression: unique_id

    timestamp: created_at
    time_grains: [day, month, quarter, year]

    dimensions:
      - feedback_source

    filters:
      - field: feedback_source
        operator: '='
        value: "'dbt_cloud_nps'"

    meta:
      metric_level: 'Company'
      owner(s): 'Jane Doe'


  - name: base__count_nps_promoters_cloud
    label: (Base) Count of NPS Promoters (Cloud)
    model: ref('fct_customer_nps')
    description: >
      'The count of dbt Cloud respondents that fall into the promoters segment.'
    tags: ['Company Metric']

    calculation_method: count
    expression: unique_id

    timestamp: created_at
    time_grains: [day, month, quarter, year]

    filters:
      - field: feedback_source
        operator: '='
        value: "'dbt_cloud_nps'"
      - field: nps_category
        operator: '='
        value: "'promoter'"

    meta:
      metric_level: 'Company'
      owner(s): 'Jane Doe'

  - name: promoters_pct
    label: Percent Promoters (Cloud)
    description: 'The percent of dbt Cloud users in the promoters segment.'
    tags: ['Company Metric']

    calculation_method: derived
    expression: "{{metric('base__count_nps_promoters_cloud')}} / {{metric('base__total_nps_respondents_cloud')}}" 

    timestamp: created_at
    time_grains: [day, month, quarter, year]

    meta:
      metric_level: 'Company'
      owner(s): 'Jane Doe'