Browsed by
Category: ETL

Simple full load SSIS package in Biml

Simple full load SSIS package in Biml

In this post I will show a simple full load SSIS package and how it can be created in Biml.

Introduction

I will try to show how to deal with these components of SSIS in Biml:

  • Connections (Connections Managers)
  • Variables
  • Package(s)
  • Tasks
  • Containers
  • Precedence Constraints
  • Data Flow (as a specific type of Task):
    • How to Extract data
    • How to Transform data
    • How to Load data

So quite a lot of new stuff to learn.

There are, however, several assumptions, which are:

  • We have a table in our Source System from where we get data
    • Source System name is Sales (Let’s try to imagine that this is your Sales System 🙂 )
    • Database name is SalesDB
    • Source table name is dbo.Sales
  • We have a table in our Staging area of our DWH to where we load data
    • Our DataWarehouse name is DWH
    • Destination table name is stg.Sales (in stg schema in order to indicate that this is a staging area, purely a convention)
  • There is 1:1 mapping between table in Source System and Destination DWH in terms of structure
    • The same fields in both tables with the same data types
    • There is an additional field in Destination Table connected with Data Lineage
  • We want to do a Full Load of data every time from Source table to Destination table:
    • We truncate destination table every time
    • We extract data from Source, add Lineage data and load into Destination

Below you can find a conceptual image how this kind of ETL process may look like:

Conceptual image of SSIS package
Conceptual image of SSIS package

Here is the full code of the SSIS package written in Biml (you will find a step-by-step explanation below the code):

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    
    <Connections>
        <Connection Name = "Src_SalesDB" CreateInProject="true" ConnectionString = "Data Source=MUSPELHEIM;Initial Catalog=SalesDB;Provider=SQLOLEDB.1;Integrated Security=SSPI;Auto Translate=False;" />
        <Connection Name = "Dst_DWH" CreateInProject="true" ConnectionString = "Data Source=MUSPELHEIM;Initial Catalog=DWH;Provider=SQLOLEDB.1;Integrated Security=SSPI;Auto Translate=False;" />
        <Connection Name = "Metadata" CreateInProject="true" ConnectionString = "Data Source=MUSPELHEIM;Initial Catalog=DWH_Metadata;Provider=SQLOLEDB.1;Integrated Security=SSPI;Auto Translate=False;" />
    </Connections>
    
    <Packages>
    
        <Package Name = "Source2Staging_SalesDB_Sales" ConstraintMode = "Parallel" ProtectionLevel = "DontSaveSensitive">
        
            <Variables>
                <Variable Name="var_ExecutionId" DataType="Int32">0</Variable>
            </Variables> 
        
            <Tasks>
            
                <ExecuteSQL Name = "SQL Start Package" ConnectionName = "Metadata" ResultSet="SingleRow">
                    
                    <DirectInput>EXECUTE etl.StartPackage 'Source2Staging_SalesDB_Sales'</DirectInput>
                    <Results>
                        <Result VariableName="User.var_ExecutionId" Name="0" />
                    </Results>
                
                </ExecuteSQL>
                
                <Container Name = "SC Package Logic">
                    
                    <PrecedenceConstraints>
                        <Inputs>
                            <Input OutputPathName="SQL Start Package.Output" />
                        </Inputs>
                    </PrecedenceConstraints>
            
                    <Tasks>
            
                        <ExecuteSQL Name = "SQL Truncate table stg__SalesDB_Sales" ConnectionName = "Dst_DWH">
                            <DirectInput>TRUNCATE TABLE stg.SalesDB_Sales</DirectInput>
                        </ExecuteSQL>
                        
                        <Dataflow Name = "Load Data">
                    
                            <PrecedenceConstraints>
                                <Inputs>
                                    <Input OutputPathName="SQL Truncate table stg__SalesDB_Sales.Output" />
                                </Inputs>
                            </PrecedenceConstraints>
                                
                            <Transformations>
                            
                                <OleDbSource Name="OLE DB Source dbo__Sales" ConnectionName="Src_SalesDB">
                                    
                                    <DirectInput>
                                        SELECT
                                          CustomerName
                                          ,ProductName
                                          ,RegionName
                                          ,SalesDate
                                          ,Quantity
                                          ,PricePerItem
                                        FROM dbo.Sales;
                                    </DirectInput>
                                
                                    
                                </OleDbSource>
                                
                                <DerivedColumns Name="Lineage Column">
                                
                                    <Columns>
                                        <Column Name="LoadId" DataType="Int32">@[User::var_ExecutionId]</Column>
                                    </Columns>
                                
                                    
                                </DerivedColumns>
                                
                                <OleDbDestination Name="OLE DB Destination stg__Sales" ConnectionName="Dst_DWH">
                                    
                                    <ExternalTableOutput Table="stg.SalesDB_Sales" />
                                    
                                </OleDbDestination>
                                
                            </Transformations>
                    
                        </Dataflow>
            
                    </Tasks>
                    
                </Container>

             
                <ExecuteSQL Name = "SQL End Package" ConnectionName = "Metadata">
                
                    <PrecedenceConstraints>
                        <Inputs>
                            <Input EvaluationOperation="Constraint" EvaluationValue="Completion" OutputPathName="SC Package Logic.Output" />
                        </Inputs>
                    </PrecedenceConstraints>
                
                    <DirectInput>EXECUTE etl.EndPackage ?</DirectInput>
    
                    <Parameters>
                        <Parameter VariableName="User.var_ExecutionId" Name="0" DataType="Int32" Direction="Input" />
                    </Parameters>
                
                </ExecuteSQL>
            
            </Tasks>
        
        </Package>
            
            
    </Packages>
    
</Biml>

And here is a screenshot of the created package by using above-mentioned Biml code:

Final SSIS package generated from Biml
Final SSIS package generated from Biml

Detailed Explanation of the code

Below there is a detailed description and explanation of the above-mentioned code, part by part:

Connections

Lines from 3 to 7 define Connection Managers. I defined 3 Connection Managers:

  • one for my Source Database (Name attribute in Connection element is set to Src_SalesDB),
  • one for my Destination Database (Name attribute in Connection element is set to Dst_DWH),
  • one for my MetaData database (Name attribute in Connection element is set to Metadata).

Some remarks:

  • All Connection Managers are configured to be project-level CM (CreateInProject attribute in Connection element is set to true),
  • In Data Source attribute on Connection element you need to define a full connections string to your database. Please notice that Provider is SQLOLEDB.1 , which means that I used OLE DB for SQL Server Connection Manager.
<Connections>
    <Connection Name = "Src_SalesDB" CreateInProject="true" ConnectionString = "Data Source=MUSPELHEIM;Initial Catalog=SalesDB;Provider=SQLOLEDB.1;Integrated Security=SSPI;Auto Translate=False;" />
    <Connection Name = "Dst_DWH" CreateInProject="true" ConnectionString = "Data Source=MUSPELHEIM;Initial Catalog=DWH;Provider=SQLOLEDB.1;Integrated Security=SSPI;Auto Translate=False;" />
    <Connection Name = "Metadata" CreateInProject="true" ConnectionString = "Data Source=MUSPELHEIM;Initial Catalog=DWH_Metadata;Provider=SQLOLEDB.1;Integrated Security=SSPI;Auto Translate=False;" />
</Connections>

Packages

This is an element that gather all your defined ETL packages. In the example code it is from line 9 to line 113.

Package

Some remarks:

  • Inside of Packages element you can define all your packages,
  • You can define as many packages as you want,
  • Every single definition of a package is within Package element.

In my example, there is only one package called Source2Staging_SalesDB_Sales, where I defined in line 11:

<Package Name = "Source2Staging_SalesDB_Sales" ConstraintMode = "Parallel" ProtectionLevel = "DontSaveSensitive">

Variables

This is a variables’ declaration section.

Some remarks:

  • I defined only one variable called var_ExecutionId (Name attribute of Variable element),
  • Data type is an Integer data type (DataType attribute of Variable element) (int for SQL Server, Int32 for SSIS to be precise),
  • Because my variable is a Int32 data type (which is a number), it can not be empty, that is why I initiated this variable with a value of 0 (zero).

What is important to mention is the scope of the variable. Because I defined the variable at the highest level of the package (it is not nested in any Task), so that it is visible for the entire package (the scope value is Source2Staging_SalesDB_Sales which is the package name). If you define variable inside Task/Container, the scope of the variable is not the entire package anymore: it is only the task/container you defined the variable and all its children tasks/containers.

This section is between lines 13 and 15 in the example code.

<Variables>
    <Variable Name="var_ExecutionId" DataType="Int32">0</Variable>
</Variables>

Tasks

Each SSIS ETL package should consists of Tasks, which are defined inside Tasks element. In my example – lines 17-108.
Inside Tasks element, I defined all my ETL package logic. Below there is an explanation of all of them:

Task – SQL Start Package

Some remarks:

  • This piece of code creates an Execute SQL Task,
  • The name of the task is SQL Start Package (Name attribute of ExecuteSQL element),
  • Connection Manager is set to Metadata (ConnectionName attribute of ExecuteSQL element),
  • As we expect to have a single row as a result, appropriate attribute is set to SingleRow (ResultSet attribute of ExecuteSQL element),
  • In DirectInput element an T-SQL statement to execute is defined,
  • As we expect some results, a result is mapped to appropriate variable (Results element and Result sub-element) with appropriate order.
    Notice that:

    • we defined variable as var_ExecutionId (line 14),
    • inside ETL package this variable is visible as @[User::var_ExecutionId],
    • in Biml code it is visible as User.var_ExecutionId.

The code of the task is between lines 19 and 26:

<ExecuteSQL Name = "SQL Start Package" ConnectionName = "Metadata" ResultSet="SingleRow">
    
    <DirectInput>EXECUTE etl.StartPackage 'Source2Staging_SalesDB_Sales'</DirectInput>
    <Results>
        <Result VariableName="User.var_ExecutionId" Name="0" />
    </Results>

</ExecuteSQL>

Sequence Container – SC Package Logic

If you want to create a Sequence Container where your Tasks will be executed, well – sequentially. Then you must wrap up all your Tasks inside Container element.
I created a Sequence Container named SC Package Logic and I wrapped up there 2 Tasks:

  • SQL Truncate table stg__SalesDB_Sales (ExecuteSQL Task),
  • Load Data (DataFlow Task).

Here is the code how to create a Sequence Container (line 28):

<Container Name = "SC Package Logic">

Task – SQL Truncate table stg__SalesDB_Sales

Some remarks:

  • This piece of code creates an Execute SQL Task,
  • The name of the task is SQL Truncate table stg__SalesDB_Sales (Name attribute of ExecuteSQL element),
  • Connection Manager is set to Dst_DWH (ConnectionName attribute of ExecuteSQL element),
  • In DirectInput element an T-SQL statement to execute is defined:
    • A simple truncation of the destination table is executed, as we want to implement Full Loading of data.

The code of the task is between lines 38 and 40:

<ExecuteSQL Name = "SQL Truncate table stg__SalesDB_Sales" ConnectionName = "Dst_DWH">
    <DirectInput>TRUNCATE TABLE stg.SalesDB_Sales</DirectInput>
</ExecuteSQL>

 Task – Load Data

Some remarks:

  • This piece of code creates a Data Flow Task,
  • The name of the task is Load Data (Name attribute of Dataflow element),
  • Precedence Constraint was defined for this Task (PrecedenceConstraints element):
    • Input for this Task is the default output of SQL Truncate table stg__SalesDB_Sales Task (OutputPathName attribute)

Data Flow Task consists of elements that are called Transformations (in Biml every step inside Data Flow Task is a Transformation not a Component, regardless whether they are Source Component, Transformation Component or Destination Component). In the example Data Flow Task called Load Data there are 3 Transformations:

  • OLE DB Source dbo__Sales (OleDbSource transformation),
  • Lineage Columns (DerivedColumns transformation),
  • OLE DB Destination stg__Sales (OleDbDestination transformation).

Below all of these transformations are described in more details:

Transformation – OLE DB Source dbo__Sales

Some remarks:

  • This piece of code creates an OLE DB Source Component
  • The name of the component is OLE DB Source dbo__Sales (Name attribute of OleDbSource element)
  • Connection Manager is set to Src_SalesDB (ConnectionName attribute of OleDbSource element)
  • In DirectInput element an T-SQL SELECT statement to execute is defined:
    • A simple SELECT statement that defines what data we want to grab from our Source database

The code of the task is between lines 52 and 66:

<OleDbSource Name="OLE DB Source dbo__Sales" ConnectionName="Src_SalesDB">
                                    
                                    <DirectInput>
                                        SELECT
                                          CustomerName
                                          ,ProductName
                                          ,RegionName
                                          ,SalesDate
                                          ,Quantity
                                          ,PricePerItem
                                        FROM dbo.Sales;
                                    </DirectInput>
                                
                                    
                                </OleDbSource>

Transformation – Lineage Columns

Some remarks:

  • This piece of code creates an Derived Column Component
  • The name of the component is Lineage Column (Name attribute of DerivedColumns element)
  • In Columns element we define what columns we want to derive. Each column must be defined in a separate Column element.
    In the example, there is one derived column:

    • Name of the column is LoadId (Name attribute of Column element),
    • Data type is Int32 (DataType attribute of Column element),
    • We define that LoadId has a value of this expression: @[User::var_ExecutionId] (so in fact, we assign a value of our var_ExecutionId variable to LoadId derived column).

The code of the task is between lines 68 and 75:

<DerivedColumns Name="Lineage Column">

    <Columns>
        <Column Name="LoadId" DataType="Int32">@[User::var_ExecutionId]</Column>
    </Columns>

    
</DerivedColumns>

 Transformation – OLE DB Destination stg__Sales

Some remarks:

  • This piece of code creates an OLE DB Destination Component
  • The name of the component is OLE DB Destination stg__Sales (Name attribute of OleDbSource element)
  • Connection Manager is set to Dst_DWH (ConnectionName attribute of OleDbSource element)
  • In ExternalTableOutput element destination table where data is loaded is defined

The code of the task is between lines 77 and 81:

<OleDbDestination Name="OLE DB Destination stg__Sales" ConnectionName="Dst_DWH">
    
    <ExternalTableOutput Table="stg.SalesDB_Sales" />
    
</OleDbDestination>

Task – SQL End Package

Some remarks:

  • This piece of code creates an Execute SQL Task
  • The name of the task is SQL End Package (Name attribute of ExecuteSQL element)
  • Connection Manager is set to Metadata (ConnectionName attribute of ExecuteSQL element)
  • Contrary to SQL Start Package we do not expect any result
  • In DirectInput element an T-SQL statement to execute is defined
  • Our T-SQL accepts a parameter which is defined in Parameters element and Parameter sub-element.
    Notice that we need to define some attributes in Parameter element:

    • The name of the variable (VariableName attribute)
    • Position of the parameter in our T-SQL code (Name attribute)
    • Data type (DataType attribute)
    • Direction of the parameter which is Input in our case (Direction attribute)
  • Precedence Constraint was defined for this Task (PrecedenceConstraints element):
    • Input for this Task is the default output of SC Package Logic Container (OutputPathName attribute)
    • Type of precedence constraint is set to Completion (EvaluationValue attribute)

The code of the task is between lines 92 and 106:

<ExecuteSQL Name = "SQL End Package" ConnectionName = "Metadata">

    <PrecedenceConstraints>
        <Inputs>
            <Input EvaluationOperation="Constraint" EvaluationValue="Completion" OutputPathName="SC Package Logic.Output" />
        </Inputs>
    </PrecedenceConstraints>

    <DirectInput>EXECUTE etl.EndPackage ?</DirectInput>
    
    <Parameters>
        <Parameter VariableName="User.var_ExecutionId" Name="0" DataType="Int32" Direction="Input" />
    </Parameters>

</ExecuteSQL>

Conclusion

Based on this code, a fully working SSIS ETL package will be created. For smaller packages like this creation from GUI might be faster than writing Biml code, but when it comes to creation of larger packages or creation of multiple packages, Biml shows its strength against Visual Studio GUI for SSIS. In the next article from this series, I will show how to create several packages by having only one Biml code, which drastically reduce development time.

Dynamic Parameterization of Azure SSIS Data Flow Components

Dynamic Parameterization of Azure SSIS Data Flow Components

In this article I would like to explain how to dynamically parameterize some of the Azure Data Flow Components settings.

Introduction

With Azure Feature Pack for SSIS, 4 new Data Flow Components were added:

  • For Blob
    • Azure Blob Source
    • Azure Blob Destination
  • For Data Lake
    • Azure Data Lake Store Source
    • Azure Data Lake Store Destination

 

Data Flow Components shipped with Azure Feature Pack for SSIS
Data Flow Components shipped with Azure Feature Pack for SSIS

Very often there is a need to parametrize some of the settings of these components.
One of the most useful parameters can be:

  • For Blob Components (Azure Blob Source & Azure Blob Destination)
    • Blob container name
    • Blob name
  • For Data Lake Components (Azure Data Lake Source & Azure Data Lake Destination)
    • File Path

On below screenshots Azure Blob Destination Editor (left hand side) and Azure Data Lake Store Destination Editor (right hand side) windows are shown as an example (Azure Blob Source Editor and Azure Data Lake Store Source Editor look similar). What we can easily see is that on:

  • Azure Blob Destination Editor
    • Blob container name
    • Blob name
  • Azure Data Lake Store Destination Editor
    • File Path

are just a text boxes without any chance of dynamic manipulation of the values.

 

Azure Blob Destination Editor
Azure Blob Destination Editor

Azure Data Lake Store Editor
Azure Data Lake Store Editor

 

When we check the Properties of these 4 components (RMB click on the component, then select Properties), we indeed see that these settings are visible, but still there is no mechanism to parametrize them (Below example for Azure Blob Destination Editor (left hand side) and Azure Data Lake Store Destination Editor (right hand side)).

 

Azure Blob Destination Properties
Azure Blob Destination Properties

Azure Data Lake Store Destination Properties
Azure Data Lake Store Destination Properties

 

The Solution

The solution is:

You are allowed to parametrize these settings on a whole Data Flow component level

In order to do this, instead of editing Properties of these 4 Components itself, edit the Properties of the parent Data Flow Component that encapsulates these 4 Components (RMB click on the whole Data Flow component, then select Properties).

When you do this, you see Expression Property Item. You can click for […] button (again, examples only for: Azure Blob Destination Editor (left hand side) and Azure Data Lake Store Destination Editor (right hand side)):

 

Properties on Data Flow Task level for Azure Blob Destination
Properties on Data Flow Task level for Azure Blob Destination

Properties on Data Flow Task level for Azure Data Lake Store Destination
Properties on Data Flow Task level for Azure Data Lake Store Destination

 

Then you can assign dynamic values based on your variables and/or parameters.

List of Properties that can be parameterized for these 4 Components are:

  • For Blob
    • Azure Blob Source
      • Blob Container
      • Blob Name
    • Azure Blob Destination
      • Blob Container
      • Blob Name
  • For Data Lake
    • Azure Data Lake Store Source
      • FilePath
      • FirstRowAsColumnNames
    • Azure Data Lake Store Destination
      • BathSize
      • ColumnDelimiter
      • FileFormat
      • FilePath
      • FirstRowAsColumnNames
      • IsBatchMode

 

Below just a visual representation of all Properties for these 4 Azure Data Flow Components:

 

Azure Blob Source Component properties
Azure Blob Source Component properties

Azure Blob Destination Component properties
Azure Blob Destination Component properties

Azure Data Lake Store Source Component properties
Azure Data Lake Store Source Component properties

Azure Data Lake Store Destination Component properties
Azure Data Lake Store Destination Component properties