We have different types of components in Matillion to serve the different purposes and solve ETL problems at hand. We know about the concept of variables in programming, although Matillion is a tool but backend it do have some programming picture. Similar to that matillion have few types of variables viz., job, environment and grid. We will discuss these variables in other section.
The concept of grid variable is similar to multi-dimensional array in any other programming language. It have multiple columns to store data record.
Iterator is something we loop over to execute same set of instructions over and over again for different entities. If you are from programming background, you might already know about loops like for, while, do-while, foreach etc. Matillion also have few types of iterators to accomplish the purpose. Matillion supports file, fixed, loop, table and grid iterators till today. These iterators for different scenerios, data types and cases. Data engineer can choose one of them for the given problem. For this article we will look at grid iterator with special attention to its mode of concurrency.
You can read about grid iterator in detail at Matillion Documentation . From the documentation, you can understand how to use this component and which properties you can customize and which are default and then you can decide if grid iterator is for you. There is some important information that can be exported from this component for analysis which includes how many iterators were actually generated and how many were successful. From this information you will know at which point jobs got failed and accordingly you can rerun only the failed part and in turn saving the time.
Concurrency in Grid Iterator
There are set of properties that can be customized to let grid iterator work for your specific problem at hand. We will focus in concurrency for this blog. This property defines the way the iteration generated will be executed. As simple as it sounds, there are some challenges to work around this. This property can be utilized as per the requirement, as each iteration will have it’s own set of variable values.
For this article, we will take one possible application of grid iterator using this property. We can schedule the other jobs according to the preset order. This way we need not to run the individual jobs manually one by one instead we can have one orchestration job with grid iterator for running other orchestration or transformation jobs. With this case, we can say that this property can also be used in scheduling. This is not the only way to schedule the jobs but this is one application of grid iterator component. Let’s try to understand this with more elaboration.
Let’s say we have five transformation jobs T_ONE to T_FIVE. The names of the jobs are placed in snowflake table along with their job sequences. We will call this table as “JOB_CONTROL”. The job sequence defines the priority. This defines the turn of each job and defines which job should run first and which can be run anyways. The job table may look like as below:
Five Orchestration jobs for illustration
Main Orchestration job with grid Iterator
We will now understand the two possible scenarios.
Sequential: When we have some dependent jobs then we need to go for sequential mode. As this will execute the set only after the previous batch is finished with the processing. This way, the dependent job will have the successfully created resources ready to consume. In Matillion, this is default mode of execution for grid iterator. This is best scenario if we have sequence of tasks where tasks two will use the output generated by task one and so on.
For our example the first case will be when every job is dependent on the previous job. Like T_TWO is dependent on T_ONE and T_THREE is dependent on T_TWO which means T_ONE should run before T_TWO and so on. For This case should consider sequential mode which is also the default mode.
Concurrent: In this scenario, jobs are considered as independent. Each iteration will have its own set of variable values which can run independently from other iterations. This setting is best if we have all the jobs which does not require any resources being generated by other jobs to run. This best scenario for all the independent tasks.
We will take same jobs but now the job sequence does not matter, so we can set this value to concurrent.
Written By: Prabhjot Kaur