Background:
I was working with a customer who needed to perform file backups housed in a specific storage account and a specific container to a separate storage account. Geo-redundant storage (GRS) was enabled at the storage account level providing physical redundancy of the customer’s data in across 2 separate data centers however the customer was storing contents for a product deployed into Azure and wanted to ensure they could restore a specific customer’s file should it be deleted by the customer accidently via their online solution.
While Azure backup solutions exists to permit file level restores for VMs and servers/workstations on-premises and in Azure, general purpose storage accounts in Azure can literally store any data and thus there isn’t a solution available (yet) to wire up a storage account to a backup solution when files are deposited into a storage account. Azure storage accounts can store any arbitrary types of data and is accessible via native APIs for Java, .NET, Ruby, Python, PHP, Node.JS and many others while providing facilities to manage storage accounts and contents via PowerShell and a cross-platform command-line-interface (CLI)
In my customer’s case, they were using a combination of Node.JS libraries and the available REST interface for managing files behind the scenes for their product.
Solution:
I decided to use Azure automation to trigger automated execution of PowerShell jobs to handle keeping storage accounts in sync. If you have not used it before, Azure Automation provides a means to trigger tasks based on webhooks or a period of time that are written in PowerShell without needing to have a sever to patch, maintain and monitor. Moreover, it is free to have a job published and you are only charged/metered for the execution time the script runs on a pooled resource in Azure. Azure Automation supports running a specialized type of PowerShell script called workflows that I’ve opted to leverage as well as they provide a number of benefits running within an Azure Automation Account – specifically:
- Since Azure Automation Jobs have a run-time execution limit of 3 hours at which point they are put to sleep; workflows allow us to save our state and pick up where we left off previously. PowerShell workflows support this with the checkpoint-workflow command – Another example of how this works is outlined here.
- PowerShell workloads allow us to invoke operations/tasks in parallel — especially helpful when enumerating collections as shown in my published script.
The example script will pull a list of files into a hashtable from the destination storage account and walk through all of the files in the source storage account in blocks of 400 processing 25 files at a time in parallel. If there is a file in the source not in the destination or if the date/time is different on the source file it’s copied to the destination as outlined in this process flow:
Walkthrough:
Setup: You will need to populate the variables with the source/destination storage account keys and names as well as the name of the container to use in both the source/destination accounts:
[string]$containerName = ‘SourceAndDestinationContainer’
$Srcstorageaccountkey = “SrcKey”;
$srcStorageAccount = “SrcStorageAccount”;
$deststorageAccount = “DestStorageAccount”;
$DestAccountKey = “Destination AccountKey”;
You can also vary the number of files pulled from the source account at a time to perform comparisons — you can experiment with it to align performance to overhead and resources used by Azure automation that may slow your job down.
PowerShell Workflows run in a specialized “workspace” to which all objects in PowerShell are serialized to the workflow runtime. This is beneficial to the way workflows allow pause/continue capabilities however it causes problems with specialized objects we need to invoke operations on or that do not serialize into a form that is usable when rehydrated by the runtime. The storage context is one of these objects used by the PowerShell storage commands and thus need to reside within an “InlineScript” block where the commands are run in a standard non-workflow PowerShell session. This introduces some other challenges — we need to reference local variables in our workflow and also return output from the script block back to our workflow.
To do this, we have a some nuances to deal with to ensure we are working with our global variables prefixing them with “$using:variable”. In the block below, you can see I’m referring to things like the string-based container name that I can refer to simply by stating “$using:containerName” If I left out the $using; the inlinescript block would create a new local variable with an empty string in our case and we would not have the desired results.
Returning data from an inline script back to the workflow is another step we need to take — in the example below, we’re taking the hashtable $DestBlobsHasTemp and returning it as the last line of the inlinesript block and assigning it to our workflow variable $DestBlobsHash.
$DestBlobsHash = InlineScript
{
$DestBlobsHashTemp = @{}
$destContext = New-AzureStorageContext -StorageAccountName $using:deststorageAccount -StorageAccountKey $using:DestAccountKey
Get-AzureStorageBlob -Context $destContext -Container $using:containerName -ContinuationToken $Token |
Select-Object -Property Name, LastModified, ContinuationToken |
ForEach { $DestBlobsHashTemp[$_.Name] = $_.LastModified.UtcDateTime };
$DestBlobsHashTemp ;
}
Parallel For loop:
Now we need to reach into the source storage account and pull groups of files based on the variable defined earlier (400 in our case). On each iteration, you will also see the checkpoint-workflow command discussed earlier to save our workflow state so, in the event this runs for longer than 3 hours, Azure automation will restart where we left off. Worse case we’d repeat cycling through some of the 400 files again if we were paused in the middle of a loop.
DO {
“Saving State…”;
Checkpoint-Workflow #on each iteration persist our state
“Source Array Population…”;
$SrcBlobs = InlineScript
{
$sourceContext = New-AzureStorageContext -StorageAccountName $Using:srcStorageAccount -StorageAccountKey $using:Srcstorageaccountkey
$SrcBlobs = Get-AzureStorageBlob -Context $sourceContext -Container $Using:containerName -MaxCount $using:MaxReturn -ContinuationToken $using:Token |
Select-Object -Property Name, LastModified, ContinuationToken;
if ($SrcBlobs -ne $null) {
$cnt = $SrcBlobs.Count;
write-host ” ** Files Found: $cnt – Pulling $using:MaxReturn files at a time…”;
}
$SrcBlobs;
}
Once we have a hashtable to compare to; there is a foreach block where I’m specifying actions to be performed in parallel throttling to a limit of 25 threads. This is another value you can experiment with to determine what works best for your scenario.
#experiment with the throttle — I found 25 to work well within Azure automation for invoking copy operations
ForEach -parallel -ThrottleLimit 25 ($SrcBlob in $SrcBlobs) {
Deploying our Azure Automation Job:
The first step is to create an Azure Automation account if you don’t already have one – Instructions for this step are outlined here.
Once you have an automation account in place, you can copy and paste this workflow into a new Runbook:
When you create a new RunBook; be sure to select that the runbook type is a PowerShell Workflow as shown in step 4:
When the runbook is created, click on edit to open the editor for the job:
Once you paste in the code, you need to ensure the name of the workflow matches the name of the runbook as shown below:
(Notice the RunBook is named BlobSync_New and the workflow is also named BlobSync_New)
That’s it! You can hit ‘Publish’ to make this runbook available to invoke manually or via a schedule!
Resources: