// Detailed architecture, data flow, pipeline execution, dependencies, and system design for the Unify data migration project. Use when you need deep understanding of how components interact.
| name | project-architecture |
| description | Detailed architecture, data flow, pipeline execution, dependencies, and system design for the Unify data migration project. Use when you need deep understanding of how components interact. |
Comprehensive architecture documentation for the Unify data migration project.
Purpose: Raw data ingestion from parquet files
Location: python_files/pipeline_operations/bronze_layer_deployment.py
Process:
bronze_cms, bronze_fvms, bronze_nichermsPurpose: Validated, standardized data organized by source
Location: python_files/silver/ (cms, fvms, nicherms subdirectories)
Process:
python_files/silver/Purpose: Business-ready, aggregated analytical datasets
Location: python_files/gold/
Process:
gold_data_model databasepython_files/gold/bronze-layer, code-layer, legacy_ingestionAZURE_MANAGED_IDENTITY_CLIENT_ID)abfss://container@account.dfs.core.windows.net/pathAuE-DataMig-Dev-KV for secret managementauedatamigdevsynwsdm8c64gbAll processing scripts auto-detect their runtime environment:
if "/home/trusted-service-user" == env_vars["HOME"]:
# Azure Synapse Analytics production environment
import notebookutils.mssparkutils as mssparkutils
spark = SparkOptimiser.get_optimised_spark_session()
DATA_PATH_STRING = "abfss://code-layer@auedatamigdevlake.dfs.core.windows.net"
else:
# Local development environment using Docker Spark container
from python_files.utilities.local_spark_connection import sparkConnector
config = UtilityFunctions.get_settings_from_yaml("configuration.yaml")
connector = sparkConnector(...)
DATA_PATH_STRING = config["DATA_PATH_STRING"]
add_row_hash(): Change detectionsave_as_table(): Standard table save with timestamp conversionclean_date_time_columns(): Intelligent timestamp parsingdrop_duplicates_simple/advanced(): Deduplication strategiesfilter_and_drop_column(): Remove duplicate flagsCentral YAML configuration includes:
*_IN_SCOPE variables)/workspaces/data) vs Azure (abfss://)@synapse_error_print_handler for consistent error handlingTableUtilities.save_as_table() automatically filters to last N years when date_created column exists, controlled by NUMBER_OF_YEARS global variable in session_optimiser.py. Prevents full dataset processing in local development.
python_files/testing/: Unit and integration testsmedallion_testing.py: Full pipeline validationbronze_layer_validation.py: Bronze layer testsingestion_layer_validation.py: Ingestion testsAfter running pipelines, build local DuckDB database for fast SQL analysis:
/workspaces/data/warehouse.duckdbmake build_duckdbunify_2_1_dm_synapse_env_d10file_executor.py, file_finder.py