Revolutionising Energy Planning: How Synthetic Time Series Generative AI Powers Forward-Thinking Development
- Dr. Daniel Dimanov

- May 1, 2024
- 5 min read
In the rapidly evolving landscape of urban development and energy management, the need for robust, forward-thinking planning tools has never been more critical. Traditional energy forecasting methods, heavily reliant on historical consumption data, are increasingly challenged by the dynamic nature of urban growth and technological innovation. These methods often struggle to adapt to new scenarios or account for unprecedented changes, leaving planners and policymakers grappling with incomplete or outdated information.
As new areas develop and existing networks undergo significant changes, the ability to anticipate and effectively manage energy demand becomes a cornerstone of sustainable development. Current state-of-the-art forecasting techniques, while helpful, typically extend past data trends into the future, assuming that what has been will continue to be.
This approach can miss the mark in today’s fast-changing world, where demographic shifts, economic fluctuations, and new technologies frequently disrupt old patterns.
Enter the innovative realm of generative artificial intelligence (Gen AI), particularly time series generative adversarial networks (GANs). These advanced models offer more than just predictions; they serve as tools for generating new, synthetic time series data from a variety of inputs. Whilst the example provided utilises random noise and demographic insights from census data to generate half-hourly load measurements of low-voltage feeders, it is important to note that this is just one potential application. In reality, the generation of synthetic data would typically incorporate a broader range of data types, not just demographic information. Although demographic data are correlated with energy usage patterns, they represent only part of the complex factors influencing energy consumption. This enhanced capability not only deepens our understanding of potential future scenarios but also significantly improves data-driven decision-making across multiple sectors:
Proactive Energy Planning
By generating realistic energy consumption profiles for new developments, planners can design more effective infrastructure and adjust to planned network changes with greater confidence.
Scenario Testing
In the spirit of digital twins—as highlighted at Nvidia's 2024 GTC—synthetic data generation allows for extensive scenario testing. Planners can simulate various 'what-if' conditions to evaluate potential impacts on energy systems before they occur, reducing risks and improving system resilience.
Data Imputation
Synthetic time series data can fill gaps in incomplete datasets, ensuring that analyses and forecasts are based on comprehensive information sets.
Research and Development
Anonymising real data while retaining its statistical integrity, synthetic datasets created by time series GANs boost accessibility for researchers, paving the way for innovations in energy management.
Extension of Limited Data Samples
When actual data is scarce, synthetic time series generated by GANs can be used as a foundation, subsequently extended using state-of-the-art forecasting methods.
Expanding Training Datasets through Synthetic Data
By generating synthetic data, GANs enables us to vastly increase the amount of data available for training machine learning algorithms. This expansion is crucial for improving the accuracy and robustness of predictive models, especially in scenarios where real data is scarce or costly to obtain.
Enhancing Data Fairness by Addressing Bias
GANs can also plays a significant role in mitigating data bias by effectively upsampling under-represented groups in the datasets. This approach helps to create a more balanced data environment, which is essential for reducing biases in machine learning models and ensuring fairer outcomes in algorithmic decision-making.
As we delve deeper into the capabilities and applications of time series GANs, it becomes clear that these tools are not just another step in the evolution of forecasting—they are redefining the very approach to energy management and planning in an increasingly uncertain world.
Unpacking Time Series Generative Adversarial Networks
The Core of GANs
Generative Adversarial Networks (GANs) represent a powerful class of AI models designed to generate new data that mimics real data. At its core, a GAN consists of two main components engaged in a continuous game: the generator, which creates data, and the discriminator, which evaluates it. This setup enables the model to improve iteratively, with the generator aiming to produce increasingly realistic data, and the discriminator becoming better at distinguishing the fake from the real.
Adapting GANs for Time Series Data
Generative Adversarial Networks (GANs) are usually deployed in the domain of computer vision and are generally designed to generate and discriminate among images. Adapting GANs to generate time series data introduces specific challenges and considerations. Unlike static images or isolated data points, time series data involves sequences that are temporally correlated. Handling these dynamics requires modifications to the traditional GAN architecture:
Temporal Consistency
The generator must not only produce statistically realistic data points but also ensure these points follow logical temporal progressions. This often involves integrating recurrent neural networks (RNNs) or Long Short-Term Memory (LSTM) networks into the generator.
Feature Integration
Time series GANs must effectively incorporate both time-dependent features (such as past energy consumption levels) and static attributes (like demographic data). This dual-input model ensures that the generated sequences respect both the temporal dynamics and the context provided by static attributes.
Evaluation Metrics
To train time-series GAN effectively the right evaluation metrics need to be chosen. Those metrics should address fidelity and diversity, appropriate discriminator loss and an adapted inception score to capture the clarity of the generated samples. Here, we follow a methodology described in https://arxiv.org/pdf/1909.13403 and we use the following metrics:
Mean Absolute Error (MAE)
This metric measures the average magnitude of the errors in a set of predictions, without considering their direction. It's particularly useful for quantifying the error in predictions of continuous variables.
Two-Sample Tests
These are applied to assess whether the distributions of the real and generated data are statistically different. Common tests include the Kolmogorov-Smirnov test, which compares the cumulative distributions, and the MMD (Maximum Mean Discrepancy), which effectively measures the distance between the sample means in a high-dimensional space.
Precision and Recall
In the context of GANs, precision measures how many of the generated samples are realistic, while recall assesses how well the generated samples capture the characteristics of the real data. These metrics can be adapted to time-series data by considering each time-series sequence as a sample.
Diversity Score
This assesses the variety within the generated samples to ensure they are not merely replicating a subset of the training data. This can be quantified using entropy measures or more sophisticated clustering and variability assessments.
Inception Score (IS)
Originally designed for images, the Inception Score can be adapted for time-series data to measure both the diversity of the data generated by the model and how well the generated data are classified compared to real data when using a pre-trained classifier.
Fréchet Inception Distance (FID)
Measures the distance between feature vectors calculated for real and generated samples. For time-series, features can be extracted using a model pre-trained on a related task or the same model architecture trained on real data.
Enhancing Realism and Reliability
Time series GANs incorporate several techniques to enhance the realism and reliability of the generated data. One of the most important ones we are interested in is the ability to do conditional generation. By conditioning the generation process on given attributes or past data, time series GANs can tailor the synthetic sequences to specific scenarios or conditions. This, however, complicates the training, hence advanced loss functions are needed, including those that measure temporal dependencies or multi-dimensional accuracy, helping in fine-tuning the generator and discriminator, enhancing the overall quality of the synthetic data.
Applications in Energy Management
In the context of energy management, time series GANs can transform how utilities and planners forecast demand, plan infrastructure, and test scenarios. These tools enable a proactive approach by allowing stakeholders to visualise the impacts of various hypothetical conditions and make informed decisions based on comprehensive, realistic data simulations.
As we explore the capabilities of time series GANs, it becomes evident why their application is particularly revolutionary in fields like energy management. However, one model, DoppelGANger, stands out by pushing the boundaries further, optimising the generation process to handle an even broader array of inputs and complexities. In the next section, we will delve into how DoppelGANger specifically adapts and advances these principles to provide unprecedented tools for scenario planning and data synthesis.
DoppelGANger: A Closer Look at Its Innovative Architecture
Architectural Overview
DoppelGANger is structured around two main components: the Attribute Generator and the Time Series Generator, each paired with its respective discriminator. The Attribute Generator is designed to produce categorical and continuous attributes, while the Time Series Generator focuses on generating corresponding time series data based on the attributes produced.
Attribute Generator
The Attribute Generator uses a feedforward neural network to output a set of attributes. This network takes a random noise vector as input and transforms it through multiple layers to generate a vector of attributes. These attributes are then passed to the Time Series Generator as conditional inputs, influencing the generation of time series data.
Time Series Generator
The Time Series Generator in DoppelGANger is more complex due to the sequential nature of its output. It starts with a combination of random noise and the output from the Attribute Generator. This combined input is processed through a series of recurrent neural network layers (specifically, LSTM units), which are well-suited for handling sequential data due to their ability to maintain information across time steps. The LSTM layers are designed to progressively refine the generated time series, ensuring that it aligns with the conditioned attributes and mimics the statistical properties of real-world data.
Discriminators
Each generator (Attribute and Time Series) has a corresponding discriminator (feedforward for attributes and recurrent neural network for time series). These discriminators are also neural networks trained to distinguish between real data (from the training dataset) and synthetic data (from the generators). The Attribute Discriminator evaluates the realism of the generated attributes, while the Time Series Discriminator assesses the authenticity of the generated time series data.
Training Process
The training of DoppelGANger follows the standard GAN training regime but with additional complexity due to its dual nature. The generators and discriminators are trained simultaneously in an adversarial manner, where the generators try to produce data that is indistinguishable from real data, fooling the discriminators. At the same time, the discriminators strive to become better at distinguishing fake data from real data. Thus, the training involves alternating between updating the discriminator using real and generated data, and updating the generator based on the feedback from the discriminator. The specific thing about Doppleganger which differentiates it from a normal GAN is the combination of loss functions. The combination is made up of adversarial loss (how well the discriminator is fooled by the synthetic data), reconstruction loss (how well the generator can reconstruct the original sequence) and most importantly, attribute consistency loss, which ensures that the generated time series data are consistent with the input attributes.
Mathematical Problem Definition
Let's explore the process in matematical terms: Let Gattr and Gts represent the Attribute Generator and Time Series Generator, respectively, and let Dattr and Dts represent the corresponding discriminators. The input to Gattr is a noise vector zattr, and the output is a set of attributes a. The input to Gts includes a noise vector zts and the output attributes a, and it outputs a time series x.
The objective function for the GAN setup in DoppelGANger can be represented as follows:
minGattr,GtsmaxDattr,DtsV(Dattr,Dts,Gattr,Gts)= 𝔼a,x∼pdata[logDattr(a)+logDts(x)]+ 𝔼zattr,zts∼pz[log(1−Dattr(Gattr(zattr)))+log(1−Dts(Gts(zts,Gattr(zattr))))]
Importantly, Gts(zts,a), indicates that the generation of the time series x by Gts is conditioned on the attributes aa generated by Gattr. This conditioning ensures that the time series data corresponds appropriately to the given attributes.
Key Hyperparameters and Their Influence
Learning Rate: Controls how quickly the model updates its weights in response to the error it predicts. A lower learning rate may lead to more stable convergence but slower training.
Batch Size: Affects the gradient estimation each step of training, with larger batch sizes generally providing more stable but slower updates.
Number of Epochs: Determines how many times the training dataset is passed through the network. More epochs can lead to better learning but increase the risk of overfitting.
Noise Dimensions zattr and zts: The size of the noise vectors directly influences the diversity of the generated attributes and time series. Larger dimensions can capture more complex patterns but require more model capacity to manage effectively.
Next, we present a feasibility study using an example use case to show the potential of such approaches. In practice such systems will include a lot more data, refinement and data sources. For our trial we are not using features like weekdays/weekends, temperature data, sunlight data, specific property data (age/efficiency, size, occupancy, gas heating, Economy7 tarrifs) and many more. We just want to tryout the approach with a simple experiment to evaluate its potential.
Use Case: Generating Energy profile of LV feeder based on Demographics
In order to test the utility of generating realistic energy profiles based on demographics, we used several different data sources and merged them together. We will now first go through the data sources we've used, then discuss how we've handled, processed, combined and used the data with DoppleGANger, then we will go through our results based on small-scale experiments we've conducted.
Data Sources
First, we acquired two different datasets from UK Power Networks. The first one details usage and other information about secondary substations in the UK for January 2024, while the second one details smart meter LV feeder level readings of energy.
The Secondary Substation data contained information about the functional location of the substations, voltage, number of transformers, customer count, postcode of the substation and lots of more useful features.
The LV Feeder Smart Meter dataset has half-hourly readings of 1988 unique LV feeders that are associated with different secondary substations. While the data details the import and export of energy, for the purpose of this experiment we only use the total consumption and the secondary substation id to associate the smart meter readings to the substations.
Then, we can use the attributes from the substation data to predict the time series features (half-hourly consumption for the whole of January). The problem is that the attributes in the substation dataset are not extremely informative, so instead, we cross-referenced the postcodes of the secondary substations with the corresponding Census records obtained from the Office for National Statistics (ons.gov.uk) website. Using this data, we built a comprehensive library for postcode to demographic attributes based on the postcodes covered in the dataset and this new rich-attribute dataset containing key attributes about the secondary substations together with the relevant demographic attributes constitute our 128 total attributes.
Data Processing
In order to make the dataset compatible with DoppleGANger, the first several steps of preprocessing need to be applied. First, after the needed postcode attributes are collected, they need to be aggregated and added to the secondary substation results and because of DoppleGANger requirements they all need to be represented as a number between 0 and 1. Here is how we normalise the Census data features for those of you looking to reproduce our experiments:
Attribute 0 to 1 normalisation
data = {
# Population
'Population': df[df['Category'] == 'Population'][df.columns[2]] / 100,
'Number of households': float(df[df['Category'] == 'Total households'][df.columns[2]]) / float(df[df['Category'] == 'Total households'][df.columns[3]])*1000,
'People 0 to 14': df[df['Category'].isin(['Aged 0 to 4', 'Aged 5 to 9', 'Aged 10 to 14'])][df.columns[2]].sum() / 100,
'People 15 to 19': df[df['Category'] == 'Aged 15 to 19'][df.columns[2]].sum() / 100,
'People 20 to 24': df[df['Category'] == 'Aged 20 to 24'][df.columns[2]].sum() / 100,
'People 25 to 29': df[df['Category'] == 'Aged 25 to 29'][df.columns[2]].sum() / 100,
'People 30 to 55': df[df['Category'].isin(['Aged 30 to 34', 'Aged 35 to 39', 'Aged 40 to 44', 'Aged 45 to 49', 'Aged 50 to 54'])][df.columns[2]].sum() / 100,
'People 55 to 70': df[df['Category'].isin(['Aged 55 to 59', 'Aged 60 to 64', 'Aged 65 to 69'])][df.columns[2]].sum() / 100,
'People 70+': df[df['Category'].isin(['Aged 70 to 74', 'Aged 75 to 79', 'Aged 80 to 84', 'Aged 85 and over'])][df.columns[2]].sum() / 100,
'Male%': df[df['Category'] == 'Male'][df.columns[2]].iloc[0] / 100,
'Single %': df[df['Category'] == 'Never married and never registered a civil partnership'][df.columns[2]].iloc[0] / 100,
'Born in UK %': df[df['Category'] == 'Born in the UK'][df.columns[2]].iloc[0] / 100,
'Household size 1-person%': df[df['Category'] == '1 person in household'][df.columns[2]].iloc[0] / 100,
'Household size 2-people %': df[df['Category'] == '2 people in household'][df.columns[2]].iloc[0] / 100,
'Household size 3-people %': df[df['Category'] == '3 people in household'][df.columns[2]].iloc[0] / 100,
'Household size 4+people%': df[df['Category'] == '4 or more people in household'][df.columns[2]].iloc[0] / 100,
'Household is not deprived in any dimension%': df[df['Category'] == 'Household is not deprived in any dimension'][df.columns[2]].iloc[0] / 100,
'Household is deprived in one dimension%': df[df['Category'] == 'Household is deprived in one dimension'][df.columns[2]].iloc[0] / 100,
'Household is deprived in two dimensions%': df[df['Category'] == 'Household is deprived in two dimensions'][df.columns[2]].iloc[0] / 100,
'Household is deprived in three dimensions%': df[df['Category'] == 'Household is deprived in three dimensions'][df.columns[2]].iloc[0] / 100,
'Household is deprived in four dimensions%': df[df['Category'] == 'Household is deprived in four dimensions'][df.columns[2]].iloc[0] / 100,
'Ethnic Asian%': df[df['Category'] == 'Asian, Asian British or Asian Welsh'][df.columns[2]].iloc[0] / 100,
'Ethnic Black%': df[df['Category'] == 'Black, Black British, Black Welsh, Caribbean or African'][df.columns[2]].iloc[0] / 100,
'Ethnic Mixed%': df[df['Category'] == 'Mixed or Multiple ethnic groups'][df.columns[2]].iloc[0] / 100,
'Ethnic White%': df[df['Category'] == 'White'][df.columns[2]].iloc[0] / 100,
'Ethnic Other%': df[df['Category'] == 'Other ethnic group'][df.columns[2]].iloc[0] / 100,
'No Religion%': df[df['Category'] == 'No religion'][df.columns[2]].iloc[0] / 100,
'Christian %': df[df['Category'] == 'Christian'][df.columns[2]].iloc[0] / 100,
'Buddhist %': df[df['Category'] == 'Buddhist'][df.columns[2]].iloc[0] / 100,
'Hindu%': df[df['Category'] == 'Hindu'][df.columns[2]].iloc[0] / 100,
'Jewish%': df[df['Category'] == 'Jewish'][df.columns[2]].iloc[0] / 100,
'Muslim%': df[df['Category'] == 'Muslim'][df.columns[2]].iloc[0] / 100,
'Sikh %': df[df['Category'] == 'Sikh'][df.columns[2]].iloc[0] / 100,
'Other%': df[df['Category'] == 'Other religion'][df.columns[2]].iloc[0] / 100,
'Not answered%': df[df['Category'] == 'Not answered'][df.columns[2]].iloc[0] / 100,
'Very good health %': df[df['Category'] == 'Very good health'][df.columns[2]].iloc[0] / 100,
'Good health%': df[df['Category'] == 'Good health'][df.columns[2]].iloc[0] / 100,
'Fair health%': df[df['Category'] == 'Fair health'][df.columns[2]].iloc[0] / 100,
'Bad health%': df[df['Category'] == 'Bad health'][df.columns[2]].iloc[0] / 100,
'Very bad health%': df[df['Category'] == 'Very bad health'][df.columns[2]].iloc[0] / 100,
'Disabled under the Equality Act %': df[df['Category'] == 'Disabled under the Equality Act'][df.columns[2]].iloc[0] / 100,
'Not disabled under the Equality Act %': df[df['Category'] == 'Not disabled under the Equality Act'][df.columns[2]].iloc[0] / 100,
'Provides no unpaid care%': df[df['Category'] == 'Provides no unpaid care'][df.columns[2]].iloc[0] / 100,
'Provides 19 hours or less unpaid care a week%': df[df['Category'] == 'Provides 19 hours or less unpaid care a week'][df.columns[2]].iloc[0] / 100,
'Provides 20 to 49 hours unpaid care a week%': df[df['Category'] == 'Provides 20 to 49 hours unpaid care a week'][df.columns[2]].iloc[0] / 100,
'Provides 50 or more hours unpaid care a week%': df[df['Category'] == 'Provides 50 or more hours unpaid care a week'][df.columns[2]].iloc[0] / 100,
'Can speak English well%': df[df['Category'].isin(['Can speak English very well', 'Can speak English well'])][df.columns[2]].sum() / 100,
'Cannot speak English well%': df[df['Category'].isin(['Cannot speak English well', 'Cannot speak English'])][df.columns[2]].sum() / 100,
'Whole house or bungalow%': df[df['Category'] == 'Whole house or bungalow'][df.columns[2]].iloc[0] / 100,
'Flat, maisonette or apartment%': df[df['Category'] == 'Flat, maisonette or apartment'][df.columns[2]].iloc[0] / 100,
'A caravan or other mobile or temporary structure%': df[df['Category'] == 'A caravan or other mobile or temporary structure'][df.columns[2]].iloc[0] / 100,
'No cars or vans in household%': df[df['Category'] == 'No cars or vans in household'][df.columns[2]].iloc[0] / 100,
'1 car or van in household%': df[df['Category'] == '1 car or van in household'][df.columns[2]].iloc[0] / 100,
'2 cars or vans in household%': df[df['Category'] == '2 cars or vans in household'][df.columns[2]].iloc[0] / 100,
'3 or more cars or vans in household%': df[df['Category'] == '3 or more cars or vans in household'][df.columns[2]].iloc[0] / 100,
'Does have central heating%': df[df['Category'] == 'Does have central heating'][df.columns[2]].iloc[0] / 100,
'1 bedroom%': df[df['Category'] == '1 bedroom'][df.columns[2]].iloc[0] / 100,
'2 bedrooms%': df[df['Category'] == '2 bedrooms'][df.columns[2]].iloc[0] / 100,
'3 bedrooms%': df[df['Category'] == '3 bedrooms'][df.columns[2]].iloc[0] / 100,
'4+ bedrooms%': df[df['Category'] == '4 or more bedrooms'][df.columns[2]].iloc[0] / 100,
'Occupancy rating +2 %': df[df['Category'] == '+2 or more'][df.columns[2]].iloc[0] / 100,
'Occupancy rating +1 %': df[df['Category'] == '+1'][df.columns[2]].iloc[0] / 100,
'Occupancy rating 0 %': df[df['Category'] == '0'][df.columns[2]].iloc[0] / 100,
'Occupancy rating -1 %': df[df['Category'] == '-1'][df.columns[2]].iloc[0] / 100,
'Occupancy rating -2 %': df[df['Category'] == '-2 or less'][df.columns[2]].iloc[0] / 100,
'Owns outright%': df[df['Category'] == 'Owns outright'][df.columns[2]].iloc[0] / 100,
'Owns with a mortgage or loan or shared ownership%': df[df['Category'] == 'Owns with a mortgage or loan or shared ownership'][df.columns[2]].iloc[0] / 100,
'Social rented%': df[df['Category'] == 'Social rented'][df.columns[2]].iloc[0] / 100,
'Private rented or lives rent free%': df[df['Category'] == 'Private rented or lives rent free'][df.columns[2]].iloc[0] / 100,
'No second address%': df[df['Category'] == 'No second address'][df.columns[2]].iloc[0] / 100,
'Second address is in the UK%': df[df['Category'] == 'Second address is in the UK'][df.columns[2]].iloc[0] / 100,
'Second address is outside UK %': df[df['Category'] == 'Second address is outside the UK'][df.columns[2]].iloc[0] / 100,
'Less than 10km from work %': df[df['Category'] == 'Less than 10km'][df.columns[2]].iloc[0] / 100,
'10km to less than 30km %': df[df['Category'] == '10km to less than 30km'][df.columns[2]].iloc[0] / 100,
'30km and over%': df[df['Category'] == '30km and over'][df.columns[2]].iloc[0] / 100,
'Distance travelled mainly from home%': df[df['Category'] == 'Works mainly from home'][df.columns[2]].iloc[0] / 100,
'Other working distance %': df[df['Category'] == 'Other'][df.columns[2]].iloc[0] / 100,
'Method of travel - Work mainly at or from home %': df[df['Category'] == 'Work mainly at or from home'][df.columns[2]].iloc[0] / 100,
'Underground, metro, light rail, tram %': df[df['Category'] == 'Underground, metro, light rail, tram'][df.columns[2]].iloc[0] / 100,
'Train%': df[df['Category'] == 'Train'][df.columns[2]].iloc[0] / 100,
'Bus%': df[df['Category'] == 'Bus, minibus or coach'][df.columns[2]].iloc[0] / 100,
'Taxi%': df[df['Category'] == 'Taxi'][df.columns[2]].iloc[0] / 100,
'Motorcycle%': df[df['Category'] == 'Motorcycle, scooter or moped'][df.columns[2]].iloc[0] / 100,
'Driving a car or van%': df[df['Category'] == 'Driving a car or van'][df.columns[2]].iloc[0] / 100,
'Passenger in a car or van%': df[df['Category'] == 'Passenger in a car or van'][df.columns[2]].iloc[0] / 100,
'Bicycle%': df[df['Category'] == 'Bicycle'][df.columns[2]].iloc[0] / 100,
'On foot %': df[df['Category'] == 'On foot'][df.columns[2]].iloc[0] / 100,
'Other method of travel to work %': df[df['Category'] == 'Other method of travel to work'][df.columns[2]].iloc[0] / 100,
'Economically active: In employment%': df[df['Category'] == 'Economically active: In employment'][df.columns[2]].iloc[0] / 100,
'Economically active: Unemployed%': df[df['Category'] == 'Economically active: Unemployed'][df.columns[2]].iloc[0] / 100,
'Economically inactive%': df[df['Category'] == 'Economically inactive'][df.columns[2]].iloc[0] / 100,
'Not in employment: Worked in the last 12 months%': df[df['Category'] == 'Not in employment: Worked in the last 12 months'][df.columns[2]].iloc[0] / 100,
'Not in employment: Not worked in the last 12 months%': df[df['Category'] == 'Not in employment: Not worked in the last 12 months'][df.columns[2]].iloc[0] / 100,
'Not in employment: Never worked%': df[df['Category'] == 'Not in employment: Never worked'][df.columns[2]].iloc[0] / 100,
'Managers, directors and senior officials%': df[df['Category'] == '1. Managers, directors and senior officials'][df.columns[2]].iloc[0] / 100,
'Professional occupations%': df[df['Category'] == '2. Professional occupations'][df.columns[2]].iloc[0] / 100,
'Associate professional and technical occupations%': df[df['Category'] == '3. Associate professional and technical occupations'][df.columns[2]].iloc[0] / 100,
'Administrative and secretarial occupations%': df[df['Category'] == '4. Administrative and secretarial occupations'][df.columns[2]].iloc[0] / 100,
'Skilled trades occupations%': df[df['Category'] == '5. Skilled trades occupations'][df.columns[2]].iloc[0] / 100,
'Caring, leisure and other service occupations%': df[df['Category'] == '6. Caring, leisure and other service occupations'][df.columns[2]].iloc[0] / 100,
'Sales and customer service occupations%': df[df['Category'] == '7. Sales and customer service occupations'][df.columns[2]].iloc[0] / 100,
'Process, plant and machine operatives%': df[df['Category'] == '8. Process, plant and machine operatives'][df.columns[2]].iloc[0] / 100,
'Elementary occupations%': df[df['Category'] == '9. Elementary occupations'][df.columns[2]].iloc[0] / 100,
'L1, L2 and L3: Higher managerial, administrative and professional occupations%': df[df['Category'] == 'L1, L2 and L3: Higher managerial, administrative and professional occupations'][df.columns[2]].iloc[0] / 100,
'L4, L5 and L6: Lower managerial, administrative and professional occupations%': df[df['Category'] == 'L4, L5 and L6: Lower managerial, administrative and professional occupations'][df.columns[2]].iloc[0] / 100,
'L7: Intermediate occupations%': df[df['Category'] == 'L7: Intermediate occupations'][df.columns[2]].iloc[0] / 100,
'L8 and L9: Small employers and own account workers%': df[df['Category'] == 'L8 and L9: Small employers and own account workers'][df.columns[2]].iloc[0] / 100,
'L10 and L11: Lower supervisory and technical occupations%': df[df['Category'] == 'L10 and L11: Lower supervisory and technical occupations'][df.columns[2]].iloc[0] / 100,
'L12: Semi-routine occupations%': df[df['Category'] == 'L12: Semi-routine occupations'][df.columns[2]].iloc[0] / 100,
'L13: Routine occupations%': df[df['Category'] == 'L13: Routine occupations'][df.columns[2]].iloc[0] / 100,
'L14.1 and L14.2: Never worked and long-term unemployed%': df[df['Category'] == 'L14.1 and L14.2: Never worked and long-term unemployed'][df.columns[2]].iloc[0] / 100,
'L15: Full-time students%': df[df['Category'] == 'L15: Full-time students'][df.columns[2]].iloc[0] / 100,
'Hours- Part-time: 15 hours or less worked%': df[df['Category'] == 'Part-time: 15 hours or less worked'][df.columns[2]].iloc[0] / 100,
'Hours-Part-time: 16 to 30 hours worked%': df[df['Category'] == 'Part-time: 16 to 30 hours worked'][df.columns[2]].iloc[0] / 100,
'Hours- Full-time: 31 to 48 hours worked%': df[df['Category'] == 'Full-time: 31 to 48 hours worked'][df.columns[2]].iloc[0] / 100,
'Hours - Full-time: 49 or more hours worked%': df[df['Category'] == 'Full-time: 49 or more hours worked'][df.columns[2]].iloc[0] / 100,
'No qualifications%': df[df['Category'] == 'No qualifications'][df.columns[2]].iloc[0] / 100,
'Level 1,2,3 qualification%': df[df['Category'] == 'Level 1, 2 or 3 qualifications'][df.columns[2]].iloc[0] / 100,
'Apprenticeship%': df[df['Category'] == 'Apprenticeship'][df.columns[2]].iloc[0] / 100,
'Level 4 qualifications and above%': df[df['Category'] == 'Level 4 qualifications and above'][df.columns[2]].iloc[0] / 100,
'Other qualifications%': df[df['Category'] == 'Other qualifications'][df.columns[2]].iloc[0] / 100,
}After this stage is completed, the data can then be merged together with the relevant features from the substation attributes collected from UK Power Networks. Then all that is left is to match the correct substation attributes to the corresponding unique smart meter time series and reformat according to the data format described in GitHub - fjxmlzn/DoppelGANger: [IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions.
Choosing the normalisation maximum values is straightforward for any values that represent percentages, but for entities such as the number of households or the power usage, we either used the maximum for the UK and presented them as % of the whole for the UK or in the case of the power usage we did independent research and found the theoretical maximum values for the particular feature. As with anything we do we aim for scalability, so even though the average maximum consumption is expected to be within the range of 10-20kW we divide the consumption by 100 000W, to ensure the feature embedding can capture these outliers for the total consumption active import.
Early results
Although with the technique described above we only had 1939 records (households) with 1488 feature values (time series half-hour ticks for 31 days of 31st of January), we trained DoppleGANger for a few hundred epochs before reviewing the results. To train it we used all the normalised attributes and the DoppleGANger noise generators in an attempt to generate the half-hourly energy demand for the LV feeder smart meter data based solely on the attributes. Below is a figure that depicts how the data from our newly trained model compares to the real one after renormalisation. It is worth mentioning that the raw output values are still normalised between 0 and 1 and can be converted as they represent a fraction of 100 000W.
As seen by this figure for one of the generated entries while it is not perfect, the synthetic generator has captured most of the trends in energy usage. While demographics alone are insufficient to make accurate estimations, we looked at what were some other key attributes that can benefit the approach. A key finding was that even with the use of more attributes more time-series data is required for accurate estimation throughout the month as three additional features (time series attributes) could greatly benefit the generation process, and these are the addition of daylight hours throughout the month, temperature forecast for the month and working days versus holidays or weekends. While we had a plan for extracting and adding this, we simply conducted preliminary experiments to evaluate the feasibility of this approach to be applied in the energy scenario. Below, we also present a day view into the first day of the month for which we had full real data, and while there is are some clear inaccuracies , the correlation between the two time series remains high and the the two time series display the same intraday pattern, predicting the peak energy demand at around the right level.
As discussed in the introduction the main purpose of these experiments was to determine the feasibility of the generational aspect which we explore next. First, we modify the attributes of the same record shown in the Figures above and simply double the number of households.
As expected, the synthetic energy demand jumps substantially. Interestingly, the mean energy demand of the newly produced modified attributes is 6521.2Wh, while the original one is 3894.6Wh, which is not exactly double but rather close to 70% increase. The standard deviation has decreased from 1930.3Wh to 1471.3Wh (~25%) and the maximum has increased to 10429.1Wh compared to 8259.9Wh (only ~25% increase).
In our exploration, the impact of varying a single attribute reveals some intriguing patterns, notably visible in the synthetic data generated. For example, consider the noticeable trough around 5 PM, which might typically coincide with the end of the workday. Interestingly, this pattern in the synthetic data, while closely mirroring real data, presents a slight deviation—a half-hour difference. This discrepancy may be influenced by several factors such as the percentage of single-person households among others. It's also conceivable that this variance could highlight a potential limitation of the model, possibly stemming from the restricted volume of training data.
Conclusion: Assessing the Promise of DoppelGANger in Energy Planning
Despite the inherent challenges, such as limited data and the model's reliance solely on static attributes—excluding dynamic factors like weather conditions (like daily sunrise/sunset, cloudiness, temperature and many others) , electricity prices, or special dates—the DoppelGANger approach offers a groundbreaking solution to a critical issue in energy planning. By synthesising realistic time-series data based on demographic attributes, DoppelGANger enables more no-history prediction and scenario testing. This capability is vital for effective energy management, particularly in the planning, construction, and development phases of urban and rural areas. In essence, tools like DoppelGANger not only enhance our ability to model and understand complex systems but also underscore the increasing importance of synthetic data in strategic decision-making and policy development. As we refine these models and expand their training datasets, their potential to revolutionise energy planning and management becomes even more pronounced, bridging the gap between theoretical planning and real-world application.





Comments