ML setup #6

franamor98 · 2024-02-25T17:06:44Z

Machine Learning setup and training issues

In this issue I will update the current setup for training, results and problems I encounter to converge a good solution.

Data

The Data used comes from a 24h (1h as output resolution) simulation over the whole arctic.
I am focusing on a small region (400km radius) in the central arctic. The goal is to predict the next position of each vertex.

For training I use all the vertexs at time 1,3,5,7,9,11 and for validation I use the same area at time 12,14,16.

Representation and input

As representation of the mesh I use two graphs, one containing vertex information and another using element information.
The input is composed of a central vertex with all neighbouring elements/vertex around it.

As input features for the vertex graphs I use the position x,y and forcing fields (wind,ocean). For the element-graphs I use position x,y, Thickness and Concentration. All inputs are standarized.

Model

The model proccess both graphs using GCNN and the combine both feature vectors to predict the velocity of the vertex
'class GCNN_2G(nn.Module):

def __init__(self, num_features1,num_features2, hidden_channels, output_size,dropout=0.1):
    super(GCNN_2G, self).__init__()
    # conv layers as a test [WIP]
    self.conv1 = ChebConv(num_features1, hidden_channels,K=1,dropout=dropout)
    self.conv2 = ChebConv(num_features2, hidden_channels,K=1,dropout=dropout)
    self.fc = nn.Linear(hidden_channels, output_size)

def forward(self,g1,g2):
   #proccess first graph
    x1 = self.conv1(g1.x, g1.edge_index,g1.edge_attr)
    x1 = F.relu(x1)
    #proccess second graph
    x2 = self.conv2(g2.x, g2.edge_index,g2.edge_attr)
    x2 = F.relu(x2)


    x2 = global_mean_pool(x2, g2.batch)
    x1 = global_mean_pool(x1, g1.batch)      

    #concatenate both outputs
    x = torch.stack([x1,x2],dim=0)
    x = torch.mean(x,dim=0)
    # Fully connected layer for the final output
    x = self.fc(x)
   
    
    return x`

Loss Function

Custom loss function to account for velocity angle and position. Can be parametrized using A,B,C coefficients.

class CustomIceLoos(nn.Module):
def init(self, A=1,B=0,C=0,step=1,d_time=3600):
super(CustomIceLoos, self).init()
self.mae = nn.L1Loss()
self.A = A
self.B = B
self.C = C
self.step = step
self.d_time = d_time

def velocity_angle(self,real_v,v_pred):
    v_pred = v_pred / v_pred.norm(p=2,dim=1).unsqueeze(1)
    real_v = real_v / real_v.norm(p=2,dim=1).unsqueeze(1)
    dot = (v_pred*real_v).sum(dim=1)
    dot = torch.clamp(dot,min=-0.99,max=0.99)
    return torch.acos(dot).mean()

def forward(self,v_pred,v_real,init_coords):

    position_pred = v_pred * self.step * self.d_time + init_coords
    position_real = v_real * self.step * self.d_time + init_coords

    position_error = 0
    angle_error = 0
    velocity_error = 0

    if self.A >0:
        velocity_error = self.mae(v_real,v_pred)

    if self.B >0:
        position_error = self.mae(position_real,position_pred)

    if self.C >0:
        angle_error = self.velocity_angle(v_pred,v_real)
        if torch.isnan(angle_error):
            print("nan error")
                
    error = self.A * velocity_error + self.B * position_error + self.C * angle_error

    return  error

'

Training and results

Training dynamics are not bad and there is no overfitting over time (using these data splits). After standarizing the features I had to apply gradient clipping to 1 in order to stabilize the gradient, otherwise it exploded. Typical training curves are like this one (notice log scale):

Results are in the order of ~65 RMSE

Problems

One of the main problems is that using gpu in training doesnt speed up the process even using large batches (128 or 512)

The text was updated successfully, but these errors were encountered:

rikrd · 2024-03-04T15:49:06Z

Several questions:

How much larger in batch size can you go?
When you say that GPU training doesn't speed up, is it in comparison to CPU?
Have you tried with larger neighborhoods or more layers of convs? Perhaps with small kernels and shallow networks, the GPU is less useful?

rikrd · 2024-03-04T16:51:44Z

More questions:

Are you keeping the hold graph in a Data object of PyTorch Geometric and then using NeighbourLoader to sample for the minibatch? I think you should try to use as much as possible the library's tools, since they may be quite well optimized for GPU usage.
Does the whole dataset fit in CPU memory?

franamor98 · 2024-03-06T11:41:47Z

Really large up to more than 2000. The input features are of shape [N_neigh*batch, n_features]. Normally n_features = 6. Both adding more neighbors and using bigger batches have the same effect for the input tensor.
Yes, in comparison with my laptop cpu (old i7).
I am waiting for results in terms of number of neighbors, but in principle it does not affect significantly the results. Also the best results I am obtaining are with shallow nets, increasing the number of convs doesnt seem to improve (its even worse).
(and 5.) I am using a normal torch dataset that keeps the object in memory (I believe) and then passing it to the custom torch geometric dataloader (https://pytorch-geometric.readthedocs.io/en/latest/modules/loader.html#torch_geometric.loader.DataLoader), I dont use NeighbourLoader.

franamor98 · 2024-03-08T15:14:01Z

model update:

class GCNN_2G(nn.Module):

def __init__(self, num_features1,num_features2, hidden_channels, output_size,dropout=0.1):
    super(GCNN_2G, self).__init__()
    # conv layers as a test [WIP]

    channels_v, channels_e = hidden_channels

    if len(channels_v) != len(channels_e):
        raise ValueError("Different number of layers for vertices and elements")
    if len(channels_v) == 0:
        raise ValueError("At least one layer is needed")
    if channels_e[-1] != channels_v[-1]:
        raise ValueError("Last layer must have the same number of channels for vertices and elements")

    self.convs1 = nn.ModuleList([ChebConv(num_features1, channels_v[0],K=1,dropout=dropout)])
    for i in range(len(channels_e)-1):
        self.convs1.append(ChebConv(channels_e[i], channels_e[i+1],K=1,dropout=dropout))


    self.convs2 = nn.ModuleList([ChebConv(num_features2, channels_e[0],K=1,dropout=dropout)])
    for i in range(len(channels_v)-1):
        self.convs2.append(ChebConv(channels_v[i], channels_v[i+1],K=1,dropout=dropout))


    self.fc = nn.Linear(channels_v[-1], output_size)

def forward(self,g1,g2):
   
    for conv in self.convs1:
        x1 = conv(g1.x, g1.edge_index,g1.edge_attr)
        x1 = F.relu(x1)

    for conv in self.convs2:
        x2 = conv(g2.x, g2.edge_index,g2.edge_attr)
        x2 = F.relu(x2)


    x2 = global_mean_pool(x2, g2.batch)
    x1 = global_mean_pool(x1, g1.batch)      

    #concatenate both outputs
    x = torch.stack([x1,x2],dim=0)
    x = torch.mean(x,dim=0)
    # Fully connected layer for the final output
    x = self.fc(x)
   
    
    return x

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML setup #6

ML setup #6

franamor98 commented Feb 25, 2024 •

edited

Loading

rikrd commented Mar 4, 2024

rikrd commented Mar 4, 2024

franamor98 commented Mar 6, 2024 •

edited

Loading

franamor98 commented Mar 8, 2024

ML setup #6

ML setup #6

Comments

franamor98 commented Feb 25, 2024 • edited Loading

Machine Learning setup and training issues

Data

Representation and input

Model

Loss Function

Training and results

Problems

rikrd commented Mar 4, 2024

rikrd commented Mar 4, 2024

franamor98 commented Mar 6, 2024 • edited Loading

franamor98 commented Mar 8, 2024

franamor98 commented Feb 25, 2024 •

edited

Loading

franamor98 commented Mar 6, 2024 •

edited

Loading