Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jamesmoore/arch 224 silo support device groups #63

Merged
merged 182 commits into from
Feb 27, 2025
Merged
Changes from 1 commit
Commits
Show all changes
182 commits
Select commit Hold shift + click to select a range
b246386
Added some scripts for testing branch dev
jimmyaxod Dec 13, 2024
091c5c4
Test working w hack/start hack/migrate
jimmyaxod Dec 13, 2024
c1b0483
Started pulling out silo bits for peer migrate_to
jimmyaxod Dec 13, 2024
fa941b7
Pulled out silo make_migratable bits
jimmyaxod Dec 13, 2024
1a552a2
Pulled out silo migrate_from (protocol)
jimmyaxod Dec 13, 2024
1c45c0f
Pulled out migrate_from (fs) silo bits
jimmyaxod Dec 13, 2024
7488186
Simplified migrate from local a little
jimmyaxod Dec 13, 2024
a167ee3
Split out shared devices from Silo devs
jimmyaxod Dec 13, 2024
4498bac
Refactor silo dev schema for local
jimmyaxod Dec 13, 2024
9bef1ea
migrate (local fs) now using silo.dg
jimmyaxod Dec 13, 2024
7ac2c55
Split initial migrate from dirty migrate phase
jimmyaxod Dec 13, 2024
181f949
Little tidy
jimmyaxod Dec 14, 2024
6134366
Cleaned out make_migratable
jimmyaxod Dec 14, 2024
09f9827
renames
jimmyaxod Dec 14, 2024
9857d5c
Simplification
jimmyaxod Dec 14, 2024
684fc24
Little more refactor
jimmyaxod Dec 14, 2024
4f0b236
Things kind of starting to work for some definition of work.
jimmyaxod Dec 15, 2024
9fc741d
Couple of tidy ups
jimmyaxod Dec 15, 2024
b16febd
WIP. Getting there... migrate works
jimmyaxod Dec 15, 2024
23916bc
Tidied up exposeSiloDeviceAsFile
jimmyaxod Dec 15, 2024
2fb88a3
Removed some stage vars
jimmyaxod Dec 15, 2024
aee7e85
Removed more stage bits
jimmyaxod Dec 15, 2024
941e3c7
More work
jimmyaxod Dec 15, 2024
9978037
Removed stage2Inputs
jimmyaxod Dec 16, 2024
a15c767
Removed some unused hooks. Logging/metrics better
jimmyaxod Dec 16, 2024
3a9b7c4
More simplification
jimmyaxod Dec 16, 2024
cfe41d3
Pulled out vm_state
jimmyaxod Dec 16, 2024
742e9ae
Removed hack scripts for testing
jimmyaxod Dec 16, 2024
947fc00
First stab at dirty refactor
jimmyaxod Dec 16, 2024
3b00f0f
Added NewDirtyManager etc
jimmyaxod Dec 16, 2024
70b46f3
Switch to new Dirty migrate code
jimmyaxod Dec 16, 2024
fa7c55e
Bit further
jimmyaxod Dec 16, 2024
ef4ddb6
First ver working with single auth transfer event
jimmyaxod Dec 17, 2024
1c1ea7c
Some tidying up
jimmyaxod Dec 17, 2024
d1cdd0b
Fixed silo import, and removed unused hook
jimmyaxod Dec 17, 2024
18001d0
Some more tidy, removed old eventHandler
jimmyaxod Dec 17, 2024
ad636e8
Tidied up close
jimmyaxod Dec 17, 2024
e84ee7b
exposed progress handler
jimmyaxod Dec 18, 2024
5a70332
Updated to have progress handler
jimmyaxod Dec 18, 2024
acbe144
Removed more
jimmyaxod Dec 18, 2024
912653f
More simplification
jimmyaxod Dec 18, 2024
cb70634
Little further
jimmyaxod Dec 18, 2024
4734ff2
Mostly tidy
jimmyaxod Dec 18, 2024
f2a911e
Bump silo to v0.1.6
jimmyaxod Dec 19, 2024
b0937a2
Unit test start on dirty manager
jimmyaxod Jan 4, 2025
191cfa7
First test for migrateFromFs
jimmyaxod Jan 4, 2025
bd59185
Added nbd to workflow, more refactor
jimmyaxod Jan 4, 2025
171b55f
Adjusted test run
jimmyaxod Jan 4, 2025
5378193
Adjust test workflow
jimmyaxod Jan 4, 2025
4a578d0
Bumped silo ver and added some debug
jimmyaxod Jan 4, 2025
6937ebf
Fix for ARCH-235/drafter-auth-transfer-dirty-block-lists
jimmyaxod Jan 5, 2025
bffebaa
Got rid of possible delay in dirty migrate timeline. Removed mig debu…
jimmyaxod Jan 5, 2025
f3d5f69
Started on some logging for migrations
jimmyaxod Jan 6, 2025
48358ad
First simple migration unit test
jimmyaxod Jan 6, 2025
2c6a065
Mig test now correctly checks data in equal
jimmyaxod Jan 6, 2025
8803bf1
Reorg peer files, and fixed dirty manager test
jimmyaxod Jan 6, 2025
3b89d99
Removed some commented out code
jimmyaxod Jan 6, 2025
cb0404b
Tidy up of missing inputs to migrateFrom
jimmyaxod Jan 6, 2025
1b6ec4b
Added schema tweak for initial dg start
jimmyaxod Jan 7, 2025
aa921a9
Updated migrations+test to do custom data xfer
jimmyaxod Jan 14, 2025
796cba4
Added hooks to transfer, and receive custom data
jimmyaxod Jan 14, 2025
71ad457
Temp files now created locally rather than in tmp
jimmyaxod Jan 15, 2025
3d58bbd
Protection if customPayload hooks not set
jimmyaxod Jan 15, 2025
b7bdc4e
Added hack scripts
jimmyaxod Jan 15, 2025
873165d
Revert to previous behaviour on destination tweak schema
jimmyaxod Jan 15, 2025
8850d23
Fixed compound nil ptr err, and now mkdirAll on incoming dev
jimmyaxod Jan 15, 2025
36d02f6
Fixes (again) the issue where small file eg state can become corrupte…
jimmyaxod Jan 15, 2025
de58a93
Fixed up config reading via device
jimmyaxod Jan 15, 2025
c0597ce
START of refactoring for things other than peer
jimmyaxod Jan 16, 2025
c99afba
First stab at refactoring mounter
jimmyaxod Jan 16, 2025
2c7dd82
Tidied up dg usage
jimmyaxod Jan 16, 2025
84c53a4
Switched ResumedPeer to static close/wait funcs
jimmyaxod Jan 16, 2025
5a4395c
Switched peer close/wait to static
jimmyaxod Jan 16, 2025
32e2b07
Moved migratedPeer close to static
jimmyaxod Jan 16, 2025
e8c8e31
Tidied close/wait inside peer
jimmyaxod Jan 16, 2025
92fc84d
Logging for peer migrations
jimmyaxod Jan 17, 2025
a37a210
Switched peer logging over
jimmyaxod Jan 20, 2025
e050198
Pulled out peer networking
jimmyaxod Jan 20, 2025
17ce19b
Removed some unused hooks etc
jimmyaxod Jan 20, 2025
56e8056
Fix drafter-mounter for removed hooks
jimmyaxod Jan 20, 2025
61c6b27
-some generics
jimmyaxod Jan 20, 2025
2597222
Merged migrated peer into peer.
jimmyaxod Jan 21, 2025
d30a5d7
Merged peer + resumed_peer
jimmyaxod Jan 21, 2025
894cf23
Moved and removed unused peer errors
jimmyaxod Jan 21, 2025
5e69cab
WIP. start->migrate works. start->ctrl+c->start does not
jimmyaxod Jan 21, 2025
00337cf
Tidied some default devs
jimmyaxod Jan 21, 2025
ce12c3e
Some tidying of snapshotter
jimmyaxod Jan 23, 2025
e2b699e
Fixed graceful exit
jimmyaxod Jan 23, 2025
4c1ccaf
WIP package_test
jimmyaxod Jan 24, 2025
947fb43
Start on next couple of tests
jimmyaxod Jan 24, 2025
eb0f1d6
Package test
jimmyaxod Jan 25, 2025
d8aaa85
Extracted runtime provider from peer, and reinstated generic rpc impl…
jimmyaxod Jan 27, 2025
3a163df
RuntimeProvider correctly sets Remote after resume
jimmyaxod Jan 27, 2025
4d52498
Moved config/vsock port bit inside runtimeProvider
jimmyaxod Jan 28, 2025
395cdcc
First e2e peer migrate test working (No writes from vm)
jimmyaxod Jan 28, 2025
509be43
Tidied up log messages in peer
jimmyaxod Jan 28, 2025
5a84b3b
Full e2e mocked vm migration with the vm writing data to devices duri…
jimmyaxod Jan 28, 2025
e74fdf7
Added a bit more output on test
jimmyaxod Jan 28, 2025
2d137df
Reorg runtimes
jimmyaxod Jan 29, 2025
096acb0
Fixed shutdown behaviour hopefully within peer/runtime
jimmyaxod Jan 29, 2025
0965ff0
Should fix RPC select bug https://github.com/loopholelabs/drafter/iss…
jimmyaxod Jan 29, 2025
90cd64c
Added baseline config for S3 assisted Silo devices
jimmyaxod Jan 29, 2025
cdfc146
Can now expose silo prom metrics from drafter-peer
jimmyaxod Jan 29, 2025
e038c89
Removed debug
jimmyaxod Jan 29, 2025
7324f32
Added a hook to MigrateFromHook for completion event
jimmyaxod Jan 30, 2025
9863f99
Fixed hang on connection close mid migration, added test
jimmyaxod Jan 30, 2025
6b595ff
fix: Don't try to access `err` in `peer.go` unless it's not-nil
pojntfx Jan 31, 2025
4beca3a
fix: Don't try to run suspend RPCs if resumes failed
pojntfx Jan 31, 2025
be74f29
fix: Don't try to access `err` in `peer.go` unless it's not-nil
pojntfx Jan 31, 2025
72c0061
Adjusted to call onComplete only on success. On error, the error will…
jimmyaxod Jan 31, 2025
d18a8ee
Added peer migrate using cow unit test
jimmyaxod Jan 31, 2025
1ec03b9
Added test for multiple migrations
jimmyaxod Jan 31, 2025
cbc7937
multi migration test failing due to not picking up dirty data
jimmyaxod Jan 31, 2025
f8e8443
Hack workaround write-path bug passes
jimmyaxod Jan 31, 2025
c519831
bump silo to fix multi-migration issue, and added test
jimmyaxod Jan 31, 2025
b8ddcd5
Little test tidy and removed printf from migrations
jimmyaxod Jan 31, 2025
e27f240
Merge remote-tracking branch 'origin/jamesmoore/arch-224-silo-support…
pojntfx Jan 31, 2025
dd7f1d6
Merge pull request #75 from loopholelabs/fix-logging-panic
pojntfx Feb 3, 2025
d5ebce4
Merge remote-tracking branch 'origin/main' into integrate-main
pojntfx Feb 3, 2025
a1d4257
Merge pull request #77 from loopholelabs/integrate-main
pojntfx Feb 3, 2025
c651f86
feat: Allow using S3 sync feature with HTTP servers
pojntfx Feb 3, 2025
c79cc8a
Bumped silo +cow migrate. Added flag sharedBase to enable
jimmyaxod Feb 4, 2025
f80ffe3
fix: Call `OnCompletion` hook regardless of whether it is local or re…
pojntfx Feb 5, 2025
4a568dc
feat: Add `--disable-postcopy-migration` flag to `drafter-peer`
pojntfx Feb 5, 2025
aeaa386
test: Add tests that verify `OnCompletion` gets called in all paths
pojntfx Feb 5, 2025
30562e6
chore: Use proper casing for flag
pojntfx Feb 5, 2025
028b3e2
If ROSourceShared set, then S3Sync.Config.OnlyDirty set now
jimmyaxod Feb 5, 2025
07d60b1
Merge pull request #79 from loopholelabs/add-missing-oncompletion-hook
pojntfx Feb 5, 2025
4f25f0b
Bump Silo to fix cpu 100% after migration
jimmyaxod Feb 5, 2025
910d444
Merge branch 'jamesmoore/arch-224-silo-support-device-groups' of gith…
jimmyaxod Feb 5, 2025
b703a9b
Bumped silo, updated to WAIT before resume if disable-postcopy-migrat…
jimmyaxod Feb 6, 2025
a8a0ba8
Bumped silo ver. added s3concurrency config per device.
jimmyaxod Feb 7, 2025
5d8eeb4
Bump silo
jimmyaxod Feb 7, 2025
9730f64
New replay runtime. New test for valkey replay. Fixed possible issue …
jimmyaxod Feb 7, 2025
6d5a54d
Merge branch 'jamesmoore/arch-224-silo-support-device-groups' into ad…
pojntfx Feb 8, 2025
6137984
Merge pull request #78 from loopholelabs/add-s3-secure-bool
pojntfx Feb 8, 2025
a424c10
Bump Silo version
jimmyaxod Feb 11, 2025
f0df236
Updated start script to use sharedBase for cow migration
jimmyaxod Feb 11, 2025
9fb2d27
Added couple of drafter metrics
jimmyaxod Feb 11, 2025
5180594
Added missing protocol to silo metrics
jimmyaxod Feb 11, 2025
e3b7487
Bumped silo
jimmyaxod Feb 11, 2025
3b919f8
Fix metrics a little, tests pass again
jimmyaxod Feb 12, 2025
f9d8347
Moved metrics creation out of peer
jimmyaxod Feb 12, 2025
d157852
Added ID to peer for metrics
jimmyaxod Feb 12, 2025
75d005f
Bump Silo, update to use instance_id for metrics
jimmyaxod Feb 12, 2025
f52b99c
Updated/fixed cow multi migrate test to correctly use a shared base i…
jimmyaxod Feb 12, 2025
a354097
new mock_runtime, and tidy up tests. Started on CoW+S3 migration test
jimmyaxod Feb 14, 2025
0544d1d
Bumped Silo. Fixed context cancel issue on S3 grab. Tests passing.
jimmyaxod Feb 17, 2025
e1416bd
Added a few more drafter status metrics
jimmyaxod Feb 17, 2025
99ae15b
Blocksize now taken from source not local config
jimmyaxod Feb 18, 2025
73a9e9e
Possible select fix
jimmyaxod Feb 18, 2025
0a06595
Added helper to log all device hashes to verify data integrity
jimmyaxod Feb 18, 2025
1db0bea
Tidied protocol handle no log on EOF
jimmyaxod Feb 18, 2025
ab88489
Clean up log a little
jimmyaxod Feb 18, 2025
5b054cf
Removed printf
jimmyaxod Feb 18, 2025
770c5be
Added run state and protection to firecracker runtime.
jimmyaxod Feb 19, 2025
5459072
Added boot args for non pvm
jimmyaxod Feb 19, 2025
aab4cdb
Bumped panrpc
jimmyaxod Feb 19, 2025
5c894fc
Fix panic if no metrics in peer. Tests pass again
jimmyaxod Feb 19, 2025
0ccb08d
Added final check in cow+s3 test
jimmyaxod Feb 19, 2025
089eee8
Started refactor+tidy firecracker runtime plugin
jimmyaxod Feb 19, 2025
0dfe653
Removed panics from snapshotter
jimmyaxod Feb 19, 2025
5356652
Split out some firecracker types
jimmyaxod Feb 19, 2025
b60a2f3
Tidy up snapshotter, add logging
jimmyaxod Feb 20, 2025
07ff761
Extracted RPC from snapshotter. removed panics, goroutinemanager etc
jimmyaxod Feb 20, 2025
f564bdc
Start on fc runner refactor
jimmyaxod Feb 20, 2025
752b7a1
More rpc tidying. runner now has access to a log
jimmyaxod Feb 20, 2025
a9ace58
Chased generics out of runner
jimmyaxod Feb 20, 2025
edb77ba
Tidied fc runner, added logging
jimmyaxod Feb 20, 2025
de55dc6
Added fc runtime snapshotter test
jimmyaxod Feb 21, 2025
d9f2c0d
Extracted resumed_runner. Runner clean
jimmyaxod Feb 21, 2025
bcfc53a
rpc extraction
jimmyaxod Feb 21, 2025
5dd9549
More rr/rpc tidying up
jimmyaxod Feb 21, 2025
69d4995
Getting there
jimmyaxod Feb 21, 2025
fbad4d6
More reorg
jimmyaxod Feb 21, 2025
c49c7ef
resumed runner fairly tidy
jimmyaxod Feb 21, 2025
9bb1253
Started on resume suspend test
jimmyaxod Feb 21, 2025
81688f1
fc vm resume/suspend test working
jimmyaxod Feb 21, 2025
4726685
resume/suspend test passing locally
jimmyaxod Feb 21, 2025
788b1d0
fc vm tests
jimmyaxod Feb 24, 2025
975cc32
Merged from main
jimmyaxod Feb 27, 2025
6479a27
Put fc tests behind integration flag
jimmyaxod Feb 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
First e2e peer migrate test working (No writes from vm)
Signed-off-by: Jimmy Moore <[email protected]>
jimmyaxod committed Jan 28, 2025

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
commit 395cdcc3fa9b76c4323dd600454559b1d6eb2cfb
194 changes: 192 additions & 2 deletions pkg/peer/peer_test.go
Original file line number Diff line number Diff line change
@@ -1,8 +1,198 @@
package peer

import "testing"
import (
"context"
"fmt"
"io"
"math/rand"
"os"
"path"
"sync"
"testing"
"time"

"github.com/loopholelabs/drafter/pkg/common"
"github.com/loopholelabs/drafter/pkg/mounter"
"github.com/loopholelabs/logging"
"github.com/loopholelabs/silo/pkg/storage/migrator"
"github.com/stretchr/testify/assert"
)

// A MockRuntimeProvider will periodically modify device(s) while it's running
type MockRuntimeProvider struct {
HomePath string
}

func (rp *MockRuntimeProvider) Start(ctx context.Context, rescueCtx context.Context) error {
fmt.Printf(" ### Start %s\n", rp.HomePath)
return nil
}

func (rp *MockRuntimeProvider) Close() error {
fmt.Printf(" ### Close %s\n", rp.HomePath)
return nil
}

func (rp *MockRuntimeProvider) DevicePath() string {
fmt.Printf(" ### DevicePath %s\n", rp.HomePath)
return rp.HomePath
}

func (rp *MockRuntimeProvider) GetVMPid() int {
fmt.Printf(" ### GetVMPid %s\n", rp.HomePath)
return 0
}

func (rp *MockRuntimeProvider) SuspendAndCloseAgentServer(ctx context.Context, timeout time.Duration) error {
fmt.Printf(" ### SuspendAndCloseAgentServer %s\n", rp.HomePath)
return nil
}

func (rp *MockRuntimeProvider) FlushData(ctx context.Context) error {
fmt.Printf(" ### FlushData %s\n", rp.HomePath)
return nil
}

func (rp *MockRuntimeProvider) Resume(resumeTimeout time.Duration, rescueTimeout time.Duration) error {
fmt.Printf(" ### Resume %s\n", rp.HomePath)
return nil
}

const testPeerSource = "test_peer_source"
const testPeerDest = "test_peer_dest"

func TestPeer(t *testing.T) {

// TODO: Test a peer migration. Need to abstract out complete VM and then construct a mockVM.
log := logging.New(logging.Zerolog, "test", os.Stderr)
// log.SetLevel(types.TraceLevel)

err := os.Mkdir(testPeerSource, 0777)
assert.NoError(t, err)
err = os.Mkdir(testPeerDest, 0777)
assert.NoError(t, err)

t.Cleanup(func() {
err := os.RemoveAll(testPeerSource)
assert.NoError(t, err)
err = os.RemoveAll(testPeerDest)
assert.NoError(t, err)
})

devicesInit := make([]common.MigrateFromDevice, 0)
devicesTo := make([]common.MigrateToDevice, 0)
devicesFrom := make([]common.MigrateFromDevice, 0)

// Create some device source files, and setup devicesTo and devicesFrom for migration.
for _, n := range common.KnownNames {
// Create some initial devices...
fn := common.DeviceFilenames[n]

dataSize := (1 + rand.Intn(5)) * 1024 * 1024
buffer := make([]byte, dataSize)
err = os.WriteFile(path.Join(testPeerSource, fn), buffer, 0777)
assert.NoError(t, err)

devicesInit = append(devicesInit, common.MigrateFromDevice{
Name: n,
Base: path.Join(testPeerSource, fn),
Overlay: "",
State: "",
BlockSize: 1024 * 1024,
Shared: false,
})

devicesTo = append(devicesTo, common.MigrateToDevice{
Name: n,
MaxDirtyBlocks: 10,
MinCycles: 1,
MaxCycles: 3,
CycleThrottle: 1 * time.Second,
})

devicesFrom = append(devicesFrom, common.MigrateFromDevice{
Name: n,
Base: path.Join(testPeerDest, fn),
Overlay: "",
State: "",
BlockSize: 1024 * 1024,
Shared: false,
})

}

rp := &MockRuntimeProvider{
HomePath: testPeerSource,
}
peer, err := StartPeer(context.TODO(), context.Background(), log, rp)
assert.NoError(t, err)

hooks1 := mounter.MigrateFromHooks{
OnLocalDeviceRequested: func(id uint32, path string) {},
OnLocalDeviceExposed: func(id uint32, path string) {},
OnLocalAllDevicesRequested: func() {},
OnXferCustomData: func(data []byte) {},
}

err = peer.MigrateFrom(context.TODO(), devicesInit, nil, nil, hooks1)
assert.NoError(t, err)

fmt.Printf("Resume\n")

err = peer.Resume(context.TODO(), 10*time.Second, 10*time.Second)
assert.NoError(t, err)

// Now we have a "resumed peer"

rp2 := &MockRuntimeProvider{
HomePath: testPeerDest,
}
peer2, err := StartPeer(context.TODO(), context.Background(), log, rp2)
assert.NoError(t, err)

r1, w1 := io.Pipe()
r2, w2 := io.Pipe()

hooks := MigrateToHooks{
OnBeforeSuspend: func() {},
OnAfterSuspend: func() {},
OnAllMigrationsCompleted: func() {},
OnProgress: func(p map[string]*migrator.MigrationProgress) {},
GetXferCustomData: func() []byte { return []byte{} },
}

var wg sync.WaitGroup
wg.Add(1)
go func() {
fmt.Printf("Calling MigrateTo...\n")
err = peer.MigrateTo(context.TODO(), devicesTo, 10*time.Second, 10, []io.Reader{r1}, []io.Writer{w2}, hooks)
fmt.Printf("MigrateTo returned %v\n", err)
assert.NoError(t, err)
wg.Done()
}()

hooks2 := mounter.MigrateFromHooks{
OnLocalDeviceRequested: func(id uint32, path string) {},
OnLocalDeviceExposed: func(id uint32, path string) {},
OnLocalAllDevicesRequested: func() {},
OnXferCustomData: func(data []byte) {},
}
fmt.Printf("MigrateFrom...\n")
err = peer2.MigrateFrom(context.TODO(), devicesFrom, []io.Reader{r2}, []io.Writer{w1}, hooks2)
fmt.Printf("MigrateFrom returned %v\n", err)
assert.NoError(t, err)

wg.Wait()

// Make sure everything migrated as expected...

for _, n := range common.KnownNames {
fn := common.DeviceFilenames[n]
buff1, err := os.ReadFile(path.Join(testPeerSource, fn))
assert.NoError(t, err)
buff2, err := os.ReadFile(path.Join(testPeerDest, fn))
assert.NoError(t, err)

// Check the data is identical
assert.Equal(t, buff1, buff2)
}
}