forked from davidpanderson/science_united
-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes.txt
608 lines (527 loc) · 16.7 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
problems
host 2191
first rpc jan 22 2019
GPU flops 4e16
too-large CPU times?
maybe because of existing (pre-SU) totals?
====================
Allocation
--------------
The way in which computing power is divided among projects and users.
"Hard allocation": one that guaranteed, based on assumptions
about volunteer pool.
--------------
goals: multi-objective
honor hard allocations
respect soft allocations
respect user prefs
maximize throughput
--------------
We can estimate the total throughput of the system,
and we can break it down by resource type,
and for a given set of keywords we can see how much is not vetoed.
Goal: someone comes to SU with an allocation request, which includes
- the available app version types
(CPU, NVIDIA, AMD, platform, VM)
- the set of keywords
- # FLOPs
We can tell them whether this can be granted, and if so starting when
Parameters:
- what % of resources to use for allocations (50?)
- max % for a single allocation (10?)
Projects also get an unallocated share,
with a fraction determined by the popularity of their keywords
to process an allocation request:
select hosts that aren't ruled out by keywords
total their flops for eligible resources
------------
SU serves as a scheduler at the level of project.
It keeps track of targets, allocations, balances, host assignments.
SU allocation is at the granularity of project.
If a project needs app- or user-level granularity it can either
- run multiple BOINC projects, 1 per user/app
- use BOINC's built-in allocation system
An "allocation" consists of
- an initial balance
- the rate of balance accrual
- the interval of balance accrual
where "balance" is in terms of FLOPs
There are two measures of work done:
1) REC, as reported directly by clients
2) credit, as reported by projects
2) is cheat-resistant, 1) is not
1) is up-to-date, 2) is not (long jobs, validation delay)
Approach:
use 1) from SU scheduling, but don't show it to volunteers
(eliminate incentive for cheating)
show 2) to volunteers.
?? what if SU assigns a host to a project, and the project doesn't have work?
SU should work in a reasonable way for sporadic workloads
BOINC client:
if devices are idle, and scheduler RPC returns nothing
do an AM RPC, but only every so often
(min AM RPC interval?)
Need way for SU to check if project has work?
--------------
gives projects more allocation if they supply computing?
No. nanoHub users are going to attach to nanoHub, not SU
--------------
========================================
projects
tacc
http://129.114.6.131/tacctest/
Biology
United States
nanohub:
https://devboinc.nanohub.org/boincserver/
Nanotechnology
United States
SETI@home
Astronomy, SETI
United States, University of California
BOINC Test Project
http://boinc.berkeley.edu/test/
Einstein@home
Astronomy, Physics
Germany, United States
Rosetta
Biology, Medicine
United States, University of Washington
-----------------
categorize keywords as major or minor
major: astro, bio, env, physics, math
US, Europe, Asia
categorize projects as auto or on-demand
auto: accounts creation initiated on SU account creation
====================
User Web site
functions:
new user experience
return user experience
catalog of research projects
where does this come from?
new from projects
--------------
main page (index.php)
If not logged in
call to action; describe VC, science; pictures; safety
big Join button, small login link
if logged in
if never got RPC from this user
need to download;link
if last RPC is old or client version is old
advise to download; link
if problem accounts, link to problem account page
progress/accounting stuff
show graph of recent EC
"In the last 24 hours your computers have contributed
CPU time/EC, GPU time/EC, #jobs success
You are contributing to science projects doing x, x, x
located in y, y, y"
link to Account page
Add new device
Android:
log in from device,
menu bar (all pages)
Your account (if logged in)
projects
science prefs
computing prefs
computing stats
certificate?
Computing
projects
leaderboards
Community
message boards
teams
profiles
community prefs
user search
user of the day
Site
search
languages
about SU
help
--------------------
Join page (su_join.php)
create account info
email
screen name
password
basic prefs
send cookie
goes to...
--------------------
Download page (su_download.php)
says you need to install BOINC
Download link (direct)
goes via concierge; sets cookies
When user runs app, welcome screen says
You are now running SU as user Fred
No further action is needed.
SU will run science jobs in the background.
OK
small print: if BOINC version >= 7.10 and VBox already installed:
In BOINC Manager
select Tools/user account manager
choose SU
enter the email address and password of your SU account
goes to...
--------------------
Getting started (su_intro.php)
Welcome, thanks etc.
add to your other devices too
tell your family and friends
Explain
computing prefs
keyword prefs
join a team
--------------------
User account page (su_home.php)
like BOINC home page:
left:
problem accounts
graphs of recent work
link to list of hosts
links to prefs
keywords
computing prefs
community prefs
right:
account info
social functions
--------------------
host list (su_hosts.php)
columns:
name
boinc version
last RPC
active projects
--------------------
keyword prefs edit page (su_prefs.php)
show list of major keywords, yes/maybe/no radio buttons
"more" button shows all keywords
if user has prefs for non-major keywords, show all in that category
--------------------
global prefs edit page
if min/med/max
show radio buttons
link to all prefs
--------------------
connect account page (su_connect.php>
explain the problem
show list of problem accounts; click for details
details page:
let the user enter password of existing account.
OK: contacts project, tell user if fail
---------------------
Accounting pages (su_acct.php)
Totals: tables and graphs of totals
projects: table of projects (su_projects_acct.php; merge)
project: graphs for 1 project
users: list of top users
user: graphs for 1 user
---------------------
new items
would be good to show news items from project
easiest way: each project exports RSS feed
news items are tagged with keywords
SU aggregates these
if logged in, filter/order by keyword
show combined stream to others
Give projects more allocation if they supply news?
forums
for discussion of projects, keywords?
==================
RPC handler
when are accounts created?
on initial registration:
pick N best projects based on prefs
(same logic as in RPC handler)
start account creation for those projects
on RPC:
pick N best projects
if not registered with any of them
return highest-score project we are registered with
==================
client changes
represent user keyword sets as two vector<int>s
The following have keyword sets:
AM_ACCOUNT
Scheduler request: include
Eventually (to show keywords in GUI),
client needs to have master keyword list (keywords.xml)
bundle with installer;
refresh every 2 weeks;
refresh if see unknown keyword ID
The following have keyword sets:
PROJECT
WORKUNIT
==================
server changes
workunit table has keywords as varchar 256
WU_RESULT has keywords as string
scheduler: maintain array of vector<int> corresponding to cache
if look at job, keywords nonempty, vector empty: parse
when remove from cache, clear vector
remote job submission:
batch can have keywords
each job in batch can have keywords
local job submission:
create_work takes kw ids
(multi-job as well as single)
==================
keyword-based scheduling
submission: batch has keywords
copied to results (new DB field, 256 chars)
each volunteer reports yes/no keywords in sched request
(AM can supply "project opaque" text, passed to projects)
score-based scheduling:
"no" keyword match returns -1
"yes" match adds 1
ensure non-starvation of jobs in cache
add job age to score
Project keywords include a fraction flag
(estimate fraction of work from that project with that keyword).
100 means all work from project will have that keyword
If a user has a "no" keyword matching a project "all" keyword,
don't attach to that project.
Can compute:
- the fraction of work from project that user can do
- the fraction of work from project that user wants to do
==================
Project plan
- web site for register and prefs
- admin web interface for editing projects, apps, keywords, quotas
- RPC stubs for account creation, setting prefs
- integrated client download
==================
docs
Overview
SU targets a different population than the current BOINC user base:
largely young (20s, 30s)
~half female
primary motivation: science goals
non-motivation: competition
show user their own participation level over time
no comparisons, leader boards
science-oriented
non-technical
don't use words like FLOPS, CPU, or GPU
OK: computer, job
Simplicity
hide information; TMI is a turn-off for target users
hide features
Graphical rather than textual display
Keep the notion of "project" in the background.
SU vs BOINC
Ideally, we'd like an SU_branded version of BOINC.
But until then, we explain that BOINC is software used by SU
instructions for alpha testers
Design doc
==================
testing and roll-out plan
- get scienceunited.berkeley.edu, su.berkeley.edu; get SSL working
- make 7.9 for win/mac;
project list includes SU and boinc test
- announce SU to alpha testers; start testing
When SU is stable:
add SU to project list, non-test
fork/test/release 7.10
------------
project stuff
Github
learn project interface
communication channels
---------------------------
project app version info
need to know whether a host
a) can do work for a project at all
b) can use its GPU (s) for the project
where to store project AV info?
digest projects.xml and for each project produce a list of triples
(platform, GPU type, vbox flag)
store these either in JSON, XML, or serialize
function host_project_compatible() returns whether
- host can do work for project at all
- it can do work w/ vbox
- it can do work using a GPU
-------------------
notion of backup project
Send 1-2 extra projects w/ zero resource share as backups,
in case higher-score project has no work, is down,
doesn't work for this host, etc.
More generally: we should notice if a host can't get work done
for a particular project, and treat it differently.
===========
New allocation model
share/total share = fraction of total FLOPS this project gets
balance = FLOPs owed to this project
daily accounting:
x = total FLOPS in last day
for each project p
p.balance += x*(p.share/total_share)
record balance, share in project accounting record
RPC
for each project p
p.balance -= reported EC
projected balance
The idea is that there will be a ~24 period from
when we start attaching a project to when
the first reports of EC come in, and its balance declines.
During that period we might over-attach it.
So we could maintain a separate "projected balance",
reset periodically to the balance,
and decremented each time we attach the project by,
say, half the estimated EC over the next day
I'm not sure we need this; let's not do it for now.
================
Admin page
30-day graphs
FLOPS
jobs success/fail
#active hosts
#active users
Text:
24-hour accounting
total accounting
links to
project list
-------
project detail page
30-day graphs
FLOPS
jobs success/fail
text:
24-hours
total
================
Use of email
email prefs:
- send status emails: never / weekly / monthly
default: week
store this in user.send_email
seti_last_result_time: when to send next email
status email
if some device has been active in last week/month
show totals for that period
for devices not active in that period
show
if no device has ever been active
link to help
=============
Message boards
News
Questions
get help
Problems
report things that are broken
Suggestions
suggest changes or additions
=============
on home page:
"Science projects" links to projects page
user projects page:
show all projects
put ones w/ work at top
fix "since"
new cols:
account status (created, pending, problem, none)
allowed by prefs?
opt-out checkbox
========================
starvation
Some projects may be down, or have no jobs, for arbitrary periods.
If we attach a client to such projects, they'll starve.
This applies only to dynamic AMs;
use "send_rec" flag as proxy for this
Questions:
- When should the client make an "starved" RPC to SU?
use exponential backoff 10 min .. 1 day to prevent flooding SU
when a resource is idle
we know this after doing a rr simulation
write a func to do it fast!
scan minimal # of jobs
keep track of:
first_starved: start of starvation interval, or zero
starved_rpc_backoff: delay between starved RPCs, or zero
starved_rpc_min_time: earliest time to do a starved RPC
if using dynamic AM
every 1 min
check idle rsc
if none
first_starved = 0
else
if first_starved == 0
first_starved = now;
starved_rpc_backoff = 10 min
starved_rpc_min_time = now + 10 min
else if now > starved_rpc_min_time
do AM RPC
starved_rpc_backoff = min(starved_rpc_backoff*2, 1 day)
starved_rpc_min_time = now + starved_rpc_backoff
A starved RPC may not return any new projects (e.g. because of prefs)
- What additional info should be in the AM RPC request?
for each project: flags saying we didn't get work
we expected from this project, namely:
- <sched_req_failed/>: last sched request failed
get this from PROJECT::nrpc_failures
- <sched_req_no_work>rsc</sched_req_no_work>
last sched request requested work for rsc and didn't get any
(not counting "too soon" error)
set this stuff in PROJECT after scheduler req
- What changes to the scheduling algorithm are needed?
in scan over projects:
if last request failed or zero jobs, add to list but
don't mark any resources
if didn't get work for some resources, don't mark those resources
=============================
Anonymous accounts
Project accounts created by SU should be anonymous:
- User name is "Science United user random-str"
- email address is [email protected]
different per project?
I think so.
- password hash is random string (use new crypt code)
Purposes:
- simplify GDPR implementation
- reduce security risk (password guessing)
- eliminate stuff for resolving passwords
- let projects see how much computing comes from SU users
Implementation
DB:
add email_addr, passwd_hash to su_account
Web code:
change account creation
if "email already exists", try new one
remove password resolution code
make script for handling existing accounts
===================
Group accounts
Projects may stop supporting account-create RPC.
So instead manually create accounts on each project.
Name "Science United",
email "[email protected]"
Add to project table
authenticator
weak_authenticator
Populate these for E@h, WCG
When creating account records for these projects,
create them immediately
====================
when a project changes from http: to https:
AM RPC:
if request includes a project w/ http: that is in our DB as https:,
tell it to detach from that project.
=================
tables or graphs for progress report:
per month for last X years:
# of active volunteers
# of active hosts
# CPU cores
# GPUs
CPU TeraFLOPS
GPU TeraFLOPS
# jobs