Muskie is implemented as a restify
service, and is deployed highly redundantly, both with a zone and in many
zones. In a typical production deployment, there are 16 muskie processes
per zone. All logging from muskies in a production
setting is done through bunyan to
/var/log/muskie.log
.
At runtime, muskie depends on mahi for
authentication/authorization, the
electric-moray stack for
object traffic, and mako for writing objects.
The same moray instance that
marlin uses is contacted for all APIs
involving job management. Lastly,
medusa is needed for mlogin
functionality.
One additional dependency in muskie is on a Moray shard (by convention the same
one that marlin uses), to store the storage node statvfs
information. Muskie
periodically refreshes this and always selects storage nodes from the local
cache on writes, as opposed to hitting Moray directly for this purpose. This is
purely in-memory so all muskie processes have a cached copy.
There is a second server running in each muskie process to facilitate
monitoring. This is accessible at port + 800
. For example, the first (of 16
in production) muskie process within a zone usually runs on port 8081, so the
monitoring server would be accessible on port 8881 from both the manta
and
admin
networks. The monitoring server exposes Kang debugging information and
Prometheus application metrics.
Kang debugging information is accessible from the route GET /kang/snapshot
.
For more information on Kang, see the documentation on
GitHub.
Application metrics can be retrieved from the route GET /metrics
. The metrics
are returned in Prometheus v0.0.4 text format.
The following metrics are collected:
- Time-to-first-byte latency for all requests
- Count of requests completed
- Count of bytes streamed to and from storage
- Count of object bytes deleted
Each of the metrics returned include the following metadata labels:
- Datacenter name (i.e. us-east-1)
- CN UUID
- Zone UUID
- PID
- Operation (i.e. 'putobject')
- Method (i.e. 'PUT')
- HTTP response status code (i.e. 203)
The metric collection facility provided is intended to be consumed by a monitoring service like a Prometheus or InfluxDB server.
On a PutObject
request, muskie will, in order:
- Lookup metadata for parent directory and current key (if overwrite)
- Ensure the parent directory exists
- If etags, enforce etag
- select random makos
- stream bytes
- save metadata back into electric-moray
Note that there is a slight race where users could theoretically be writing an object while they delete the parent directory, leading to an "orphaned" object, but this could only happen if there were no objects in the directory, and an upload was happening while the user deleted the parent.
Probably the most interesting aspect of PutObject that's hard to figure out from
code inspection is how storage node selection works. In essence, we always
select random "stripes" of hosts (where a stripe is as wide as
durability-level
), across datacenters (if a multi-DC deployment) that have
heartbeated in the last N seconds and have enough space. Muskie then
synchronously starts requests to all of the servers in the stripe, and if
any fail, it abandons and moves to the next stripe. Assuming 1 of the stripes
has 100% connect success rate, muskie completes the request. If all the stripes
are exhausted, muskie returns an HTTP 503 error code. To make a practical
example, assume the following topology of storage nodes:
DC1 | DC2 | DC3
-----+-----+-----
A,D,G|B,E,H|C,F,I
On a Put request, muskie might select the following configurations (totally random):
1. B,C
2. I,A,
3. E,D
4. G,F
And supposing C
and A
are somehow down, then muskie would end up writing the
user data to pair E,D
.
On a GET request, muskie will go fetch the metadata for the record from electric-moray and then start requests to all of the storage nodes that hold the object. The first one to respond is the one that ultimately streams back to the user, and the other requests are abandoned.
Deletion of objects is actually very simple - muskie simply deletes the metadata record in moray. Actual blobs of data are asynchronously garbage collected. See manta-mola for more information.
Creating or updating a directory in Manta can be thought of a special case of creating an object where all aspects of streaming the user byte stream are ignored. But otherwise muskie does the same metadata lookups and enforcement, and then simply saves the directory record into Moray.
Additionally, directories are setup with postgres triggers to maintain a count
of entries (this is returned in the result-set-size
HTTP header). See
moray.js
in node-libmanta for the
actual meat, but this allows muskie to not ask moray to do SELECT count(*)
,
which is known to be very slow at scale in postgres.
Listing a directory in muskie is just a bounded listing of entries in moray
(e.g. a findobjects
) request. The manta_directory_counts
(trigger-maintained) bucket is used to fill in result-set-size
.
Deletion of a directory simply ensures there are no children and then drops the dirent; again there is a tiny race here as this spans shards.