Skip to content

Commit

Permalink
Fix for element types where size can't be defined (#24)
Browse files Browse the repository at this point in the history
* Fix for element types where size can't be defined

* Add more documentation

* fix badges

* Improve coveragre

* More tests

* Bump version
  • Loading branch information
meggart authored Jan 6, 2021
1 parent 3b7eddd commit 317a6fb
Show file tree
Hide file tree
Showing 8 changed files with 117 additions and 103 deletions.
48 changes: 48 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: CI
on:
pull_request:
branches:
- master
push:
branches:
- master
tags: '*'
jobs:
test:
name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }}
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
version:
- '1.2' # Replace this with the minimum Julia version that your package supports. E.g. if your package requires Julia 1.5 or higher, change this to '1.5'.
- '1' # Leave this line unchanged. '1' will automatically expand to the latest stable 1.x release of Julia.
- 'nightly'
os:
- ubuntu-latest
- windows-latest
- macos-latest
arch:
- x64
steps:
- uses: actions/checkout@v2
- uses: julia-actions/setup-julia@v1
with:
version: ${{ matrix.version }}
arch: ${{ matrix.arch }}
- uses: actions/cache@v1
env:
cache-name: cache-artifacts
with:
path: ~/.julia/artifacts
key: ${{ runner.os }}-test-${{ env.cache-name }}-${{ hashFiles('**/Project.toml') }}
restore-keys: |
${{ runner.os }}-test-${{ env.cache-name }}-
${{ runner.os }}-test-
${{ runner.os }}-
- uses: julia-actions/julia-buildpkg@v1
- uses: julia-actions/julia-runtest@v1
- uses: julia-actions/julia-processcoverage@v1
- uses: codecov/codecov-action@v1
with:
file: lcov.info
22 changes: 0 additions & 22 deletions .gitlab-ci.yml

This file was deleted.

34 changes: 0 additions & 34 deletions .travis.yml

This file was deleted.

2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "DiskArrays"
uuid = "3c3547ce-8d99-4f5e-a174-61eb10b00ae3"
authors = ["Fabian Gans <[email protected]>"]
version = "0.2.6"
version = "0.2.7"

[compat]
julia = "1.0"
Expand Down
14 changes: 12 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
![Lifecycle](https://img.shields.io/badge/lifecycle-retired-orange.svg)
![Lifecycle](https://img.shields.io/badge/lifecycle-archived-red.svg)
![Lifecycle](https://img.shields.io/badge/lifecycle-dormant-blue.svg) -->
[![Build Status](https://travis-ci.com/meggart/DiskArrays.jl.svg?branch=master)](https://travis-ci.com/meggart/DiskArrays.jl)
[![codecov.io](http://codecov.io/github/meggart/DiskArrays.jl/coverage.svg?branch=master)](http://codecov.io/github/meggart/DiskArrays.jl?branch=master)
[![Build Status][ci-img]][ci-url]
[![codecov.io][codecov-img]][codecov-url]

This package is an attempt to collect utilities for working with n-dimensional array-like data
structures that do not have considerable overhead for single read operations. Most important
Expand Down Expand Up @@ -162,3 +162,13 @@ There are situations where one wants to read every other value along a certain a
In this case a backend can define `readblock!(a,aout,r::OrdinalRange...)` and the respective `writeblock`
method which will overwrite the fallback behavior that would read the whol block of data and only return
the desired range.

## Arrays that do not implement eachchunk

There are arrays that live on disk but which are not split into rectangular chunks, so that the `haschunks` trait returns `Unchunked()`. In order to still enable broadcasting and reductions for these arrays, a chunk size will be estimated in a way that a certain memory limit per chunk is not exceeded. This memory limit defaults to 100MB and can be modified by changing `DiskArrays.default_chunk_size[]`. Then a chunk size is computed based on the element size of the array. However, there are cases where the size of the element type is undefined, e.g. for Strings or variable-length vectors. In these cases one can overload the `DiskArrays.element_size` function for certain container types which returns an approximate element size (in bytes). Otherwise the size of an element will simply be assumed to equal the value stored in `DiskArrays.fallback_element_size` which defaults to 100 bytes.


[ci-img]: https://github.com/meggart/DiskArrays.jl/workflows/CI/badge.svg
[ci-url]: https://github.com/meggart/DiskArrays.jl/actions?query=workflow%3ACI
[codecov-img]: http://codecov.io/github/meggart/DiskArrays.jl/coverage.svg?branch=master
[codecov-url]: (http://codecov.io/github/meggart/DiskArrays.jl?branch=master)
43 changes: 0 additions & 43 deletions appveyor.yml

This file was deleted.

27 changes: 26 additions & 1 deletion src/chunks.jl
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,15 @@ function Base.iterate(g::GridChunks, state)
end

#Define the approx default maximum chunk size (in MB)
"The target chunk size for processing for unchunked arrays in MB, defaults to 100MB"
const default_chunk_size = Ref(100)

"""
A fallback element size for arrays to determine a where elements have unknown
size like strings. Defaults to 100MB
"""
const fallback_element_size = Ref(100)

#Here we implement a fallback chunking for a DiskArray although this should normally
#be over-ridden by the package that implements the interface

Expand All @@ -62,7 +69,25 @@ struct Unchunked end
function haschunks end
haschunks(x) = Unchunked()

estimate_chunksize(a::AbstractArray) = estimate_chunksize(size(a), sizeof(eltype(a)))
"""
element_size(a::AbstractArray)
Returns the approximate size of an element of a in bytes. This falls back to calling `sizeof` on
the element type or to the value stored in `DiskArrays.fallback_element_size`. Methods can be added for
custom containers.
"""
function element_size(a::AbstractArray)
if isbitstype(eltype(a))
return sizeof(eltype(a))
elseif isbitstype(Base.nonmissingtype(eltype(a)))
return sizeof(Base.nonmissingtype(eltype(a)))
else
@warn "Can not determine size of element type. Using DiskArrays.fallback_element_size[] = $(fallback_element_size[]) bytes"
return fallback_element_size[]
end
end

estimate_chunksize(a::AbstractArray) = estimate_chunksize(size(a), element_size(a))
function estimate_chunksize(s, si)
ii = searchsortedfirst(cumprod(collect(s)),default_chunk_size[]*1e6/si)
ntuple(length(s)) do idim
Expand Down
30 changes: 30 additions & 0 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -229,4 +229,34 @@ import Base.PermutedDimsArrays.invperm
a_disk1 = permutedims(_DiskArray(rand(9,2,10), chunksize=(3,2,5)),p)
test_broadcast(a_disk1)
end

@testset "Unchunked String arrays" begin
a = reshape(1:200000,200,1000)
b = string.(a)
c = collect(Union{Int,Missing},a)

DiskArrays.default_chunk_size[] = 100
DiskArrays.fallback_element_size[] = 100
@test DiskArrays.estimate_chunksize(a) == (200,1000)
@test DiskArrays.eachchunk(a) == DiskArrays.GridChunks(a,(200,1000))
@test DiskArrays.estimate_chunksize(b) == (200,1000)
@test DiskArrays.eachchunk(b) == DiskArrays.GridChunks(b,(200,1000))
@test DiskArrays.estimate_chunksize(c) == (200,1000)
@test DiskArrays.eachchunk(c) == DiskArrays.GridChunks(c,(200,1000))
DiskArrays.default_chunk_size[] = 1
@test DiskArrays.estimate_chunksize(a) == (200,625)
@test DiskArrays.eachchunk(a) == DiskArrays.GridChunks(a,(200,625))
@test DiskArrays.estimate_chunksize(b) == (200,50)
@test DiskArrays.eachchunk(b) == DiskArrays.GridChunks(b,(200,50))
@test DiskArrays.estimate_chunksize(c) == (200,625)
@test DiskArrays.eachchunk(c) == DiskArrays.GridChunks(c,(200,625))
DiskArrays.fallback_element_size[] = 1000
@test DiskArrays.estimate_chunksize(a) == (200,625)
@test DiskArrays.eachchunk(a) == DiskArrays.GridChunks(a,(200,625))
@test DiskArrays.estimate_chunksize(b) == (200,5)
@test DiskArrays.eachchunk(b) == DiskArrays.GridChunks(b,(200,5))
@test DiskArrays.estimate_chunksize(c) == (200,625)
@test DiskArrays.eachchunk(c) == DiskArrays.GridChunks(c,(200,625))
end

end

2 comments on commit 317a6fb

@meggart
Copy link
Collaborator Author

@meggart meggart commented on 317a6fb Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/27430

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.2.7 -m "<description of version>" 317a6fb96dc17b74522e5f1f74c34bd7bbfa32f3
git push origin v0.2.7

Please sign in to comment.