client.WriteSeries returns: Server returned (400): IO error: /opt/influxdb/shared/data/db/shard_db_v2/00190/MANIFEST-000006: No such file or directory #985

jvshahid · 2014-10-24T20:42:04Z

using https://github.com/vimeo/whisper-to-influxdb/ which invokes influxClient.WriteSeriesWithTimePrecision(toCommit, client.Second)
to write a series called "servers.dfvimeostatsd1.diskspace.root.inodes_free" with 60643 records of (time, sequence_number, value) format, to my graphite database, which i recreated from scratch yesterday after i upgraded.

influx> list shardspaces
Database    Name Regex Retention Duration RF Split
graphite default  /.*/      365d       7d  1     1
influx>

got this response:

Server returned (400): IO error: /opt/influxdb/shared/data/db/shard_db_v2/00190/MANIFEST-000006: No such file or directory

my influxdb is 0.8.3, has debug logging enabled,
but the log only contains messages matching
(GraphiteServer committing|Executing leader loop|Dumping the cluster config|Testing if we should|Checking for shards to drop), no other messages.
I also checked dmesg, no errors there. ditto for /var/log/messages, nothing useful there.

Dieterbe · 2014-09-30T11:29:14Z

this might also be useful

[root@dfvimeographite1 ~]# df -hT
Filesystem           Type   Size  Used Avail Use% Mounted on
/dev/mapper/vg0-lv0  ext4    20G  9.1G  9.7G  49% /
tmpfs                tmpfs   48G     0   48G   0% /dev/shm
/dev/sda1            ext4   194M   61M  123M  34% /boot
/dev/mapper/vg0-lv3  ext4   1.6T  255G  1.3T  17% /data
/dev/mapper/vg0-lv1  ext4    20G  8.1G   11G  43% /var
[root@dfvimeographite1 ~]# df -i
Filesystem              Inodes  IUsed     IFree IUse% Mounted on
/dev/mapper/vg0-lv0    1310720  88453   1222267    7% /
tmpfs                 12381474      1  12381473    1% /dev/shm
/dev/sda1                51200     46     51154    1% /boot
/dev/mapper/vg0-lv3  106889216 182482 106706734    1% /data
/dev/mapper/vg0-lv1    1310720  28957   1281763    3% /var

Dieterbe · 2014-09-30T13:43:23Z

when i manually retry the same write later, it works fine. so maybe it uses another dir then, or it was a race condition between creating the dir and trying to use it?

Dieterbe · 2014-10-03T13:03:50Z

today i got another, bit more exotic variant of this:

Failed to write InfluxDB series 'servers.dfvimeodfs5.iostat.sdf1.iops' (159429 points): Server returned (400): Corruption: Can't access /000013.sst: IO error: /opt/influxdb/shared/data/db/shard_db_v2/00222//000013.sst: No such file or directory
Can't access /000015.sst: IO error: /opt/influxdb/shared/data/db/shard_db_v2/00222//000015.sst: No such file or directory
 (operation took 2.183744668s)

nothing particular in /var/log/messages, dmesg, and plenty of space and inodes available.

again, resuming my program where it left off (i.e. doing the same write that failed) seems to work fine

jvshahid · 2014-10-14T15:57:38Z

Similar to #1009 and #1013. This is caused by the concurrent closing of a shard and opening it at the same time. This operation needs to be go routine safe. I'm not sure why the shards are being dropped though. @Dieterbe are you trying to write points in the past or the data collection is lagging behind ?

Dieterbe · 2014-10-14T16:00:29Z

yes, this is an import of old data, with timestamps anywhere between 2y ago and now.

jvshahid · 2014-10-14T16:32:12Z

What's the retention and duration of those shards

Dieterbe · 2014-10-14T16:45:00Z

not sure, have recreated the db a couple of times in the meantime. I think i've usually kept shard duration 7d, retention was probably 365 or 730 days. (it's possible that some of the points being written have timestamps older than what the shard cares about)

jvshahid · 2014-10-14T16:47:19Z

Cool, just wanted to make sure my guess makes sense.

* shard_datastore.go(Deleteshard): Check the reference count of the shard and mark it for deletion if there are still more references out there. Otherwise, delete the shard immediately. Also refactor the deletion code in deleteShard(), see below. * shard_datastore.go(ReturnShard): Check to see if the shard is marked for deletion. * shard_datastore.go(deleteShard): Refactor the code that used to be in Deleteshard in its own method. Use `closeShard` instead of doing the cleanup ourselves.

jvshahid · 2014-10-24T20:43:06Z

/cc @toddboom @dgnorton review please

toddboom · 2014-10-24T21:05:19Z

Looks good to me.

dgnorton · 2014-10-24T21:14:13Z

lgtm

client.WriteSeries returns: Server returned (400): IO error: /opt/influxdb/shared/data/db/shard_db_v2/00190/MANIFEST-000006: No such file or directory

Background of the bug: Prior to this patch we actually tried writing points that were older than the retention period of the shard. This caused race condition when it came to writing points to a shard that's being dropped, which will happen frequently if the user is loading old data (by accident). This is demonstrated in the test in this commit.This bug was previously addressed in #985. It turns the fix for #985 wasn't enough. A user reported in #1078 that some shards are left behind and not deleted. It turns out that while the shard is being dropped more write requests could come in and end up on line `cluster/shard.go:195` which will cause the datastore to create a shard on disk that isn't tracked anywhere in the metadata. This shard will live forever and never get deleted. This fix address this issue by not writing old points in, but there are still some edge cases with the current implementation, at least not as bad as current master.

Background of the bug: Prior to this patch we actually tried writing points that were older than the retention period of the shard. This caused race condition when it came to writing points to a shard that's being dropped, which will happen frequently if the user is loading old data (by accident). This is demonstrated in the test in this commit.This bug was previously addressed in #985. It turns the fix for #985 wasn't enough. A user reported in #1078 that some shards are left behind and not deleted. It turns out that while the shard is being dropped more write requests could come in and end up on line `cluster/shard.go:195` which will cause the datastore to create a shard on disk that isn't tracked anywhere in the metadata. This shard will live forever and never get deleted. This fix address this issue by not writing old points in, but there are still some edge cases with the current implementation, at least not as bad as current master. Close #1078

jvshahid added bug labels Oct 14, 2014

jvshahid mentioned this pull request Oct 14, 2014

Server returned (400): Corruption: VersionEdit: unknown tag #1009

Closed

jvshahid self-assigned this Oct 21, 2014

jvshahid added 2 - Working and removed 1 - Ready labels Oct 22, 2014

jvshahid added review and removed 2 - Working labels Oct 24, 2014

jvshahid added a commit that referenced this pull request Oct 24, 2014

Merge pull request #985 from influxdb/fix-985

97cd03c

client.WriteSeries returns: Server returned (400): IO error: /opt/influxdb/shared/data/db/shard_db_v2/00190/MANIFEST-000006: No such file or directory

jvshahid merged commit 97cd03c into master Oct 24, 2014

jvshahid deleted the fix-985 branch October 24, 2014 21:17

jvshahid removed the review label Oct 24, 2014

Dieterbe unassigned jvshahid Feb 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client.WriteSeries returns: Server returned (400): IO error: /opt/influxdb/shared/data/db/shard_db_v2/00190/MANIFEST-000006: No such file or directory #985

client.WriteSeries returns: Server returned (400): IO error: /opt/influxdb/shared/data/db/shard_db_v2/00190/MANIFEST-000006: No such file or directory #985

jvshahid commented Oct 24, 2014

Dieterbe commented Sep 30, 2014

Dieterbe commented Sep 30, 2014

Dieterbe commented Oct 3, 2014

jvshahid commented Oct 14, 2014

Dieterbe commented Oct 14, 2014

jvshahid commented Oct 14, 2014

Dieterbe commented Oct 14, 2014

jvshahid commented Oct 14, 2014

jvshahid commented Oct 24, 2014

toddboom commented Oct 24, 2014

dgnorton commented Oct 24, 2014

client.WriteSeries returns: Server returned (400): IO error: /opt/influxdb/shared/data/db/shard_db_v2/00190/MANIFEST-000006: No such file or directory #985

client.WriteSeries returns: Server returned (400): IO error: /opt/influxdb/shared/data/db/shard_db_v2/00190/MANIFEST-000006: No such file or directory #985

Conversation

jvshahid commented Oct 24, 2014

Dieterbe commented Sep 30, 2014

Dieterbe commented Sep 30, 2014

Dieterbe commented Oct 3, 2014

jvshahid commented Oct 14, 2014

Dieterbe commented Oct 14, 2014

jvshahid commented Oct 14, 2014

Dieterbe commented Oct 14, 2014

jvshahid commented Oct 14, 2014

jvshahid commented Oct 24, 2014

toddboom commented Oct 24, 2014

dgnorton commented Oct 24, 2014