-
Notifications
You must be signed in to change notification settings - Fork 0
ramfjord/ramfjord_tgrep
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
My version of tgrep uses an interpolated search, which estimates the distance in the file to the start time, jumps to that distance, checks, and then either re-estimates or starts a linear search. This is basically what you do when you look a word up in a real dictionary. While estimating the distance, I assumed that the logs were uniformly distributed across all the time periods. This worked well with the log-generator script, but I suspect it would be a little worse in practice. RUN TIME: The average case run-time of this depends on the distribution of logs. If it is normal, as the file grows the run-time becomes constant (the estimate gets more and more accurate). The worst case, however is O(n), or linear. This would be the case if we had a file composed almost entirely of logs from a single second. In the test file, the standard tgrep always showed 3-4 lseeks. Considering that the 1 is used to find the last time in the file, and 1 is used to go back to where we just read after getting close enough, it's not bad. I believe I know why I sometimes didn't get 3, but it wasn't worth it to fix it. Also it's too long to write here. You can ask me about it in my interview. As the file size increased, the number did not increase, because it just made the distribution more normal. In addition to interpolated, there is also a real binary search that you can do by adding a "-b". This has worst case run-time of log(n), and average case run-time of log(n). Because log(n) goes up sooooo slowly with n, the constant time really isn't much better. Also note that it would still technically be possible for the binary search to be O(n) if literally the whole document except the last entry consisted of the second before the last entry. Debated making this the default version, but I think the other one will be slightly faster on the log files. The other one does, of course, run faster on the generated log files, because they are normally distributed. In the test file, the standard binary search usually showed several more lseeks. The max I got was 19, on a 3 gig file. This probably seems like a lot more, but it's still a trivial amount. ISSUES: This will probably not work on files with multiple midnight crossings. This is because I essentially use a boolean variable to keep track of the day, because I didn't want to make something to do modular arithmetic for days of the month. It would be relatively easy to change that if necessary though. If you enter a time-range that crosses the beginning/end of the file, then you get an error message, unless you add the parameter "-s" or "-e" -s: truncates the [s]tarting time you entered to the first time in the file -e: truncates the [e]nding time you entered to the end time of the file Ex: file goes from 7 am Monday to 8 am Tuesday $ tgrep 5-9 -error $ tgrep 5-9 -s - logs from 7-9 am monday $ tgrep 5-9 -e - logs from 7-8 am tuesday If you enter a time range that occurs multiple times in a single file, tgrep will always display the first one. This won't happen if the file is actually 24 hours. Could add an option to change this, but don't know if it's necessary. I assumed that a log couldn't be larger than 1000 characters, and if it is, it might fail. An easy way to fix this, however, would be to increase the values MAX_LINE_LEN and SEARCH_B_SIZE is tgrep.h. Make sure MAX_LINE_LEN is never > SEARCH_B_SIZE.
About
challenge question 2 for the reddit application
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published