- A DemandCube Project
NeverwinterDP the Big Data Pipeline for Hadoop
- Mailing List
- IRC channel #neverwinterdp on irc.freenode.net
- [Google Hangout] (http://www.google.com/hangouts/)
- It's what commiters use to chat directly but have to connect with us first on the mailing list, then request an invite there.
- [List of Issues] (https://github.com/DemandCube/NeverwinterDP/issues?labels=&state=open)
- [Kanban Board of Issues] (https://huboard.com/DemandCube/NeverwinterDP/#/)
Neverwinter is an open source distributed data ingestion system/framework for capturing large amounts of data (ranging from gigabytes to petabytes) to be (processed or saved in real-time) to one or more down databases / repositories (i.e. Hadoop, HDFS, S3, Mysql, Hbase, Storm).
Neverwinter was designed and written from the ground up for reliability, scalability, and operational maintainability to support the growing needs of event and message data collection at scale to support startups and enterprise organizations.
Neverwinter - The real-time log/data pipeline. Sparkngin(Nginx) Kafka, Scribengin, leveraging processing in Hadoop and Storm the Data pipeline integrates with Logstash, Ganglia and Nagios Integration. It's a replacement for flume but also can be integrated with it.
It Supports:
- Batch Analytics
- Realtime Analytics
Neverwinter is the combination of three major open source project that leverage the best in open source.
- Sparkngin (powered by Nginx)
- Kafka
- Scribengin
Now that we have used enough buzz words. Neverwinter reliably captures lots of data and saves it to hadoop and other systems.
Neverwinter allows data ingestion from any system that can emit http/rest (or zeromq) calls and then publish this data to a down stream database, including Hive, HBase, relational databases or even proprietary data stores. A single Neverwinter pipeline can combine data from multiple sources and deliver them to multiple sources, allowing for data to be delivered to multiple team or an entire organization.
Neverwinter is targeted at data and analytics engineering teams who expect response times ranging from sub-second to minutes. Neverwinter breaks the false choice between having a batch or real-time system. Also the false choice between having a fast or maintainable system.
Example - [Twitch's Data Pipeline] (http://ossareh.posthaven.com/the-twitch-statistics-pipeline)
- Http/Rest/ZeroMQ Log Collection Endpoint - Sparkngin
- Data Bus
- Kafka or Amazon Kinesis (Version 2)
- Data Pump/Transport
- vagrant-flow Vagrant Plugin allows for a better ansible flow also generates ansible inventory files, and runs playbooks
- ansible-flow Ansible modules to make working with ansible easier
- DeveloperPlayBooks Ansible Playbooks to quickly setup and install DemandCube projects
- DemandSpike Load Testing Framework for Distributed Applications
- KafkaSphere Web console for Kafka that has a Web tier that talks to a REST Api tier
- High Level
+-----------+ +-------------+ +---------------+ +------------+ +------------+
|Source | |Rest | |Persistent | |Data | |Data |
| Client |+-->| Endpoint |+-->| Queue/Buffer |+-->| Distributor|+-->| Sink |
| | |(Sparkngin) | |(Kafka/Kinesis)| |(Scribengin)| |(Hive/Hbase)|
+-----------+ +-------------+ +---------------+ +------------+ +------------+
- Mid Level
- Source Client
- Collector Endpoint
- Collector Producer
- Persistent Queue/Buffer
- Data Distributor / Stream Processor (CEP - Complex Event Processing)
- Data Sink
- Binary
- Framework
- Schema
- Encryption
- Binary
- This level transports any arbitrary data
- Framework
- This level transports any data wrapped in fields of data needed by the framework for monitorying
- Schema
- This level adds schemas to the data being transported in the framework layer
- Encryption
- This level adds encrytion around the data in schemas or the framework transport
There are many ways you can contribute towards the project. A few of these are:
Jump in on discussions: It is possible that someone initiates a thread on the Mailing List describing a problem that you have dealt with in the past. You can help the project by chiming in on that thread and guiding that user to overcome or workaround that problem or limitation.
File Bugs: If you notice a problem and are sure it is a bug, then go ahead and file a GitHub Issue. If however, you are not very sure that it is a bug, you should first confirm it by discussing it on the Mailing List.
Review Code: If you see that a GitHub Issue has a "patch available" status, go ahead and review it. The other way is to review code submited with a pull request, it is the prefered way. It cannot be stressed enough that you must be kind in your review and explain the rationale for your feedback and suggestions. Also note that not all review feedback is accepted - often times it is a compromise between the contributor and reviewer. If you are happy with the change and do not spot any major issues, then +1 it.
Provide Patches: We encourage you to assign the relevant GitHub Issue to yourself and supply a patch or pull request for it. The patch you provide can be code, documentation, tests, configs, build changes, or any combination of these.
- Create issue on NeverwinterDP for the work (linking only be convention)
- Announce issue on mailinglist and discuss design on the mailinglist
- Sign CLA if you haven't yet - request from neema ( at ) demandcube.com
- Do development according Git Workflow Summary
- Request Code Review on NeverwinterDP mailinglist e.g. "Review Request Issue #111: Title" Followed by link to pull request
- A Commiter reviews the issue, and changes in the pull request (Accepting or requesting changes first)
Git Workflow Summary
- Fork the Repository or Update Fork
- **Create a branch for your feature development - off of master or appropriate branch **
- Do your Development
- Stay in your feature branch
- Squash commit on your feature branch (Optional)
- Update local master from upstream repository
- Merge from feature branch onto your local master branch
- Issue pull request on github
- Stop making changes on the master branch till merged
-
Step 1(New Fork):
-
Step 1(Existing Fork e.g. "YourUserName/NeverwinterDP"):
git pull --no-ff https://github.com/DemandCube/NeverwinterDP.git master
-
Step 2:
git checkout -b feature/featurename master
-
Step 3:
-
Step 4 (Optional - but recommended):
git checkout feature/featurename
git rebase -i master
-
Step 5:
git checkout master
git pull --no-ff https://github.com/DemandCube/NeverwinterDP.git master
-
Step 6:
git checkout master
git merge feature/featurename
git push origin master
-
Step 7:
- Add the remote, call it "upstream":
git remote add upstream [email protected]:DemandCube/NeverwinterDP.git
- Fetch all the branches of that remote into remote-tracking branches,
- such as upstream/master:
git fetch upstream
- Make sure that you're on your master branch:
git checkout master
- Merge upstream changes to your master branch
git merge upstream/master
- Create a patch
- Make sure it applies cleanly against trunk
- Test
- If code supply tests and unit test
- Propose New Features or API
- Document the new Feature or API in the Wiki, the get consensus by discussing on the mailing list
- Open a GitHub Ticket
- Create the patch or pull request, attach your patch or pull request to the Issue.
- Your changes should be well-formated, readable and lots of comments
- Add tests
- Add to documentation, especially if it changes configs
- Add documentation so developers, can understand the feature or API to continue to contribute
- Document information about the issue and approach you took to fix it and put it in the issue.
- Send a message on the mailing list requesting a commiter review it.
- Nag the list if we (commiters) don't review it and followup with us.
- How to create a patch file:
- The preferred naming convention for Sparkngin patches is
SPARKNGIN-12345-0.patch
where12345
is the Issue number and0
is the version of the patch. - Patch Command:
$ git diff > /path/to/SPARKNGIN-1234-0.patch
- How to apply someone else's patch file:
$ cd ~/src/Sparkngin # or wherever you keep the root of your Sparkngin source tree
$ patch -p1 < SPARKNGIN-1234-0.patch # Default when using git diff
$ patch -p0 < SPARKNGIN-1234-0.patch # When using git diff --no-prefix
- Reviewing Patches
- Find issues with label "Patch Available", look over and give your feedback in the issue as necessary. If there are questions discuss in the Mailing List.
- Pull Request
- Issue pull request
- Announce on the mailing list and request code review
- How push from your local repo to github
- How to send a pull request
- How to sync a forked repo on github
- Other Suggested Git Workflows
If you can't actually move issues around let me (Steve) know.
- "Accepted" - are tickets you plan to start working on this week.
- "Working on" - are tickets your actively working on
- "In code review" - are tickets that need a code review ( You should have put a code review request on the mailinglist ) (If no one responded it's up to you to followup)
- "Working on documentation and automated tests" - are tickets your finishing the documentation and creating, unit, integration, configuration management/deployment (Ansible) installation tests.
- "In documentation and automated test review" - review specifically of the documentation and test. Follows the same process as code reviews. A review should be requested on the mailinglist.
- "Done" - The task should pass the automated integration test review from Jenkins
HA Testing
- Testing using SimianArmy and Chaos Monkey, and Jenkins
Providing
- Data Steaming/Collection Framework
- High Availability and Scalable Data Collection
- Data Monitoring
- Autoconfiguration - with ZeroConf - Stored in Zookeeper
- Data Partition Notification
Additional Features: High Availability and Performant Log Collection
- Log Distribution - Multi-data center
- Log Replay
- Log Monitoring
- Log Search
- Log Operational Watchdog
- Log Reporting
- Log Alerting
Logs are fed into
- HDFS
- Elastic Search
- Hive
- Mysql
- HBase
- Vertica
- Put in contributor information and update projects to reference (Kafka and Flume)
- Log Stash to Neverwinter Plugin
- Log Collection Standard
- Neverwinter Nginx Plugin to Kafka
- Neverwinter Kafka Queue Monitor in Kibana
- Develop NW Distributed Data Pump -> HCatalog - Think can be distributed framework for managing a cluster of writers to Flume or Logstash
- Add other main github projects
Prototype framework with zmq in python
Topics
-
Registry
-
Heartbeat
-
Stats
-
LogTopics
-
Develop - Protocol
- Sparkngin
- Connector Component
Out of the box super easy plugin to
- Nagios
- Ganglia
[Nginx] -> Openresty, libkafka with spillover buffer, spillagent, window registration and monitoring
- Nginx -> Kafka -> Flume -> Hcatalog -> Hive
[Log]
- Epoch timestamp, ip, process and optional type and version
[Monitoring] Log normal, error, watchdog, normal spill, error spill, watchdog spill
- Type, Lines, Size per minute per process per server
[Concept/Abstraction] _ Emmiter Client
- Zero Conf - module to -> zookeeper
- async
- persistence
- send
- spillover
- spillover recovery agent
- LogEmitter
- LogFormat
- LogVersion
- LogType
- WatchDogEmitter
- WatchDog Register
- WatchDog Monitor and Alerts
[Reporting]
- Summary Report
- LogType Report
- Key Coverage Report
- Value Coverage Report
[Support]
- Data Processing Assesment
- Process management
- Annual Data Assessment
[Dependencies]
- OpenResty
- Nginx
- Kafka
- Hadoop
- Ganglia
- Cacti
- ElasticSearch
- LogStash
- HCatalog
- Hive
- Storm
- HBase
- Flume
- Kibana
- Doxygen
- Zeromq
[ To investigate ]
- Nginx -> Kafka
- Nginx -> Logrotate
- Nginx -> module timestamp
- Nginx -> logrotation module
- Logrotate frequency mod
- Logtail - find reference to old project that I looked at
- Filehandle monitor
- fuser
- inotify-tools
Capabilities
- file handle monitor
- file process monitor
- file tail
- Kafka efficient socket to file transfer
+------------+ +-----------+ +------------+ +----------------+
|NW | |NW | |NW | | |
| | | | | | | |
| Front End | | Data Bus | | Data Pump | | End Point |
| Emitter |+-->| |+-->| |+-->|- HDFS |
| - Http Get| | | | | |- Elastic Search|
| - Json | | | | | | |
| - Avro | | | | | | |
+------------+ +-----------+ +------------+ +----------------+
+-----------+ +---------+ +-----+ +--------+
| Log Stash | |Sparkngin| |Kafka| |Hadoop |
|-----------| |---------| |-----| |--------|
| |+--->| |+--->| |+-->|HCatalog|
| | | | | | |HBase |
+-----------+ +---------+ +-----+ +--------+
+
| +--------+
+------->|Storm |
|--------|
| |
| |
+--------+
http://www.asciiflow.com/#Draw
Should a distributed fault tolerant data transport layer from Kafka to hadoop be build on
-
- Storm
-
- Yarn
[ Front End Emmiter ]
- The concept is to have a nginx front end that will ship logs to
[ log collection (logstash] -> [rest end point (nginx) ] -> [ data bus (kafka) ] -> [ data pump/transport (storm or yarn) ] -> [ rdbms (hive - data registration live) | file system (hdfs) | key store (hbase) ]
Look at developing the protocol prototype with a Avro Producer using zmq and a Avro Consumer communicating through kafka. -Version/Lineage,Heartbeat,Source,Header/Footer. Take design aspects from Camus , must provide built monitoring. There needs to be a messaging (source timestamp, system timestamp) and a way to inspect where hour boundries exist on the queue. Additionally need a way to register servers and when they come on and offline for log registration.
Should there be the ability for schema registration, so that schema's can be pushed to downstream?
Should there be a mapping and general payload support. Json support
Should Avro / Thrift / Protobuff / HBase / Hive / Storm - Type Mappings be maintained?
- https://github.com/criteo/kafka-ganglia
- Creating Author: Maxime Brugidou [email protected]
- Interested Potential Contributor: Andrew Otto [email protected]
- Level 0 Raw
- Level 1 Envelop Framework (Leverage Avro/Protocol Buffers/Thrift)
- Level 2 Event Payload
- Level 3 Encrypted Payload
- Events configured to flow by topic and get partitioned by either server timestamp or application supplied
- Sparknginx in-memory encryption layer.
Preferred Development Tools
- [Ansible] (http://www.ansibleworks.com/)
- [Vangrant] (http://www.vagrantup.com/)
- [Virtualbox] (https://www.virtualbox.org/)
- [Gradle] (http://www.gradle.org/)