Skip to content

Inherit mesos-slave attributes #154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 135 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
135 commits
Select commit Hold shift + click to select a range
dc20de4
Enable checkpointing
b0r6 Feb 23, 2017
dfd9b61
model: Added nodes namespace
b0r6 Feb 23, 2017
3698f50
Added nodes protobuf file
b0r6 Feb 23, 2017
7469be4
added nodes.proto
b0r6 Feb 23, 2017
9bf9f81
Pass user specified agent-id and mesos generated agent-id to zookeeper.
b0r6 Feb 23, 2017
7af96ab
Insert the data, not just an empty byte array.
b0r6 Feb 23, 2017
33f2a7f
Added crashed_nodes
b0r6 Feb 24, 2017
3bd00e4
Added crashed_nodes.proto
b0r6 Feb 24, 2017
ec74b73
Added add()-function for crashed_nodes
b0r6 Feb 24, 2017
6aeca42
Added missing import
b0r6 Feb 24, 2017
db1a1a7
Rename type to avoid conflict
b0r6 Feb 24, 2017
b3728e3
Protobuf: change variable type
b0r6 Feb 24, 2017
27ceacd
Added generated protobuf for crashed_nodes
b0r6 Feb 24, 2017
1e2dc1f
Added functions for getting node either by uuid or mesos-generated id
b0r6 Feb 24, 2017
b56c04f
Don't return array since they won't share the same uuid.
b0r6 Feb 24, 2017
5dd3da6
Add agent to list of crashed nodes if mesos detects a disconnect.
b0r6 Feb 24, 2017
68708a1
Fix invalid variable type
b0r6 Feb 24, 2017
3c1c1b5
Added missing protobuf function
b0r6 Feb 24, 2017
e025a0e
Fixed a few lines that used the wrong types.
b0r6 Feb 24, 2017
74a71b6
Renamed some variables and functions
b0r6 Feb 27, 2017
bab7ba5
Use uuid for crashed_nodes
b0r6 Feb 27, 2017
30480bd
Added mesosAgentID to crashed_nodes
b0r6 Feb 27, 2017
b12e7c3
crashed_nodes: Added getters
b0r6 Feb 27, 2017
6e10265
Fixed broken function
b0r6 Feb 27, 2017
af2af02
Fixed a place where two functions had accidently gotten their names s…
b0r6 Feb 27, 2017
3552c58
Fixed eternal loop
b0r6 Feb 27, 2017
7f88edd
Fixed another place where an invalid function call would result in an…
b0r6 Feb 27, 2017
f4b6f98
Return nil if there weren't any matches
b0r6 Feb 27, 2017
7ae06fa
Detect disconnected agents when they come back online.
b0r6 Feb 27, 2017
a551c43
Only try to return first result if there were any results at all
b0r6 Feb 28, 2017
b21733a
Added possibility to filter instances by AgentMesosID
b0r6 Feb 28, 2017
a178639
Fetch all instances previously launched by the crashed agent and add …
b0r6 Feb 28, 2017
413b87b
Return nil here as well if there weren't any matches
b0r6 Feb 28, 2017
8d7a73a
rename agent attribute
b0r6 Feb 28, 2017
3663af3
Add new config option to scheduler.toml
b0r6 Feb 28, 2017
4e167e3
Fix invalid type
b0r6 Feb 28, 2017
2e6f338
Return if node already exists
b0r6 Feb 28, 2017
10c89fc
Rearrange some stuff
b0r6 Feb 28, 2017
6859d9c
Don't return an error if node already exists
b0r6 Mar 2, 2017
879e29c
protobuf: Added field to crashed_nodes
b0r6 Mar 2, 2017
3728f8e
protobuf: Added reconnectedAt timestamp to crashed_nodes
b0r6 Mar 2, 2017
ec3c118
crashed_nodes: Added SetReconnected()
b0r6 Mar 2, 2017
f2eb5a1
If node comes back up online, set as reconnected
b0r6 Mar 2, 2017
1d1613b
Hypervisor: added type ContainerState
b0r6 Mar 3, 2017
1a346b6
hypervisor: Added GetContainerState()
b0r6 Mar 3, 2017
be08631
executor: get states of both instance and container
b0r6 Mar 6, 2017
d22c35f
Some changes to error handling
b0r6 Mar 6, 2017
cf9122e
executor: added recoverInstance()
b0r6 Mar 6, 2017
d9c1225
Updated model.proto
b0r6 Mar 7, 2017
818309f
protobuf: added autorecovery and connectionstatus
b0r6 Mar 7, 2017
066527c
crashed_nodes.go: a few changes to add()
b0r6 Mar 7, 2017
f87de1f
Removed global autorecovery. Should be decided individually for insta…
b0r6 Mar 7, 2017
e7fab90
executor.recoverinstance: handle instanceState RUNNING and STOPPED fo…
b0r6 Mar 7, 2017
9ea9da9
Protobuf adjustments
b0r6 Mar 7, 2017
5597772
model: added updateConnectionStatus()
b0r6 Mar 7, 2017
1f7af1c
Model: define recently added function
b0r6 Mar 8, 2017
21b9c8e
Fixed invalid memory address bug
b0r6 Mar 8, 2017
3c7f2b8
executor: update instance connectionState
b0r6 Mar 8, 2017
1a168cf
instances: set default connectionState at creation
b0r6 Mar 10, 2017
1fc4343
Scheduler: refactored and split code into functions, handle disconnec…
b0r6 Mar 10, 2017
3de2641
Prevent offer from being used more than once when relaunching tasks
b0r6 Mar 10, 2017
2821a57
crashed_nodes getByAgentID(): Only get the most recent disconnected one
b0r6 Mar 10, 2017
3ef5995
Get crashed nodes by using unique id instead rather than the mesos-ge…
b0r6 Mar 10, 2017
541972c
Instances: set autorecovery to true as default for now.
b0r6 Mar 10, 2017
7b544a7
Check if instance has autoRecovery set to true before adding to relau…
b0r6 Mar 10, 2017
26ea4a8
Update instance.slaveId when adding it to relaunch queue.
b0r6 Mar 10, 2017
c02584d
Combine two functions into one
b0r6 Mar 10, 2017
9723567
Detect crashed nodes after master crash
b0r6 Mar 10, 2017
e4b3cb4
Merge branch 'master' into crash-recovery
b0r6 Mar 13, 2017
27aaeb4
When detecting a new crashed node after master crash; set relevant in…
b0r6 Mar 13, 2017
4e23e96
nodes: added update function
b0r6 Mar 13, 2017
0fe4855
Slight change to update function
b0r6 Mar 13, 2017
ef4c61b
Update node agentMesosId upon reconnection
b0r6 Mar 13, 2017
d22d8b1
Solved a few bugs
b0r6 Mar 13, 2017
a6ed25f
protobuf: Added functions that were missing for some reason.
b0r6 Mar 13, 2017
522e68d
Added missing function to types_test.go
b0r6 Mar 13, 2017
33d5c7a
Merge branch 'master' into crash-recovery
Mar 16, 2017
fb4bccc
api: Add AutoRecovery flag to CreateRequest
Mar 16, 2017
bb6c0f7
cli/create,run: Add --auto-recovery flag
Mar 16, 2017
4eb6d11
Merge branch 'master' into crash-recovery
Mar 22, 2017
78bbee3
Merge branch 'crash-recovery' into inherit-slave-attr
Mar 27, 2017
66a11b4
executor: Defer HypervisorProvider initialization until mesos.Registered
Mar 27, 2017
afb9cd5
executor: Remove "hypervisor" flag
Mar 27, 2017
7c900ec
Re-arrange some stuff to let the hypervisordriver handle the recovery…
b0r6 Mar 27, 2017
fa13eed
exec: Removed GetStates() function since it's not necessary anymore
b0r6 Mar 27, 2017
ad30a40
hypervisor: remove GetContainerState() since executor doesn't need th…
b0r6 Mar 27, 2017
dc7a5b7
Remove unused enum
b0r6 Mar 27, 2017
ab7fd08
Renamed variables
b0r6 Mar 27, 2017
ebf3a51
%v -> %s
b0r6 Mar 27, 2017
3a4c619
Add switch case to Recover
b0r6 Mar 27, 2017
c102a86
devbox: Add attribute openvdc-node-id
b0r6 Mar 27, 2017
c616484
Updated types_test.go
b0r6 Mar 27, 2017
e3c8fe9
Added crash recovery diagrams
b0r6 Mar 27, 2017
f929317
Add bg color to svg
b0r6 Mar 27, 2017
141d1b8
attempt #2
b0r6 Mar 27, 2017
9cb03ae
renamed variables
b0r6 Mar 27, 2017
c99bc41
ci/multibox: Append openvdc-node-id
Mar 27, 2017
b202274
Merge branch 'crash-recovery' into inherit-slave-attr
Mar 28, 2017
231f523
Merge branch 'master' into crash-recovery
Mar 28, 2017
4dbc97f
lxc: Fix container name reference
Mar 28, 2017
85c2d9d
Merge branch 'master' into inherit-slave-attr
Mar 28, 2017
4cd69d4
ci/multibox: Remove "hypervisor" config item
Mar 28, 2017
13ea517
Merge branch 'master' into crash-recovery
Apr 3, 2017
0bf67f8
Merge branch 'master' into inherit-slave-attr
Apr 3, 2017
eb23899
executor: Forgot to call LoadConfig()
Apr 4, 2017
3398669
Merge branch 'fix-zk-version' into crash-recovery
Apr 6, 2017
beb6f3e
Merge branch 'fix-zk-version' into inherit-slave-attr
Apr 6, 2017
3dbe3e6
Merge branch 'master' into crash-recovery
Apr 6, 2017
2a6d35f
more error logging.
Apr 6, 2017
d8c128b
proto: Add json_name options
Apr 6, 2017
2b0e4f6
Merge branch 'crash-recovery' into inherit-slave-attr
Apr 6, 2017
68cd21c
executor: Inherit openvdc-node-id from mesos slave.
Apr 6, 2017
91b12f7
Merge branch 'master' into inherit-slave-attr
Apr 13, 2017
a47d9a1
executor: Fix crash due to earger HypervisorProvider checking
Apr 13, 2017
b92aca7
Merge branch 'master' into crash-recovery
b0r6 Apr 28, 2017
31bc4ea
Added base for crash-recovery test
b0r6 Apr 28, 2017
d1e20e8
Fixed wrong function getting called
b0r6 Apr 28, 2017
4e6a907
Tweaked test
b0r6 Apr 28, 2017
41022f8
Tweaked test
b0r6 Apr 28, 2017
e8884b6
Fixed variable being declared twice
b0r6 May 8, 2017
20d3277
Changed test
b0r6 May 11, 2017
38c835b
Give scheduler some time to reconnect stuff
b0r6 May 11, 2017
3cb5bd8
Check if value was returned
b0r6 May 11, 2017
bc2bf9e
Ignore instances unless they have AutoRecovery set to true
b0r6 May 11, 2017
71dcbe7
Tweak waiting time
b0r6 May 11, 2017
3155cfc
test
b0r6 May 11, 2017
d9a2170
Remove import
b0r6 May 11, 2017
1154542
agent>slave
b0r6 May 11, 2017
b8d7226
Merge branch 'master' into inherit-slave-attr
May 25, 2017
fed79cd
executor: Improve error case of SendStatusUpdate() calls in LaunchTask
May 25, 2017
8efbb96
Merge branch 'master' into inherit-slave-attr
May 25, 2017
362ae95
Merge branch 'master' into crash-recovery
May 27, 2017
b6b8210
Merge branch 'master' into crash-recovery
Jun 9, 2017
849fefc
Revert "Merge files".
Jun 9, 2017
8d1fffc
Merge branch 'crash-recovery' into inherit-slave-attr
Jun 9, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions api/instance_service.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ type InstanceAPI struct {

func (s *InstanceAPI) Create(ctx context.Context, in *CreateRequest) (*CreateReply, error) {
inst, err := model.Instances(ctx).Create(&model.Instance{
Template: in.GetTemplate(),
Template: in.GetTemplate(),
AutoRecovery: in.GetAutoRecovery(),
})
if err != nil {
log.WithError(err).Error()
Expand Down Expand Up @@ -94,7 +95,10 @@ func (s *InstanceAPI) Run(ctx context.Context, in *CreateRequest) (*RunReply, er
if err := checkSupportAPI(in.GetTemplate(), ctx); err != nil {
return nil, err
}
res1, err := s.Create(ctx, &CreateRequest{Template: in.GetTemplate()})
res1, err := s.Create(ctx, &CreateRequest{
Template: in.GetTemplate(),
AutoRecovery: in.GetAutoRecovery(),
})
if err != nil {
log.WithError(err).Error("Failed InstanceAPI.Run at Create")
return nil, err
Expand Down
143 changes: 76 additions & 67 deletions api/v1.pb.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
@@ -1 +1 @@
hypervisor:null
hypervisor:null;openvdc-node-id:null1
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
[hypervisor]
driver = "null"

[zookeeper]
endpoint = "zk://10.0.100.10:2181,10.0.100.11:2181,10.0.100.12:2181/openvdc"
Original file line number Diff line number Diff line change
@@ -1 +1 @@
hypervisor:lxc;node-groups:linuxbr
hypervisor:lxc;openvdc-node-id:lxc1;node-groups:linuxbr
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
[hypervisor]
driver = "lxc"
script-path = "/etc/openvdc/scripts/"
image-server-uri = "http://10.0.100.12/images"
cache-path = "/var/cache/lxc/"
Expand Down
Original file line number Diff line number Diff line change
@@ -1 +1 @@
hypervisor:lxc;node-groups:ovs
hypervisor:lxc;openvdc-node-id:lxc2;node-groups:ovs
35 changes: 35 additions & 0 deletions ci/citest/acceptance-test/tests/crash_recovery_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
// +build acceptance

package tests

import (
"strings"
"testing"
"time"
)

func TestCrashRecovery(t *testing.T) {
stdout, _ := RunCmdAndReportFail(t, "openvdc", "run", "centos/7/lxc")
instance_id := strings.TrimSpace(stdout.String())

WaitInstance(t, 5*time.Minute, instance_id, "RUNNING", []string{"QUEUED", "STARTING"})
RunSshWithTimeoutAndReportFail(t, executor_lxc_ip, "sudo lxc-info -n "+instance_id, 10, 5)

stdout, _ = RunCmdAndReportFail(t, "openvdc", "stop", instance_id)
WaitInstance(t, 5*time.Minute, instance_id, "STOPPED", []string{"RUNNING", "STOPPING"})

//Simulate crash
RunSshWithTimeoutAndReportFail(t, executor_lxc_ip, "sudo systemctl stop mesos-slave", 10, 5)

//Give scheduler some time to register crash
time.Sleep(3 * time.Minute)

RunSshWithTimeoutAndReportFail(t, executor_lxc_ip, "sudo systemctl start mesos-slave", 10, 5)
time.Sleep(1 * time.Minute)

//stdout, _ = RunCmdAndReportFail(t, "openvdc", "stop", instance_id)
//WaitInstance(t, 5*time.Minute, instance_id, "STOPPED", []string{"RUNNING", "STOPPING"})

stdout, _ = RunCmdWithTimeoutAndReportFail(t, 10, 5, "openvdc", "destroy", instance_id)
WaitInstance(t, 5*time.Minute, instance_id, "TERMINATED", nil)
}
2 changes: 1 addition & 1 deletion ci/devbox/devbox-centos7.json
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@
"systemctl enable zookeeper",
"systemctl enable mesos-master",
"systemctl enable mesos-slave",
"echo 'hypervisor:lxc' > /etc/mesos-slave/attributes",
"echo 'hypervisor:lxc;openvdc-node-id:lxc1' > /etc/mesos-slave/attributes",
"echo 'false' > /etc/mesos-slave/switch_user",
"echo '{\"PATH\":\"/home/vagrant/go/src/github.com/axsh/openvdc:/usr/libexec/mesos:/usr/bin:/usr/sbin:/usr/local/bin\"}' > /etc/mesos-slave/executor_environment_variables",
"#firewall-cmd --permanent --zone=public --add-port=5050/tcp",
Expand Down
15 changes: 15 additions & 0 deletions cmd/lxc-openvdc/init_linux.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
package main

import (
"log/syslog"
"github.com/Sirupsen/logrus"
logrus_syslog "github.com/Sirupsen/logrus/hooks/syslog"
)

func init() {
hook, err := logrus_syslog.NewSyslogHook("", "", syslog.LOG_DEBUG, "lxc-openvdc")
if err != nil {
logrus.Fatal("Failed to initialize syslog hook: ", err)
}
logrus.AddHook(hook)
}
10 changes: 0 additions & 10 deletions cmd/lxc-openvdc/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ import (
"fmt"
"io"
"io/ioutil"
"log/syslog"
"net/http"
"net/url"
"os"
Expand All @@ -14,7 +13,6 @@ import (
"strings"

log "github.com/Sirupsen/logrus"
logrus_syslog "github.com/Sirupsen/logrus/hooks/syslog"
"github.com/mholt/archiver"
"github.com/pkg/errors"
)
Expand All @@ -29,14 +27,6 @@ var dist string
var release string
var arch string

func init() {
hook, err := logrus_syslog.NewSyslogHook("", "", syslog.LOG_DEBUG, "lxc-openvdc")
if err != nil {
log.Fatal("Failed to initialize syslog hook: ", err)
}
log.AddHook(hook)
}

func main() {

_dist := flag.String("dist", "centos", "Name of the distribution")
Expand Down
Loading