Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

shimv2: Add tracing #2980

Merged

Conversation

cmaf
Copy link
Contributor

@cmaf cmaf commented Sep 22, 2020

Added tracing with tracer being created in Create, which should ideally
happen in New. Currently getting no trace or span output when the tracer
is created in New or StartShim, or Create before create() is called.
Also no log output from these functions.

To do:

  • Determine which context is more appropriate to associate with spans
    (ctx passed in or ctx saved in service struct - currently implemented
    using service ctx)
  • Find out what in create() allows tracing to work and implement it in
    New if possible

Signed-off-by: Chelsea Mafrica [email protected]

@cmaf cmaf linked an issue Sep 22, 2020 that may be closed by this pull request
@cmaf cmaf added the wip Work in Progress (PR incomplete - needs more work or rework) label Sep 22, 2020
@cmaf
Copy link
Contributor Author

cmaf commented Sep 22, 2020

@egernst @jodh-intel could either of you test locally? also maybe @amshinde wants to take a look

@cmaf
Copy link
Contributor Author

cmaf commented Sep 22, 2020

/test

@cmaf cmaf self-assigned this Sep 22, 2020
@codecov
Copy link

codecov bot commented Sep 23, 2020

Codecov Report

Merging #2980 (0279c81) into master (2a98f43) will decrease coverage by 0.05%.
The diff coverage is 29.09%.

@@            Coverage Diff             @@
##           master    #2980      +/-   ##
==========================================
- Coverage   50.35%   50.30%   -0.06%     
==========================================
  Files         120      120              
  Lines       15918    15978      +60     
==========================================
+ Hits         8016     8037      +21     
- Misses       6812     6851      +39     
  Partials     1090     1090              

@cmaf cmaf force-pushed the tracing-shimv2-create-nodebug branch 2 times, most recently from b302c1f to 003a824 Compare September 23, 2020 00:36
@cmaf cmaf added the needs-forward-port Changes need to be applied to a newer branch / repository label Sep 23, 2020
@cmaf cmaf force-pushed the tracing-shimv2-create-nodebug branch 3 times, most recently from 1b06857 to eddc957 Compare September 23, 2020 03:42
@cmaf
Copy link
Contributor Author

cmaf commented Sep 24, 2020

The answer to why tracing doesn't work earlier is that tracer must be created after katautils.LoadConfiguration. This happens relatively early in the cli, but doesn't happen in shimv2 until create (from Create in service.go) calls loadRuntimeConfig.

We could try to call LoadConfiguration earlier in shimv2; I'm not sure if there was a reason for calling it after New and StartShim.

@cmaf
Copy link
Contributor Author

cmaf commented Sep 24, 2020

Maybe @bergwolf or @lifupan have advice on whether or not LoadConfiguration should be called earlier?

@lifupan
Copy link
Member

lifupan commented Sep 25, 2020

Maybe @bergwolf or @lifupan have advice on whether or not LoadConfiguration should be called earlier?

Hi @cmaf

Since sihmv2 support loading config files from different places which can be passed from containerd through Create call, thus it's better to load it in Create service.

@cmaf cmaf force-pushed the tracing-shimv2-create-nodebug branch from eddc957 to c328c4d Compare September 26, 2020 01:15
@cmaf
Copy link
Contributor Author

cmaf commented Sep 26, 2020

@jodh-intel @amshinde @egernst this traces from New which works when you load a runtime config in New.

I still need to: (1) add check for error after initial LoadRuntimeConfig, (2) check for tracing enabled before creating tracer, (3) ensure runtime config gets overwritten in Create, (4) test additional scenarios.

@cmaf cmaf force-pushed the tracing-shimv2-create-nodebug branch 4 times, most recently from 8941634 to 0533bcf Compare September 30, 2020 03:42
@cmaf
Copy link
Contributor Author

cmaf commented Sep 30, 2020

/test

@egernst
Copy link
Member

egernst commented Oct 1, 2020

Thanks @cmaf!
What's next here then? Will you start at the new call now if the config allows? What if you create tracer in new but the loaded config has tracing disabled?

Copy link
Contributor

@jodh-intel jodh-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cmaf. A few comments.

Comment on lines 211 to 183
// Stop tracing because a new tracer will be created when New is called again after StartShim
defer katautils.StopTracing(s.ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it might be better to check if s.tracer in New() and calling StopTracing() there if tracing is already active.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jodh-intel I'm not sure I understand this. I could call StopTracing() before the tracer is created in New(), but s.tracer doesn't exist until halfway through the function. A new context is used every time New() is called. New() is called twice per container, initially followed by StartShim(), so there is one context used between those two, and a different context for the second New() and subsequent functions.

Copy link
Contributor Author

@cmaf cmaf Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jodh-intel I thought about this some more and tracing can't be stopped earlier in New() because stopping tracing depends on using the specific context associated with the tracer, which isn't available since a new one is used in that function. It is possible I'm missing functions in between New() and StartShim() (or after), but StartShim() is the only function that I've seen called after New() using the same context associated with the tracer where we can end the first trace.

func trace(ctx context.Context, tracer opentracing.Tracer, name string) opentracing.Span {
if tracer != opentracing.Tracer(nil) {
span, _ := opentracing.StartSpanFromContext(ctx, name)
span.SetTag("subsystem", "runtime")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could set this tag in the root span to avoid having to set every time here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jodh-intel I tried setting it only at New where the tracer is created and don't see a subsystem for the other functions, so I'm leaving it in for now.

@cmaf
Copy link
Contributor Author

cmaf commented Oct 1, 2020

What's next here then? Will you start at the new call now if the config allows? What if you create tracer in new but the loaded config has tracing disabled?

@egernst yes it starts at New now if the config allows. I haven't tested cases yet where it is enabled in the first config and disabled in the second (or vice versa), but with the former it should create a trace containing New and StartShim with no other effects, and with the latter tracing still should not work at all since the tracer is created in New and not Create (using the original place where the config is loaded requires the tracer to be created after create() in Create).

@cmaf cmaf force-pushed the tracing-shimv2-create-nodebug branch 2 times, most recently from 680c0a6 to 5c4e754 Compare October 6, 2020 18:00
@cmaf
Copy link
Contributor Author

cmaf commented Oct 6, 2020

@jodh-intel Thanks, I updated it based on your feedback and am testing a few locations for StopTracing().

@cmaf cmaf added no-backport-needed Changed do not need to be applied to an older branch / repository needs-review Needs to be assessed by the team. labels Oct 28, 2020
@jodh-intel
Copy link
Contributor

Hi @cmaf - I've atlast had a chance to try out this PR. Although I'm seeing trace spans, they are a couple of problems:

  • I'm not seeing all the spans I'd expect to.
  • The spans are not grouped into a single transaction.

Adding a bit of debug shows an initially somewhat confusing set of shimv2 calls being made:

New
StartShim
newCommand
New
Create
Wait
Wait
Start
Start
State
State
State
State
Delete
Delete
Shutdown
Shutdown

However, adding more shows the real picture: the Kata v2 shim (which is loaded for the lifetime of a single container) is forked twice. The group of API calls divides into two sets:

# Initial Kata shimv2 pid
New
StartShim
newCommand

# Double-forked Kata shimv2 pid
New
Create
Wait
Wait
Start
Start
State
State
State
State
Delete
Delete
Shutdown
Shutdown

This now makes more sense and leads to the following possibilities:

Option Create Tracer in function Stop tracing function Notes
1 New()
(when ppid != 1)
Shutdown()
(when ppid == 1)
Earliest opportunity to trace
(and "widest" root span possible as it spans both pids).
2 New()
(when ppid == 1)
Shutdown()
(when ppid == 1)
"Sweet spot"? Only trace the double-forked shimv2 pid.
Simpler. Nice wide root span.
3 Create()
(when ppid == 1)
Delete()
(when ppid == 1)
The narrowest root span that would make sense.

I think Option 2 would be the best. And if you changed pkg/katautils/tracing.go::StopTracing() to set tracing=false at the end of the function, the fact that Shutdown() gets called twice by the same pid shouldn't be a problem (I think it's not unreasonable to end tracing at the first call to Shutdown() ;)

I appreciate that we're not tracing as early as possible with Option 2 (we wouldn't be tracing StartShim()), but we are atleast "spanning" the duration of a container.

@jodh-intel
Copy link
Contributor

I should have mentioned that I'm using containerd 1.4.1:

$ ctr version
Client:
  Version:  v1.4.1
  Revision: c623d1b36f09f8ef6536a057bd658b3aa8632828
  Go version: go1.13.15

ctx, cancel := context.WithCancel(ctx)

s := &service{
id: id,
pid: uint32(os.Getpid()),
ctx: ctx,
tracer: tracer,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmaf Could you help me understand how this Tracer instance is used? I'm probably overlooking something but I don't see it referenced anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmores, thanks for catching this. You're not overlooking anything, I previously used s.tracer in the code but that had been removed. I removed this instance.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmaf Thanks!

@cmaf cmaf force-pushed the tracing-shimv2-create-nodebug branch from 9ce6522 to 1063b13 Compare November 20, 2020 01:01
@cmaf
Copy link
Contributor Author

cmaf commented Nov 20, 2020

/test

@cmaf
Copy link
Contributor Author

cmaf commented Nov 20, 2020

I think Option 2 would be the best. And if you changed pkg/katautils/tracing.go::StopTracing() to set tracing=false at the end of the function, the fact that Shutdown() gets called twice by the same pid shouldn't be a problem (I think it's not unreasonable to end tracing at the first call to Shutdown() ;)

@jodh-intel Updated with a ppid check for scenario 2. If you generate a trace now you should see several ungrouped spans not specific to shimv2, and then a single New span with the spans traced in service.go grouped under it. Without the check, you will see the several ungrouped spans, and two New spans, one with StartShim grouped under it and the other with the remaining spans. I wasn't able to replicate the functions being called twice so I did not add the tracing check to pkg/katautils/tracing.go::StopTracing().

The test for tracing in shimv2 will be in the 2.0-dev branch: kata-containers/tests#3038.

if os.Getppid() == 1 {
span, _ := trace(s.ctx, "ResizePty")
defer span.Finish()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmaf Are the ppid checks useful though outside of New(), Shutdown(), Create() and Delete()? I would assume that the first fork exits after spawning the second one...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmores you're right, they actually don't need to be in all functions. However I am removing all the checks because we decided to trace all functions irrespective of ppid.

@amshinde
Copy link
Member

Hi @cmaf, I just tried the latest version of your PR to test out the tracing output, but I am not seeing any traces related to CreateSandbox as I was seeing earlier. I do see that I have enabled tracing for the runtime in the configuration.toml file, but the only traces I am seeing are traces consisting of a single span for grpc.Health/Check. Can you check if the tracing works for you?

@cmaf
Copy link
Contributor Author

cmaf commented Nov 20, 2020

@amshinde yes it is working for me. I see createSandbox under New

@@ -274,7 +298,20 @@ func getTopic(e interface{}) string {
return cdruntime.TaskUnknownTopic
}

func trace(ctx context.Context, name string) (opentracing.Span, context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmaf Why not just reuse the Trace() function in pkg/katautils/tracing.go and modify it if needed if you want to add the check for nil context. cc @jodh-intel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amshinde Trace() in pkg/katautils/tracing.go sets a tag for the cli, which we don't want in this case. I noticed virtcontainers had its own implementation of trace() so I implemented one for shimv2. I think we should fix this in a subsequent PR so that we're not replicating code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but this seems like a simple fix, and probably easier to do in this commit than at a later time when the two functions may have diverged (they already began diverging here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@c3d virtcontainers has 13 different trace() implementations which exist to add different tags to spans (@jodh-intel, correct me if I'm wrong). I think this should be changed so that all use a general Trace() function that takes a list of tags, but since it is not essential to shimv2 tracing functionality and applies to many other functions, it should have its own issue & PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #3093

@cmaf cmaf force-pushed the tracing-shimv2-create-nodebug branch from 2de728b to 7fad6fe Compare November 21, 2020 01:30
@pmores
Copy link

pmores commented Nov 23, 2020

With the latest changes, this PR looks good to me at this point.

@@ -274,7 +298,20 @@ func getTopic(e interface{}) string {
return cdruntime.TaskUnknownTopic
}

func trace(ctx context.Context, name string) (opentracing.Span, context.Context) {
if ctx == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not exist in the Trace() function in katautils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@c3d you're right, it does not exist there but it exists in the sandbox trace function and a few other ones in virtcontainers. The CI fails with a SIGSEGV without it so I kept it in.

@@ -274,7 +298,20 @@ func getTopic(e interface{}) string {
return cdruntime.TaskUnknownTopic
}

func trace(ctx context.Context, name string) (opentracing.Span, context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but this seems like a simple fix, and probably easier to do in this commit than at a later time when the two functions may have diverged (they already began diverging here)

ctx = context.Background()
}
span, ctx := opentracing.StartSpanFromContext(ctx, name)
span.SetTag("subsystem", "runtime")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Trace() function sets the source tag to runtime, not the subsystem one. What is the reason for the difference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Trace() function has an additional created span message, which according to the comment, is essential for tracing testing. Thag begs the question: how do you plan to test this newly added tracing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@c3d I switched it to source, it should be consistent.

Regarding the created span message, the test is undergoing significant changes in a PR (kata-containers/tests#2129) and will not check for that string.

@@ -330,6 +367,9 @@ func (s *service) Cleanup(ctx context.Context) (_ *taskAPI.DeleteResponse, err e

// Create a new sandbox or container with the underlying OCI runtime
func (s *service) Create(ctx context.Context, r *taskAPI.CreateTaskRequest) (_ *taskAPI.CreateTaskResponse, err error) {
span, _ := trace(s.ctx, "Create")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to have a few span.LogField() added giving more information about what is being created?
(This same comment applies to practically all functions below)

Or do you plan to add that later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at least to start, the basic spans would be good -- it shows the trace of execution very clearly. if feasible, I think we should look at adding in follow on PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@c3d @egernst we can consider this for a later PR

@egernst
Copy link
Member

egernst commented Dec 2, 2020

/test

@cmaf cmaf force-pushed the tracing-shimv2-create-nodebug branch from 7fad6fe to 82ede0e Compare December 2, 2020 05:53
@cmaf
Copy link
Contributor Author

cmaf commented Dec 2, 2020

/test

Add trace calls to shimv2 that create spans for functions in service.go.
Tracing starts in New(), which is forked twice and is followed by either
StartShim() or Create().

Tracing cannot start without the value for Trace enabled from the
runtime config so load the config in New(), which results in it being
loaded every time New() is called in addition to where it is originally
loaded after Create().

Fixes kata-containers#1807

Signed-off-by: Chelsea Mafrica <[email protected]>
@cmaf cmaf force-pushed the tracing-shimv2-create-nodebug branch from 82ede0e to 0279c81 Compare December 2, 2020 06:22
@cmaf
Copy link
Contributor Author

cmaf commented Dec 2, 2020

/test

@cmaf
Copy link
Contributor Author

cmaf commented Dec 2, 2020

@c3d, thanks for the feedback! I think I responded to everything, please let me know if you still have concerns.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs-forward-port Changes need to be applied to a newer branch / repository needs-review Needs to be assessed by the team. no-backport-needed Changed do not need to be applied to an older branch / repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

shimv2: Add tracing for Kata shimv2 binary
7 participants