2018-04-25 OW Tech Interchange - Meeting Notes

Attendees: James, Matt, Andy Steed, Dan B., Vincent, Christian, Olivier, Moritz, Andreas, Dave G., Michele, dankico (Dan), Markus, Tyson, Jason, Duy, Dragos, Rodric, Priti, Sandeep, himavanth, Martin, Vadim, Carlos

Notes:

James Thomas is moderating today
last meeting was 2018-04-11

Introductions of new attendees

Andy Steed (Adobe): starting to work on OW project, getting on-boarded
Dan Kelso, comes from Apache Sling looking at OW and Java tooling…
Moritz: intro’d myself joined project 4 weeks ago

Open comments on status/updates in a few areas:

Main/core OpenWhisk

https://github.com/apache/incubator-openwhisk/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged
Status updates:

Markus: notable commits incl:

Christian adding perf. tests in core repo. (moved into core)
Intro. notion of cluster discovery, cluster orchest. become first class citizen of OW

needed for LB clusters (Mesos, Kube for example)

ArtifactStores, first bits merged

Concurrent Activations (Tyson updates, shares screen) [presentation: concurrent activations.pdf]

Use cases at Adobe have throughput reqs. where tolerances for latency are very low
mixed events from diff sources, system becomes “cold”, warm containers are lost, etc.
goal: run fewer containers with larger load, keep containers warm account for latency
PR 2795 “enable concurrent activation processing”

opened some time ago, now getting back to

Action Image changes:

Steps to achieve concurrency

Action image changes…
PR31 under “nodejs”, to recognize, to remove state tracking to avoid concurrent activations (allow better external control)
Markus says “nuke it”

Actions are “interleaved” in log, need to prevent this interleaving…
At Adobe have had disable log collection and rely external collection via Docker log driver to solve this.
Rodric: in terms, could use/add activation ID to all log messages?
Tyson: we have no structure today, and this is not something everyone may want to do? other approaches may work (coerce log output from action to be structured
Markus: have thought about this as well… from my data, it is clear logging is a problem today, more logging slows system, would like to have act. ID in logs as well, to help with container management.

We should start a “dev” list discussion on this topic; it is valuable for many deployments

Tyson: follow-up, do we allow user to log using unstruct. format? Do we bubble ups a struct. to user? Do we stop ad-hoc writing to logs?

Invoker changes:

Change message feed (buffer where msgs are puled from kafka and delivered to container pool)

May need to allow tuning of how we adjust pulling of messages to ack. concurrent activations vs. not (now it is maxed out based on max conc. setting)

How do we decide if container is “free” vs “busy”… ContainerPool mgmt. remaens in free pool until it is maxed out, logic needs to change
CounterProxy: Activation count, “ready” calculation needs adjustment
Deep-rooted change incl. HttpUtils (client) needs to be switched to multi-threaded connection manager, there is a “pooling” conn. manager that we could leverage… otherwise we see how system “bogs down” heavily

Rodric: pool manager, uses Apache HttpClient, historically; at one point we tried Akka client, but we saw deadlock… or timeouts… maybe Akka has fixed these issues? Need to try again, but be aware that we tried this… Markus may recall these issues
Markus: we can try this route again, need a threaded model, vs. pool mgr w/ diff threads.
Tyson: Dan M. is set to research this at Adobe, in next week or 2
Markus: need to limit scope and assure change keeps in line with throughput today
Dragos/Tyson: should we have an integer value with limit of total? or should have boolean that indicates TRUE = all activations for a current acton are routed to same container, regardless of # of total (force use of same container… vs only 10 per container)
Tyson:current PR is a global switch… default is 1,

can set to 200 meaning all containers can run with 200 con. actions (not a long term solution, needs to be settable on a per-action configuration vs. a global system setting.

Tyson: discusses “worst case” and “best case” results/risks for this PR
Tyson: could have a “concurrent peek ratio” that auto adjusts?
Tyson: testing added as part of this PR of course…

throughput tests (existing) plus async (new)
Shows some results with #conc. actions=200 and =1 set shown for sync vs async cases
slightly better, not amazingly better…

Markus: why is max 200 slower that max 1?
Tyson: need to investigate.. perhaps running on local machine, vs. a true cluster?
Rodric: could be the bigger look-ahead buffer…?
Markus: #containers * max. concurrent may be same as today?
Tyson: may need to disable logging as well to get more acc. numbers…
Tyson: activation leaks may be an issue as a result…
James: customers would like this (log tracking, more user control)… be able to track conc. activations better. Feature equiv. as to what other platforms may have?
Tyson: challenges with OW, state of invokers is not shared anywhere… so this level of control is not exposed.
James: let’s take these things to “dev” list

Areas of concern:

addressed previous week's concern… Vagrant (hello/ez-up) should now be healthy again fixes made to image (disk size, etc.)

Release process:

Matt: Cwiki: Maturity model posted… in progress
Matt: release repo docs now list license policies/exclusion

getting ready for Apache Inc. board to show well-doc processes
More update to LICENSE/NOTICE still to be completed

Mesos/Splunk update:

Tyson: at some point (came up on slack), resource advert. and request for developers

need to see if we can allow GPU access to actions…
will pursue on “dev” list

Kubernetes:

https://github.com/apache/incubator-openwhisk-deploy-kube/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged
More in our default testing matrix: diff. versions of Minikube for example, etc.
Helm charts now being added, extended and working in Travis
Stuck on “trusty” image, may want xenial image...

API Gateway

https://github.com/apache/incubator-openwhisk-apigateway/pulls
Dragos: Actively working to get Docker Compose working

Catalog/Packages/Samples

Matt: Updating installation to use wskdeploy (starting with “utils”) PR in-progress

Tooling/Utilities

Go Runtime / Docker Skeleton (handlers for perf. improvements based upon Go experience)

Michele: presents an update: Presentation here: https://www.slideshare.net/MicheleSciabarr/openwhisk-goswiftbinaries-runtime

Final PR submitted, under review
Lead to dev. of a faster/better Docker Skeleton…
Improve JSON serialization
Support generic binaries
new feature: using runtimes as “compilers”
raw actions are much faster (golang and swift tested)

James: cannot see your screen…
Michele: now shares his screen

shows diagram of “Action Loop” in new Go HttpServer
using File Descriptor #3 (i.e., "fd3") instead of stdout… but preserves compatibility

Michele: idea: use images as compilers…

place source in “/src” dir…
docker run with “compile” flag
perhaps support in wskdeploy (better option than cli)
result: binary left out of main

James: awesome and exciting work!

once documented, can write some blogs, etc. perf improves, compiler etc.

Matt: would love to support in wskdeploy… please open an issue
Carlos: we had this idea before “init’ actions…
Michele: idea to add new option to wskdeploy (precompile action option)
Carlos: this would apply to NodeJs as well, instead of doing local install from NPM… need local modules we can resolve..

QQ; can we start with Go 1.10? skip 1.9?

Michele: can do this, 1.9 now more widely used, no backward compat. issues…
Carlos: let’s discuss offline… 1.10 would help with support…

James: helping create a conference called “ServerlessDAYS”

Contact me if interested, need to get OW on schedule/talks/sessions

Confirm moderator for next call

Andy volunteers for May 9th meeting
adjourn 11:00 AM US Central

raw chat log:

From rodric rabbah to Everyone: (10:31 AM)

Would be nice to explain the difference @tyson

also need to consider the risk of losing more activations with a deeper peek buffer since that bounds the number of activations that can be lost

From Markus Thömmes to Everyone: (10:33 AM)

that was discussed though?

From rodric rabbah to Everyone: (10:33 AM)

Was it? I must have missed it. Where does the peek buffer get backed up to?

From Markus Thömmes to Everyone: (10:34 AM)

It was stated that there is that risk and there might be strategies to tweak the peek with a ratio to limit the danger

From James Thomas to Everyone: (10:45 AM)

Can anyone else see michele’s screen?

From dgrove to Everyone: (10:45 AM)

no

From Tyson Norris to Everyone: (10:45 AM)

no

From vadimraskin to Everyone: (10:45 AM)

nope

Space shortcuts

Page tree