Attendees: James, Matt,  Andy Steed, Dan B., Vincent, Christian, Olivier, Moritz, Andreas, Dave G., Michele, dankico (Dan), Markus, Tyson, Jason, Duy, Dragos, Rodric, Priti, Sandeep, himavanth, Martin, Vadim, Carlos
Notes:
  • James Thomas is moderating today
  • last meeting was 2018-04-11
Introductions of new attendees
  • Andy Steed (Adobe): starting to work on OW project, getting on-boarded
  • Dan Kelso, comes from Apache Sling looking at OW and Java tooling…
  • Moritz: intro’d myself joined project 4 weeks ago
Open comments on status/updates in a few areas:
  • Main/core OpenWhisk 
    • https://github.com/apache/incubator-openwhisk/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Amerged
    • Status updates:
      • Markus: notable commits incl:
        • Christian adding perf. tests in core repo. (moved into core)
        • Intro. notion of cluster discovery, cluster orchest. become first class citizen of OW
          • needed for LB clusters (Mesos, Kube for example)
        • ArtifactStores, first bits merged
    • Concurrent Activations (Tyson updates, shares screen) [presentation: concurrent activations.pdf]
      • Use cases at Adobe have throughput reqs. where tolerances for latency are very low
      • mixed events from diff sources, system becomes “cold”, warm containers are lost, etc.
      • goal: run fewer containers with larger load, keep containers warm account for latency
      • PR 2795 “enable concurrent activation processing”
        • opened some time ago, now getting back to
      • Action Image changes:
        • Steps to achieve concurrency
          • Action image changes… 
          • PR31 under “nodejs”, to recognize, to remove state tracking to avoid concurrent activations (allow better external control)
          • Markus says “nuke it”
        • Actions are “interleaved” in log, need to prevent this interleaving…
        • At Adobe have had disable log collection and rely external collection via Docker log driver to solve this.
        • Rodric: in terms, could use/add activation ID to all log messages?
        • Tyson: we have no structure today, and this is not something everyone may want to do?  other approaches may work (coerce log output from action to be structured
        • Markus: have thought about this as well… from my data, it is clear logging is a problem today, more logging slows system, would like to have act. ID in logs as well, to help with container management.
          • We should start a “dev” list discussion on this topic; it is valuable for many deployments
        • Tyson: follow-up, do we allow user to log using unstruct. format? Do we bubble ups a struct. to user? Do we stop ad-hoc writing to logs?
      • Invoker changes:
        • Change message feed (buffer where msgs are puled from kafka and delivered to container pool)
          • May need to allow tuning of how we adjust pulling of messages to ack. concurrent activations vs. not (now it is maxed out based on max conc. setting)
        • How do we decide if container is “free” vs “busy”… ContainerPool mgmt. remaens in free pool until it is maxed out, logic needs to change
        • CounterProxy: Activation count, “ready” calculation needs adjustment
        • Deep-rooted change incl. HttpUtils (client) needs to be switched to multi-threaded connection manager, there is a “pooling” conn. manager that we could leverage… otherwise we see how system “bogs down” heavily
      • Rodric: pool manager, uses Apache HttpClient, historically; at one point we tried Akka client, but we saw deadlock… or timeouts… maybe Akka has fixed these issues? Need to try again, but be aware that we tried this…  Markus may recall these issues
      • Markus: we can try this route again, need a threaded model, vs. pool mgr w/ diff threads.
      • Tyson: Dan M. is set to research this at Adobe, in next week or 2
      • Markus: need to limit scope and assure change keeps in line with throughput today
      • Dragos/Tyson: should we have an integer value with limit of total? or should have boolean that indicates TRUE = all activations for a current acton are routed to same container, regardless of # of total (force use of same container… vs only 10 per container)
      • Tyson:current PR is a global switch… default is 1, 
        • can set to 200 meaning all containers can run with 200 con. actions (not a long term solution, needs to be settable on a per-action configuration vs. a global system setting.
      • Tyson: discusses “worst case” and “best case” results/risks for this PR
      • Tyson: could have a “concurrent peek ratio” that auto adjusts?
      • Tyson: testing added as part of this PR of course…
        • throughput tests (existing) plus async (new)
        • Shows some results with #conc. actions=200 and =1 set shown for sync vs async cases
        • slightly better, not amazingly better…
      • Markus: why is max 200 slower that max 1?
      • Tyson: need to investigate.. perhaps running on local machine, vs. a true cluster?
      • Rodric: could be the bigger look-ahead buffer…?
      • Markus: #containers * max. concurrent may be same as today?
      • Tyson: may need to disable logging as well to get more acc. numbers…
      • Tyson: activation leaks may be an issue as a result…
      • James: customers would like this (log tracking, more user control)… be able to track conc. activations better.  Feature equiv. as to what other platforms may have?
      • Tyson: challenges with OW, state of invokers is not shared anywhere… so this level of control is not exposed.
      • James: let’s take these things to “dev” list
    • Areas of concern:
      • addressed previous week's concern… Vagrant (hello/ez-up) should now be healthy again fixes made to image (disk size, etc.)
  • Release process:
    • Matt: Cwiki: Maturity model posted… in progress
    • Matt: release repo docs now list license policies/exclusion
      • getting ready for Apache Inc. board to show well-doc processes
      • More update to LICENSE/NOTICE still to be completed
  • Mesos/Splunk update:
    • Tyson: at some point (came up on slack), resource advert. and request for developers
      • need to see if we can allow GPU access to actions…
      • will pursue on “dev” list
  • Kubernetes:
  • API Gateway
  • Catalog/Packages/Samples
    • Matt: Updating installation to use wskdeploy (starting with “utils”) PR in-progress
  • Tooling/Utilities
  • Go Runtime / Docker Skeleton (handlers for perf. improvements based upon Go experience)
    • Michele: presents an update: Presentation here: https://www.slideshare.net/MicheleSciabarr/openwhisk-goswiftbinaries-runtime
      • Final PR submitted, under review
      • Lead to dev. of a faster/better Docker Skeleton…
      • Improve JSON serialization
      • Support generic binaries
      • new feature: using runtimes as “compilers”
      • raw actions are much faster (golang and swift tested)
    • James: cannot see your screen…
    • Michele: now shares his screen
      • shows diagram of “Action Loop” in new Go HttpServer
      • using File Descriptor #3 (i.e., "fd3")  instead of stdout… but preserves compatibility
    • Michele: idea: use images as compilers…
      • place source in “/src” dir…
      • docker run with “compile” flag
      • perhaps support in wskdeploy (better option than cli)
      • result: binary left out of main
    • James: awesome and exciting work!
      • once documented, can write some blogs, etc. perf improves, compiler etc.
    • Matt: would love to support in wskdeploy… please open an issue
    • Carlos: we had this idea before “init’ actions…
    • Michele: idea to add new option to wskdeploy (precompile action option)
    • Carlos: this would apply to NodeJs as well, instead of doing local install from NPM… need local modules we can resolve..
      • QQ; can we start with Go 1.10? skip 1.9?
    • Michele: can do this, 1.9 now more widely used, no backward compat. issues…
    • Carlos: let’s discuss offline… 1.10 would help with support…
  • James: helping create a conference called “ServerlessDAYS”
    • Contact me if interested, need to get OW on schedule/talks/sessions
Confirm moderator for next call
  • Andy volunteers for May 9th meeting
  • adjourn 11:00 AM US Central

raw chat log:
From rodric rabbah to Everyone: (10:31 AM)
Would be nice to explain the difference @tyson
also need to consider the risk of losing more activations with a deeper peek buffer since that bounds the number of activations that can be lost
From Markus Thömmes to Everyone: (10:33 AM)
that was discussed though?
From rodric rabbah to Everyone: (10:33 AM)
Was it? I must have missed it. Where does the peek buffer get backed up to?
From Markus Thömmes to Everyone: (10:34 AM)
It was stated that there is that risk and there might be strategies to tweak the peek with a ratio to limit the danger
From James Thomas to Everyone: (10:45 AM)
Can anyone else see michele’s screen?
From dgrove to Everyone: (10:45 AM)
no
From Tyson Norris to Everyone: (10:45 AM)
no
From vadimraskin to Everyone: (10:45 AM)
nope
  • No labels