Tuesday, July 13, 2021

Lag : Consoles, Emulators, FPGAs

When playing classic video games on non-original hardware, one should always be conscious for the amount of extra lag that method may offer over the original hardware.  Along with accuracy, latency is one of the most important tangible benefits (versus of using original hardware and display technology (CRTs) over emulators and current display technology (LCDs).  Latency has always existed in some form, and in this article I will give an overview on how latency has evolved over time.  

Additionally, the use of FPGA chips to simulate original hardware has become increasingly popular over the past five years.  FPGAs can offer the benefit of lower latency compared to traditional software based emulation and can offer a high degree of accuracy by using relatively inexpensive hardware.  FPGAs are not without their singular issues, and in this article I will go over some of the issues with using FPGAs as a replacement for original hardware.

Units of time

Before we begin discussing latency, it is helpful to review the units of time that are going to be used throughout this article and their relationships.

1 second = 1,000 milliseconds (ms) = 1,000,000 microseconds (us) = 1,000,000,000 nanoseconds (ns)

1/60 second = 16.7 milliseconds, 1/50 second = 20 milliseconds

Latency in Classic Consoles

It is incorrect to state that classic consoles and original hardware had "no lag".  Our classic consoles and home computers certainly had input latency.  In fact, there are multiple sources of latency in any computing device which accepts input.  The sources of lag in a classic console are input reading time, CPU processing time, program code and display output.

The first step when a program wants to receive input is to execute CPU instructions which read from a port or a memory location assigned to an input device.  For a classic console like the NES, the controllers are read at a pair of specific memory locations.  To obtain the button state from a controller, first the program instructs the CPU to write to a specific memory address to "strobe" the controller.  The strobe signal tells the shift register in the NES controller to start sending button states to the console, which it does serially.  The CPU reads eight bits from the memory address to obtain the state of all the buttons on the controller.  Then the program's code performs logical operations to figure out what the bits mean.  An efficient controller read routine can take about 125 CPU cycles.  The NES's CPU has approximately 1,789,773 CPU cycles to play with every second. (1.79MHz).  Each CPU cycle takes .000000559 seconds, or 559ns.  If the CPU takes 125 cycles to read that controller, then that takes 69.8us. 

The second and third steps rely on the CPU and the program to handle.  Games vary in frequency in how often they read the controller.  They typically read controllers during the vertical blanking period, the period where the system is not actively drawing the graphics.  One game may read the controller every frame, another game every other frame or every third frame.  Some games may read the controllers multiple times per frame and a few read more than once per frame (typically timing sensitive peripherals like light guns and analog paddles and joysticks).  The controllers may be read several times to allow the buttons to settle or "debounce" or because something else may interfere with controller reads (the NES's DPCM APU channel can interfere with controller reads in NTSC consoles).  

The final step is the graphics as output to the TV.  In the classic console era, this invariably meant a CRT, LCD TVs are a phenomenon of the 21st century.  The console output analog video serially to the TV.  The CRT does not draw wait for a complete frame to be ready to be displayed like an LCD.  It draws the display line by line, approximately 262 lines per frame, 50 or 60 times per second, for classic consoles.  An NTSC TV takes 16.7ms to draw a frame and a PAL TV takes 20ms.  A scanline takes about 63.6us to draw.  Most consoles before the PlayStation, N64 and Saturn had just enough video memory to buffer one complete frame (except the Atari 2600, which had no video memory and could only buffer one line).

Because a part the screen is being drawn constantly, it cannot be said that a CRT has zero lag, or that it has one frame of lag.  The correct terminology is to give an average latency of the drawing of a single frame, so we say that a CRT has 8.3ms of latency.  To illustrate this, let's take three samples of a player hitting a button when the game is drawing its image on different lines of the screen.  We will assume the game reads the controller once per frame during vblank.  In sample 1, the player hits the button when the console is drawing scanline 1.  In sample 2, the player hits the button when the console is drawing scanline 131 and in sample 3 he hits it on scanline 250.  In these samples, the third sample has the lowest latency compared to the reading routine and the first sample has the highest latency, but only if the button reading routine is called before scanline 250.  Scanline 240 is where the vertical blanking period begins in NTSC consoles and many PAL consoles.

Latency with Modern Emulators

Today's PCs are many orders of magnitude faster and more capable than any home console, handheld console or home computer designed and released during the 20th century.  This allows them to run emulators of classic systems with a high level of accuracy.  The more advanced the emulated system, the more powerful the PC is required to emulate that system to reference level accuracy.  Reference level is "near perfection" in that the emulator will run about 98% of licensed games without obvious issues.

There are four main sources of latency with emulators, the emulator itself, the operating system, the input and the display.  These sources (except for the display) must be considered in addition to the latency of the game and console itself.  

Game inputs to a modern system are handled through the USB protocol.  The USB interface is a packet-based protocol and as it can support any kind of peripheral, the protocol is much more complex and uses significantly more data to transmit button states than an old school controller input.  A SNES controller can give its state in two bytes, but a USB version of that controller may take twenty with all the overhead of the packet.  USB device speeds can vary quite a bit in how quickly they can send packets to the USB controller in the system.  Good quality controllers should be able to hit 1ms over a wired USB connection.  

Emulators have not read PC input ports directly since the days of Nesticle and DOS-based emulators, and even then they had to use a parallel, serial or gameport.  A modern emulator, whether running Windows, Linux or macOS, will have to communicate with the USB controller through an API.  The OS has to give varying levels of priority to programs and the dozens of services running in the background.  The efficiency of the interaction between emulator and OS may have a significant effect on the resulting latency of the input reading. Also, for more advanced or demanding emulators, the CPU speed and resources may also add latency if the CPU cannot keep up with the OS's demands.

Modern Display Latency

By the end of the first decade of the 21st century, the flat-panel display had completely displaced the CRT as the mainstream display method of choice.  Computers had pretty much dispensed with the huge, heavy high resolution CRT screens by 2005.  Other technologies like Plasma and DLP displays have also been consigned to the obsolete pile.  

Until the advent of "game mode" sometime around 2015, LCD display manufacturers did not care about latency (except for computer displays) and there were few consumer-accessible ways to measure latency.  One display could take 30ms to display a frame (1.8 frames), another could take 60ms (3.6 frames) and a third could take 120ms to display a picture (7.2 frames).  If the display had to upscale a lower resolution to its native resolution, that would often add significant delay to the frame updating due to video processing.  Pixel response times, especially in older panels and in the passive matrix displays of the laptops and handheld consoles 1990s, also could add latency as pixels lazily transitioned from one state to another.  

If you were watching static content like a TV show or a movie, latency did not matter as long as the sound and the video were in sync.  Recent game consoles factored latency into their reading routines, as wireless controllers like the Wiimote and the PS3 Dualshock tended to offer a significant and a more variable amount of lag over wired and simpler controllers of the previous generations.  Moreover, the increasing use of the 2.4GHz band for wireless remote usage could also cause delay as controller receivers struggle to discern noise from signal and get the right signal.  Modern games accept the higher latency that comes from these controllers and structure their controller reading routines to be much more forgiving than classic games.  However, for classic consoles designed to be played on a CRT, the high-latency updating of these LCDs just will not do for an ideal experience.

Today we have a highly competitive display manufacturer market that has come to appreciate the marketing benefits that come with advertising low latency displays.  We also have reviewers that will take panel latency into effect, good YouTube channels and affordable lag testers.  These days, 16-17ms is generally considered the minimum acceptable display latency for a display that would try to compete with a CRT.  An LCD does not display all the pixels of a frame instantly or all at one time, so latency may vary a bit depending on where you hit the button relative to what is happening on the screen. If you can measure 8.3ms at the center of the screen, then you have got yourself a fast-response display.

FPGAs : Benefits and Drawbacks Regarding Latency and Accuracy

An FPGA console can provide significant benefits over a traditional PC emulator, but none of those benefits are guaranteed.  An FPGA console can demonstrate a very high level of accuracy to the original systems with relatively modestly priced hardware.  FPGAs have the advantage of having programmable logic and that logic can be programmed to simulate the original circuitry and the various quirks, edge cases and delays associated with those circuits.  While a highly-accurate SNES emulator like Higan requires a rather powerful modern PC system to maintain perfect synchronization between the emulated CPU, PPU and APU and a stable 60fps, the Analogue Super Nt and SNES MiSTer core can do the same for a fraction of the cost of the PC hardware needed for that same level of fidelity.  

With the FPGA consoles from Analogue and others, for the controllers they support those controllers can be read at the same speed as the original consoles (if the controller was intended for the console, Analogue's Nt Mini and Nt Mini Noir must translate the NES controller signals for the Atari 2600 core, for example).  MiSTer is designed for USB controllers (including adapters which translate original controller data into USB data).  MiSTer was not always designed for low latency, in the early days when home computers cores were being ported over the latency was in the neighborhood of 2 frames.  While this did not matter quite as much for most home computers, when home consoles and handhelds began to be ported over the relatively slow translation of USB input into native input no longer was good enough. It can poll compatible USB controllers as fast as 1ms (1000Hz) and now can process that input with very little delay, so for most people USB input will be good enough.  MiSTer also has a low-latency option, namely the SNAC (Serial Native Accessory Converter), which can provide a port to read an original controller directly for the controllers supported (NES, SNES, Genesis, Turbo Grafx).  It also has a lower then USB latency option called BLiSSTer which can translate native controller input to a core more quickly than via USB but not as directly as SNAC.

FPGA devices benefit because there is no operating system dictating which program have access to resources at which times, when the FPGA is running the console, it has full control over video, audio and input.  FPGA devices have a menu and a framework but it is not called unless the user enters the menu. Having access to a menu does not induce latency, even though the system must be on the look out for a menu key press, it can usually piggyback on a program's input reading routines without butting into the gameplay.  Flash carts like the EverDrive also can use this method to bring up in-game menus.  

Programming an FPGA requires a specialized skill set.  The two commonly used programming languages, Verilog and VHDL, have a C-like syntax but that is where the similarities end.  These instructions define how the logic gates that drive the hardware is supposed to work.  So if you do not know your adders from your shift registers and your flip flops, then the FPGA might as well be programmed in ancient sanskrit for all the good its development environment will do for you.  

FPGA benefit the most when the hardware is well understood and has been comprehensively tested.  Old hardware chips are often full of edge cases, bugs and glitches, and in many cases these are poorly documented.  Ideally one should have a die shot from the decapsulated chip to work with to assist the FPGA programmer in determining how the logic works.  Reading and making sense of a die shot also requires a special skill in how logic circuits are laid out in a lithographic fashion, to most of us the die photographs look like innumerable roads and junctions.  And an FPGA cannot reproduce all circuits in the same manner, you cannot take a schematic of a chip and translate that 1 to 1 in Verilog.  Chips can do things, like tri-state logic, that cannot be directly replicated on an FPGA.  And ideally, one should be able to write test ROMs to determine how original hardware behaves in response to edge cases.  Perfecting FPGA simulation cores requires a very specialized skill set, and possessed by an extremely limited number of people.  Many FPGA cores are based mainly on observed behavior using the same methods as software emulators to determine hardware behavior, not by a circuit analysis of the essential chips and the schematics of a console.  

The Right Amount of Latency - The Goldilocks Problem

The goal of reducing latency is to get it as close as possible to original hardware running on a CRT.  Obviously using original hardware is sometimes not possible or just is not convenient.  The Analogue consoles with (Super Nt, Mega Sg) or without (Nt Mini, Nt Mini Noir) the Analogue DAC can run on a 15KHz CRT and can give identical latency compared to an original console.  MiSTer with an I/O Board and SNAC (for those cores which support it) can also give identical latency compared to an original console.  Not everyone has CRTs anymore, and with a low-latency LCD you can get competitive with CRT latency.

The RetroArch emulator frontend has a most interesting way of reducing latency called "run-ahead" for those of its emulators which support the feature.  What RetroArch does is runs the emulator 2-3 frames ahead of the current frame, make a save state when a button is pressed and calculates the differential between the button press and the "current frame" and seamlessly shows the frame which corresponds to the button press.  It has been advertised with providing better than original hardware latency.  But this method requires calibration for every game based on how frequently that game checks for input and can become complicated or non-ideal if the game checks for button presses with different frequencies at different times or more than once per frame.  It also requires a faster computer than without the feature because the emulator has to render more frames than normal, so there is always that potential trade-off between accuracy and performance.  Additionally, if the button press would change the program's behavior for those 2-3 frames run ahead, the program must take time to make sure that the "valid path" was followed.

Ideally what is needed is not slower than original latency or faster than original latency but to maintain the original latency to the closest extent possible.  Faster latency gives people an advantage over people who play the game as intended.  Slower latency gives a disadvantage and no one wants that.  Remember that a classic console can read a controller very quickly, much more quickly than even a USB controller which can respond 1000 times per second. When timing the critical jumps or the fast combos, the difference between 1ms and 500us may not seem like much and is not obviously noticeable, it adds up into missed opportunities over time.  You, the player, are "racing the beam" when you perform timing critical moves.  Your character is in the path of a bullet and you move to avoid it.  Did you press the button in time for the game to read it in that frame's vblank, or does the game have to wait until the next vblank to register the press, by which time it is too late and you die?

But the issue does not end there, because wired USB controllers of today are no longer dog slow compared to classic controllers. A USB controller which can sample at 1000 times per second (1000Hz polling) can certainly have plenty of button samples to provide to a normal controller, even a controller routine which reads 3x per frame is still only sampling 180x per second.  Many reading routines read the controllers 2-3 times to eliminate the inconsistent results from button debounce or to eliminate spurious corrupted input.  So while it may be possible to lose the occasional input because the USB controller's reading is not synchronized to the game's reading, it is unlikely to have a noticeable effect for most gamers.  Speedrunners, being something of a superstitious lot, may have a different view as applied to their play.

Conclusion

Latency is inescapable, the only real question is how much of it can be tolerated while still remaining faithful to original hardware.  LCDs are much faster than they used to be, and with careful shopping one can buy one with lag measurements which are competitive with CRTs. Software emulation, given the right hardware and OS environment, can produce latency on par with original hardware or FPGA hardware that mimics that hardware, but FPGA hardware tends to do so much more cheaply.  What level of tolerance of extra lag comes down to the individual user's preference, but the standard to which I encourage anyone who plays classic video games should strive for is "as good as original hardware".  If that cannot be maintained, then the goal should be "as little extra as possible" but not less latency than the original hardware could manage.

5 comments:

  1. Your units of time table is wrong. 1 second = 1,000 milliseconds (ms) = 1,000,000 microseconds (us) = 1,000,000,000 nanoseconds (ns)

    ReplyDelete
  2. Interesting article! I just started playing with a Mister and a lot of my C64 and Amiga games feel right for the first time despite a cheap USB controller and pretty nasty 1080p LCD over HDMI. It feels way better than an emulator on my high end gaming PC and monitor. I think there’s something to be said for the consistency and fidelity, not just latency in an FPGA.

    Been having a great time for very little money 🙂

    ReplyDelete
  3. PC Gaming monitors have truly excellent lag numbers these days, at least on the higher end : https://www.rtings.com/monitor/tests/inputs/input-lag

    ReplyDelete
  4. While there's lots of good info here, there's a bit of misleading stuff about LCDs. LCDs don't wait for an entire frame and then start rendering. The range of delay between the start of the source vsync and pixels appearing at the top of the screen is quite wide but can be as low as a couple milliseconds, even on hardware released nearly 20 years ago (here, I'm thinking of PC TN LCDs; TVs that fast took a much longer time to appear). Here are some hard numbers for a range of LCDs dating back to the mid 2000s:

    http://alantechreview.blogspot.com/2020/06/pilagtesterpro-input-lag-test-result.html

    native min lag is the column of interest, representing lag at the top of the screen, measured relative to when that line of the image was sent over the data cable (hdmi, dvi, etc).

    as you can see plenty of old PC LCDs (and even a few TVs) start drawing the input just 5 ms after receiving it. Almost all LCDs scan out just like CRTs did, one line at a time starting at the top (or occasionally, the bottom), just with some extra delay. Your broad point is certainly true that there's much more lag on your average LCD than on a CRT, but if you are going to get nerdy about it, might as well get the small details right.

    ReplyDelete