change: Finished almost my last draft on the paper

2023-12-14 02:23:51 -08:00
parent c95b02bd7e
commit 8d8a1e0634
4 changed files with 178 additions and 112 deletions
--- a/paper/document.tex
+++ b/paper/document.tex
@@ -27,7 +27,7 @@ or three-dimensional pixels.
 % Applications of voxels
 A voxel\cite{enwiki:1186283262} represents a single point or cube in a
 three-dimensional grid, at a variable size. This feature allows them to
-approximately model many three-dimensional structures, in order to reduce the
+approximately model many three-dimensional structures, and to reduce the
 computational complexity in analyzing the shape, which has led to many
 data-related use cases outside of computer science. For example, to model the
 inner workings of the brain, Neuroscientists track oxygen concentration through
@@ -37,7 +37,7 @@ reflections for visual effects\cite{museth2013vdb}. The output of MRI scans in
 hospitals are very high-resolution voxel grids. Most recently, machine learning
 models are being trained on the LIDAR data from self-driving
 cars\cite{li2020deep} in order to better process their environments. However,
-voxels are not often thought of as a way to store three-dimensional shapes, and
+voxels are not often thought of as a way to permanently store three-dimensional shapes, and
 existing research focuses mainly on efficiently representing and processing
 shapes. My approach models this problem of voxel storage and representation, and
 turns it into a problem of database design.
@@ -205,9 +205,7 @@ advantage of this speedup. In VDB\cite{museth2013vdb} Museth demonstrates that
 by modeling a sparse voxel grid in different resolutions, a computer cluster can
 efficiently approximate a physical structures such as a cloud, in order to
 calculate expensive lighting operations.
-
+% Parallel processing on voxels
 \subsection{Parallel Processing on Voxel Databases}
 Williams\cite{williams1992voxel} expands upon the uses of a voxel database to
 model graph and mesh-based problems. Taking advantage of the parallelism in the
 grid, many problems can be reframed in the representation of voxels, and solve
@@ -216,7 +214,7 @@ voxel is stored in shared memory, making this process only viable to solve
 problems that can be modeled on one machine, and are far more computationally
 expensive, rather than data-intensive.
-\subsection{Large Voxel Data Set Processing}
+\subsection{Storing Large Voxel Data Sets}
 Another approach to the problem of storing voxel data is the distributed
 approach in Gorte et. al. \cite{gorte2023analysis}. Since memory is limited
@@ -229,6 +227,28 @@ of the data that they are working on. In the paper, Gorte acknowledges the need
 to split large datasets up into smaller regions, which is similar to the concept
 of ``chunks'' in my implementation.
 \subsection{Chunk Systems in Other Games}
 The decision to choose chunks to represent game data has many justifications. As
 \cite{gorte2023analysis} mentions, an infinite grid of voxels needs to be broken
 up in a way where applications can store data in an efficient way, and many
 other games converge on this same implementation. Another voxel-based game,
 Veloren\cite{https://veloren.net} uses the same chunk-based system, although
 differs in its storage method. The game switches between several different
 storage implementations in each chunk, depending on how dense or sparse the voxel
 data within the chunk is. For sparser data, the game stores block information in
 a simple key-value hash map. As the number of voxels increase, the game further
 breaks this information up, and creates several smaller sections within the
 chunk. Finally, for very dense data, the game stores a compressed version using
 Zlib compression\cite{veloren32}. This gives many options for data compression
 in my database, but also shows how the database can be adapted to store sparser
 structures more efficiently if the focus of the project ever needs to change.
 Since this game is not based on Minecraft, but an independent project named cube
 world, the game comes up with a similar data structure, and shows the
 performance considerations for using such a structure. The benchmarks that they
 show suggest about an order-of-magnitude improvement over using a key-value
 store.
 \subsection{Previous Special-Purpose Databases}
 The design of my database was also inspired by the LSM tree and data-driven
@@ -242,11 +262,14 @@ and replicate these in real-time.
 \section{Methods}
-Almost every part of the database was designed so that most operations could be
+\subsection{The Interface for the Database}
-done in constant time.
+
 For developers to interact with the database, the database is implemented as a
 library, and the database provides a simple application programming interface to
 read and write data, consisting of the following operations. The performance
 considerations for each of these operations can be found in the methods section
 below.
 The database provides a simple interface to read and write data, consisting of
 the following:
 \begin{itemize}
  \item Read a single block
  \item Write a single block
@@ -254,34 +277,44 @@ the following:
  \item Read a pre-defined ``chunk'' of blocks
 \end{itemize}
 \subsection{Reading and Writing a Single Voxel}
-The process of fetching the data for a single point in the world starts at that
+The process of updating the data for a single point in the world starts with the
-point's $x, y$ and $z$ location. The world is infinite in size on the horizontal
+voxel's position. Because the world is infinite on the horizontal $x$ and $z$
-$x$ and $z$ axes, but limited in the vertical $y$ axis. In my database, the
+axes, this is implemented by a system of ``chunks'', which are fixed-size 16x16
-world is composed of an infinite grid of ``chunks'', or columns that are a fixed
+columns of voxels, 256 voxels high. The size of these chunks are chosen so that
-16 x 16 blocks in the $x$ and $z$ axes, but 256 blocks in the vertical $y$ axis.
+they are large enough to be efficiently cached, and many operations can occur
 within the same chunk, but not too large to the point where the hundred or so
 chunks sent to the user upon joining the world cause a network slowdown. Given a
 point's $x$ and $z$ positions, the chunk that that voxel belongs to can be found
 with a fast modulus operation, in constant time.
-Once you know a point's location, you can find with a modulus what chunk the
+To fetch the data for that chunk, the database needs to read that data from
-point is located within. From there, the database only needs to retrieve the
+disk. The database stores this information in combined files that I call ``unity
-data for the chunk stored at that location.
+files'' (shown in figure \ref{fig:unity}), which consist of a single file on disk, but with the encoded data for
 each chunk stored as a start index and size, so that the \verb|seek| syscall can
 be used to efficiently query this data, while only keeping one file open. This
 scheme was used over the previous system of storing chunk files separately,
 because the filesystem had a hard time searching through the hundreds of
 thousands of chunks in larger worlds. This start position and size are stored in
 an auxillary hash map that stores a mapping of every chunk's position to its
 metadata within the unity file. This structure uses a minimal amount of memory,
 and also allows for a file to be fetched from disk in a constant amount of time
 and disk reads.
-Initial implementations for my database focused on tree-based approaches for
+\begin{figure}
-finding the files for chunks, but with their complexity and non-constant
+  \centering
-complexity, I decided to store each chunk separately. However, with worlds with
+  \includegraphics[width=8cm]{unity-file.drawio.png}
-chunk counts in the hundreds of thousands, the filesystem implementations had
+  \caption{The Layout of a Unity File}
-issues with searching through so many files, which led to performance problems.
+  \label{fig:unity}
-Finally, I settled on merging all the chunk data into one file, and use the
+\end{figure}
 filesystem's \verb|seek| syscall to lookup the offset for the correct chunk. A
 simple hash table was then used to store each chunk's location with its offset
 in the file, which keeps the memory cost low, even with chunk counts in the
 millions. This allows for constant-time searches for the chunk's data.
-Once a chunk is retrieved from disk, the format of the chunk is broken down into
+Each chunk is further divided into sections, in this case each chunk consists of
-smaller cubic slices of the chunk, called ``sections'' each section is a
+16 stacked 16x16x16 cubes of voxels, which results in a total of 4096 block
-16x16x16 cubic area that keeps an index for every chunk. The point's $y$
+states per section. Using the voxel's $y$ position, the section for a block can
-position tells the database what section the point is in, and a simple formula
+be found with another modulus. Once this is found, a perfect hash function is
-is done to convert the remaining $x$ and $z$ axes into an index within the
+used to map the voxel's position to an array index within the section. Again,
-section.
+both of these steps are done in constant time respectively.
 Every section additionally stores a look-up-table, that stores a mapping of a
 \textit{palette index} to the state of a block. When the value for the point is
@@ -289,8 +322,8 @@ retrieved from the section, the value returned is not the block's state, but
 simply an index into this palette. The palette lookup is done in constant time,
 and when a new block is added into the section that needs an additional state in
 the palette, this value is added in constant time as well. The existence of this
-palette supports the efficient operation of another part of the database, which
+palette supports the efficient operation changing large portions of blocks in
-is the ability to change large portions of blocks in the world.
+the world.
 Once the value of the point is found in the palette, the value can be returned
 to the user. A visual diagram of this process can be found in figure
@@ -407,28 +440,29 @@ chunks, so that chunk data could be retrieved without decoding the entire chunk.
 However, this would require a much more constrained data layout, and limit the
 implementation of different voxels.
-Additionally, compression 
+Additionally, compression would also reduce the amount of data sent from the
 disk to the application.
 \section{Ethical Considerations}
 \subsection{Considerations of Computing Resources}
-Since databases are at the core part of most complex systems, they are often
+Since a database is at the core part of most software systems, it is important
-built to be run on hardware that the normal consumer can afford
+that the database is designed to work on a wide variety of computers, in order
 to ensure all parties are able to take advantage of the improvements. I
 designed my database to run on entry-level commodity hardware, as well as
 alongside existing application programs that can require far more resources.
 Additionally, by focusing on disk storage, which is far cheaper than equivalent
 capacities of memory, this further allows researchers or individuals to run
 large datasets on a single machine.
 My system targets far less memory usage than existing commercial applications
 \footnote{\url{https://docs.oracle.com/en/database/oracle/oracle-database/12.2/ntdbi/oracle-database-minimum-hardware-requirements.html}}
 \footnote{\url{https://wiki.lustre.org/Lustre_Server_Requirements_Guidelines}}.
 In the design of my application I had to take advantage of as much of the
 computing hardware as possible, but make sure that the approachability and
 accessibility for the application does not decrease as as result.
 The large hardware requirements of these databases come from the environments
 where they are implemented, and at many of these companies, the ability to
 keep buying faster hardware allows the company to work on other things that are
 more important. However, what this does to the player is effectively prices them
 out of the game that they would be already playing, especially since the
 database would also have to run alongside the existing Java application of
 Minecraft, which quickly exhaust system memory.
 In the design of my server I have to prioritize both performance to take
 advantage of the existing hardware, but make sure that the accessibility for
 the application does not decrease as a result.
 \subsection{Considerations of Complexity}
 Another factor to consider in the implementation of my database is how complex
@@ -436,22 +470,20 @@ the existing systems are. Some of the most popular SQL databases, PostgreSQL and
 MySQL have 1.4 and 4.4 million lines of code respectively
 \footnote{\url{https://news.ycombinator.com/item?id=24813239}}.
-With so much complexity going on, this significantly decreases the overall
+Because these systems are so complex, this decreases the number of people who
-knowledge of the system, as well as the individual user who has to debug their
+can effectively work with these systems and maintain them, effectively limiting
-game. Most of this is from the large amount of query logic that handles caching
+this role to larger companies that can afford teams of people to solve these
-and speeding up certain queries, so knowing more about the specific problem that
+problems for them. By not focusing on the significant complexity that comes with
-I am trying to solve removes this process from having to be done.
+caching logic, and keeping a simple implementation for the server, I allow more
-
+companies and developers to use this database for their own needs, and expand
-Especially since most of the people in the Minecraft community are volunteers in
+with them. In addition, many decisions were made to help in the debugging
-the open-source community, debugging this large of an application would be out of
+process, including the choice of JSON serialization for the chunk data, which
-scope for enjoying a game, and likely lead to it being replaced with something
+allows users to read the contents of files easier, and recover potentially
-more simple. The reliability characteristics are also less than what are
+corrupted data.
 required for Minecraft, since they are being compared against a single-threaded
 Java program which has been tested to do the correct thing.
 \subsection{Considerations in Security}
-Since these databases are very complex, there is also the risk that having a
+Since databases are very complex, there is also the risk that having a
 server exposed over the internet through the Minecraft game server might leave
 it exposed to attacks. While this is a large issue, an even more important
 implication is the ability to configure the database correctly. Since these
@@ -461,37 +493,31 @@ breaches\footnote{\url{https://www.zdnet.com/article/hacker-ransoms-23k-mongodb-
 that involve a single server, even at larger companies that have dedicated teams
 that involve a data breach.
-My plan to mitigate this risk is to implement the database in a memory-safe
+I mitigate this risk by implementing the database in a memory-safe
-programming language, which should remove the risk class of memory-unsafety
+programming language, Go, which should remove the risk class of memory-unsafety
 bugs, which account for around 70\% of all bugs in the Chromium browser
 engine\footnote{\url{https://www.chromium.org/Home/chromium-security/memory-safety/}},
 which is entirely written in non-memory safe C++.
-And if the database information is ever able to be leaked through the Minecraft
+However, there is the possibility that information stored in the database is
-protocol, the attacker would have access to the full data, because I am planning
+exposed, whether the database not secured, or exposed via an application error.
-to store it unencrypted for performance reasons, and rely on the encryption of
+With this, my database follows the previous threat model of many other
-the Minecraft client. And, the data involved does not involve personally
+databases, and leaves the security up to the user implementing the application.
-identifying information, so the usefulness of the data would be close to
+Implementing features such as encryption would provide some additional layer of
-nothing.
+security, but would also likely decrease performance and increase complexity,
-
+which are also harmful to security in their own ways. Ultimately, I rely on a
-But, perhaps the most important security risk is if an attacker is able to
+setting of defaults that doesn't many any assumptions about the security of the
-access the database directly and bypass all the isolation in the Minecraft
+system.
 protocol, in order to wipe or corrupt the data for malicious reasons. This would
 likely lead to the Minecraft server being unable to be played, and degrade the
 experience of the players. It is my plan to take advantage of the limitations of
 the types of Minecraft items to provide resilience and easy backups to the
 system, because of the purpose-built nature of the system
 \footnote{\url{https://twitter.com/eatonphil/status/1568247643788267521?s=20}}.
 \subsection{Considerations in Fairness}
 In the implementation of databases, it can often be beneficial to make certain
 operations faster, at the expense of others that are not done as often. For
-instance, if I notice that players often pull items in and out of their systems
+instance, if I notice that researchers often write more to the database, and
-often, but almost never search through the list of items, I can take advantage
+adjust the application accordingly, I can take advantage of this assumption to
-of this to speed up the database for the most common operations. However, this
+speed up the database for the most common operations. However, this can be
-can be problematic if the things that I choose to sacrifice affect a certain
+problematic if the things that I choose to sacrifice affect a certain group of
-group of users.
+users.
 This tradeoff between speed and reliability occurs so often in Computer Science 
 and is described in terms of percentiles. For instance, if we notice that some
@@ -501,15 +527,9 @@ Similarly, if an event only occurs 1\% of the time, we can say it occurs in the
 like this is make is written about by Google \cite{dean2013tail}, who have to make every
 decision like this at their scale.
-My plan is to not have any tradeoffs that affect the normal gameplay of the
+My database plans to keep a consistent set of gaurantees in regards to the
-server, and keep it within the 50ms timeframe that the Minecraft has allocated
+complexity of the basic operations, and provide constant-time operations for
-to itself. Apart from this, one of the main goals of the project is to give
+most of these operations.
 consistent performance, so any further decisions will be made around the
 existing implementation of the Minecraft server.
 %https://www.embedded.com/implementing-a-new-real-time-scheduling-policy-for-linux-part-1/
 %https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html
 %https://helix979.github.io/jkoo/post/os-scheduler/
 \subsection{Considerations in Accessibility}
@@ -518,24 +538,9 @@ require a certain type of computer. Requiring a certain operating system or a
 more powerful computer would limit access to many of the people that were
 playing the game before.
-However, by basing the goal of the project on improving the performance of the
+However, with the previous performance goals, as well as an implementation in a
-already existing implementation, any improvements would result in more people
+portable language, the program is available for as many systems as the Go
-being able to play than before. Also, by designing the system for normal
+compiler supports.
 hardware and in a cross-platform way, this does not limit the people that are
 able to access the improvements.
 \subsection{Considerations in the Concentration of Power}
 With any improvements to performance to servers in Minecraft, this would allow
 many of the larger hosting companies, who rent servers monthly to individual
 people, to drive down their hosting costs, and allow them to have larger returns
 over the smaller providers. However, since this market is so competitive between
 companies, because of how easy it is to set up a company, and the options
 between companies aren't very different, I would expect any improvement to be
 quickly disappear into the competitive market, and benefit everyone equally.
 \section{Future Work, and Conclusion}
 \printbibliography
--- a/paper/references.bib
+++ b/paper/references.bib
@@ -305,3 +305,11 @@ How storage works in database systems, and the evolution of how data is stored
  year={2010},
  publisher={ACM New York, NY, USA}
 }
@misc{veloren32,
  title = "This Week In Veloren 32",
  author = "AngelOnFira",
  month = "September",
  year = "2019",
  url = "https://veloren.net/blog/devblog-32/"
 }
--- a/paper/unity-file.drawio
+++ b/paper/unity-file.drawio
@@ -0,0 +1,53 @@
 <mxfile host="Electron" modified="2023-12-14T09:51:26.683Z" agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/22.0.2 Chrome/114.0.5735.289 Electron/25.8.4 Safari/537.36" etag="iOiW5F6x8VUFkmnMflTj" version="22.0.2" type="device">
  <diagram name="Page-1" id="TafIrdbnw2cWi4bqOyK2">
    <mxGraphModel dx="1114" dy="999" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="850" pageHeight="1100" math="0" shadow="0">
      <root>
        <mxCell id="0" />
        <mxCell id="1" parent="0" />
        <mxCell id="f65CT_Lw4DzFi_7RwwvQ-1" value="" style="rounded=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
          <mxGeometry x="40" y="20" width="120" height="200" as="geometry" />
        </mxCell>
        <mxCell id="f65CT_Lw4DzFi_7RwwvQ-2" value="Chunk 1" style="rounded=1;whiteSpace=wrap;html=1;fillColor=#fff2cc;strokeColor=#d6b656;" vertex="1" parent="1">
          <mxGeometry x="50" y="50" width="100" height="40" as="geometry" />
        </mxCell>
        <mxCell id="f65CT_Lw4DzFi_7RwwvQ-3" value="Chunk 2" style="rounded=1;whiteSpace=wrap;html=1;fillColor=#fff2cc;strokeColor=#d6b656;" vertex="1" parent="1">
          <mxGeometry x="50" y="100" width="100" height="40" as="geometry" />
        </mxCell>
        <mxCell id="f65CT_Lw4DzFi_7RwwvQ-6" value="" style="endArrow=none;dashed=1;html=1;dashPattern=1 3;strokeWidth=2;rounded=0;" edge="1" parent="1">
          <mxGeometry width="50" height="50" relative="1" as="geometry">
            <mxPoint x="100" y="210" as="sourcePoint" />
            <mxPoint x="100" y="150" as="targetPoint" />
          </mxGeometry>
        </mxCell>
        <mxCell id="f65CT_Lw4DzFi_7RwwvQ-7" value="Metadata" style="swimlane;fontStyle=0;childLayout=stackLayout;horizontal=1;startSize=30;horizontalStack=0;resizeParent=1;resizeParentMax=0;resizeLast=0;collapsible=1;marginBottom=0;whiteSpace=wrap;html=1;" vertex="1" parent="1">
          <mxGeometry x="230" y="40" width="140" height="90" as="geometry" />
        </mxCell>
        <mxCell id="f65CT_Lw4DzFi_7RwwvQ-8" value="Start: 0, Size: 2" style="text;strokeColor=none;fillColor=none;align=left;verticalAlign=middle;spacingLeft=4;spacingRight=4;overflow=hidden;points=[[0,0.5],[1,0.5]];portConstraint=eastwest;rotatable=0;whiteSpace=wrap;html=1;" vertex="1" parent="f65CT_Lw4DzFi_7RwwvQ-7">
          <mxGeometry y="30" width="140" height="30" as="geometry" />
        </mxCell>
        <mxCell id="f65CT_Lw4DzFi_7RwwvQ-9" value="Start: 2, Size 3" style="text;strokeColor=none;fillColor=none;align=left;verticalAlign=middle;spacingLeft=4;spacingRight=4;overflow=hidden;points=[[0,0.5],[1,0.5]];portConstraint=eastwest;rotatable=0;whiteSpace=wrap;html=1;" vertex="1" parent="f65CT_Lw4DzFi_7RwwvQ-7">
          <mxGeometry y="60" width="140" height="30" as="geometry" />
        </mxCell>
        <mxCell id="f65CT_Lw4DzFi_7RwwvQ-11" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="f65CT_Lw4DzFi_7RwwvQ-8" target="f65CT_Lw4DzFi_7RwwvQ-2">
          <mxGeometry relative="1" as="geometry">
            <Array as="points">
              <mxPoint x="190" y="85" />
              <mxPoint x="190" y="50" />
            </Array>
          </mxGeometry>
        </mxCell>
        <mxCell id="f65CT_Lw4DzFi_7RwwvQ-12" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;entryX=1;entryY=0;entryDx=0;entryDy=0;" edge="1" parent="1" source="f65CT_Lw4DzFi_7RwwvQ-9" target="f65CT_Lw4DzFi_7RwwvQ-3">
          <mxGeometry relative="1" as="geometry">
            <Array as="points">
              <mxPoint x="190" y="115" />
              <mxPoint x="190" y="100" />
            </Array>
          </mxGeometry>
        </mxCell>
        <mxCell id="f65CT_Lw4DzFi_7RwwvQ-14" value="Unity File" style="text;html=1;strokeColor=none;fillColor=none;align=center;verticalAlign=middle;whiteSpace=wrap;rounded=0;" vertex="1" parent="1">
          <mxGeometry x="70" y="20" width="60" height="30" as="geometry" />
        </mxCell>
      </root>
    </mxGraphModel>
  </diagram>
 </mxfile>
--- a/paper/unity-file.drawio.png
+++ b/paper/unity-file.drawio.png