Chapter [ ]: Technical Assessments
[Several in-person and take-home challenges]
Questions About Data Science Projects
You have data on the duration of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test (even graphically) whether your expectations are borne out?
How can you detect individual paid accounts shared by multiple users? This is a big problem for publishers, digital newspapers, software developers offering API access, the music / movie industry (file sharing issues) and organizations offering monthly paid access (flat fee not based on the numbers of accesses or downloads) to single users, to view or download content, as it results in revenue losses.
How can you optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases? Answer: Putting a time-out threshold to less than 2 seconds, extracting no more than the first 20 KB of each pages, and not revisiting pages already crawled (that is, avoid recurrence in your algorithm) are good starting points.
How would you come up with a solution to identify plagiarism?
You are about to send 1 million e-mails (marketing campaign). How do you optimize delivery? How do you optimize response? Can you optimize both separately? Answer: Not really.
How would you turn unstructured data into structured data? Is it really necessary? Is it OK to store data as flat text files rather than in a SQL-powered RDBMS?
How would you build nonparametric confidence intervals, for scores?
How can you detect the best rule set for a fraud detection scoring technology? How can you deal with rule redundancy, rule discovery, and the combinatorial nature of the problem (for finding optimum rule set—the one with best predictive power)? Can an approximate solution to the rule set problem be OK? How would you find an OK approximate solution? How would you decide it is good enough and stop looking for a better one?
How can you create keyword taxonomies?
What is a botnet? How can it be detected?
How does Zillow's algorithm work to estimate the value of any home in the United States?
How can you detect bogus reviews or bogus Facebook accounts used for bad purposes?
How would you create a new anonymous digital currency? Please focus on the aspects of security - how do you protect this currency against Internet pirates? Also, how do you make it easy for stores to accept it? For instance, each transaction has a unique ID used only once and expiring after a few days; and the space for unique ID’s is very large to prevent hackers from creating valid ID’s just by chance.
You design a robust nonparametric statistic (metric) to replace correlation or R square that is independent of sample size, always between –1 and +1 and based on rank statistics. How do you normalize for sample size? Write an algorithm that computes all permutations of n elements. How do you sample permutations (that is, generate tons of random permutations) when n is large, to estimate the asymptotic distribution for your newly created metric? You may use this asymptotic distribution for normalizing your metric. Do you think that an exact theoretical distribution might exist, and therefore, you should find it and use it rather than wasting your time trying to estimate the asymptotic distribution using simulations?
Here’s a more difficult, technical question related to previous one. There is an obvious one-to-one correspondence between permutations of n elements and integers between 1 and factorial n. Design an algorithm that encodes an integer less than factorial n as a permutation of n elements. What would be the reverse algorithm used to decode a permutation and transform it back into a number? Hint: An intermediate step is to use the factorial number system representation of an integer. You can check this reference online to answer the question. Even better, browse the web to find the full answer to the question. (This will test the candidate's ability to quickly search online and find a solution to a problem without spending hours reinventing the wheel.)
How many “useful” votes will a Yelp review receive?
Answer: Eliminate bogus accounts or competitor reviews. (How to detect them: Use taxonomy to classify users, and location. Two Italian restaurants in same ZZIP code could badmouth each other and write great comments for themselves.) Detect fake likes: Some companies (for example, FanMeNow.com) will charge you to produce fake accounts and fake likes. Eliminate prolific users who like everything; those who hate everything. Have a blacklist of keywords to filter fake reviews. See if an IP address or IP block of a reviewer is in a blacklist such as Stop Forum Spam. Create a honeypot to catch fraudsters. Also watch out for disgruntled employees badmouthing their former employer. Watch out for two or three similar comments posted the same day by three users regarding a company that receives few reviews. Is it a new company? Add more weight to trusted users. (Create a category of trusted users.) Flag all reviews that are identical (or nearly identical) and come from the same IP address or the same user. Create a metric to measure the distance between two pieces of text (reviews). Create a review or reviewer taxonomy. Use hidden decision trees to rate or score review and reviewers.
Can you estimate and forecast sales for any book, based on Amazon.com public data? Hint: See http://www.fonerbooks.com/surfing.htm.
This is a question about experimental design (and a bit of computer science) with Legos. Say you purchase two sets of Legos (one set to build a car A and a second set to build another car B). Now assume that the overlap between the two sets is substantial. There are three different ways that you can build the two cars. The first step consists in sorting the pieces (Legos) by color and maybe also by size. The three ways to proceed are (1) Sequentially: Build one car at a time. This is the traditional approach. (2) Semi-parallel system: Sort all the pieces from both sets simultaneously so that the pieces will be blended. Some in the red pile will belong to car A; some will belong to car B. Then build the two cars sequentially, following the instructions in the accompanying leaflets. (3) In parallel: Sort all the pieces from both sets simultaneously, and build the two cars simultaneously, progressing simultaneously with the two sets of instructions. Which is the most efficient way to proceed?
Answer: The least efficient is sequential. Why? If you are a good at multitasking, the full parallel approach is best. Why? Note that with the semi-parallel approach, the first car that you build will take less time than the second car (due to easier way to find the pieces that you need because of higher redundancy) and less time than needed if you used the sequential approach (for the same reason). To test these assumptions and help you become familiar with the concept of distributed architecture, you can have your kid build two Lego cars, A and B, in parallel and then two other cars, C and D, sequentially. If the overlap between A and B (the proportion of Lego pieces that are identical in both A and B) is small, then the sequential approach will work best. Another concept that can be introduced is that building an 80-piece car takes more than twice as much time as building a 40-piece car. Why? (The same also applies to puzzles).
How can you design an algorithm to estimate the number of LinkedIn connections your colleagues have? The number is readily available on LinkedIn for your first-degree connections with up to 500 LinkedIn connections. However, if the number is above 500, it just says 500+. The number of shared connections between you and any of your LinkedIn connections is always available.
Answer: A detailed algorithm can be found in the book, in the section “job interview questions”.
How would proceed to reverse engineer the popular BMP image format and create your own images byte by byte, with a programming language?
Answer: The easiest BMP format is the 24-bit format, where 1 pixel is stored on 3 bits, the RGB (red/green/blue) components of the pixel, which determines its pixel. After producing a few sample images, it becomes obvious that the format starts with a 52-bit header. Starts with a blank image. Modify the size. This will help you identify where and how the size is encoded in the 52-bit header. On a test image (for example, 40 * 60 pixels), change the color of one pixel. See what happens to the file: Three bytes have a new value, corresponding to the red, green, and blue components of the color attached to your modified pixel. Now you can understand the format and play with it any way you want with low-level access.